Fix bf16/fp16 accuracy issue in sparsecsr addmm#3273
Fix bf16/fp16 accuracy issue in sparsecsr addmm#3273jenniew wants to merge 21 commits intointel:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets the reported bf16/fp16 accuracy gap in XPU SparseCSR addmm by changing how the beta * input term is applied during dense fallback computation.
Changes:
- Updates
addmm_calculationto scaleinputbybetavia an explicit multiply before accumulating into themmresult. - Removes use of the fused
add_(tensor, alpha=beta)form in favor ofadd_(tensor * beta).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (beta.toComplexDouble() != 0.) { | ||
| Tensor input_dense = input.layout() != kStrided ? input.to_dense() : input; | ||
| result_dense.add_(input_dense, beta); | ||
| result_dense.add_(input_dense * beta); |
There was a problem hiding this comment.
result_dense.add_(input_dense * beta) introduces an extra elementwise multiply + a temporary tensor allocation. This is typically both slower and less numerically accurate for bf16/fp16 than the fused add_(input_dense, /*alpha=*/beta) (single kernel, often FMA). If the intent is to improve bf16/fp16 accuracy, consider keeping the fused add_ form and fixing precision/casting in the underlying XPU add kernel (or doing the accumulation in fp32) rather than splitting into mul+add here.
| result_dense.add_(input_dense * beta); | |
| result_dense.add_(input_dense, beta); |
| result_dense.add_(input_dense, beta); | ||
| result_dense.add_(input_dense * beta); |
There was a problem hiding this comment.
Could you explain the difference between these two calling methods?
Fix bf16/fp16 accuracy issue in sparsecsr addmm.
Related issue: #3177