Skip to content

Conversation

@efric
Copy link
Member

@efric efric commented Oct 17, 2025

This patch introduces iteration space expansion for reductions in the VectorDistribute path.

Specifically, we:

  1. Add a new attribute, expand_dims, for reductions.
  2. Introduce a new pass, GPUExpandDimensions, which uses expand_dims to expand the iteration space of relevant dimensions.
  3. Refactor common functionality shared between GPUExpandDimensions and BlockDynamicDimensions into reusable utilities.
  4. Refactor encoding helpers from EncodingAttrs.cpp into reusable utilities.

This change also enables chain FMA in matvec codegen as we iterate along the K reduction dimension.


Performance Summary

IREE benchmark module

  • Only expansion: ~4% improvement
  • Expansion + chain FMA: ~11% improvement

rocprof

  • Only expansion: ~13% worse
  • Expansion + chain FMA: ~9% better

Register usage

  • 10% reduction (60 → 54 registers for matvec dispatches)

Instruction latency (post-reduction loop epilogue)

  • 3.5% improvement (340 → 328 total mean latency)

Notes

  • As a follow-up, we can explore applying iteration space expansion to the reduction in attention
  • Right now, we only expand one dimension into two although the implementation supports expansion to N dimensions.
  • Please note this PR changes the reduction order, some expect some minor changes to the numerics
  • This is does not improve performance by itself/can cause regression without chain FMA [LLVMGPU][Codegen] Emit packed chain FMA from select multi_reductions and contracts #21855

Traces for matvec dispatches are attached for all variations (original, only expansion, and expansion + chain FMA).

115_expansion_and_chain.tar.gz
115_nothing.tar.gz
115_only_expansion.tar.gz

Fixes: #22153

ci-extra: test_torch

efric added 23 commits October 12, 2025 21:15
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
@efric efric changed the title [GPU][Codegen] Expand dimensions based on expand_dims lowering_config [GPU][Codegen] Expand iteration space based on new expand_dims lowering_config Oct 17, 2025
efric added 6 commits October 16, 2025 23:20
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good now, just some nits and one comment about the configuration logic. Nice work, and thanks for addressing all my comments so far!

In case I'm OOO by the time you address these changes, I'll remove my request for changes now. It LGTM after these final comments, so anyone else can give the final approval after this round of comments.

Comment on lines 7 to 24
#include "iree/compiler/Codegen/Common/Transforms.h"
#include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h"
#include "iree/compiler/Dialect/LinalgExt/Transforms/Transforms.h"
#include "iree/compiler/Dialect/Util/IR/UtilDialect.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallVectorExtras.h"
#include "llvm/Support/DebugLog.h"
#include "llvm/Support/LogicalResult.h"
#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/Utils/Utils.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/MemRef/Transforms/Transforms.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/IR/AffineExpr.h"
#include "mlir/IR/AffineMap.h"
#include "mlir/IR/BuiltinTypeInterfaces.h"
#include "mlir/IR/OpDefinition.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please clean up the includes. I think a lot of them are probably not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need these remaining ones


// -----

func.func @expand_dynamic_dim(%a: tensor<4x?xf16>, %b: tensor<1x?xf16>) -> tensor<4x1xf32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For this test and the test below, can you use something simple like linalg.add? The tests are quite long, and I don't think it needs to be a matvec to test what is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to linalg.add for all except one (I think it is a useful reference since it's the only use real use of this atm)

@Max191 Max191 dismissed their stale review December 10, 2025 20:34

I will be OOO, and it looks good to me after the final comments are addressed. Someone else can give the final approval.

efric and others added 11 commits December 16, 2025 15:20
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
@Groverkss
Copy link
Contributor

@efric I added a ci-extra trailer to run test_torch can you check?

Signed-off-by: Eric Feng <[email protected]>
@efric efric force-pushed the static_dimension_expansion branch from edb0ed2 to 8850fc8 Compare December 18, 2025 23:02
efric added a commit that referenced this pull request Dec 22, 2025
The current binary size for this test is nearing its threshold
(459794/460000). Bump to 470000 to allow future changes such as #22342
some room (goes up to 460306)

Signed-off-by: Eric Feng <[email protected]>
@efric efric merged commit b879ee1 into iree-org:main Dec 23, 2025
130 of 142 checks passed
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 23, 2025
The current binary size for this test is nearing its threshold
(459794/460000). Bump to 470000 to allow future changes such as iree-org#22342
some room (goes up to 460306)

Signed-off-by: Eric Feng <[email protected]>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 23, 2025
…ibute (iree-org#22342)

This patch introduces iteration space expansion for reductions in the
VectorDistribute path.

Specifically, we:

1. Add a new attribute, `expand_dims`, for reductions.
2. Introduce a new pass, `GPUExpandDimensions`, which uses `expand_dims`
to expand the iteration space of relevant dimensions.
3. Refactor common functionality shared between `GPUExpandDimensions`
and `BlockDynamicDimensions` into reusable utilities.
4. Refactor encoding helpers from `EncodingAttrs.cpp` into reusable
utilities.

This change also enables [chain
FMA](iree-org#21855) in matvec codegen as
we iterate along the K reduction dimension.

---
**Performance Summary**

**IREE benchmark module**

* Only expansion: ~4% improvement
* Expansion + chain FMA: ~11% improvement

**rocprof**

* Only expansion: ~13% worse
* Expansion + chain FMA: ~9% better

**Register usage**
* 10% reduction (60 → 54 registers for matvec dispatches)

**Instruction latency (post-reduction loop epilogue)**
* 3.5% improvement (340 → 328 total mean latency)

---

**Notes**

* As a follow-up, we can explore applying iteration space expansion to
the reduction in attention
* Right now, we only expand one dimension into two although the
implementation supports expansion to N dimensions.
* Please note this PR changes the reduction order, so expect some
minor changes to the numerics
* This is does not improve performance by itself/can cause regression
without chain FMA iree-org#21855

Traces for matvec dispatches are attached for all variations (original,
only expansion, and expansion + chain FMA).


[115_expansion_and_chain.tar.gz](https://github.com/user-attachments/files/23268046/115_expansion_and_chain.tar.gz)

[115_nothing.tar.gz](https://github.com/user-attachments/files/23268047/115_nothing.tar.gz)

[115_only_expansion.tar.gz](https://github.com/user-attachments/files/23268048/115_only_expansion.tar.gz)

Fixes: iree-org#22153

ci-extra: test_torch

---------

Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
efric added a commit to efric/iree that referenced this pull request Dec 24, 2025
efric added a commit to efric/iree that referenced this pull request Dec 24, 2025
efric added a commit that referenced this pull request Dec 25, 2025
…ms` attribute (#22342) (#22982)

This reverts commit b879ee1.

To temporarily fix ONNX models test suite from issues suspected from
[[DispatchCreation] Unit dims not properly folded away during
GlobalOptimization](#22978) and
reported regression.

Signed-off-by: Eric Feng <[email protected]>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 26, 2025
The current binary size for this test is nearing its threshold
(459794/460000). Bump to 470000 to allow future changes such as iree-org#22342
some room (goes up to 460306)

Signed-off-by: Eric Feng <[email protected]>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 26, 2025
…ibute (iree-org#22342)

This patch introduces iteration space expansion for reductions in the
VectorDistribute path.

Specifically, we:

1. Add a new attribute, `expand_dims`, for reductions.
2. Introduce a new pass, `GPUExpandDimensions`, which uses `expand_dims`
to expand the iteration space of relevant dimensions.
3. Refactor common functionality shared between `GPUExpandDimensions`
and `BlockDynamicDimensions` into reusable utilities.
4. Refactor encoding helpers from `EncodingAttrs.cpp` into reusable
utilities.

This change also enables [chain
FMA](iree-org#21855) in matvec codegen as
we iterate along the K reduction dimension.

---
**Performance Summary**

**IREE benchmark module**

* Only expansion: ~4% improvement
* Expansion + chain FMA: ~11% improvement

**rocprof**

* Only expansion: ~13% worse
* Expansion + chain FMA: ~9% better

**Register usage**
* 10% reduction (60 → 54 registers for matvec dispatches)

**Instruction latency (post-reduction loop epilogue)**
* 3.5% improvement (340 → 328 total mean latency)

---

**Notes**

* As a follow-up, we can explore applying iteration space expansion to
the reduction in attention
* Right now, we only expand one dimension into two although the
implementation supports expansion to N dimensions.
* Please note this PR changes the reduction order, so expect some
minor changes to the numerics
* This is does not improve performance by itself/can cause regression
without chain FMA iree-org#21855

Traces for matvec dispatches are attached for all variations (original,
only expansion, and expansion + chain FMA).


[115_expansion_and_chain.tar.gz](https://github.com/user-attachments/files/23268046/115_expansion_and_chain.tar.gz)

[115_nothing.tar.gz](https://github.com/user-attachments/files/23268047/115_nothing.tar.gz)

[115_only_expansion.tar.gz](https://github.com/user-attachments/files/23268048/115_only_expansion.tar.gz)

Fixes: iree-org#22153

ci-extra: test_torch

---------

Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 26, 2025
The current binary size for this test is nearing its threshold
(459794/460000). Bump to 470000 to allow future changes such as iree-org#22342
some room (goes up to 460306)

Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Bangtian Liu <[email protected]>
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Dec 26, 2025
…ibute (iree-org#22342)

This patch introduces iteration space expansion for reductions in the
VectorDistribute path.

Specifically, we:

1. Add a new attribute, `expand_dims`, for reductions.
2. Introduce a new pass, `GPUExpandDimensions`, which uses `expand_dims`
to expand the iteration space of relevant dimensions.
3. Refactor common functionality shared between `GPUExpandDimensions`
and `BlockDynamicDimensions` into reusable utilities.
4. Refactor encoding helpers from `EncodingAttrs.cpp` into reusable
utilities.

This change also enables [chain
FMA](iree-org#21855) in matvec codegen as
we iterate along the K reduction dimension.

---
**Performance Summary**

**IREE benchmark module**

* Only expansion: ~4% improvement
* Expansion + chain FMA: ~11% improvement

**rocprof**

* Only expansion: ~13% worse
* Expansion + chain FMA: ~9% better

**Register usage**
* 10% reduction (60 → 54 registers for matvec dispatches)

**Instruction latency (post-reduction loop epilogue)**
* 3.5% improvement (340 → 328 total mean latency)

---

**Notes**

* As a follow-up, we can explore applying iteration space expansion to
the reduction in attention
* Right now, we only expand one dimension into two although the
implementation supports expansion to N dimensions.
* Please note this PR changes the reduction order, so expect some
minor changes to the numerics
* This is does not improve performance by itself/can cause regression
without chain FMA iree-org#21855

Traces for matvec dispatches are attached for all variations (original,
only expansion, and expansion + chain FMA).

[115_expansion_and_chain.tar.gz](https://github.com/user-attachments/files/23268046/115_expansion_and_chain.tar.gz)

[115_nothing.tar.gz](https://github.com/user-attachments/files/23268047/115_nothing.tar.gz)

[115_only_expansion.tar.gz](https://github.com/user-attachments/files/23268048/115_only_expansion.tar.gz)

Fixes: iree-org#22153

ci-extra: test_torch

---------

Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Eric Feng <[email protected]>
Signed-off-by: Bangtian Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GPU Codegen] Reduction Optimization: Expand iteration space of innermost reduction dimension

5 participants