Skip to content

[MLIR][XeGPU] Remove the transpose attribte from Gather/Scatter ops and Cleanup the documents #145389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 36 additions & 96 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,6 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
information e.g., memref<?x?xf16>, the strides information has to be explicitly
passed via the "strides" and "const_strides" argument.

In SIMT mode, tensor descriptor is augmented with `LayoutAttr` which describes the
mapping of the tensor descriptor to the work items.

Example 1 (suppose the tensor shape inferred by the compiler is 8x16):
```mlir
%0 = memref.alloc() : memref<1024x1024xf32>
Expand All @@ -106,15 +103,6 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
%c1 = arith.constant 1 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: ui64 -> TensorDesc<8x16xf32>
```

Example 4 (SIMT mode):
```mlir
%0 = memref.alloc() : memref<1024x1024xf32>
%c0 = arith.constant 0 : index
%c1 = arith.constant 8 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0] : memref<1024x1024xf32>
-> !xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
```
}];

let arguments = (ins
Expand Down Expand Up @@ -301,9 +289,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
fp32 or fp64. It implies that vnni and transpose cannot exit at the
same time.

In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, result
vector represents the data to be loaded by each work-item.
In SIMT mode, result vector represents the data to be loaded by each work-item.

Example 1:
```mlir
Expand All @@ -317,8 +303,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
```mlir
xegpu.load_nd %1 {l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<8x16xf32,
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<8x1xf32>
: !xegpu.tensor_desc<8x16xf32> -> vector<8xf32>
```


Expand Down Expand Up @@ -359,9 +344,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
of cache, L1, L2 and L3. If hardware does not have a correspoding cache,
Corresponding cache hint attribute will be masked.

In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, input
vector represents the data to be stored by each work-item.
In SIMT mode, the input vector represents the data to be stored by each work-item.

Example 1:
```mlir
Expand All @@ -375,8 +358,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
: vector<8x1xf16>, !xegpu.tensor_desc<8x16xf16,
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
: vector<8xf16>, !xegpu.tensor_desc<8x16xf16>
```


Expand Down Expand Up @@ -410,15 +392,10 @@ def XeGPU_UpdateNdOffsetOp : XeGPU_Op<"update_nd_offset",
The offsets are relative offset to the current position in the number
of elements. It will result in a same type TensorDesc as the input.

Example 1:
Example:
```
%2 = xegpu.update_nd_offset %1, [0, 16]: !xegpu.tensor_desc<8x16xf32>
```
Example 2 (SIMT mode):
```
%2 = xegpu.update_nd_offset %1, [0, 16]:
!xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
```
}];

let arguments = (ins
Expand Down Expand Up @@ -476,11 +453,6 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
match the dimension of offsets. It may also has a second dimension corresponding to
the chunk_size if the chunk size is larger than 1.

In SIMT mode, similar to `create_nd_tdesc` the resulting tensor descriptor is augmented
with `LayoutAttr` which describes the mapping of the tensor descriptor to the work items.
In this case, the first dimension of the tensor descriptor represents the work-items, and
the second dimension represents the chunk size.

Example 1: It assumes subgroup size is 4, and accesses a[0], a[16], a[32], a[64]
```mlir
%a = memref.alloc() : memref<1024xf32>
Expand All @@ -505,15 +477,6 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
-> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>>
```

Example 4: SIMT mode
```mlir
%0 = memref.alloc() : memref<1024xf32>
%off = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
-> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>,
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>
```
}];

let arguments = (ins XeGPU_BaseAddrType: $source,
Expand Down Expand Up @@ -609,54 +572,44 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [
let description = [{ It (aka. load) load data per each work-item. The output
describes the data being loaded at the subgroup level, so its size is
consistent with the number of work-items in a subgroup. When the chunk size
is larger than 2, the output vector is a 2D vector, with dim-1 correspoding
to work-items, and dim-0 corresponding to the chunk size loaded by each work-item.
Specially, there is a transpose effect on the result (as compared to the TensorDesc)
due to the hardware implementation. Therefore, a transpose attribute is introduced
on purpose, making sure users are aware of this implicit transformation.

is larger than 2, the output vector is a 2D vector, with dim-0 correspoding
to work-items, and dim-1 corresponding to the chunk size loaded by each work-item.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming here that there will always be a transpose operation after the load?

I wonder how a user can understand the semantics of this op. what if the user does not want the transpose and want to use the op in isolation (which is perfectly legal)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no transpose. The semantic is each row corresponding to a lane. In the SIMD lowering pipeline, the transpose will be added when we lower the load_gather to the corresponding intrinsic. For SIMT lowering, there is no transpose at all.

Copy link
Contributor

@charithaintc charithaintc Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it again.

It seems like now xegpu.load (with chunck > 1) is just a logical operation. meaning it does not have a matching HW instruction. Logically we can use it without an accompanying transpose operation. that is true.

In practice, it will always come with an accompanying transpose. It will mostly be useful for A*BT case. In that case we always need an explicit vector.transpose after the xegpu.load. During lowering the load + transpose are optimized away in both SIMD and SIMT paths. Essentially we say that "we have a HW instruction that can do both these together, so transpose here is a nop". No need to do any shuffling to the transpose.

For A*B case, I think doing multiple loads will maybe be cheaper than doing a load gather and then doing an in-resister transpose. not sure about this case.

A*BT case

func.func @load_gather_with_transpose_effect(%arg0: memref<8x16xf16>, %arg1: memref<256xf16>, %arg2: memref<8x16xf32>) {
  %c0 = arith.constant 0 : index
  %0 = xegpu.create_nd_tdesc %arg0[%c0, %c0] : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
  %1 = xegpu.load_nd %0  : !xegpu.tensor_desc<8x16xf16> -> vector<8x16xf16>
  %cst = arith.constant dense<[0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240]> : vector<16xindex>
  %cst_0 = arith.constant dense<true> : vector<16xi1>
  %2 = xegpu.create_tdesc %arg1, %cst : memref<256xf16>, vector<16xindex> -> !xegpu.tensor_desc<16x16xf16, #xegpu.scatter_tdesc_attr<chunk_size = 16 : i64>>
  %3 = xegpu.load %2, %cst_0  : !xegpu.tensor_desc<16x16xf16, #xegpu.scatter_tdesc_attr<chunk_size = 16 : i64>>, vector<16xi1> -> vector<16x16xf16> // layout  = [16, 1][1, 2]
 %6 = vector.transpose %3 : vector<16x16xf16> -> vector<16x16xf16> // this is a NOP // layout = [1, 16][2, 1]
  %4 = xegpu.dpas %1, %6 : vector<8x16xf16>, vector<16x16xf16> -> vector<8x16xf32>
  %5 = xegpu.create_nd_tdesc %arg2[%c0, %c0] : memref<8x16xf32> -> !xegpu.tensor_desc<8x16xf32>
  xegpu.store_nd %4, %5  : vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32>
  return
}

A*B case.

func.func @load_gather_with_transpose_effect(%arg0: memref<8x16xf16>, %arg1: memref<256xf16>, %arg2: memref<8x16xf32>) {
  %c0 = arith.constant 0 : index
  %0 = xegpu.create_nd_tdesc %arg0[%c0, %c0] : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
  %1 = xegpu.load_nd %0  : !xegpu.tensor_desc<8x16xf16> -> vector<8x16xf16>
  %cst = arith.constant dense<[0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240]> : vector<16xindex>
  %cst_0 = arith.constant dense<true> : vector<16xi1>
  %2 = xegpu.create_tdesc %arg1, %cst : memref<256xf16>, vector<16xindex> -> !xegpu.tensor_desc<16x16xf16, #xegpu.scatter_tdesc_attr<chunk_size = 16 : i64>>
  %3 = xegpu.load %2, %cst_0  : !xegpu.tensor_desc<16x16xf16, #xegpu.scatter_tdesc_attr<chunk_size = 16 : i64>>, vector<16xi1> -> vector<16x16xf16> 
  %4 = xegpu.dpas %1, %3 : vector<8x16xf16>, vector<16x16xf16> -> vector<8x16xf32>
  %5 = xegpu.create_nd_tdesc %arg2[%c0, %c0] : memref<8x16xf32> -> !xegpu.tensor_desc<8x16xf32>
  xegpu.store_nd %4, %5  : vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32>
  return
}

A*BT case is clear to me. but not sure what we do with A*B case here ? Maybe I am still missing something. @Jianhui-Li can you also clarify on these examples. I know that A*B is not a real use case, but still confused how layout propagation works here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. It is clear now after discussing with @Jianhui-Li. A*B case will need a convert_layout because the load is not giving us the layout needed for DPAS B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For A* B case, as you use load w/ chunk_size for B, which assumes [16, 1] [1, 2] layout. The propagation needs to insert a xegpu.conv_layout to convert it to [1, 16][2, 1] before it feed to DPAS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in lowering perspective we expect two cases.

  1. xegpu.load + vector.transpose : regular case. just lower to load with chunk size instrinsic.
  2. xegpu.load + convert_layout : load to load with chunk size intrinsic and do cross lane shuffles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
One thing to note in the lowering: In you code example, the user specify xegpu.load w/ chunk_size, which will be lowered XeVM.load w/ vector size by default (each lane load contiguous data).
If user override the layout of xegpu.load w/ chunk_size, say forcing it to be takes [1, 16] [2, 1] layout, it will need to lowered to multiple regular XeVM.load, since now the data loaded by each lane are not contiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user override the layout of xegpu.load w/ chunk_size, say forcing it to be takes [1, 16] [2, 1] layout, it will need to lowered to multiple regular XeVM.load, since now the data loaded by each lane are not contiguous.

is the user allowed to do this? I also like it if we keep it relaxed. But I can see in this PR we have hard coded the scattered load layout to [16, 1][1, 2]. Check here
https://github.com/llvm/llvm-project/pull/145389/files#diff-fcc9cdbf8bb4e5d37e661524b877082aee9b7badb0317f980c1881da564a926dR230-R237

Copy link
Contributor

@Jianhui-Li Jianhui-Li Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the propagation pass will be improved to allow user to set the layout which override the default decision.

The mask operand masks out memory access so that it is safe to pass out-of-boundary
addresses/offsets as long as they are masked. It applies to slots of SIMD lanes.

In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, result vector
represents the data to be loaded by each work-item. Each work-item recieves a `chunk_size`
number of elements.
In SIMT mode, the result vector represents the data to be loaded by each work-item.
Each work-item recieves a `chunk_size` number of elements.

Example 1:
```mlir
%2 = xegpu.load %1, %0 {l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}
%2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<16xf32, #xegpu.scatter_tdesc_attr<memory_space=global>>,
vector<16xi1> -> vector<16xf32>
```

Example 2:
```mlir
%2 = xegpu.load %1, %0 {transpose,
l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}
%2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>>,
vector<16xi1> -> vector<8x16xf32>
vector<16xi1> -> vector<16x8xf32>
```
Example 3 (SIMT mode):
```mlir
%2 = xegpu.load %1, %0 {transpose,
l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>,
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>>
vector<16xi1> -> vector<8x1xf32>
%2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>>
vector<16xi1> -> vector<8xf32>
```

}];

let arguments = (ins XeGPU_TensorDesc: $TensorDesc,
XeGPU_MaskType: $mask,
OptionalAttr<UnitAttr>: $transpose,
OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint);
Expand Down Expand Up @@ -699,44 +652,38 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [
has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is
introduced on purpose, making sure users are aware of this implicit transformation.

In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, input vector
represents the data to be stored by each work-item. Each work-item recieves a `chunk_size`
number of elements.
In SIMT mode, the input vector represents the data to be stored by each work-item.
Each work-item stores a `chunk_size` number of elements.

Example 1:
```mlir
xegpu.store %0, %1, %2 {l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}>
: vector<16xf32>, !xegpu.tensor_desc<16xf32, #xegpu.scattered_tdesc_attr<>>, vector<16xi1>
```

Example 2:
```mlir
xegpu.store %0, %1, %2 {transpose,
l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
: vector<8x16xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1>
xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}>
: vector<16x8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1>
```

Example 3 (SIMT mode):
```mlir
xegpu.store %0, %1, %2 {transpose,
l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
: vector<8x1xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>,
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>> vector<16xi1>
xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}>
: vector<8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>> vector<16xi1>
```

}];

let arguments = (ins
XeGPU_ValueType: $value,
XeGPU_TensorDesc: $TensorDesc,
XeGPU_MaskType: $mask,
OptionalAttr<UnitAttr>: $transpose,
OptionalAttr<XeGPU_CacheHintAttr>: $l1_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l2_hint,
OptionalAttr<XeGPU_CacheHintAttr>: $l3_hint);
Expand Down Expand Up @@ -773,20 +720,13 @@ def XeGPU_UpdateOffsetOp: XeGPU_Op<"update_offset",
update the offset per work-item, so its offsets contains values representing
shifts for each work-item.

Example 1:
Example:
```mlir
%off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
%2 = xegpu.update_offset %1, %off :
!xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>>, vector<4xindex>
```

Example 2 (SIMT mode):
```mlir
%off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
%2 = xegpu.update_offset %1, %off :
!xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>,
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>, vector<4xindex>
```
}];

let arguments = (ins XeGPU_TensorDesc: $TensorDesc,
Expand Down
2 changes: 2 additions & 0 deletions mlir/include/mlir/Dialect/XeGPU/Utils/XeGPUUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ constexpr unsigned packedSizeInBitsForDefault =
16; // Minimum packing size per register for DPAS A.
constexpr unsigned packedSizeInBitsForDpasB =
32; // Minimum packing size per register for DPAS B.
constexpr unsigned packedSizeInBitsForGatherScatter =
32; // Minimum packing size per register for Gather and Scatter ops.
} // namespace targetinfo
} // namespace xegpu

Expand Down
35 changes: 20 additions & 15 deletions mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

#include "mlir/Dialect/Utils/IndexingUtils.h"
#include "mlir/Dialect/XeGPU/IR/XeGPU.h"
#include "mlir/Dialect/XeGPU/Utils/XeGPUUtils.h"
#include "mlir/IR/Builders.h"
#include "mlir/IR/DialectImplementation.h"
#include "llvm/ADT/TypeSwitch.h"
Expand Down Expand Up @@ -309,11 +310,23 @@ LogicalResult TensorDescType::verify(
llvm::ArrayRef<int64_t> shape, mlir::Type elementType,
mlir::Attribute encoding, mlir::Attribute layout) {
size_t rank = shape.size();
// Low-precision types are packed in 32-bit units.
int32_t packingFactor = 32 / elementType.getIntOrFloatBitWidth();
if (rank != 1 && rank != 2)
return emitError() << "expected 1D or 2D tensor";

auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding);
if (blockAttr) {
MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace();
if (rank == 2 && memorySpaceAttr &&
memorySpaceAttr.getValue() == MemorySpace::SLM)
return emitError() << "SLM is not supported for 2D block tensor";
}

// for gather and scatter ops, Low-precision types are packed in 32-bit units.
unsigned bitWidth = elementType.getIntOrFloatBitWidth();
int chunkAlignmentFactor =
bitWidth < targetinfo::packedSizeInBitsForGatherScatter
? targetinfo::packedSizeInBitsForGatherScatter / bitWidth
: 1;
auto scatterAttr = mlir::dyn_cast_if_present<ScatterTensorDescAttr>(encoding);
if (scatterAttr) {
// Expected tensor ranks for scattered data:
Expand All @@ -329,21 +342,13 @@ LogicalResult TensorDescType::verify(
if (chunkSize > 1) {
if (shape.back() != chunkSize)
return emitError() << "expected tensor shape[1] to match chunk size";
if (shape.back() % packingFactor != 0)
return emitError()
<< "expected tensor shape[1] to be a multiple of packing factor "
<< packingFactor;
if (shape.back() % chunkAlignmentFactor != 0)
return emitError() << "expected tensor shape[1] to be a multiple of "
"chunk alignment factor "
<< chunkAlignmentFactor;
}
}

auto blockAttr = mlir::dyn_cast_if_present<BlockTensorDescAttr>(encoding);
if (blockAttr) {
MemorySpaceAttr memorySpaceAttr = blockAttr.getMemorySpace();
if (rank == 2 && memorySpaceAttr &&
memorySpaceAttr.getValue() == MemorySpace::SLM)
return emitError() << "SLM is not supported for 2D block tensor";
}

auto layoutAttr = llvm::dyn_cast_if_present<LayoutAttr>(layout);
if (layoutAttr) {
if (rank != (size_t)layoutAttr.getRank())
Expand All @@ -360,7 +365,7 @@ LogicalResult TensorDescType::verify(
if (rank > 1 && laneData[0] != 1)
return emitError()
<< "cannot map over non-contiguous scattered row elements";
if (laneData[rank - 1] != packingFactor)
if (laneData[rank - 1] != chunkAlignmentFactor)
return emitError() << "work item data mapping must match the number of "
"contiguous elements";
}
Expand Down
Loading