| Age | Commit message (Collapse) | Author |
|
|
|
This PR adds some tests for covering some useful corner cases.
1. more tests for `vector.shape_cast` distribution.
2. testing for `MoveFuncBodyToWarpOp` pattern that was not possible
before.
|
|
Current subgroup distribution test employ the entire
`xegpu-subgroup-distribute` pass which include multiple steps like
layout propagation, move func body into warp op, and distribute to work
items.
This makes it harder to isolate the testing for xegpu subgroup
distribution logic, because certain corner cases may be not supported
yet by other steps mentioned above.
This PR introduces a test pass for subgroup distribution logic and
isolate the testing for distribution logic. We plan to add more corner
case (that were not possible before) covering non-xegpu ops (like
vector) in next PRs.
This PR also include,
1. minor bug fixes in gather/scatter distribution.
2. bug fix in vector multi reduction lowering where it fails to retain
some layouts.
|
|
|
|
load_nd/store_nd/prefetch_nd (#160323)
Adds support for new syntax in XeGPUUnroll for:
1. `create_nd_desc` without offsets
2. `load_nd` with offsets
3. `store_nd` with offsets
4. `prefetch_nd` with offsets
`create_nd_desc with offsets` + `load_nd with offsets` won't be lowered
correctly. In this case the IR would still have two unrealized
conversions that will fail later in the pipeline.
The offsets computation for the unrolled tile is now moved from
descriptors to load/store/prefetch operations. The resulted IR now has
one single descriptor that is being iterated in load/store/prefetch ops.
<details><summary>old/new behavior examples</summary>
```mlir
// before unroll pass:
gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> {
%tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>>
%ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32>
gpu.return %ld : vector<24x32xf32>
}
// after unroll pass (offsets in create_nd_desc):
gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
%c24 = arith.constant 24 : index
%c32 = arith.constant 32 : index
%c8 = arith.constant 8 : index
%c16 = arith.constant 16 : index
// create 6 descriptors for each tile
%0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
%6 = xegpu.load_nd %0 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%7 = xegpu.load_nd %1 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%8 = xegpu.load_nd %2 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%9 = xegpu.load_nd %3 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%10 = xegpu.load_nd %4 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%11 = xegpu.load_nd %5 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
...
}
// after unroll pass (offsets in load_nd):
gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
%c24 = arith.constant 24 : index
%c32 = arith.constant 32 : index
%c16 = arith.constant 16 : index
%c8 = arith.constant 8 : index
// create only one descriptor with proper tile shape
%0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
// compute tile offsets at the operation (using only one descriptor)
%1 = xegpu.load_nd %0[%c8, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%2 = xegpu.load_nd %0[%c8, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%3 = xegpu.load_nd %0[%c16, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%4 = xegpu.load_nd %0[%c16, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%5 = xegpu.load_nd %0[%c24, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
%6 = xegpu.load_nd %0[%c24, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
...
}
```
</details>
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
|
|
offsets (#159453)
This PR adds unrolling/blocking patterns for load_gather and
store_scatter ops with offsets.
|
|
`vector.shape_cast` SIMT distribution. (#157560)
Add support for distributing the `vector.multi_reduction` operation
across lanes in a warp. Currently only 2D to 1D reductions are
supported. Given layouts for the source and accumulator vectors,
* If the reduction dimension is distributed across lanes, the reduction
is non-lane-local and the reduction is done using warp shuffles. Here we
simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s
inside the warp op body. Actual distribution will be done by
`WarpOpReduction` pattern.
* If the reduction dimension is not distributed across lanes, the
reduction is lane-local. In this case, we yield the source and
accumulator vectors from the warp op and perform the lane-local
reduction outside the warp op using a sequence of `ReductionOp`s.
PR also adds support for distributing `vector.shape_cast` based on
layouts.
|
|
store_matrix. (#154403)
|
|
Mechanically applied using clang-tidy.
|
|
---------
Co-authored-by: Charitha Saumya <136391709+charithaintc@users.noreply.github.com>
|
|
Change local variants to use new central one.
|
|
|
|
Add blocking support for scatter ops: Create_tdesc, update, prefetch,
load and store. It also enables the load/store with chunk size.
|
|
Add support for load/store with chunk_size, which requires special
consideration for the operand blocking since offests and masks are
n-D and tensor are n+1-D. Support operations including create_tdesc,
update_tdesc, load, store, and prefetch.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
|
|
Add unrolling support for create_tdesc, load, store, prefetch, and update_offset.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
Co-authored-by: Chao Chen <chao.chen@intel.com>
|
|
|
|
Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes
such that they can be dispatched into a hardware instruction. This PR
marks the initial phase of a series dedicated to incorporating unroll
patterns for XeGPU operations. In this installment, we introduce
patterns for the following operations:
1. createNd
2. updateNd
3. prefetchNd
4. loadNd
5. storeNd
6. dpas
|