| Age | Commit message (Collapse) | Author |
|
PyYAML is not an actual use-time/runtime dependency of our bindings. It
is necessary only if someone wants to regenerate
`LinalgNamedStructuredOps.yaml`:
https://github.com/llvm/llvm-project/blob/93097b2d47c87bf5eee0a2612d961c7a01831eab/mlir/tools/mlir-linalg-ods-gen/update_core_linalg_named_ops.sh.in#L29
This PR does the minimal refactor to remove the need during actual run/use time.
|
|
|
|
|
|
This patch drops two instances of REQUIRES: shell from MLIR tests. This
feature does not mean much given the internal shell is the default for
MLIR. It does prevent these tests from running on Windows, but it does
not seem like there is anything inherent to these tests preventing them
from running on Windows (minus maybe the lack of spirv-tools, which is
explicitly required anyways.
|
|
Some targets have a specific calling convention that should be used for
generated calls to runtime functions.
Pass that down and use it.
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
|
|
FIR operations that use derived types need to have type descriptor
globals available on device when offloading. Examples of this can be
seen in `CUFDeviceGlobal` which ensures that such type descriptor uses
work on device for CUF.
Similarly, this is needed for OpenACC. This change introduces a new
interface to the OpenACC dialect named
`IndirectGlobalAccessOpInterface` which can be attached to operations
that may result in generation of accesses that use type descriptor
globals. This functionality is needed for the `ACCImplicitDeclare` pass
that is coming in a follow-up change which implicitly ensures that all
referenced globals are available in OpenACC compute contexts.
The interface provides a `getReferencedSymbols` method that collects all
global symbols referenced by an operation. When a symbol table is
provided, the implementation for FIR recursively walks type descriptor
globals to find all transitively referenced symbols.
Note that alternately this could have been implemented in different
ways:
- Codegen could implicitly generate such type globals as needed by
changing the technique that relies on populating them during lowering
(eg generate them directly in gpu.module during codegen).
- This interface could attach to types instead of operations for a
potentially more conservative implementation which maps all type
descriptors even if the underlying implementation using it won't
necessarily need such mapping.
The technique chosen here is consistent with `CUFDeviceGlobal` (which
walks operations inside `prepareImplicitDeviceGlobals`) and avoids
conservative mapping of all type descriptors.
|
|
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
|
|
Reverts llvm/llvm-project#165873
The revert is triggered by a failing integration test on a couple of
buildbots.
|
|
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
|
|
This enables changing the location of a block argument. Follows the
approach for updating type of block arg.
|
|
AffineOps.cpp (NFC)
|
|
AsmPrinter.cpp (NFC)
|
|
ShardOps.cpp (NFC)
|
|
and set_desc_layout (#168929)
`set_op_layout_attr` and `set_desc_layout` transform ops wrap
`xegpu.layout` in an `xegpu.slice` attribute if `slice_dims` argument is
set.
|
|
This commit enables the direct import of debug records by default and
fixes issues with two edge cases:
- Detect early on if the address operand is an argument list (calling
getAddress() for argument lists asserts)
- Use getAddress() to check if the address operand is null, which means
the address operand is an empty metadata node, which currently is not
supported.
- Add support for debug label records.
This is a follow-up to:
https://github.com/llvm/llvm-project/pull/167812
|
|
This PR removes op verifiers that do not implement any custom
verification logic.
|
|
Sg (#168118)
|
|
given tiled loop nest. (#167634)
The existing `scf::tileAndFuseConsumerOfSlices` takes a list of slices
(and loops they are part of), tries to find the consumer of these slices
(all slices are expected to be the same consumer), and then tiles the
consumer into the loop nest using the `TilingInterface`. A more natural
way of doing consumer fusion is to just start from the consumer, look
for operands that are produced by the loop nest passed in as `loops`
(presumably these loops are generated by tiling, but that is not a
requirement for consumer fusion). Using the consumer you can find the
slices of the operands that are accessed within the loop which you can
then use to tile and fuse the consumer (using `TilingInterface`). This
handles more naturally the case where multiple operands of the consumer
come from the loop nest.
The `scf::tileAndFuseConsumerOfSlices` was implemented as a mirror of
`scf::tileAndFuseProducerOfSlice`. For the latter, the slice has a
single producer for the source of the slice, which makes it a natural
way of specifying producer fusion. But for consumers, the result might
have multiple users, resulting in multiple candidates for fusion, as
well as a fusion candidate using multiple results from the tiled loop
nest. This means using slices
(`tensor.insert_slice`/`tensor.parallel_insert_slice`) as a hook for
consumer fusion turns out to be quite hard to navigate. The use of the
consumer directly avoids all those pain points. In time the
`scf::tileAndFuseConsumerOfSlices` should be deprecated in favor of
`scf::tileAndFuseConsumer`. There is a lot of tech-debt that has
accumulated in `scf::tileAndFuseConsumerOfSlices` that needs to be
cleanedup. So while that gets cleaned up, and required functionality is
moved to `scf::tileAndFuseConsumer`, the old path is still maintained.
The test for `scf::tileAndFuseConsumerUsingSlices` is copied to
`tile-and-fuse-consumer.mlir` to
`tile-and-fuse-consumer-using-slices.mlir`. All the tests that were
there in this file are now using the `tileAndFuseConsumer` method. The
test op `test.tile_and_fuse_consumer` is modified to call
`scf::tileAndFuseConsumer`, while a new op
`test.tile_and_fuse_consumer_of_slice` is used to keep the old path
tested while it is deprecated.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
|
|
ValueBoundsOpInterface.cpp (NFC)
|
|
There are several places where we use `llvm::OwningArrayRef`. The
interface to this requires us to first construct temporary storage, then
allocate space and set the allocated memory to 0, then copy the values
we actually want into that memory, then move the array into place.
Instead we can just do it all inline in a single pass by using
`std::vector`. In one case we actually allocate a completely separate
container and then allocate + copy the data over because
`llvm::OwningArrayRef` does not (and can't) support `push_back`.
Note that `llvm::SmallVector` is not a suitable replacement here because
we rely on reference stability on move construction: when the outer
container reallocates, we need the the contents of the inner containers
to be fixed in memory, and `llvm::SmallVector` does not give us that
guarantee.
|
|
The dialect implementation mostly copies the one of `cf.switch`, but
aligns naming to the SPIR-V spec.
|
|
This commit addresses a crash in the dialects folder. The currently
folder assumes no broadcasting of the input operand happens and
therefore the folder can complain that the returned value was not the
same
shape as the result.
For now, this commit ensures no folding happens when broadcasting is
involved. In the future, folding with a broadcast could likely be
supported by inserting a `tosa.tile` operation before returning the
operand. This type of transformation is likely better suited for a
canonicalization pass. This commit only aims to avoid the crash.
|
|
(#167959)
This fixes a crash in SCF→GPU when building the per‑dim index for mapped
scf.parallel.
**Change**:
- Map step/lb through cloningMap, then run ensureLaunchIndependent.
- If either is still unavailable at launch scope, emit a match‑failure;
otherwise build the affine.apply.
**Why this is correct:**
- Matches how the pass already handles launch bounds; avoids creating an
op with invalid operands and replaces a segfault with a clear
diagnostic.
**Tests**:
- Added two small regressions that lower to gpu.launch and exercise the
affine.apply path.
Fixes : #167654
Signed-off-by: Shashi Shankar <shashishankar1687@gmail.com>
|
|
The dead alloc elimination pass previously considered only subviews when
checking for dead stores. This change generalizes the logic to support
all view-like operations, ensuring broader coverage.
|
|
Note, scalable reductions dims are left as a TODO.
|
|
SparseBufferRewriting.cpp (NFC)
|
|
LinalgOps.cpp (NFC)
|
|
LLVMToLLVMIRTranslation.cpp (NFC)
|
|
(#167844)
The current `DIScopeForLLVMFuncOp` pass handles debug information for
inlined code by processing `CallSiteLoc` attributes. However, some
compilation scenarios compose code from multiple source files directly
into a single function without generating `CallSiteLoc`.
**Scenario:**
```python
# a.py
def kernel_a(tensor):
print("a: {}", tensor) # a.py:3
jit_func_b(tensor) # Calls b.py code
# b.py
def func_b(tensor):
print("b: {}", tensor) # b.py:7
```
The scenario executes Python at compile-time and directly inserts
operations from `b.py` into the kernel function, resulting in MLIR like:
```mlir
@kernel_a(...) {
print("a: {}", %arg0) loc(#loc_a) // a.py:3
print("b: {}", %arg0) loc(#loc_b) // b.py:7 <- FileLineColLoc, not CallSiteLoc
} loc(#loc_kernel) // a.py:1
#loc1 = loc("a.py":3:.)
#loc2 = loc("b.py":7:.)
#loc_a = loc("print"(#loc1))
#loc_b = loc("print"(#loc2))
```
```llvm
!6 = !DIFile(filename: "a.py", directory: "...")
!9 = distinct !DISubprogram(name: "...", linkageName: "...", scope: !6, file: !6, line: 13, ...)
!10 = !DILocation(line: 7, column: ., scope: !9) // Points to kernel's DISubprogram, not correct
```
|
|
interface" (#168847)
Reverts llvm/llvm-project#168499
|
|
Interfaces can be optional: whether an op implements an interface or not
can depend on the state of the operation.
```
// An optional code block for adding additional "classof" logic. This can
// be used to better enable "optional" interfaces, where an entity only
// implements the interface if some dynamic characteristic holds.
// `$_attr`/`$_op`/`$_type` may be used to refer to an instance of the
// interface instance being checked.
code extraClassOf = "";
```
The current `Pass::canScheduleOn(RegisteredOperationName)` is
insufficient. This commit adds an additional overload to inspect
`Operation *`.
This commit fixes a crash when scheduling an `InterfacePass` for an
optional interface on an operation that does not actually implement the
interface.
|
|
delete ops (#166766)
Reland https://github.com/llvm/llvm-project/pull/165725, fix the Failed
test by removing successor operands before delete operations. Following
the deletion of cond.branch, its successor operands will subsequently be
removed.
|
|
(#168764)
This PR relaxes the validation checks to allow input/output data to have
dynamic batch dimensions.
|
|
This PR adds pattern for unrolling shape_cast given a targetShape. This
PR is a follow up of #164010 which was very general and was using
inserts and extracts on each element (which is also
LowerVectorShapeCast.cpp is doing).
After doing some more research on use cases, we (me and @Jianhui-Li )
realized that the previous version in #164010 is unnecessarily generic
and doesn't fit our performance needs.
Our use case requires that targetShape is contiguous in both source and
result vector.
This pattern only applies when contiguous slices can be extracted from
the source vector and inserted into the result vector such that each
slice remains in vector form with targetShape (and not decompose to
scalars). In these cases, the unrolling proceeds as:
vector.extract_strided_slice -> vector.shape_cast (on the slice
unrolled) -> vector.insert_strided_slice
|
|
memref. (#168610)
adaptor already lowers memref to base address.
Conversion patterns should use it instead of generating code to get base
address from memref.
|
|
|
|
Legalizing following IR to `tosa` using `tf-tosa-opt` from `tensorflow`
repo:
```
func.func @main(%arg0: tensor<?x?x?x?xf32>) -> tensor<?x?x?x5xf32> {
%0 = "tfl.pseudo_const"() <{value = dense<0.000000e+00> : tensor<5xf32>}> : () -> tensor<5xf32>
%1 = tfl.add(%arg0, %0) <{fused_activation_function = "NONE"}> : (tensor<?x?x?x?xf32>, tensor<5xf32>) -> tensor<?x?x?x5xf32>
return %1 : tensor<?x?x?x5xf32>
}
```
fails with
```
error: 'tosa.add' op operands don't have matching ranks
%1 = tfl.add(%arg0, %0) <{fused_activation_function = "NONE"}> : (tensor<?x?x?x?xf32>, tensor<5xf32>) -> tensor<?x?x?x5xf32>
^
tfl.mlir:3:10: note: see current operation: %1 = "tosa.add"(%arg0, %0) : (tensor<?x?x?x?xf32>, tensor<5xf32>) -> tensor<?x?x?x5xf32>
// -----// IR Dump After TosaLegalizeTFLPass Failed (tosa-legalize-tfl) //----- //
"func.func"() <{function_type = (tensor<?x?x?x?xf32>) -> tensor<?x?x?x5xf32>, sym_name = "main"}> ({
^bb0(%arg0: tensor<?x?x?x?xf32>):
%0 = "tosa.const"() <{values = dense<0.000000e+00> : tensor<5xf32>}> : () -> tensor<5xf32>
%1 = "tosa.add"(%arg0, %0) : (tensor<?x?x?x?xf32>, tensor<5xf32>) -> tensor<?x?x?x5xf32>
"func.return"(%1) : (tensor<?x?x?x5xf32>) -> ()
}) : () -> ()
```
This is because of the following check in `computeReshapeOutput` called
from `EqualizeRanks` function:
```
if (lowerRankDim != 1 && higherRankDim != 1 &&
lowerRankDim != higherRankDim)
return failure();
```
Based on the broadcast semantics defined in
https://mlir.llvm.org/docs/Traits/Broadcastable/#dimension-inference I
think it's legal to allow `lowerRankDim != higherRankDim` if one of them
is dynamic. At runtime verifier should enforce that
1. if lowerRankDim is dynamic and higherRankDim is static then the
dynamic dim matches the static dim and vice-versa
2. if both are dynamic, they should match
It's not necessary to error out during the op construction time.
|
|
- Add missing locations, namely 'ErrnoMem', 'TargetMem0', and
'TargetMem1'.
|
|
LLVM_DISTRIBUTION_COMPONENTS (#168407)
Fixes https://github.com/llvm/llvm-project/issues/168393. Also adds
top-level `MLIR_PYTHON_STUBGEN_ENABLED` CMake option.
|
|
memref (#166959)
Vectorization of a 1-d reduction where the output variable is a 1-ranked
memref can generate an invalid `vector.transfer_write` with no indices
for the memref, e.g.:
vector.transfer_write"(%vec, %buff) <{...}> : (vector<f32>,
memref<1xf32>) -> ()
This patch solves the problem by providing the expected amount of
indices (i.e. matching the rank of the memref).
|
|
|
|
code (#168536)
ODS generate code can be included and used outside of the `mlir`
namespace and so references to symbols in the mlir namespace
must be fully qualified.
|
|
|
|
This commit fixes linker failures evident on some failing build bots.
|
|
Reland commit fb829bf11feeb53f815a3abf539e63ec3a23ed3d with additional fixes relating to post-merge CI failure
```
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp: In function ‘constexpr llvm::nvvm::CTAGroupKind getNVVMCtaGroupKind(mlir::NVVM::CTAGroupKind)’:
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/llvm/include/llvm/Support/ErrorHandling.h:165:36: error: call to non-constexpr function ‘void llvm::llvm_unreachable_internal(const char*, const char*, unsigned int)’
::llvm::llvm_unreachable_internal(msg, __FILE__, __LINE__)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp:73:3: note: in expansion of macro ‘llvm_unreachable’
llvm_unreachable("unsupported cta_group value");
^
```
|
|
|
|
|
|
This PR relax the create_mem_desc's restriction on source memref,
allowing it to be a 2d memref.
|
|
Add the cl::values to the pass options so an assert is not reached when
trying to generate a reproducer e.g. "unknown data value for option"
|
|
Introduce two new OpenACC operation interfaces for identifying global
variables and their address computations:
- `GlobalVariableOpInterface`: Identifies operations that define global
variables. Provides an `isConstant()` method to query whether the global
is constant.
- `AddressOfGlobalOpInterface`: Identifies operations that compute the
address of a global variable. Provides a `getSymbol()` method to
retrieve the symbol reference.
This is being done in preparation for `ACCImplicitDeclare` pass which
will automatically ensure that `acc declare` is applied to globals when
needed.
The following operations now implement these interfaces:
- `memref::GlobalOp` implements `GlobalVariableOpInterface`
- `memref::GetGlobalOp` implements `AddressOfGlobalOpInterface`
- `fir::GlobalOp` implements `GlobalVariableOpInterface`
- `fir::AddrOfOp` implements `AddressOfGlobalOpInterface`
|