| Age | Commit message (Collapse) | Author |
|
|
|
|
|
FIR operations that use derived types need to have type descriptor
globals available on device when offloading. Examples of this can be
seen in `CUFDeviceGlobal` which ensures that such type descriptor uses
work on device for CUF.
Similarly, this is needed for OpenACC. This change introduces a new
interface to the OpenACC dialect named
`IndirectGlobalAccessOpInterface` which can be attached to operations
that may result in generation of accesses that use type descriptor
globals. This functionality is needed for the `ACCImplicitDeclare` pass
that is coming in a follow-up change which implicitly ensures that all
referenced globals are available in OpenACC compute contexts.
The interface provides a `getReferencedSymbols` method that collects all
global symbols referenced by an operation. When a symbol table is
provided, the implementation for FIR recursively walks type descriptor
globals to find all transitively referenced symbols.
Note that alternately this could have been implemented in different
ways:
- Codegen could implicitly generate such type globals as needed by
changing the technique that relies on populating them during lowering
(eg generate them directly in gpu.module during codegen).
- This interface could attach to types instead of operations for a
potentially more conservative implementation which maps all type
descriptors even if the underlying implementation using it won't
necessarily need such mapping.
The technique chosen here is consistent with `CUFDeviceGlobal` (which
walks operations inside `prepareImplicitDeviceGlobals`) and avoids
conservative mapping of all type descriptors.
|
|
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
|
|
Reverts llvm/llvm-project#165873
The revert is triggered by a failing integration test on a couple of
buildbots.
|
|
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
|
|
This enables changing the location of a block argument. Follows the
approach for updating type of block arg.
|
|
and set_desc_layout (#168929)
`set_op_layout_attr` and `set_desc_layout` transform ops wrap
`xegpu.layout` in an `xegpu.slice` attribute if `slice_dims` argument is
set.
|
|
This commit enables the direct import of debug records by default and
fixes issues with two edge cases:
- Detect early on if the address operand is an argument list (calling
getAddress() for argument lists asserts)
- Use getAddress() to check if the address operand is null, which means
the address operand is an empty metadata node, which currently is not
supported.
- Add support for debug label records.
This is a follow-up to:
https://github.com/llvm/llvm-project/pull/167812
|
|
This PR removes op verifiers that do not implement any custom
verification logic.
|
|
given tiled loop nest. (#167634)
The existing `scf::tileAndFuseConsumerOfSlices` takes a list of slices
(and loops they are part of), tries to find the consumer of these slices
(all slices are expected to be the same consumer), and then tiles the
consumer into the loop nest using the `TilingInterface`. A more natural
way of doing consumer fusion is to just start from the consumer, look
for operands that are produced by the loop nest passed in as `loops`
(presumably these loops are generated by tiling, but that is not a
requirement for consumer fusion). Using the consumer you can find the
slices of the operands that are accessed within the loop which you can
then use to tile and fuse the consumer (using `TilingInterface`). This
handles more naturally the case where multiple operands of the consumer
come from the loop nest.
The `scf::tileAndFuseConsumerOfSlices` was implemented as a mirror of
`scf::tileAndFuseProducerOfSlice`. For the latter, the slice has a
single producer for the source of the slice, which makes it a natural
way of specifying producer fusion. But for consumers, the result might
have multiple users, resulting in multiple candidates for fusion, as
well as a fusion candidate using multiple results from the tiled loop
nest. This means using slices
(`tensor.insert_slice`/`tensor.parallel_insert_slice`) as a hook for
consumer fusion turns out to be quite hard to navigate. The use of the
consumer directly avoids all those pain points. In time the
`scf::tileAndFuseConsumerOfSlices` should be deprecated in favor of
`scf::tileAndFuseConsumer`. There is a lot of tech-debt that has
accumulated in `scf::tileAndFuseConsumerOfSlices` that needs to be
cleanedup. So while that gets cleaned up, and required functionality is
moved to `scf::tileAndFuseConsumer`, the old path is still maintained.
The test for `scf::tileAndFuseConsumerUsingSlices` is copied to
`tile-and-fuse-consumer.mlir` to
`tile-and-fuse-consumer-using-slices.mlir`. All the tests that were
there in this file are now using the `tileAndFuseConsumer` method. The
test op `test.tile_and_fuse_consumer` is modified to call
`scf::tileAndFuseConsumer`, while a new op
`test.tile_and_fuse_consumer_of_slice` is used to keep the old path
tested while it is deprecated.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
|
|
There are several places where we use `llvm::OwningArrayRef`. The
interface to this requires us to first construct temporary storage, then
allocate space and set the allocated memory to 0, then copy the values
we actually want into that memory, then move the array into place.
Instead we can just do it all inline in a single pass by using
`std::vector`. In one case we actually allocate a completely separate
container and then allocate + copy the data over because
`llvm::OwningArrayRef` does not (and can't) support `push_back`.
Note that `llvm::SmallVector` is not a suitable replacement here because
we rely on reference stability on move construction: when the outer
container reallocates, we need the the contents of the inner containers
to be fixed in memory, and `llvm::SmallVector` does not give us that
guarantee.
|
|
The dialect implementation mostly copies the one of `cf.switch`, but
aligns naming to the SPIR-V spec.
|
|
interface" (#168847)
Reverts llvm/llvm-project#168499
|
|
Interfaces can be optional: whether an op implements an interface or not
can depend on the state of the operation.
```
// An optional code block for adding additional "classof" logic. This can
// be used to better enable "optional" interfaces, where an entity only
// implements the interface if some dynamic characteristic holds.
// `$_attr`/`$_op`/`$_type` may be used to refer to an instance of the
// interface instance being checked.
code extraClassOf = "";
```
The current `Pass::canScheduleOn(RegisteredOperationName)` is
insufficient. This commit adds an additional overload to inspect
`Operation *`.
This commit fixes a crash when scheduling an `InterfacePass` for an
optional interface on an operation that does not actually implement the
interface.
|
|
This PR adds pattern for unrolling shape_cast given a targetShape. This
PR is a follow up of #164010 which was very general and was using
inserts and extracts on each element (which is also
LowerVectorShapeCast.cpp is doing).
After doing some more research on use cases, we (me and @Jianhui-Li )
realized that the previous version in #164010 is unnecessarily generic
and doesn't fit our performance needs.
Our use case requires that targetShape is contiguous in both source and
result vector.
This pattern only applies when contiguous slices can be extracted from
the source vector and inserted into the result vector such that each
slice remains in vector form with targetShape (and not decompose to
scalars). In these cases, the unrolling proceeds as:
vector.extract_strided_slice -> vector.shape_cast (on the slice
unrolled) -> vector.insert_strided_slice
|
|
|
|
- Add missing locations, namely 'ErrnoMem', 'TargetMem0', and
'TargetMem1'.
|
|
code (#168536)
ODS generate code can be included and used outside of the `mlir`
namespace and so references to symbols in the mlir namespace
must be fully qualified.
|
|
Reland commit fb829bf11feeb53f815a3abf539e63ec3a23ed3d with additional fixes relating to post-merge CI failure
```
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp: In function ‘constexpr llvm::nvvm::CTAGroupKind getNVVMCtaGroupKind(mlir::NVVM::CTAGroupKind)’:
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/llvm/include/llvm/Support/ErrorHandling.h:165:36: error: call to non-constexpr function ‘void llvm::llvm_unreachable_internal(const char*, const char*, unsigned int)’
::llvm::llvm_unreachable_internal(msg, __FILE__, __LINE__)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp:73:3: note: in expansion of macro ‘llvm_unreachable’
llvm_unreachable("unsupported cta_group value");
^
```
|
|
|
|
This PR relax the create_mem_desc's restriction on source memref,
allowing it to be a 2d memref.
|
|
Add the cl::values to the pass options so an assert is not reached when
trying to generate a reproducer e.g. "unknown data value for option"
|
|
Introduce two new OpenACC operation interfaces for identifying global
variables and their address computations:
- `GlobalVariableOpInterface`: Identifies operations that define global
variables. Provides an `isConstant()` method to query whether the global
is constant.
- `AddressOfGlobalOpInterface`: Identifies operations that compute the
address of a global variable. Provides a `getSymbol()` method to
retrieve the symbol reference.
This is being done in preparation for `ACCImplicitDeclare` pass which
will automatically ensure that `acc declare` is applied to globals when
needed.
The following operations now implement these interfaces:
- `memref::GlobalOp` implements `GlobalVariableOpInterface`
- `memref::GetGlobalOp` implements `AddressOfGlobalOpInterface`
- `fir::GlobalOp` implements `GlobalVariableOpInterface`
- `fir::AddrOfOp` implements `AddressOfGlobalOpInterface`
|
|
Reverts llvm/llvm-project#164356
The bots are broken.
|
|
This pass aims to narrow i64 types on TOSA operations to i32. It can be
useful for legalizations from various frameworks. It comes with the
following options:
- "aggressive-rewrite" - This option is typically able to narrow more
values, but may impact numerical behaviour if not used carefully.
- "convert-function-boundaries" - If enabled, parameters/ results
to/from a function may be narrowed. Otherwise, casts are inserted to
preserve the I/O of the function.
Currently the non aggressive mode is very limited, targeting an argmax
-> cast sequence that has been observed during legalization as well as
some data layout operations that can always narrow. Support for more
operations will be added in the future.
Co-authored-by: Vitalii Shutov <vitalii.shutov@arm.com>
Co-authored-by: Shubham <shubham@arm.com>
Co-authored-by: Declan Flavin <declan.flavin@arm.com>
Signed-off-by: Luke Hutton <luke.hutton@arm.com>
Co-authored-by: Vitalii Shutov <vitalii.shutov@arm.com>
Co-authored-by: Shubham <shubham@arm.com>
Co-authored-by: Declan Flavin <declan.flavin@arm.com>
|
|
|
|
This commit adds support for tgen05.mma family of instructions in the NVVM MLIR dialect and lowers to LLVM Intrinsics. Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions) for information
|
|
Add a pass option to `convert-scf-to-cf` to deactivate pattern rollback
for better performance. The lowering patterns from SCF->CF to benefit a
lot from this feature because `splitBlock` is expensive in the rollback
driver.
|
|
Add pass options to run lowerings to NVVM without pattern rollback. This
makes the dialect conversions easier to debug and improves
performance/memory usage.
|
|
This change adds the ACCImplicitRoutine pass which implements the
OpenACC specification for implicit routine directives (OpenACC 3.4 spec,
section 2.15.1).
According to the specification: "If no explicit routine directive
applies to a procedure whose definition appears in the program unit
being compiled, then the implementation applies an implicit routine
directive to that procedure if any of the following conditions holds:
The procedure is called or its address is accessed in a compute region."
The pass automatically generates `acc.routine` operations for functions
called within OpenACC compute constructs or within existing routine
functions that do not already have explicit routine directives. It
recursively applies implicit routine directives while avoiding infinite
recursion when dependencies form cycles.
Key features:
- Walks through all OpenACC compute constructs (parallel, kernels,
serial) to identify function calls
- Creates implicit `acc.routine` operations for functions without
explicit routine declarations
- Recursively processes existing `acc.routine` operations to handle
transitive dependencies
- Avoids infinite recursion through proper tracking of processed
routines
- Respects device-type specific bind clauses to skip routines bound to
different device types
Requirements:
- Function operations must implement `mlir::FunctionOpInterface` to be
identified and associated with routine directives.
- Call operations must implement `mlir::CallOpInterface` to detect
function calls and traverse the call graph.
- Optionally pre-register `acc::OpenACCSupport` if custom behavior is
needed for determining if a symbol use is valid within GPU regions (such
as functions which are already considerations for offloading even
without `acc routine` markings)
Co-authored-by: delaram-talaashrafi<dtalaashrafi@nvidia.com>
|
|
gather/scatter ops (#167850)
The PR changes the layout attribute type for
`xegpu::LoadGatherOp/StoreScatterOp` from `LayoutAttr` to
`DistributeLayoutAttr` to also support `xegpu.slice` layouts.
Initially we [wanted to restrict slice
layouts](https://github.com/llvm/llvm-project/pull/163414#discussion_r2478978798)
from the attribute, but now it turns out there are actually valid use
cases for that:
```mlir
gpu.func @distribute_load_slice_attr() {
%2 = memref.alloca() {alignment = 1024} : memref<4096xf32>
%offset = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<0> : vector<256xindex>
%mask = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<1> : vector<256xi1>
%3 = xegpu.load %2[%offset], %mask <{chunk_size = 1, layout = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]>>} {
layout_result_0 = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]>
} : memref<4096xf32>, vector<256xindex>, vector<256xi1> -> vector<256xf32>
%4 = vector.broadcast %3 {layout_result_0 =
#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>} : vector<256xf32> to vector<256x256xf32>
gpu.return
}
```
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
|
|
as Pure (#166648)
This MR modifies side effect traits of some integer arithmetic
operations in the LLVM dialect.
Prior to this MR, the LLVM dialect `sdiv` and `udiv` operations were
marked as `Pure` through `tblgen` inheritance of the
`LLVM_ArithmeticOpBase` class. The `Pure` trait allowed incorrect
hoisting of `sdiv`/`udiv` operations by the
`loop-independent-code-motion` pass.
This MR modifies the `sdiv` and `udiv` LLVM operations to have traits
and code motion behavior identical to their counterparts in the `arith`
dialect, which were established by the commit/review below.
https://github.com/llvm/llvm-project/commit/ed39825be48805b174d3177f1d8d41ed84784d18
https://reviews.llvm.org/D137814
|
|
This MR adds support for the `exact` flag to the
`arith.shrui/shrsi/divsi/divui` operations. The semantics are identical
to those of the LLVM dialect and the LLVM language reference.
This MR also modifies the mechanism for converting `arith` dialect
**attributes** to corresponding **properties** in the `LLVM` dialect.
(As a specific example, the integer overflow flags `nsw/nuw` are
**properties** in the `LLVM` dialect, as opposed to attributes.)
Previously, attribute converter classes were required to have a specific
method to support integer overflow flags:
```C++
template <typename SourceOp, typename TargetOp>
class AttrConvertPassThrough {
public:
...
LLVM::IntegerOverflowFlags getOverflowFlags() const {
return LLVM::IntegerOverflowFlags::none;
}
};
```
This method was required, even for `arith` source operations that did
not use integer overflow flags (e.g. `AttrConvertFastMathToLLVM`).
This MR modifies the interface required by `arith` dialect attribute
converters to instead provide a (possibly NULL) properties attribute:
```C++
template <typename SourceOp, typename TargetOp>
class AttrConvertPassThrough {
public:
...
Attribute getPropAttr() const { return {}; }
};
```
For `arith` operations with attributes that map to `LLVM` dialect
**properties**, the attribute converter can create a `DictionaryAttr`
containing target properties and return that attribute from the
attribute converter's `getPropAttr()` method. The `arith` attribute
conversion framework will set the `propertiesAttr` of an
`OperationState`, and the target operation's `setPropertiesFromAttr()`
method will be invoked to set the properties when the target operation
is created. The `AttrConvertOverflowToLLVM` class in this MR uses the
new approach.
|
|
This PR builds upon the infrastructure set up for Sparse Tensor Loop
Ordering Heuristics (#154656) by adding a preference to have dense loops
outer and sparse loops inner.
As always I'd love to get feedback and know if there's any other
direction to go with this work that might be better.
|
|
|
|
The mbarrier Ops also require access to the `mem_scope` and
`shared_space` attributes. Hence, this patch moves their definitions
to the beginning of the file alongside the other attribute definitions.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
|
|
This patch adds verification to the `SymbolOpInterface` to enforce the
design constraint that symbol operations must not produce SSA results,
as documented in [Symbols and
SymbolTables](https://mlir.llvm.org/docs/SymbolsAndSymbolTables/#defining-or-declaring-a-symbol).
This is a follow-up of #168376
|
|
|
|
This patch adds support for shared::cta as destination space in
the TMA non-tensor copy Op (from global to shared::cta).
* Appropriate verifier checks are added.
* Unit tests are added to verify the lowering.
The related intrinsic changes were merged through PR #167508.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
|
|
(#168066)
This patch adds `ub` as a dependent dialect to `memref`, and uses
`ub.poison` as the default value in `AllocaOp::getDefaultValue` for the
mem2reg pass.
This aligns the behavior of `mem2reg` with LLVM, where loading a value
before having a value should be poison.
---------
Signed-off-by: Fabian Mora <fabian.mora-cordero@amd.com>
|
|
This PR adds a new utility function to check whether symbols used in
OpenACC regions are legal for offloading. Functions must be marked with
`acc routine` or be built-in intrinsics. Global symbols must be marked
with `acc declare`.
The utility is designed to be extensible, and the OpenACCSupport
analysis has been updated to allow handling of additional symbols that
do not necessarily use OpenACC attributes but are marked in a way that
still guarantees the symbol will be available when offloading. For
example, in the Flang implementation, CUF attributes can be validated as
legal symbols.
|
|
This PR makes the following improvements to `vector.scatter` and its
lowering pipeline:
- In addition to `memref`, accept a ranked `tensor` as the base operand
of `vector.scatter`, similar to `vector.transfer_write`.
- Implement bufferization support for `vector.scatter`, so that
tensor-based scatter ops can be fully lowered to memref-based forms.
It's worth to complete the functionality of map_scatter decomposition.
Full discussion can be found here:
https://github.com/iree-org/iree/issues/21135
---------
Signed-off-by: Ryutaro Okada <1015ryu88@gmail.com>
|
|
Effectively means we don't need to call into
`llvmModule->convertFromNewDbgValues()` anymore. Added a flag to allow
users to access the old behavior.
|
|
This commit ensures that gather and scatter operations with int64 index
tensors can be created. This aligns with the EXT_INT64 extension.
|
|
-- This commit includes the basic infra/utilities to add matchers for
linalg.*conv*/*pool* ops - such that given a `linalg.generic` op it
identifies which linalg.*conv*/*pool* op it is.
-- It adds a few representative linalg.*conv*/*pool* ops to demo the
matchers' capability and does so as part of
`linalg-specialize-generic-ops`
pass.
-- The goal is directed towards addressing the aim of
[[RFC] Op explosion in
Linalg](https://discourse.llvm.org/t/rfc-op-explosion-in-linalg/82863)
iteratively for `*conv*/*pooling*` ops.
-- This is part-1 of a series of PRs aimed to add matchers for
Convolution ops.
-- For further details, refer to
https://github.com/llvm/llvm-project/pull/163374#pullrequestreview-3341048722
Signed-off-by: Abhishek Varma <abhvarma@amd.com>
|
|
possible (#167041)
The current implementation of the WMMA intrinsic ops as they are defined
in the ROCDL tablegen is incorrect. They represent as operands what
should be attributes such as `clamp`, `opsel`, `signA/signB`. This
change performs a refactoring to bring it in line with what we expect.
---------
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
|
|
The IR was not able to be roundtrip through mlir-opt. Update the
assembly format and add round trip tests.
```
mlir-opt mlir/test/Target/LLVMIR/nvvm/barrier.mlir | mlir-opt
<stdin>:6:5: error: cannot name an operation with no results
%0 = nvvm.barrier <and> %arg2 -> i32
```
|
|
Added missing cluster.ids op - i.e., along x, y, and z dims. Extended
all rocdl tests
|
|
|