summaryrefslogtreecommitdiff
path: root/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
AgeCommit message (Collapse)Author
2025-11-17[mlir][amdgpu] Add lowerings for ScaledExtPacked816 (#168123)Erick Ochoa Lopez
* Adds lowerings for amdgpy.scaled_ext_packed816 * updates verifiers
2025-11-17[mlir][amdgpu] Fix documentation and verifiers (#167369)Erick Ochoa Lopez
2025-10-28[mlir][amdgpu][rocdl] Add gfx1250 wmma ops (#165064)Jakub Kuderski
Update `amdgpu.wmma` op definition and implement amdgpu to rocdl conversion for new variants.
2025-10-25[mlir][amdgpu] Update mfma assembly format with intrinsic shape (#165037)Jakub Kuderski
Use the same format as introduced for wmma by https://github.com/llvm/llvm-project/pull/164920. Also make `blocks` default to 1.
2025-10-24[mlir][amdgpu] Add explicit intrinsic shape to wmma (#164920)Jakub Kuderski
This is in preparation for adding support for gfx1250 wmma intrinsics that include much more possible shapes. Instead of guessing the wave32/wave64 mode based on element types and vector sizes, require the intrinsic shapes to be set explicitly as attributes.
2025-10-17[mlir][amdgpu] Add scaled_ext_packed{8,16} operations (#159830)Erick Ochoa Lopez
2025-10-16[mlir][AMGPU] Replace use of SmallVector with ArrayRef, NFC (#163770)Muzammil
Improving choice of class used, from SmallVector to ArrayRef (https://llvm.org/docs/ProgrammersManual.html#llvm-adt-arrayref-h). Also infer template types when possible. Leftover from https://github.com/llvm/llvm-project/pull/155951. --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-10-10[mlir][amdgpu] Add Inliner interface (#162873)Ivan Butygin
All the `amdgpu` dialect ops can be inlined. --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
2025-09-18[mlir][AMDGPU] Add canonicalization pattern to pack scales for ScaledMFMAOp ↵Muzammil
(#155951) The ScaledMFMAOp accepts scales as a vector of 4 bytes (`vector<4xf8E8M0FNU>`) that can be stored in a single register with a particular scale accessed using the `OpSel` attribute. Currently, we only use one byte in this 4-byte vector, resulting in 3 wasted registers. This is fixed by identifying when single byte extractions are performed and rewriting them into extractions of 4-byte vectors. Example: ``` %unit = vector.extract %ScaleSrc[offsets] : f8E8M0FNU from vector<?x?x?xf8E8M0FNU> %scale = vector.insert %unit, ... : f8E8M0FNU into vector<4xf8E8M0FNU> amdgpu.scaled_mfma(%scale[0] * ... ``` to ``` %reshaped = vector.shape_cast %ScaleSrc : vector<?x?x?xf8E8M0FNU> to vector<?x4xf8E8M0FNU> %scale = vector.extract %reshaped[?] : vector<4xf8E8M0FNU> from vector<?x4xf8E8M0FNU> amdgpu.scaled_mfma(%scale[0-3] * ... ``` --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-09-18[MLIR] Apply clang-tidy fixes for readability-identifier-naming in ↵Mehdi Amini
AMDGPUDialect.cpp (NFC)
2025-08-21[mlir][AMDGPU] Add PermlaneSwapOp (#154345)Tim Gymnich
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and `rocdl.permlane32.swap` --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-08-07[mlir][AMDGPU] Allow non-contiguous destination memrefs for gather_to_lds ↵Quinn Dawkins
(#152559) The requirement that the LDS operand is contiguous is overly restrictive because it's perfectly valid to have a subview depend on subgroup IDs that is still subgroup contiguous. We could continue trying to do this verification based on the number of copied elements, but instead this change just opts to clarify the semantics on the op definition.
2025-07-24[mlir][AMDGPU] Add canonicalizer for folding casts into gather_to_lds (#150503)Quinn Dawkins
2025-07-21[mlir][AMDGPU] Infer canonical layouts for fat_raw_buffer_cast resetOffset ↵Krzysztof Drewniak
(#149867) When inferring the return type of amdgpu.fat_raw_buffer_cast with the offset reset, we would sometimes use a strided layout, like strided<[1]>, in cases where, after stripping the offset, the memref had the identity layout. This would cause issues with EmulateNarrowTypes, which does perform this layout canonicalization. Now, the return type inference will put in an identity layout after offset stripping for 1. Statically-shaped memrefs of any rank where the strides match the suffix product of the shape, and 2. Memrefs of rank <= 1 whose strides are [1] (or []) that just had their offset removed by resetOffset.
2025-07-18[mlir][amdgpu] Properly handle mismatching memref ranks in ↵Ivan Butygin
`amdgpu.gather_to_lds` (#149407) This op doesn't have any rank or indices restrictions on src/dst memrefs, but was using `SameVariadicOperandSize` which was causing issues. Also fix some other issues while we at it.
2025-07-09[AMDGPU] [MLIR] Add 96 and 128 bit GatherToLDS for gfx950 (#147496)Daniel Hernandez-Juarez
This PR adds 96 and 128 gather_to_lds support for gfx950. Updating lowering, verifier and tests.
2025-06-25[AMDGPU] Adding AMDGPU dialect wrapper for ROCDL transpose loads. (#145395)Alan Li
* 1-to-1 mapping wrapper op. * Direct lowering from AMDGPU wrapper to ROCDL intrinsics.
2025-06-13[mlir][AMDGPU] Add scaled floating point conversion ops (#141554)Tim Gymnich
implement `ScaledExtPackedOp` and `PackedScaledTruncOp`
2025-05-19[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)Krzysztof Drewniak
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the same API and "gather from a pointer to LDS" is something of an abstract operation. This commit adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang.
2025-04-08[mlir][bazel] Fix after dae0ef53a0b99c6c2b74143baee5896e8bc5c8e7Christian Sigg
Remove unnecessary include.
2025-04-08[MLIR][AMDGPU] Add a wrapper for global LDS load intrinsics in AMDGPU (#133498)Alan Li
Defining a new `amdgpu.global_load` op, which is a thin wrap around ROCDL `global_load_lds` intrinsic, along with its lowering logics to `rocdl.global.load.lds`.
2025-04-01[mlir][AMDGPU] Add gfx950 MFMAs to the amdgpu.mfma op (#133553)Krzysztof Drewniak
This commit extends the lowering of amdgpu.mfma to handle the new double-rate MFMAs in gfx950 and adds tests for these operations. It also adds support for MFMAs on small floats (f6 and f4), which are implented using the "scaled" MFMA intrinsic with a scale value of 0 in order to have an unscaled MFMA. This commit does not add a `amdgpu.scaled_mfma` operation, as that is future work. --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-03-03[MLIR][AMDGPU] Add OCP FP8 support for new hardware (#127728)Mirza Halilčević
(Continuing from #106160) This PR addresses remaining review comments from the original PR. Original PR Description --- Upcoming hardware (gfx12 and some future gfx9) will support the OCP 8-bit float formats for their matrix multiplication intrinsics and conversion operations, retaining existing opcodes and compiler builtins. This commit adds support for these types to the MLIR wrappers around such operations, ensuring that the OCP types aren't used to generate those builtins on hardware that doesn't expect that format and, conversely, to ensure that the pre-OCP formats aren't used on new hardware. --------- Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com> Co-authored-by: Paul Fuqua <pf@acm.org> Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-02-27[mlir][AMDGPU] Add int4 intrinsics, mixed-type fp8 to handle gfx12 (#128963)Krzysztof Drewniak
1. Extend the gfx12 FP8 support to allow mixed-type intrinsics (since they've been added), creating limited mixed-type support that mirrors MFMA 2. Extend the `amdgpu.wmma` intrinsic lowering to correctly handle shorter vectors because gfx12 now has instructions that logically take a 4xi8, or, as far as LLVM's concerned, an i32. Similarly, there are 4xi4 inputs, which are an i16 (that must be zero-extended to i32). 3. Correctly handle the ambiguities in the int4 intrinsics on gfx12, which can either be 16x16x16 or 16x16x32 4. Add tests showing all WMMAs being lowered the way gfx12 expects (mirroring LLVM's tests) 5. Add a verifier to prevent emiting ilegal instructions on gfx12.
2025-02-26[mlir][AMDGPU] Plumb address space 7 through MLIR, add address_space attr. ↵Krzysztof Drewniak
(#125594) This commit adds support for casting memrefs into fat raw buffer pointers to the AMDGPU dialect. Fat raw buffer pointers - or, in LLVM terms, ptr addrspcae(7), allow encapsulating a buffer descriptor (as produced by the make.buffer.rsrc intrinsic or provided from some API) into a pointer that supports ordinary pointer operations like load or store. This allows people to take advantage of the additional semantics that buffer_load and similar instructions provide without forcing the use of entirely separate amdgpu.raw_buffer_* operations. Operations on fat raw buffer pointers are translated to the corresponding LLVM intrinsics by the backend. This commit also goes and and defines a #amdgpu.address_space<> attribute so that AMDGPU-specific memory spaces can be represented. Only #amdgpu.address_space<fat_raw_buffer> will work correctly with the memref dialect, but the other possible address spaces are included for completeness. --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com> Co-authored-by: Prashant Kumar <pk5561@gmail.com>
2025-01-21[mlir][IR][NFC] Move free-standing functions to `MemRefType` (#123465)Matthias Springer
Turn free-standing `MemRefType`-related helper functions in `BuiltinTypes.h` into member functions.
2025-01-20[mlir][IR] Remove `isF...()` type API for low-precision FP types (#123326)Matthias Springer
Remove `type.isFloat4E2M1FN()` etc. Use `isa<Float4E2M1FNType>(type)` instead. For details, see: https://discourse.llvm.org/t/rethink-on-approach-to-low-precision-fp-types/82361/28
2024-10-18eliminating g++ warnings (#105520)Frank Schlimbach
Eliminating g++ warnings. Mostly declaring "[[maybe_unused]]", adding return statements where missing and fixing casts. @rengolin --------- Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech> Co-authored-by: Renato Golin <rengolin@systemcall.eu>
2024-09-03[MLIR][AMDGPU] Add support for fp8 ops on gfx12 (#106388)Giuseppe Rossini
This PR is adding support for `fp8` and `bfp8` on gfx12
2024-08-26[MLIR][AMDGPU] Introduce fp16 packed arithmetic (#105688)Giuseppe Rossini
This PR is introducing rocdl.cvt.pkrtz in the ROCDL dialect and it is using that instruction when lowering `arith::TruncFOp`.
2024-08-16[mlir][AMDGPU] Implement AMDGPU DPP operation in MLIR. (#89233)stefankoncarevic
Defined AMDGPU DPP operation in mlir to represent semantics. Introduced a new enumeration attribute for different permutations and allowed for different types of arguments. Implemented constant attribute handling for ROCDL::DPPMovOp operation. The operation now correctly accepts constant attributes for dppCtrl, rowMask, bankMask, boundCtrl, and passes them to the corresponding LLVM intrinsic.
2024-04-19Switch member calls to `isa/dyn_cast/cast/...` to free function calls. (#89356)Christian Sigg
This change cleans up call sites. Next step is to mark the member functions deprecated. See https://mlir.llvm.org/deprecation and https://discourse.llvm.org/t/preferred-casting-style-going-forward.
2024-04-11[mlir][amdgpu] Remove shared memory optimization pass (#88225)Jakub Kuderski
This implementation has a number of issues and ultimately does not work on gfx9. * It does not reduce bank conflicts with wide memory accesses. * It does not correctly account for when LDS bank conflicts occur on amdgpu. * The implementation is too fragile to be used on real-world code. For example, the code bails out on any `memref.subview` in the root op, even when the subview is not a user of any of the `memref.alloc` ops. I do not see how these can be easily fixed, therefore I think it's better to delete this code.
2024-01-25[reland][mlir][amdgpu] Shared memory access optimization pass (#79164)erman-gurses
- Reland: https://github.com/llvm/llvm-project/pull/75627 - Reproduced then fixed the build issue
2024-01-19Revert "[mlir][amdgpu] Shared memory access optimization pass" (#78822)Mehdi Amini
Reverts llvm/llvm-project#75627 ; it broke the bot: https://lab.llvm.org/buildbot/#/builders/61/builds/53218
2024-01-19[mlir][amdgpu] Shared memory access optimization pass (#75627)erman-gurses
It implements transformation to optimize accesses to shared memory. Reference: https://reviews.llvm.org/D127457 _This change adds a transformation and pass to the NvGPU dialect that attempts to optimize reads/writes from a memref representing GPU shared memory in order to avoid bank conflicts. Given a value representing a shared memory memref, it traverses all reads/writes within the parent op and, subject to suitable conditions, rewrites all last dimension index values such that element locations in the final (col) dimension are given by newColIdx = col % vecSize + perm[row](col / vecSize, row) where perm is a permutation function indexed by row and vecSize is the vector access size in elements (currently assumes 128bit vectorized accesses, but this can be made a parameter). This specific transformation can help optimize typical distributed & vectorized accesses common to loading matrix multiplication operands to/from shared memory._
2023-09-28[mlir][AMDGPU] Add packed 8-bit float conversion ops and loweringKrzysztof Drewniak
Define operations that wrap the gfx940's new operations for converting between f32 and registers containing packed sets of four 8-bit floats. Define rocdl operations for the intrinsics and an AMDGPU dialect wrapper around them (to account for the fact that MLIR distinguishes the two float formats at the type level but that the LLVM IR does not). Define an ArithToAMDGPU pass, meant to run before conversion to LLVM, that replaces relevant calls to arith.extf and arith.truncf with the packed operations in the AMDGPU dialect. Note that the conversion currently only handles scalars and vectors of rank <= 1, as we do not have a usecase for multi-dimensional vector support right now. Reviewed By: jsjodin Differential Revision: https://reviews.llvm.org/D152457
2023-07-20[mlir][AMDGPU] Define wrappers for WMMA matrix opsGiuseppe Rossini
Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate matrix multiplication on RDNA3 architectures. LLVM already provides a set of intrinsics to generate wmma instructions. This change uses those intrinsics to enable the feature in MLIR. Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D152451
2023-05-12[mlir] Update method cast calls to function callsTres Popp
The MLIR classes Type/Attribute/Operation/Op/Value support cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast functionality in addition to defining methods with the same name. This change begins the migration of uses of the method to the corresponding function call as has been decided as more consistent. Note that there still exist classes that only define methods directly, such as AffineExpr, and this does not include work currently to support a functional cast/isa call. Context: * https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…" * Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443 Implementation: This follows a previous patch that updated calls `op.cast<T>()-> cast<T>(op)`. However some cases could not handle an unprefixed `cast` call due to occurrences of variables named cast, or occurring inside of class definitions which would resolve to the method. All C++ files that did not work automatically with `cast<T>()` are updated here to `llvm::cast` and similar with the intention that they can be easily updated after the methods are removed through a find-replace. See https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check for the clang-tidy check that is used and then update printed occurrences of the function to include `llvm::` before. One can then run the following: ``` ninja -C $BUILD_DIR clang-tidy run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\ -export-fixes /tmp/cast/casts.yaml mlir/*\ -header-filter=mlir/ -fix rm -rf $BUILD_DIR/tools/mlir/**/*.inc ``` Differential Revision: https://reviews.llvm.org/D150348
2023-05-03[mlir][AMDGPU] Add emulation pass for atomics on AMDGPU targetsKrzysztof Drewniak
Not all AMDGPU targets support all atomic operations. For example, there are not atomic floating-point adds on the gfx10 series. Add a pass to emulate these operations using a compare-and-swap loop, by analogy to the generic atomicrmw rewrite in MemrefToLLVM. This pass is named generally, as in the future we may have a memref-to-amdgpu that translates constructs like atomicrmw fmax (which doesn't generally exist in LLVM) to the relevant intrinsics, which may themselves require emulation. Since the AMDGPU dialect now has a pass that operates on it, the dialect's directory structure is reorganized to match other similarly complex dialects. The pass should be run before amdgpu-to-rocdl if desired. This commit also adds f64 support to atomic_fmax. Depends on D148722 Reviewed By: nirvedhmeshram Differential Revision: https://reviews.llvm.org/D148724
2023-05-03[mlir][AMDGPU] Define atomic compare-and-swap for raw buffersKrzysztof Drewniak
This commit adds the buffer cmpswap intrinsic to the ROCDL dialect and its corresponding AMDGPU dialect wrappers. Reviewed By: nirvedhmeshram Differential Revision: https://reviews.llvm.org/D148722
2023-02-28[MLIR][AMDGPU][ROCDL] Adding raw.buffer.atomic.fmax/smax/umin supportManupa Karunaratne
This commit adds support for atomic fmax/smax/umin support for AMDGPU dialect and the dependent dialects to allow such a lowering. Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D144097
2023-02-15[mlir][AMDGPU] 8-bit float usage in the AMDGPU dialectKrzysztof Drewniak
Upcoming AMD hardware will include functions that accept 8-bit floats. Specifically, there are MFMA instructions that accept 8-bit floats, either using the same or mixed formats. This patch adds MLIR wrappers for these intrinsics and explicitly adds support for 8-bit floats in the gpu-to-rocdl conversion by way of amdgpu-to-rocdl. Since LLVM does not have f8 types, when targeting LLVM for compilation on an AMD GPU, both f8 types used on AMD hardware (f8E5M2FNUZ and f8E4M3FNUZ) are rewritten to i8. This patch also relaxes the restriction that the types of both source operands to a amdgpu.mfma instructions match exactly, as this is not necessarily required for the bf8 (f8E5M2FNUZ) and fp8 (f8E4M3FNUZ) instructions. In addition, since the buffer_{load,store} operations maintain a whitelist of permitted types, we add the relevant f8 types to that list. This patch does not add any implementations of arithmetic operations for f8 types. Reviewed By: jakeh-gc Differential Revision: https://reviews.llvm.org/D143956
2023-02-09Add generic type attribute mapping infrastructure, use it in GpuToXKrzysztof Drewniak
Remapping memory spaces is a function often needed in type conversions, most often when going to LLVM or to/from SPIR-V (a future commit), and it is possible that such remappings may become more common in the future as dialects take advantage of the more generic memory space infrastructure. Currently, memory space remappings are handled by running a special-purpose conversion pass before the main conversion that changes the address space attributes. In this commit, this approach is replaced by adding a notion of type attribute conversions TypeConverter, which is then used to convert memory space attributes. Then, we use this infrastructure throughout the *ToLLVM conversions. This has the advantage of loosing the requirements on the inputs to those passes from "all address spaces must be integers" to "all memory spaces must be convertible to integer spaces", a looser requirement that reduces the coupling between portions of MLIR. ON top of that, this change leads to the removal of most of the calls to getMemorySpaceAsInt(), bringing us closer to removing it. (A rework of the SPIR-V conversions to use this new system will be in a folowup commit.) As a note, one long-term motivation for this change is that I would eventually like to add an allocaMemorySpace key to MLIR data layouts and then call getMemRefAddressSpace(allocaMemorySpace) in the relevant *ToLLVM in order to ensure all alloca()s, whether incoming or produces during the LLVM lowering, have the correct address space for a given target. I expect that the type attribute conversion system may be useful in other contexts. Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D142159
2023-01-14[mlir] Use std::optional instead of llvm::Optional (NFC)Kazu Hirata
This patch replaces (llvm::|)Optional< with std::optional<. I'll post a separate patch to remove #include "llvm/ADT/Optional.h". This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2023-01-13[mlir] Add #include <optional> (NFC)Kazu Hirata
This patch adds #include <optional> to those files containing llvm::Optional<...> or Optional<...>. I'll post a separate patch to actually replace llvm::Optional with std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-17[mlir] llvm::Optional::value => operator*/operator->Fangrui Song
std::optional::value() has undesired exception checking semantics and is unavailable in older Xcode (see _LIBCPP_AVAILABILITY_BAD_OPTIONAL_ACCESS). The call sites block std::optional migration.
2022-12-03[mlir] Use std::nullopt instead of None (NFC)Kazu Hirata
This patch mechanically replaces None with std::nullopt where the compiler would warn if None were deprecated. The intent is to reduce the amount of manual work required in migrating from Optional to std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-11-21[mlir][AMDGPU] Remove buffer ops that are statically out of boundsKrzysztof Drewniak
When the bounds check attribute is true, the raw buffer load, store, and atomic operations have well-defined behavior (returning 0 for loads and ignoring stores) when the buffer access exceeds the bounds of the memory being accessed. Because of how LLVM currently implements these buffer operations (as opaque intrinsics), the backend cannot optimize out this known behavior and eliminate the memory operations. Therefore, use MLIR's canonicalization system to eliminate these operations. Reviewed By: nirvedhmeshram Differential Revision: https://reviews.llvm.org/D138146
2022-08-31[mlir][amdgpu] Fix signed/unsigned comparison for abid/cbsz comparisonRob Suderman
Unsigned/signed comparison failure due to implicit signed value. Reviewed By: stella.stamenova Differential Revision: https://reviews.llvm.org/D133061