summaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
AgeCommit message (Collapse)Author
2025-11-17[AMDGPU] update LDS block size for gfx1250 (#167614)Changpeng Fang
LDS block size should be 2048 bytes (512 dwords) based on current spec.
2025-11-17[AMDGPU] Fix layering violations in AMDGPUMCExpr.cpp. NFC (#168242)Craig Topper
AMDGPUMCExpr lives in the MC layer it should not depend on Function.h or GCNSubtarget.h Move the function that needed GCNSubtarget to the one file that called it.
2025-09-10Revert "[AMDGPU][gfx1250] Add `cu-store` subtarget feature (#150588)" (#157639)Pierre van Houtryve
This reverts commit be17791f2624f22b3ed24a2539406164a379125d. This is not necessary for gfx1250 anymore.
2025-09-02[AMDGPU] Fix hw stage metadata setting for unsigned values (#154502)Ana Mihajlovic
2025-08-27[AMDGPU] Set GRANULATED_WAVEFRONT_SGPR_COUNT of compute_pgm_rsrc1 to 0 for ↵Shoreshen
gfx10+ (#154666) According to `llvm-project/llvm/docs/AMDGPUUsage.rst::L5212` the `GRANULATED_WAVEFRONT_SGPR_COUNT`, which is `compute_pgm_rsrc1[6:9]` has to be 0 for gfx10+ arch --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-08-26[AMDGPU] Do not assert on non-zero COMPUTE_PGM_RSRC3 on gfx1250. NFCI (#155498)Stanislav Mekhanoshin
COMPUTE_PGM_RSRC3 does exist on gfx1250, we are just not using it yet.
2025-08-20[AMDGPU] report named barrier cnt part2 (#154588)Gang Chen
2025-08-19[AMDGPU] Remove an unnecessary cast (NFC) (#154470)Kazu Hirata
getAddressableLocalMemorySize() already returns unsigned.
2025-08-19[AMDGPU] upstream barrier count reporting part1 (#154409)Gang Chen
2025-08-15AMDGPU gfx12: Add _dvgpr$ symbols for dynamic VGPRs (#148251)Tim Renouf
For each function with the AMDGPU_CS_Chain calling convention, with dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the function symbol, plus an offset encoding one less than the number of VGPR blocks used by the function (16 VGPRs per block, no more than 128) in bits 5..3 of the symbol value. This is used by a front-end to have functions that are chained rather than called, and a dispatcher that dynamically resizes the VGPR count before dispatching to a function.
2025-08-14[AMDGPU] Don't allow wgp mode on gfx1250 (#153680)Stanislav Mekhanoshin
- gfx1250 only supports cu mode
2025-08-14[AMDGPU] Increase LDS to 320K on gfx1250 (#153645)Stanislav Mekhanoshin
2025-08-08[AMDGPU] AsmPrinter: Unify arg handling (#151672)Diana Picus
When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).
2025-08-06MC,AMDGPU: Don't pad .text with s_code_end if it would otherwise be empty ↵Tim Renouf
(#147980) We don't want that padding in a module that only contains data, not code. Also fix MCSection::hasInstructions() so it works with the asm streamer too.
2025-07-29[AMDGPU][gfx1250] Add `cu-store` subtarget feature (#150588)Pierre van Houtryve
Determines whether we can use `SCOPE_CU` stores (on by default), or whether all stores must be done at `SCOPE_SE` minimum.
2025-07-21[AMDGPU] Enable FWD_PROGRESS bit for GFX10+ on PAL (#139895)Jay Foad
Performance testing shows no significant gains or losses on graphics workloads, so this is mostly to make the behavior consistent across all supported OSes instead of special-casing HSA.
2025-07-10[AMDGPU][NewPM] Port "AMDGPUResourceUsageAnalysis" to NPM (#130959)Vikram Hegde
2025-06-24[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)Diana Picus
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.
2025-06-21AMDGPU: Use reportFatalUsageError for unsupported code object version (#145133)Matt Arsenault
2025-06-17[llvm] annotate interfaces in llvm/Target for DLL export (#143615)Andrew Rogers
## Purpose This patch is one in a series of code-mods that annotate LLVM’s public interface for export. This patch annotates the `llvm/Target` library. These annotations currently have no meaningful impact on the LLVM build; however, they are a prerequisite to support an LLVM Windows DLL (shared library) build. ## Background This effort is tracked in #109483. Additional context is provided in [this discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307), and documentation for `LLVM_ABI` and related annotations is found in the LLVM repo [here](https://github.com/llvm/llvm-project/blob/main/llvm/docs/InterfaceExportAnnotations.rst). A sub-set of these changes were generated automatically using the [Interface Definition Scanner (IDS)](https://github.com/compnerd/ids) tool, followed formatting with `git clang-format`. The bulk of this change is manual additions of `LLVM_ABI` to `LLVMInitializeX` functions defined in .cpp files under llvm/lib/Target. Adding `LLVM_ABI` to the function implementation is required here because they do not `#include "llvm/Support/TargetSelect.h"`, which contains the declarations for this functions and was already updated with `LLVM_ABI` in a previous patch. I considered patching these files with `#include "llvm/Support/TargetSelect.h"` instead, but since TargetSelect.h is a large file with a bunch of preprocessor x-macro stuff in it I was concerned it would unnecessarily impact compile times. In addition, a number of unit tests under llvm/unittests/Target required additional dependencies to make them build correctly against the LLVM DLL on Windows using MSVC. ## Validation Local builds and tests to validate cross-platform compatibility. This included llvm, clang, and lldb on the following configurations: - Windows with MSVC - Windows with Clang - Linux with GCC - Linux with Clang - Darwin with Clang
2025-06-13Revert "[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#… ↵Diana Picus
(#144039) …133242)" This reverts commit 130080fab11cde5efcb338b77f5c3b31097df6e6 because it causes issues in testcases similar to coalescer_remat.ll [1], i.e. when we use a VGPR tuple but only write to its lower parts. The high VGPRs would then not be included in the vgpr_count, and accessing them would be an out of bounds violation. [1] https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/coalescer_remat.ll
2025-06-03[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#133242)Diana Picus
Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode. --------- Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
2025-05-06Register assembly printer passes (#138348)Matthias Braun
Register assembly printer passes in the pass registry. This makes it possible to use `llc -start-before=<target>-asm-printer ...` in tests. Adds a `char &ID` parameter to the AssemblyPrinter constructor to allow targets to use the `INITIALIZE_PASS` macros and register the pass in the pass registry. This currently has a default parameter so it won't break any targets that have not been updated.
2025-05-05[ErrorHandling] Add reportFatalInternalError + reportFatalUsageError (NFC) ↵Nikita Popov
(#138251) This implements the result of the discussion at: https://discourse.llvm.org/t/rfc-report-fatal-error-and-the-default-value-of-gencrashdialog/73587 There are two different use cases for report_fatal_error, so replace it with two functions reportFatalInternalError() and reportFatalUsageError(). The former indicates a bug in LLVM and generates a crash dialog. The latter does not. The names have been suggested by rnk and people seemed to like them. This replaces a lot of the usages that passed an explicit value for GenCrashDiag. I did not bulk replace remaining report_fatal_error usage -- they probably require case by case review for which function to use.
2025-03-19[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)Diana Picus
The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).
2025-03-18[AMDGPU] Add SubtargetFeature for dynamic VGPR mode (#130030)Diana Picus
This represents a hardware mode supported only for wave32 compute shaders. When enabled, we set the `.dynamic_vgpr_en` field of `.compute_registers` to true in the PAL metadata. This will be changed to use an attribute after downstream consumers have been migrated.
2025-03-17[llvm][AMDGPU] Enable FWD_PROGRESS bit for GFX10+ (#128367)Alex Voicu
From GFX10 onwards it is possible to employ benevolent scheduling of waves. This patch unconditionally enables, for the `amdhsa` OS, the bit which controls that capability, as it is beneficial for algorithms that rely on more complex concurrent coordination and it is generally performance neutral otherwise.
2025-03-03[AMDGPU] Set inst_pref_size to maximum (#126981)Stanislav Mekhanoshin
On gfx11 and gfx12 set initial instruction prefetch size to a minimum of kernel size and maximum allowed value. Fixes: SWDEV-513122
2025-03-03[AMDGPU] Extend ComputePGMRSrc3 to gfx10+. NFCI. (#129289)Stanislav Mekhanoshin
ComputePGMRSrc3 exists since gfx90a and gfx10+. Current code only expects gfx90a. This is NFCI since we do not fill it on gfx10+ yet.
2025-02-17[AMDGPU] Move into SIProgramInfo and cache getFunctionCodeSize. NFCI. (#127111)Stanislav Mekhanoshin
This moves function as is, improvements to the estimate go into a subseqent patch.
2025-01-23[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)Lucas Ramirez
Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.
2025-01-21[AMDGPU] Change scope of resource usage info symbols (#114810)Janek van Oirschot
Change scope of resource usage info MC symbols to align with the function linkage type
2025-01-10[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)Austin Kerbow
Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.
2024-11-20[NFC][AMDGPU] Remove redundant code in `AMDGPUAsmPrinter.cpp`Shilei Tian
2024-11-18AMDGPU: Increase the LDS size to support to 160 KB for gfx950 (#116309)Matt Arsenault
2024-11-08Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 ↵Shilei Tian
(#112403)" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-07[AMDGPU] Fix resource usage information for unnamed functions (#115320)Janek van Oirschot
Resource usage information would try to overwrite unnamed functions if there are multiple within the same compilation unit. This aims to either use the `MCSymbol` assigned to the unnamed function (i.e., `CurrentFnSym`), or, rematerialize the `MCSymbol` for the unnamed function.
2024-10-03[AMDGPU] Qualify auto. NFC. (#110878)Jay Foad
Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)
2024-10-02Add and call `AMDGPUMCResourceInfo::reset` method (#110818)Thomas Symalla
When compiling multiple pipelines, the `MCRegisterInfo` instance in `AMDGPUAsmPrinter` gets re-used even after finalization, so it calls `finalize()` multiple times. Add a reset method and call it in `AMDGPUAsmPrinter::doFinalization`. Different approach would be to make it a `unique_ptr`. --------- Co-authored-by: Thomas Symalla <tsymalla@amd.com>
2024-09-30[AMDGPU] Convert AMDGPUResourceUsageAnalysis pass from Module to MF pass ↵Janek van Oirschot
(#102913) Converts AMDGPUResourceUsageAnalysis pass from Module to MachineFunction pass. Moves function resource info propagation to to MC layer (through helpers in AMDGPUMCResourceInfo) by generating MCExprs for every function resource which the emitters have been prepped for. Fixes https://github.com/llvm/llvm-project/issues/64863
2024-09-23[AMDGPU] Include unused preload kernarg in KD total SGPR count (#104743)Austin Kerbow
Unlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain preloaded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves.
2024-08-15[AMDGPU] MCExpr printing helper with KnownBits support (#95951)Janek van Oirschot
Walks over the MCExpr and uses KnownBits to deduce whether an expression is known and if so, prints said known value. Should support the most common MCExpr cases for AMDGPU metadata.
2024-07-22[AMDGPU] Do not print `kernel-resource-usage` information on non-kernels ↵Joseph Huber
(#99720) Summary: This pass is used to get helpful information about the kernel resources without needing to insepct the binary. However, it currently prints on every function. These values will always be zero, so it's just spam on the terminal, at best an indication that a function wasn't internalized / optimized out. This patch makes it only print for kernels to make it more useful in practice.
2024-07-17[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298)Jay Foad
2024-07-17[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.Jay Foad
2024-07-17[AMDGPU] clang-tidy: use std::make_unique. NFC.Jay Foad
2024-06-28[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)Nikita Popov
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.
2024-06-26[AMDGPU] MCExpr-ify AMDGPU HSAMetadata (#94788)Janek van Oirschot
Enables MCExpr for HSAMetadata, particularly, HSAMetadata's msgpack format.
2024-06-25[AMDGPU][NFC] Rename AMDGPUVariadicMCExpr to AMDGPUMCExpr. (#96618)Ivan Kosarev
Some of our custom expressions are not variadic and there seems to be little benefit in mentioning the variadic nature of expression nodes in the name anyway.
2024-06-25AMDGPU: Add plumbing for private segment size argument (#96445)Nicolai Hähnle
The actual size of scratch/private is determined at dispatch time, so add more plumbing to request it. Will be used in subsequent change.