summaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU
AgeCommit message (Collapse)Author
2025-11-22[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374)tyb0807
- Support serialization of the number of allocated preload kernarg SGPRs - Support serialization of the first preload kernarg SGPR allocated Together they enable reconstructing correctly MIR with preload kernarg SGPRs.
2025-11-21AMDGPU: Improve getShuffleCost accuracy for 8- and 16-bit shuffles (#168818)Nicolai Hähnle
These shuffles can always be implemented using v_perm_b32, and so this rewrites the analysis from the perspective of "how many v_perm_b32s does it take to assemble each register of the result?" The test changes in Transforms/SLPVectorizer/reduction.ll are reasonable: VI (gfx8) has native f16 math, but not packed math.
2025-11-21AMDGPU: Handle invariant when lowering global loads (#168914)Matt Arsenault
Global with invariant should be treated identically to constant.
2025-11-21Revert "[AMDGPU] Remove leftover implicit operands from ↵Nathan Corbyn
SI_SPILL/SI_RESTORE." (#169068) PR causes build failures with expensive checks enabled Reverts llvm/llvm-project#168546
2025-11-21[AMDGPU] Handle AV classes in SIFixSGPRCopies::processPHINode (#169038)Jay Foad
Fix a problem exposed by #166483 using AV classes in more places. `isVectorRegister` only accepts registers of VGPR or AGPR classes. `hasVectorRegisters` additionally accepts the combined AV classes. Fixes: #168761
2025-11-21AMDGPU: Stop implementing shouldCoalesce (#168988)Matt Arsenault
Use the default, which freely coalesces anything it can. This mostly shows improvements, with a handful of regressions. The main concern would be if introducing wider registers is more likely to push the register usage up to the next occupancy tier.
2025-11-21[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016)Christudasan Devadasan
This patch enables the multi-group xnack replay mode by configuring the hardware MODE register at kernel entry. This aligns the hardware behavior with the compiler's existing multi-group s_wait_xcnt insertion logic.
2025-11-20AMDGPU: Don't duplicate implicit operands in 3-address conversion (#168426)Nicolai Hähnle
We previously got a duplicate implicit $exec operand. It didn't really hurt anything (other than being a slight drag on compile-time performance). Still, let's keep things clean.
2025-11-20AMDGPU: Handle invariant loads when considering if a load can be scalar ↵Matt Arsenault
(#168787)
2025-11-20AMDGPU: Fix treating divergent loads as uniform (#168785)Matt Arsenault
Avoids regression which caused the revert 6d5f87fc42. This is a hack on a hack. We currently have isUniformMMO, which improperly treats unknown source value as known uniform. This is hack from before we had divergence information in the DAG, and should be removed. This is the minimum change to avoid the regression; removing the aggressive handling of the unknown case (or dropping isUniformMMO entirely) are more involved fixes.
2025-11-20Revert "[AMDGPU] Add wave reduce intrinsics for float types - 2 (#161… ↵Aaditya
(#168845) …815)" This reverts commit dcab4cb49bfb0aa17df3d3fabe582696100e0d35.
2025-11-20[AMDGPU] Add wave reduce intrinsics for float types - 2 (#161815)Aaditya
Supported Ops: `fadd`, `fsub`
2025-11-20[AMDGPU] Add wave reduce intrinsics for float types - 1 (#161814)Aaditya
Supported Ops: `fmin`, `fmax`
2025-11-19[AMDGPU] Fixed crash in getLastMIForRegion when the region is empty. (#168653)Dhruva Chakrabarti
PreRARematStage builds region live-outs if GCN trackers are enabled. If rematerialization leads to empty regions, this can cause a crash because of dereference of an invalid iterator in getLastMIForRegion. The fix is to skip calling getLastMIForRegion for empty regions. This patch fixes another bug in the same code region. getLastMIForRegion calls skipDebugInstructionsBackward which may immediately return the RegionEnd if it is not the begin instruction and it is a non-debug instruction. That would imply considering an instruction that is outside the relevant region. The fix is to always pass the previous of RegionEnd to skipDebugInstructionsBackward. This bug was found while using GCN trackers on the existing LIT test machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure. llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168: llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const [with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr, true, true, void, false, void>; bool IsReverse = false; bool IsConst = false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference = llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
2025-11-19[AMDGPU] Prioritize allocation of low 256 VGPR classes (#167978)Stanislav Mekhanoshin
If we have 1024 VGPRs available we need to give priority to the allocation of these registers where operands can only use low 256. That is noteably scale operands of V_WMMA_SCALE instructions. Otherwise large tuples will be allocated first and take all low registers, so we would have to spill to get a room for these scale registers. Allocation priority itself does not eliminate spilling completely in large kernels, although helps to some degree. Increasing spill weight of a restricted class on top of it helps.
2025-11-19CodeGen: Add subtarget to TargetLoweringBase constructor (#168620)Matt Arsenault
Currently LibcallLoweringInfo is defined inside of TargetLowering, which is owned by the subtarget. Pass in the subtarget so we can construct LibcallLoweringInfo with the subtarget. This is a temporary step that should be revertable in the future, after LibcallLoweringInfo is moved out of TargetLowering.
2025-11-19[llvm] Use llvm::size (NFC) (#168675)Kazu Hirata
Note that llvm::size only works on types that allow std::distance in O(1).
2025-11-19[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE. (#168546)LU-JOHN
Remove leftover implicit operands from SI_SPILL/SI_RESTORE. --------- Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-19[AMDGPU][SDAG] Only fold flat offsets if they are inbounds PTRADDs (#165427)Fabian Ritter
For flat memory instructions where the address is supplied as a base address register with an immediate offset, the memory aperture test ignores the immediate offset. Currently, SDISel does not respect that, which leads to miscompilations where valid input programs crash when the address computation relies on the immediate offset to get the base address in the proper memory aperture. Global or scratch instructions are not affected. This patch only selects flat instructions with immediate offsets from PTRADD address computations with the inbounds flag: If the PTRADD does not leave the bounds of the allocated object, it cannot leave the bounds of the memory aperture and is therefore safe to handle with an immediate offset. Affected tests: - CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added new positive tests where we still do fold them. - CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem integral to this test, so the test is not changed to make offset folding still happen. - CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds addresses for accesses to be based on potentially OOB addresses used for prefetching. - I think the remaining ones suffer from the limited preservation of the inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out in PR #165424 and the fact that `AMDGPUTargetLowering::SplitVector{Load|Store}` legalizes too-wide accesses by repeatedly splitting them in half. Legalizing a V32S32 memory accesses therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a single ptradd, but the involved transformations generally cannot preserve the inbounds flag (even though it would be valid in this case). Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353 Analogous PR for GISel (merged): #153001 Fixes SWDEV-516125.
2025-11-19[AMDGPU] Ignore wavefront barrier latency during scheduling DAG mutation ↵Carl Ritson
(#168500) Do not add latency for wavefront and singlethread scope fences during barrier latency DAG mutation. These scopes do not typically introduce any latency and adjusting schedules based on them significantly impacts latency hiding.
2025-11-18[AMDGPU][GlobalISel] Add regbankselect rules for G_FSHR (#159818)Anshil Gandhi
2025-11-19[AMDGPU] Adding instruction specific features (#167809)Shoreshen
2025-11-18[NFC] Check operand type instead of opcode (#168641)Shilei Tian
A folow-up of #168458.
2025-11-18[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from ↵Shilei Tian
its lower 32-bit (#168458) On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit. Fixes SWDEV-567139.
2025-11-18[NFC][AMDGPU] IGLP: Fixes for unsigned int handling (#135090)Robert Imschweiler
Fixes unsigned int underflows in `MFMASmallGemmSingleWaveOpt::applyIGLPStrategy`.
2025-11-18[AMDGPU] Consider FLAT instructions for VMEM hazard detection (#137170)Robert Imschweiler
In general, "Flat instructions look at the per-workitem address and determine for each work item if the target memory address is in global, private or scratch memory." (RDNA2 ISA) That means that FLAT instructions need to be considered for VMEM hazards even without "specific segment". Also, LDS DMA should be considered for LDS hazard detection. See also #137148
2025-11-18[AMDGPU][GlobalISel] Add RegBankLegalize support for G_IS_FPCLASS (#167575)vangthao95
2025-11-18[AMDGPU] Remove const on a return type. (#168490)Kazu Hirata
While I am at it, this patch switches to the constructor that takes a container instead of a pair of begin/end. Identified with readability-const-return-type.
2025-11-18[AMDGPU][SIMemoryLegalizer] Combine GFX10-11 CacheControl Classes (#168058)Pierre van Houtryve
Also breaks the long inheritance chains by making both `SIGfx10CacheControl` and `SIGfx12CacheControl` inherit from `SICacheControl` directly. With this patch, we now just have 3 `SICacheControl` implementations that each do their own thing, and there is no more code hidden 3 superclasses above (which made this code harder to read and maintain than it needed to be).
2025-11-17[AMDGPU] update LDS block size for gfx1250 (#167614)Changpeng Fang
LDS block size should be 2048 bytes (512 dwords) based on current spec.
2025-11-17[AMDGPU] Fix layering violations in AMDGPUMCExpr.cpp. NFC (#168242)Craig Topper
AMDGPUMCExpr lives in the MC layer it should not depend on Function.h or GCNSubtarget.h Move the function that needed GCNSubtarget to the one file that called it.
2025-11-17[AMDGPU][GlobalISel] Add RegBankLegalize support for G_FMUL (#167847)vangthao95
2025-11-17[AMDGPU][SIMemoryLegalizer] Combine all GFX6-9 CacheControl Classes (#168052)Pierre van Houtryve
Merge the following classes into `SIGfx6CacheControl`: - SIGfx7CacheControl - SIGfx90ACacheControl - SIGfx940CacheControl They were all very similar and had a lot of duplicated boilerplate just to implement one or two codegen differences. GFX90A/GFX940 have a bit more differences, but they're still manageable under one class because the general behavior is the same. This removes 500 lines of code and puts everything into a single place which I think makes it a lot easier to maintain, at the cost of a slight increase in complexity for some functions. There is still a lot of room for improvement but I think this patch is already big enough as is and I don't want to bundle too much into one review.
2025-11-17[AMDGPU] Add amdgpu-lower-exec-sync pass to lower named-barrier globals ↵Chaitanya
(#165692) This PR introduces `amdgpu-lower-exec-sync` pass which specifically lowers named-barrier LDS globals introduced by #114550 . Changes include: - Moving the logic of lowering named-barrier LDS globals from `amdgpu-lower-module-lds` pass to this new pass. - This PR adds the pass to pipeline, remove the existing lowering logic for named-barrier LDS in `amdgpu-lower-module-lds` See #161827 for discussion on this topic.
2025-11-17[AMDGPU] TableGen-erate SDNode descriptions (#168248)Sergei Barannikov
This allows SDNodes to be validated against their expected type profiles and reduces the number of changes required to add a new node. Autogenerated node names start with "AMDGPUISD::", hence the changes in the tests. The few nodes defined in R600.td are *not* imported because TableGen processes AMDGPU.td that doesn't include R600.td. Ideally, we would have two sets of nodes, but that would require careful reorganization of td files since some nodes are shared between AMDGPU/R600. Not sure if it something worth looking into. Some nodes fail validation, those are listed in `AMDGPUSelectionDAGInfo::verifyTargetNode()`. Part of #119709. Pull Request: https://github.com/llvm/llvm-project/pull/168248
2025-11-15[AMDGPU] When shrinking and/or to bitset*, remove implicit scc def (#168128)LU-JOHN
When shrinking and/or to bitset* remove leftover implicit scc def. bitset* instructions do not set scc. Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-14AMDGPU: Select vector reg class for divergent build_vector (#168169)Matt Arsenault
The main improvement is to the mfma tests. There are some mild regressions scattered around, and a few major ones. The worst regressions are in some of the bitcast tests; these are cases where the SGPR argument list runs out and uses VGPRs, and the copies-from-VGPR are misidentified as divergent. Most of the shufflevector tests are also regressions. These end up with cleaner MIR, but then get poor regalloc decisions.
2025-11-14AMDGPU: Consider isVGPRImm when forming constant from build_vector (#168168)Matt Arsenault
This probably should have turned into a regular integer constant earlier. This is to defend against future regressions.
2025-11-15AMDGPU: Use vgpr to implement divergent i32->i64 anyext (#168167)Matt Arsenault
Handle this for consistency with the zext case.
2025-11-14AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166)Matt Arsenault
Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.
2025-11-15[AMDGPU] Delete some dead code (NFC) (#167891)Sergei Barannikov
`getLanesWithProperty()` is called with virtual registers only.
2025-11-14[AMDGPU] Fix wrong MSB encoding for V_FMAMK instructions (#168107)Shilei Tian
These instructions use `src0`, `imm`, `src1` as operand. Fixes SWDEV-566579.
2025-11-14[AMDGPU][MC] Disallow nogds in ds_gws_* instructions (#166873)Jun Wang
The ds_gws_* instructions require gds as an operand. However, when nogds is given, it is treated the same as gds. This patch fixes this to disallow nogds.
2025-11-14[AMDGPU] Use std::variant in ArgDescriptor. (#167992)Craig Topper
This replaces the 2 bool flags and the anonymous union. This also removes an implicit conversion from Register to unsigned and a call to MCRegister::id(). The ArgDescriptor constructor was always assigning the union through the MCRegister field even for stack offsets. The change to SIMachineFunctionInfo.h fixes a case where getRegister was being called on an unset ArgDescriptor. Since it was only this case, it seemed cleaner to fix it at the caller. The other option would be to make getRegister() return MCRegister() for an unset ArgDescriptor.
2025-11-14AMDGPU: Fix verifier error when waterfall call target is in AV register ↵Matt Arsenault
(#168017)
2025-11-14AMDGPU: Constrain readfirstlane operand when writing to m0 (#168004)Matt Arsenault
Fixes another verifier error after introducing AV registers. Also fixes not clearing the subregister index if there was one.
2025-11-14AMDGPU: Constrain readfirstlane operand to vgpr_32 (#168001)Matt Arsenault
2025-11-14[TableGen] Split *GenRegisterInfo.inc. (#167700)Ivan Kosarev
Reduces memory usage compiling backend sources, most notably for AMDGPU by ~98 MB per source on average. AMDGPUGenRegisterInfo.inc is tens of megabytes in size now, and is even larger downstream. At the same time, it is included in nearly all backend sources, typically just for a small portion of its content, resulting in compilation being unnecessarily memory-hungry, which in turn stresses buildbots and wastes their resources. Splitting .inc files also helps avoiding extra ccache misses where changes in .td files don't cause changes in all parts of what previously was a single .inc file. It is thought that rather than building on top of the current single-output-file design of TableGen, e.g., using `split-file`, it would be more preferable to recognise the need for multi-file outputs and give it a proper first-class support directly in TableGen.
2025-11-14[AMDGPU] Ensure SCC is not live before shrinking to s_bitset* (#167907)LU-JOHN
Ensure SCC is not live before shrinking s_and*/s_or* instructions to s_bitset*. --------- Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-14[AMDGPU][True16][CodeGen] lower flat_d16_saddr_t16 to saddr inst (#166603)Brox Chen
In true16 mode, D16 insts are lowered to a pseudo t16 first, and then lowered to hi/lo inst in MC lowering using D16T16 table. However, the D16T16 table selects both `flat_load_d16_t16 / flat_load_d16_t16_saddr` to `flat_load_d16_(hi)_b16` which is wrong. saddr pseudo inst `flat_load_d16_t16_saddr` should be selected to saddr hi/lo inst The global/scratch are correct while the flat seems to be the only one with this issue.