summaryrefslogtreecommitdiff
path: root/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
AgeCommit message (Collapse)Author
2025-11-23Revert "[RegAlloc] Fix the terminal rule check for interfere with DstReg ↵Aiden Grossman
(#168661)" This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc. This caused a couple test failures, likely due to a mid-air collision. Reverting for now to get the tree back to green and allow the original author to run UTC/friends and verify the output.
2025-11-23[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661)hstk30-hw
This maybe a bug which is introduced by commit 6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever since. In this case, `OtherReg` always overlaps with `DstReg` cause they from the `Copy` all.
2025-11-14AMDGPU: Select vector reg class for divergent build_vector (#168169)Matt Arsenault
The main improvement is to the mfma tests. There are some mild regressions scattered around, and a few major ones. The worst regressions are in some of the bitcast tests; these are cases where the SGPR argument list runs out and uses VGPRs, and the copies-from-VGPR are misidentified as divergent. Most of the shufflevector tests are also regressions. These end up with cleaner MIR, but then get poor regalloc decisions.
2025-11-14AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166)Matt Arsenault
Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.
2025-11-11AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475)Matt Arsenault
Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a11015 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ead4b didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.
2025-10-28[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663)Carl Ritson
Apply additional counter waits to address VALU writes to SGPRs. Rework expiry detection and apply wait coalescing to mitigate some of the additional waits.
2025-10-22[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201)LU-JOHN
Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0. Passes check-libc-amdgcn-amd-amdhsa now. Distribution of instructions that allowed a redundant S_CMP to be deleted in check-libc-amdgcn-amd-amdhsa test: ``` S_AND_B32 485 S_AND_B64 47 S_ANDN2_B32 42 S_ANDN2_B64 277492 S_CSELECT_B64 17631 S_LSHL_B32 6 S_OR_B64 11 ``` --------- Signed-off-by: John Lu <John.Lu@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-10-18Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116)Jan Patrick Lehr
Reverts llvm/llvm-project#162352 Broke our buildbot: https://lab.llvm.org/buildbot/#/builders/10/builds/15674 To reproduce cd llvm-project cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake -GNinja cd thebuild ninja ninja check-libc-amdgcn-amd-amdhsa
2025-10-18[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352)LU-JOHN
Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC if sX!=0. --------- Signed-off-by: John Lu <John.Lu@amd.com>
2025-09-06AMDGPU: Allow folding multiple uses of some immediates into copies (#154757)Matt Arsenault
In some cases this will require an avoidable re-defining of a register, but it works out better most of the time. Also allow folding 64-bit immediates into subregister extracts, unless it would break an inline constant. We could be more aggressive here, but this set of conditions seems to do a reasonable job without introducing too many regressions.
2025-06-27[AMDGPU] Fix bad removal of s_delay_alu (#145728)Ana Mihajlovic
instructionWaitsForSGPRWrites function covers ALL SALU instructions, including those like s_waitcnt that don't read from sgpr. This results in removing delay_alu instructions in cases like VALU->SGPR->VALU, which results in performance regression. Change modifies the function so that it checks if instruction also reads a sgpr.
2025-05-28MachineScheduler: Reset next cluster candidate for each node (#139513)Ruiling, Song
When a node is picked, we should reset its next cluster candidate to null before releasing its successors/predecessors.
2025-05-16[MachineCopyPropagation] Make use of lane mask info in basic block liveins ↵Jay Foad
(#140248)
2025-03-28[AMDGPU] Unused sdst writing to null (#133229)Ana Mihajlovic
Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.
2025-03-13Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)" (#131111)Ana Mihajlovic
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.
2025-03-13AMDGPU: Replace undef global initializers in tests with poison (#131051)Matt Arsenault
2025-03-12Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)"Kazu Hirata
This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d. Multiple buildbot failures have been reported: https://github.com/llvm/llvm-project/pull/127212
2025-03-12[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)Ana Mihajlovic
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.
2025-02-01[MachineScheduler] Fix physreg dependencies of ExitSU (#123541)Sergei Barannikov
Providing the correct operand index allows addPhysRegDataDeps to compute the correct latency. Pull Request: https://github.com/llvm/llvm-project/pull/123541
2025-01-30PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111)Matt Arsenault
Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.
2024-11-26AMDGPU: Remove some -verify-machineinstrs from tests (#117736)Matt Arsenault
We should leave these for EXPENSIVE_CHECKS builds. Some of these were near the top of slowest tests.
2024-11-08Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 ↵Shilei Tian
(#112403)" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 ↵Shilei Tian
(#112403)" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.
2024-11-08[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)Shilei Tian
2024-11-05[AMDGPU] Extend type support for update_dpp intrinsic (#114597)Stanislav Mekhanoshin
We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.
2024-10-21[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765)Stanislav Mekhanoshin
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.
2024-09-23AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514)Matt Arsenault
2024-09-11[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889)Jay Foad
Always generate v_cndmask_b32 instead of modifying exec around v_mov_b32. This is expected to be faster because modifying exec generally causes pipeline stalls.
2024-09-05[AMDGPU] V_SET_INACTIVE optimizations (#98864)Carl Ritson
Optimize V_SET_INACTIVE by allow it to run in WWM. Hence WWM sections are not broken up for inactive lane setting. WWM V_SET_INACTIVE can typically be lower to V_CNDMASK. Some cases require use of exec manipulation V_MOV as previous code. GFX9 sees slight instruction count increase in edge cases due to smaller constant bus. Additionally avoid introducing exec manipulation and V_MOVs where a source of V_SET_INACTIVE is the destination. This is a common pattern as WWM register pre-allocation often assigns the same register.
2024-09-04[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108)Jay Foad
Use poison for an unused input to the permlanex16 intrinsic, to improve register allocation and avoid an unnecessary v_mov instruction.
2024-07-26[AMDGPU] Remove -wavefrontsize32 and -wavefrontsize64 from GFX10+ tests ↵Changpeng Fang
(NFC) (#100711) They are no longer needed after the patch: [AMDGPU] Remove wavefrontsize feature from GFX10: https://github.com/llvm/llvm-project/pull/98400 The exception is when "target-features" are set to "+wavefrontsize32" or "+wavefrontsize64", we still need to remove a wavefrontsize feature before add a different one to make sure only one of them are present.
2024-07-23[AMDGPU] Codegen support for constrained multi-dword sloads (#96163)Christudasan Devadasan
For targets that support xnack replay feature (gfx8+), the multi-dword scalar loads shouldn't clobber any register that holds the src address. The constrained version of the scalar loads have the early clobber flag attached to the dst operand to restrict RA from re-allocating any of the src regs for its dst operand.
2024-07-15[AMDGPU] Enable atomic optimizer for divergent i64 and double values (#96934)Vikram Hegde
2024-07-15Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and ↵Matt Arsenault
follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851) This reverts commit adaff46d087799072438dd744b038e6fd50a2d78. Drop the -O3 checks from default-attributes.hip. I don't know why they are different on some bots but reverting this is far too disruptive.
2024-07-14Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and ↵dyung
follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851) This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and 78bc1b64a6dc3fb6191355a5e1b502be8b3668e7. The test CodeGenHIP/default-attributes.hip is failing on multiple bots even after the attempted fix including the following: - https://lab.llvm.org/buildbot/#/builders/3/builds/1473 - https://lab.llvm.org/buildbot/#/builders/65/builds/1380 - https://lab.llvm.org/buildbot/#/builders/161/builds/595 - https://lab.llvm.org/buildbot/#/builders/154/builds/1372 - https://lab.llvm.org/buildbot/#/builders/133/builds/1547 - https://lab.llvm.org/buildbot/#/builders/81/builds/755 - https://lab.llvm.org/buildbot/#/builders/40/builds/570 - https://lab.llvm.org/buildbot/#/builders/13/builds/748 - https://lab.llvm.org/buildbot/#/builders/12/builds/1845 - https://lab.llvm.org/buildbot/#/builders/11/builds/1695 - https://lab.llvm.org/buildbot/#/builders/190/builds/1829 - https://lab.llvm.org/buildbot/#/builders/193/builds/962 - https://lab.llvm.org/buildbot/#/builders/23/builds/991 - https://lab.llvm.org/buildbot/#/builders/144/builds/2256 - https://lab.llvm.org/buildbot/#/builders/46/builds/1614 These bots have been broken for a day, so reverting to get everything back to green.
2024-07-14AMDGPU: Move attributor into optimization pipeline (#83131)Matt Arsenault
Removing it from the codegen pipeline induces a lot of test churn because llc is no longer optimizing out implicit arguments to kernels. Mostly mechanical, but there are some creative test updates. I preferred to take the changes as-is in tests where the ABI isn't relevant. In cases where it's more relevant, or the optimize out logic was too ingrained in the test, I pre-run the optimization. Some cases manually add attributes to disable inputs.
2024-07-08[AMDGPU] Cleanup bitcast spam in atomic optimizer (#96933)Vikram Hegde
2024-04-04[AMDGPU] Combine or remove redundant waitcnts at the end of each MBB (#87539)Jay Foad
Call generateWaitcnt unconditionally at the end of SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to generate a new waitcnt instruction it has the effect of combining or removing redundant waitcnts that were already present. Tests show various small improvements in waitcnt placement.
2024-01-16[AMDGPU,test] Change llc -march= to -mtriple= (#75982)Fangrui Song
Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```
2023-12-25[LLVM] Make use of s_flbit_i32_b64 and s_ff1_i32_b64 (#75158)Acim Maravic
Update DAG ISel to support 64bit versions S_FF1_I32_B64 and S_FLBIT_I32_B664 --------- Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>
2023-12-15[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known ↵Pierre van Houtryve
to be 0 already (#72830) Co-authored-by: Juan Manuel MARTINEZ CAAMAÑO <juamarti@amd.com>
2023-10-30[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)Stanislav Mekhanoshin
This allows folding of 64-bit operands if fit into 32-bit. Fixes https://github.com/llvm/llvm-project/issues/67781
2023-08-30[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.Pravin Jagtap
Reduction and Scan are implemented using `Iterative` and `DPP` strategy for `float` type. Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D156301
2023-07-19[AMDGPU] Insert s_nop before s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)Jay Foad
Differential Revision: https://reviews.llvm.org/D155681
2023-06-30Revert "[AMDGPU] Mark mbcnt as convergent"Sameer Sahasrabuddhe
This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb. The output of mbcnt does not depend on other active lanes, and hence it is not convergent. The original change was made as a possible fix for https://github.com/ROCm-Developer-Tools/HIP/issues/3172 But changing mbcnt does not fix that issue. Reviewed By: ruiling, foad, yaxunl Differential Revision: https://reviews.llvm.org/D153953
2023-06-22[AMDGPU] Switch to the new cl option amdgpu-atomic-optimizer-strategy.Pravin Jagtap
Atomic optimizer is turned on by default through D152649. This patch removes the usage of old command line option amdgpu-atomic-optimizations and transfer the responsibility to `amdgpu-atomic-optimizer-strategy`. We can safely remove old option when LLPC remove its all usage. Reviewed By: foad, arsenm, #amdgpu, cdevadas Differential Revision: https://reviews.llvm.org/D153007
2023-06-09[AMDGPU] Iterative scan implementation for atomic optimizer.Pravin Jagtap
This patch provides an alternative implementation to DPP for Scan Computations. An alternative implementation iterates over all active lanes of Wavefront using llvm.cttz and performs the following steps: 1. Read the value that needs to be atomically incremented using llvm.amdgcn.readlane intrinsic 2. Accumulate the result. 3. Update the scan result using llvm.amdgcn.writelane intrinsic if intermediate scan results are needed later in the kernel. Reviewed By: arsenm, cdevadas Differential Revision: https://reviews.llvm.org/D147408
2022-12-19[AMDGPU] Convert some tests to opaque pointers (NFC)Nikita Popov
2022-11-14[MachineCSE] Allow CSE for instructions with ignorable operandsGuozhi Wei
Ignorable operands don't impact instruction's behavior, we can safely do CSE on the instruction. It is split from D130919. It has big impact to some AMDGPU test cases. For example in atomic_optimizations_raw_buffer.ll, when trying to check if the following instruction can be CSEed %37:vgpr_32 = V_MOV_B32_e32 0, implicit $exec Function isCallerPreservedOrConstPhysReg is called on operand "implicit $exec", this function is implemented as - return TRI.isCallerPreservedPhysReg(Reg, MF) || + return TRI.isCallerPreservedPhysReg(Reg, MF) || TII.isIgnorableUse(MO) || (MRI.reservedRegsFrozen() && MRI.isConstantPhysReg(Reg)); Both TRI.isCallerPreservedPhysReg and MRI.isConstantPhysReg return false on this operand, so isCallerPreservedOrConstPhysReg is also false, it causes LLVM failed to CSE this instruction. With this patch TII.isIgnorableUse returns true for the operand $exec, so isCallerPreservedOrConstPhysReg also returns true, it causes this instruction to be CSEed with previous instruction %14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec So I got different result from here. AMDGPU's implementation of isIgnorableUse is bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const { // Any implicit use of exec by VALU is not a real register read. return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() && isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent()); } Since the operand $exec is not a real register read, my understanding is it's reasonable to do CSE on such instructions. Because more instructions are CSEed, so I get less instructions generated for these tests. Differential Revision: https://reviews.llvm.org/D137222
2022-10-06[Sink] Allow sinking of invariant loads across critical edgesCarl Ritson
Invariant loads can always be sunk. Reviewed By: foad, arsenm Differential Revision: https://reviews.llvm.org/D135133