llvm-project.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-11-22	[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374)	tyb0807
	- Support serialization of the number of allocated preload kernarg SGPRs - Support serialization of the first preload kernarg SGPR allocated Together they enable reconstructing correctly MIR with preload kernarg SGPRs.
2025-09-16	[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)	Shilei Tian
	Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>
2025-09-03	[AMDGPU] Remove most uses of /dev/null in tests (#156630)	Jay Foad
	Using options like -filetype=null instead should allow tools to save some work by not generating any output.
2025-08-08	[AMDGPU] AsmPrinter: Unify arg handling (#151672)	Diana Picus
	When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).
2025-07-29	[AMDGPU] Add NoaliasAddrSpace to AAMDnodes (#149247)	Shoreshen
	This is the following PR of https://github.com/llvm/llvm-project/pull/136553 which calculate NoaliasAddrSpace. This PR carries the info calculated into MIR by adding it into AAMDnodes
2025-07-21	[AMDGPU] ISel & PEI for whole wave functions (#145858)	Diana Picus
	Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-15	[AMDGPU] gfx1250 64-bit relocations and fixups (#148951)	Stanislav Mekhanoshin

2025-06-24	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)	Diana Picus
	Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.
2025-06-06	[MIRParser] Report register class errors in a deterministic order (#142928)	Jay Foad

2025-05-08	[CodeGen] Parse nusw flag (#138856)	Pierre van Houtryve
	Fixes #127781
2025-04-15	[AMDGPU] Remove the AnnotateKernelFeatures pass (#130198)	Jun Wang
	Previously the AnnotateKernelFeatures pass infers two attributes: amdgpu-calls and amdgpu-stack-objects, which are used to help determine if flat scratch init is allowed. PR #118907 created the amdgpu-no-flat-scratch-init attribute. Continuing with that work, this patch makes use of this attribute to determine flat scratch init, replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the removal of the AnnotateKernelFeatures pass.
2025-03-19	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)	Diana Picus
	The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).
2025-03-17	AMDGPU: Migrate more tests away from undef (#131314)	Matt Arsenault
	andorbitset.ll is interesting since it directly depends on the difference between poison and undef. Not sure it's useful to keep the version using poison, I assume none of this code makes it to codegen. si-spill-cf.ll was also a nasty case, which I doubt has been reproducing its original issue for a very long time. I had to reclaim an older version, replace some of the poison uses, and run simplify-cfg. There's a very slight change in the final CFG with this, but final the output is approximately the same as it used to be.
2025-03-08	[AMDGPU] Change SGPR layout to striped caller/callee saved (#127353)	Shilei Tian
	This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes #113782.
2025-01-23	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)	Lucas Ramirez
	Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.
2025-01-17	[AMDGPU] Fix printing hasInitWholeWave in mir (#123232)	Stanislav Mekhanoshin

2024-12-18	[AMDGPU] Make max dwords of memory cluster configurable (#119342)	Ruiling, Song
	We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value.
2024-11-08	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 ↵	Shilei Tian
	(#112403)" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 ↵	Shilei Tian
	(#112403)" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.
2024-11-08	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)	Shilei Tian

2024-11-07	Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)	dyung
	Reverts llvm/llvm-project#115291 Reverting due to test failures on many bots including https://lab.llvm.org/buildbot/#/builders/174/builds/8049
2024-11-07	[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291)	Akshat Oke

2024-11-05	[AMDGPU] Fix 3495d04 MIR test (#114963)	Akshat Oke
	Needed to specify scratchRSrcReg and spreg in order to stop after prologepilog. - Fixes #113129 test failure
2024-11-05	[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129)	Akshat Oke

2024-10-21	Reland [AMDGPU] Serialize WWM_REG vreg flag (#110229) (#112492)	Akshat Oke
	A reland but not an exact copy as `VRegInfo.Flags` from the parser is now an int8 instead of a vector; so only need to copy over the value.
2024-10-15	Revert "[AMDGPU] Serialize WWM_REG vreg flag (#110229)"	Peter Collingbourne
	This reverts commit bec839d8eed9dd13fa7eaffd50b28f8f913de2e2. Caused buildbot failures, e.g. https://lab.llvm.org/buildbot/#/builders/52/builds/2928
2024-10-14	[AMDGPU] Serialize WWM_REG vreg flag (#110229)	Akshat Oke

2024-09-13	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)	Diana Picus
	This reverts commit https://github.com/llvm/llvm-project/commit/7792b4ae79e5ac9355ee13b01f16e25455f8427f. The problem was a conflict with https://github.com/llvm/llvm-project/commit/e55d6f5ea2656bf842973d8bee86c3ace31bc865 "[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (https://github.com/llvm/llvm-project/pull/107889)" which changed the syntax of V_SET_INACTIVE (and thus made my MIR test crash). ...if only we had a merge queue.
2024-09-12	Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" ↵	Diana Picus
	(#108054)"" (#108341) Reverts llvm/llvm-project#108173 si-init-whole-wave.mir crashes on some buildbots (although it passed both locally with sanitizers enabled and in pre-merge tests). Investigating.
2024-09-12	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)	Diana Picus
	This reverts commit https://github.com/llvm/llvm-project/commit/c7a7767fca736d0447832ea4d4587fb3b9e797c2. The buildbots failed because I removed a MI from its parent before updating LIS. This PR should fix that.
2024-09-10	Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)	Vitaly Buka
	Breaks bots, see #105822. Reverts llvm/llvm-project#105822
2024-09-10	[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)	Diana Picus
	This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail. As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions): ``` entry: %func_exec = call i1 @llvm.amdgcn.init.whole.wave() br i1 %func_exec, label %func, label %tail func: ; ... stuff that should run with the actual EXEC mask br label %tail tail: ; ... stuff that runs with all the lanes enabled; ; can contain more than one basic block ``` It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch). The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now is expanded in si-wqm (which is where SI_INIT_EXEC is handled too); however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).
2024-09-05	[AMDGPU] V_SET_INACTIVE optimizations (#98864)	Carl Ritson
	Optimize V_SET_INACTIVE by allow it to run in WWM. Hence WWM sections are not broken up for inactive lane setting. WWM V_SET_INACTIVE can typically be lower to V_CNDMASK. Some cases require use of exec manipulation V_MOV as previous code. GFX9 sees slight instruction count increase in edge cases due to smaller constant bus. Additionally avoid introducing exec manipulation and V_MOVs where a source of V_SET_INACTIVE is the destination. This is a common pattern as WWM register pre-allocation often assigns the same register.
2024-07-15	Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and ↵	Matt Arsenault
	follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851) This reverts commit adaff46d087799072438dd744b038e6fd50a2d78. Drop the -O3 checks from default-attributes.hip. I don't know why they are different on some bots but reverting this is far too disruptive.
2024-07-14	Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and ↵	dyung
	follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851) This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and 78bc1b64a6dc3fb6191355a5e1b502be8b3668e7. The test CodeGenHIP/default-attributes.hip is failing on multiple bots even after the attempted fix including the following: - https://lab.llvm.org/buildbot/#/builders/3/builds/1473 - https://lab.llvm.org/buildbot/#/builders/65/builds/1380 - https://lab.llvm.org/buildbot/#/builders/161/builds/595 - https://lab.llvm.org/buildbot/#/builders/154/builds/1372 - https://lab.llvm.org/buildbot/#/builders/133/builds/1547 - https://lab.llvm.org/buildbot/#/builders/81/builds/755 - https://lab.llvm.org/buildbot/#/builders/40/builds/570 - https://lab.llvm.org/buildbot/#/builders/13/builds/748 - https://lab.llvm.org/buildbot/#/builders/12/builds/1845 - https://lab.llvm.org/buildbot/#/builders/11/builds/1695 - https://lab.llvm.org/buildbot/#/builders/190/builds/1829 - https://lab.llvm.org/buildbot/#/builders/193/builds/962 - https://lab.llvm.org/buildbot/#/builders/23/builds/991 - https://lab.llvm.org/buildbot/#/builders/144/builds/2256 - https://lab.llvm.org/buildbot/#/builders/46/builds/1614 These bots have been broken for a day, so reverting to get everything back to green.
2024-07-14	AMDGPU: Move attributor into optimization pipeline (#83131)	Matt Arsenault
	Removing it from the codegen pipeline induces a lot of test churn because llc is no longer optimizing out implicit arguments to kernels. Mostly mechanical, but there are some creative test updates. I preferred to take the changes as-is in tests where the ABI isn't relevant. In cases where it's more relevant, or the optimize out logic was too ingrained in the test, I pre-run the optimization. Some cases manually add attributes to disable inputs.
2024-03-18	[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001)	Jonas Paulsson
	- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code. - Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
2024-02-09	[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)	Diana Picus
	At the moment, the emergency spill slot is a fixed object for entry functions and chain functions, and a regular stack object otherwise. This patch adopts the latter behaviour for entry/chain functions too. It seems this was always the intention [1] and it will also save us a bit of stack space in cases where the first stack object has a large alignment. [1] https://github.com/llvm/llvm-project/commit/34c8b835b16fb3879f1b9770e91df21883356bb6
2024-02-05	[CodeGen] Convert tests to opaque pointers (NFC)	Nikita Popov

2024-01-18	[AMDGPU] Add mark last scratch load pass (#75512)	Mirko Brkušanin

2024-01-16	[AMDGPU,test] Change llc -march= to -mtriple= (#75982)	Fangrui Song
	Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```
2023-11-30	MachineVerifier: Reject extra non-register operands on instructions (#73758)	Matt Arsenault
	We were allowing extra immediate arguments, and only bothering to check if registers were implicit or not. Also consolidate extra operand checks in verifier, to make this testable. We had 3 different places checking if you were trying to build an instruction with more operands than allowed by the definition. We had an assertion in addOperand, a direct check in the MIRParser to avoid the assertion, and the machine verifier checks. Remove the assert and parser check so the verifier can provide a consistent verification experience, which will also handle instructions modified in place.
2023-08-21	[AMDGPU] Add IsChainFunction to the MachineFunctionInfo	Diana Picus
	This will represent functions with the amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. Differential Revision: https://reviews.llvm.org/D156410
2023-08-20	[GlobalISel] introduce MIFlag::NoConvergent	Sameer Sahasrabuddhe
	Some opcodes in MIR are defined to be convergent by the target by setting IsConvergent in the corresponding TD file. For example, in AMDGPU, the opcodes G_SI_CALL and G_INTRINSIC* are marked as convergent. But this is too conservative, since calls to functions that do not execute convergent operations should not be marked convergent. This information is available in LLVM IR. The new flag MIFlag::NoConvergent now allows the IR translator to mark an instruction as not performing any convergent operations. It is relevant only on occurrences of opcodes that are marked isConvergent in the target. Differential Revision: https://reviews.llvm.org/D157475
2023-08-04	[AMDGPU] Change syncscopes.mir not to use undefined cpol bits. NFC.	Stanislav Mekhanoshin

2023-07-31	Reapply "[CodeGen]Allow targets to use target specific COPY instructions for ↵	Matt Arsenault
	live range splitting" This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd. The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work around the underlying problem with SUBREG_TO_REG.
2023-07-26	Revert "[CodeGen]Allow targets to use target specific COPY instructions for ↵	Vitaly Buka
	live range splitting" And dependent commits. Details in D150388. This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c. This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e. This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826. This reverts commit b7836d856206ec39509d42529f958c920368166b. No conflicts in the code, few tests had conflicts in autogenerated CHECKs: llvm/test/CodeGen/Thumb2/mve-float32regloops.ll llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll Reviewed By: alexfh Differential Revision: https://reviews.llvm.org/D156381
2023-07-21	[Support] Implement LLVM_ENABLE_REVERSE_ITERATION for StringMap	Fangrui Song
	ProgrammersManual.html says > StringMap iteration order, however, is not guaranteed to be deterministic, so any uses which require that should instead use a std::map. This patch makes -DLLVM_REVERSE_ITERATION=on (currently -DLLVM_ENABLE_REVERSE_ITERATION=on works as well) shuffle StringMap iteration order (actually flipping the hash so that elements not in the same bucket are reversed) to catch violations, similar to D35043 for DenseMap. This should help change the hash function (e.g., D142862, D155781). With a lot of fixes, there are still some violations. This patch implements the "reverse_iteration" lit feature to skip such tests. Eventually we should remove this feature. `ninja check-{llvm,clang,clang-tools}` are clean with `#define LLVM_ENABLE_REVERSE_ITERATION 1`. Reviewed By: jhenderson Differential Revision: https://reviews.llvm.org/D155789
2023-07-07	[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs	Christudasan Devadasan
	Currently, the custom SGPR spill lowering pass spills SGPRs into physical VGPR lanes and the remaining VGPRs are used by regalloc for vector regclass allocation. This imposes many restrictions that we ended up with unsuccessful SGPR spilling when there won't be enough VGPRs and we are forced to spill the leftover into memory during PEI. The custom spill handling during PEI has many edge cases and often breaks the compiler time to time. This patch implements spilling SGPRs into virtual VGPR lanes. Since we now split the register allocation for SGPRs and VGPRs, the virtual registers introduced for the spill lanes would get allocated automatically in the subsequent regalloc invocation for VGPRs. Spill to virtual registers will always be successful, even in the high-pressure situations, and hence it avoids most of the edge cases during PEI. We are now left with only the custom SGPR spills during PEI for special registers like the frame pointer which is an unproblematic case. Differential Revision: https://reviews.llvm.org/D124196
2023-07-07	[AMDGPU] Implement whole wave register spill	Christudasan Devadasan
	To reduce the register pressure during allocation, when the allocator spills a virtual register that corresponds to a whole wave mode operation, the spill loads and restores should be activated for all lanes by temporarily flipping all bits in exec register to one just before the spills. It is not implemented in the compiler as of today and this patch enables the necessary support. This is a pre-patch before the SGPR spill to virtual VGPR lanes that would eventually causes the whole wave register spills during allocation. Reviewed By: arsenm, cdevadas Differential Revision: https://reviews.llvm.org/D143759