llvm-project.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2025-11-22	[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374)	tyb0807
	- Support serialization of the number of allocated preload kernarg SGPRs - Support serialization of the first preload kernarg SGPR allocated Together they enable reconstructing correctly MIR with preload kernarg SGPRs.
2025-10-24	[GlobalISel] Make scalar G_SHUFFLE_VECTOR illegal. (#140508)	David Green
	I'm not sure if this is the best way forward or not, but we have a lot of issues with forgetting that shuffle_vectors can be scalar again and again. (There is another example from the recent known-bits code added recently). As a scalar-dst shuffle vector is just an extract, and a scalar-source shuffle vector is just a build vector, this patch makes scalar shuffle vector illegal and adjusts the irbuilder to create the correct node as required. Most targets do this already through lowering or combines. Making scalar shuffles illegal simplifies gisel as a whole, it just requires that transforms that create shuffles of new sizes to account for the scalar shuffle being illegal (mostly IRBuilder and LessElements).
2025-10-13	[NFC][MIR] Fix extra whitespace in MIR printing (#162928)	Rahul Joshi
	Fix a whitespace regression in MIR printing that was introduced in https://github.com/llvm/llvm-project/pull/137361. The default value for `ListSeparator` is `", "`, so we don't need to print an additional space in front of tokens for optional symbols and other things printed after operands. Note, the modified LIT test will fail at trunk without the fix, demonstrating that the extra space before `, pre-instr-symbol <mcsymbol >` on Line 63 exists currently and is fixed with this change.
2025-09-23	[MIR][NFC] Build fix after 1132e82 (#160273)	Elizaveta Noskova

2025-09-23	[MIR] Support save/restore points with independent sets of registers (#119358)	Elizaveta Noskova
	This patch adds the MIR parsing and serialization support for save and restore points with subsets of callee saved registers. That is, it syntactically allows a function to contain two or more distinct sub-regions in which distinct subsets of registers are spilled/filled as callee save. This is useful if e.g. one of the CSRs isn't modified in one of the sub-regions, but is in the other(s). Support for actually using this capability in code generation is still forthcoming. This patch is the next logical step for multiple save/restore points support. All points are now stored in DenseMap from MBB to vector of CalleeSavedInfo. Shrink-Wrap points split Part 4. RFC: https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581 Part 1: https://github.com/llvm/llvm-project/pull/117862 (landed) Part 2: https://github.com/llvm/llvm-project/pull/119355 (landed) Part 3: https://github.com/llvm/llvm-project/pull/119357 (landed) Part 5: https://github.com/llvm/llvm-project/pull/119359 (likely to be further split)
2025-09-16	[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)	Shilei Tian
	Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>
2025-09-11	[AArch64][MIR] Serialize AArch64MachineFunctionInfo::HasStackFrame to MIR ↵	David Tellenbach
	(#158122) This patch adds serialization of AArch64MachineFunctionInfo::HasStackFrame into MIR.
2025-09-03	[AMDGPU] Remove most uses of /dev/null in tests (#156630)	Jay Foad
	Using options like -filetype=null instead should allow tools to save some work by not generating any output.
2025-08-12	[MIR] Remove std::variant from multiple save/restore point handling [nfc] ↵	Philip Reames
	(#153226) In review of bbde6b, I had originally proposed that we support the legacy text format. As review evolved, it bacame clear this had been a bad idea (too much complexity), but in order to let that patch finally move forward, I approved the change with the variant. This change undoes the variant, and updates all the tests to just use the array form.
2025-08-12	[llvm] Support multiple save/restore points in mir (#119357)	Elizaveta Noskova
	Currently mir supports only one save and one restore point specification: ``` savePoint: '%bb.1' restorePoint: '%bb.2' ``` This patch provide possibility to have multiple save and multiple restore points in mir: ``` savePoints: - point: '%bb.1' restorePoints: - point: '%bb.2' ``` Shrink-Wrap points split Part 3. RFC: https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581 Part 1: https://github.com/llvm/llvm-project/pull/117862 Part 2: https://github.com/llvm/llvm-project/pull/119355 Part 4: https://github.com/llvm/llvm-project/pull/119358 Part 5: https://github.com/llvm/llvm-project/pull/119359
2025-08-08	[AMDGPU] AsmPrinter: Unify arg handling (#151672)	Diana Picus
	When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).
2025-07-30	[llvm] Extract and propagate callee_type metadata	Prabhu Rajasekaran
	Update MachineFunction::CallSiteInfo to extract numeric CalleeTypeIds from callee_type metadata attached to indirect call instructions. Reviewers: nikic, ilovepi Reviewed By: ilovepi Pull Request: https://github.com/llvm/llvm-project/pull/87575
2025-07-29	[AMDGPU] Add NoaliasAddrSpace to AAMDnodes (#149247)	Shoreshen
	This is the following PR of https://github.com/llvm/llvm-project/pull/136553 which calculate NoaliasAddrSpace. This PR carries the info calculated into MIR by adding it into AAMDnodes
2025-07-28	Reapply "[llvm] Add CalleeTypeIds field to CallSiteInfo" (#150335) (#150990)	Prabhu Rajasekaran
	This reverts commit 05e08cdb3e576cc0887d1507ebd2f756460c7db7. Adding the missing -mtriple flags in MIR/X86 test files which caused these tests to fail which was the reason for reverting the patch.
2025-07-23	Revert "[llvm] Add CalleeTypeIds field to CallSiteInfo" (#150335)	Haowei
	Reverts llvm/llvm-project#87574, which breaks LLVM :: CodeGen/MIR/X86/call-site-info-ambiguous-indirect-call-typeid.mir tests on linux-arm64 builders.
2025-07-23	[llvm] Add CalleeTypeIds field to CallSiteInfo	Prabhu Rajasekaran
	Introducing `EnableCallGraphSection` target option to add CalleeTypeIds field in CallSiteInfo. Read the callee type ids in and out by the MIR parser/printer. Reviewers: ilovepi Reviewed By: ilovepi Pull Request: https://github.com/llvm/llvm-project/pull/87574
2025-07-21	[AMDGPU] ISel & PEI for whole wave functions (#145858)	Diana Picus
	Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-15	[AMDGPU] gfx1250 64-bit relocations and fixups (#148951)	Stanislav Mekhanoshin

2025-07-09	[NVPTX] Rework and cleanup FTZ ISel (#146410)	Alex MacLean
	This change cleans up DAG-to-DAG instruction selection around FTZ and SETP comparison mode. Largely these changes do not impact functionality though support for `{sin.cos}.approx.ftz.f32` is added.
2025-06-27	[NVPTX] Fixup v2i8 parameter and return lowering (#145585)	Alex MacLean
	This change fixes v2i8 lowering for parameters and returned values. As part of this work, I move the lowering for return values to use generic ISD::STORE nodes as these are more flexible and have existing legalization handling. Note that calling a function with v2i8 arguments or returns is still not working but this is left for a subsequent change as this MR is already fairly large. Partially addresses #128853
2025-06-24	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)	Diana Picus
	Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.
2025-06-23	[NVPTX] Rename register classes after float register removal (NFC) (#145255)	Alex MacLean

2025-06-06	[MIRParser] Report register class errors in a deterministic order (#142928)	Jay Foad

2025-05-29	[NVPTX] Cleanup ISel code after float register removal, use BasicNVPTXInst ↵	Alex MacLean
	(#141711)
2025-05-21	[NVPTX] Remove Float register classes (#140487)	Alex MacLean
	These classes are redundant, as the untyped "Int" classes can be used for all float operations. This change is intended to be as minimal as possible and leaves the many potential simplifications and refactors this exposes as future work.
2025-05-13	[NVPTX] Vectorize and lower 256-bit global loads/stores for sm_100+/ptx88+ ↵	Drew Kersnar
	(#139292) PTX 8.8+ introduces 256-bit-wide vector loads/stores under certain conditions. This change extends the backend to lower these loads/stores. It also overrides getLoadStoreVecRegBitWidth for NVPTX, allowing the LoadStoreVectorizer to create these wider vector operations. See the spec for the three relevant PTX instructions here: - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st
2025-05-08	[CodeGen] Parse nusw flag (#138856)	Pierre van Houtryve
	Fixes #127781
2025-05-06	[AArch64] Correct position of CFI Instruction for Pointer Authentication ↵	Daniel Kiss
	(#137795) This reverts partially this commit 0b73b5af60f2c544892b9dd68b4fa43eeff52fc1. This is not a clear revert because other changes already landed. CFI directives like `.cfi_negate_ra_state` must be emitted after the instruction. If the execution is stopped before the `paciasp` instruction is executed the debugger/unwinder would try to authenticated the return address as the `.cfi_negate_ra_state` already indicates it got signed. fixes: #137802
2025-04-15	[AMDGPU] Remove the AnnotateKernelFeatures pass (#130198)	Jun Wang
	Previously the AnnotateKernelFeatures pass infers two attributes: amdgpu-calls and amdgpu-stack-objects, which are used to help determine if flat scratch init is allowed. PR #118907 created the amdgpu-no-flat-scratch-init attribute. Continuing with that work, this patch makes use of this attribute to determine flat scratch init, replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the removal of the AnnotateKernelFeatures pass.
2025-04-04	[Verifier] Require that dbg.declare variable is a ptr (#134355)	Nikita Popov
	As far as I understand, the first operand of dbg_declare should be a pointer (inside a metadata wrapper). However, using a non-pointer is currently not rejected, and we have some tests that use non-pointer types. As far as I can tell, these tests either meant to use dbg_value or are just incorrect hand-crafted tests. Ran into this while trying to `fix` #134008.
2025-03-27	[YAML] fix output incorrect format for block scalar string (#132897)	Congcong Cai
	After outputting block scalar string, the indent will be wrong. This patch fixes Padding after block scalar string to ensure the correct format of yaml. The new added ut will fail in main. ```diff @@ -3,4 +3,4 @@ Just a block scalar doc -scalar: a + scalar: a ...\n ```
2025-03-19	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)	Diana Picus
	The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).
2025-03-17	AMDGPU: Migrate more tests away from undef (#131314)	Matt Arsenault
	andorbitset.ll is interesting since it directly depends on the difference between poison and undef. Not sure it's useful to keep the version using poison, I assume none of this code makes it to codegen. si-spill-cf.ll was also a nasty case, which I doubt has been reproducing its original issue for a very long time. I had to reclaim an older version, replace some of the poison uses, and run simplify-cfg. There's a very slight change in the final CFG with this, but final the output is approximately the same as it used to be.
2025-03-15	MIR: Replace undef with poison in some MIR tests (#131282)	Matt Arsenault
	The IR doesn't matter so much in these.
2025-03-13	[CodeGen][NPM] Port BranchFolder to NPM (#128858)	Akshat Oke
	EnableTailMerge is false by default and is handled by the pass builder. Passes are independent of target pipeline options. This completes the generic `MachineLateOptimization` passes for the NPM pipeline.
2025-03-08	[AMDGPU] Change SGPR layout to striped caller/callee saved (#127353)	Shilei Tian
	This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes #113782.
2025-03-06	[win] NFC: Rename `EHCatchret` to `EHCont` to allow for EH Continuation ↵	Daniel Paoliello
	targets that aren't `catchret` instructions (#129953) This change splits out the renaming and comment updates from #129612 as a non-functional change.
2025-03-03	[win] Enable test/CodeGen/MIR/AArch64 on Windows (#122832)	Daniel Paoliello
	Not sure why this was disabled in the first place (dates back to <https://github.com/llvm/llvm-project/commit/fbe9c04c5f72cf3eca39793aafc92071ef13c046>), but it appears to be working for me.
2025-03-03	[RegAlloc][NewPM] Plug Greedy RA in codegen pipeline (#120557)	Akshat Oke
	Use `-passes="regallocgreedy<[all\|sgpr\|wwm\|vgpr]>` to insert the greedy RA with a filter and `-regalloc-npm=<type>` to control which RA to use in existing pipeline.
2025-02-27	[NVPTX] Combine addressing-mode variants of ld, st, wmma (#129102)	Alex MacLean
	This change fold together the _ari, _ari64, and _asi variants of these instructions into a single instruction capable of holding any address. This allows for the removal of a lot of unnecessary code and moves us towards a standard way of representing an address in NVPTX.
2025-02-20	[NVPTX] Remove redundant addressing mode instrs (#128044)	Alex MacLean
	Remove load and store instructions which do not include an immediate, and just use the immediate variants in all cases. These variants will be emitted exactly the same when the immediate offset is 0. Removing the non-immediate versions allows for the removal of a lot of code and would make any MachineIR passes simpler.
2025-02-10	MachineCopyPropagation: Do not remove copies preserved by regmask (#125868)	Jinsong Ji
	llvm/llvm-project@9e436c2daa44 tries to handle register masks and sub-registers, it avoids clobbering RegUnit presreved by regmask. But it then introduces invalid pointer issues. We delete the copies without invalidate all the use in the CopyInfo, so we dereferenced invalid pointers in next interation, causing asserts. Fixes: #126107 --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-28	[Clang] Cleanup docs and comments relating to -fextend-variable-liveness ↵	Stephen Tozer
	(#124767) This patch contains a number of changes relating to the above flag; primarily it updates comment references to the old flag names, "-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names, "-fextend-variable-liveness[={all,this}]". These changes are all NFC. This patch also removes the explicit -fextend-this-ptr-liveness flag alias, and shortens the help-text for the main flag; these are both changes that were meant to be applied in the initial PR (#110000), but due to some user-error on my part they were not included in the merged commit.
2025-01-23	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)	Lucas Ramirez
	Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.
2025-01-17	[AMDGPU] Fix printing hasInitWholeWave in mir (#123232)	Stanislav Mekhanoshin

2025-01-13	Reapply "[aarch64][win] Add support for import call optimization (equivalent ↵	Daniel Paoliello
	to MSVC /d2ImportCallOptimization) (#121516)" (#122777) This reverts commit 2f7ade4b5e399962e18f5f9a0ab0b7335deece51. Fix is available in #122762
2025-01-13	Revert "[aarch64][win] Add support for import call optimization (equivalent ↵	Kirill Stoimenov
	to MSVC /d2ImportCallOptimization) (#121516)" Breaks sanitizer build: https://lab.llvm.org/buildbot/#/builders/52/builds/5179 This reverts commits: 5ee0a71df919a328c714e25f0935c21e586cc18b d997a722c194feec5f3a94dec5acdce59ac5e55b
2025-01-11	[aarch64][win] Add support for import call optimization (equivalent to MSVC ↵	Daniel Paoliello
	/d2ImportCallOptimization) (#121516) This change implements import call optimization for AArch64 Windows (equivalent to the undocumented MSVC `/d2ImportCallOptimization` flag). Import call optimization adds additional data to the binary which can be used by the Windows kernel loader to rewrite indirect calls to imported functions as direct calls. It uses the same [Dynamic Value Relocation Table mechanism that was leveraged on x64 to implement `/d2GuardRetpoline`](https://techcommunity.microsoft.com/blog/windowsosplatform/mitigating-spectre-variant-2-with-retpoline-on-windows/295618). The change to the obj file is to add a new `.impcall` section with the following layout: ```cpp // Per section that contains calls to imported functions: // uint32_t SectionSize: Size in bytes for information in this section. // uint32_t Section Number // Per call to imported function in section: // uint32_t Kind: the kind of imported function. // uint32_t BranchOffset: the offset of the branch instruction in its // parent section. // uint32_t TargetSymbolId: the symbol id of the called function. ``` NOTE: If the import call optimization feature is enabled, then the `.impcall` section must be emitted, even if there are no calls to imported functions. The implementation is split across a few parts of LLVM: * During AArch64 instruction selection, the `GlobalValue` for each call to a global is recorded into the Extra Information for that node. * During lowering to machine instructions, the called global value for each call is noted in its containing `MachineFunction`. * During AArch64 asm printing, if the import call optimization feature is enabled: - A (new) `.impcall` directive is emitted for each call to an imported function. - The `.impcall` section is emitted with its magic header (but is not filled in). * During COFF object writing, the `.impcall` section is filled in based on each `.impcall` directive that were encountered. The `.impcall` section can only be filled in when we are writing the COFF object as it requires the actual section numbers, which are only assigned at that point (i.e., they don't exist during asm printing). I had tried to avoid using the Extra Information during instruction selection and instead implement this either purely during asm printing or in a `MachineFunctionPass` (as suggested in [on the forums](https://discourse.llvm.org/t/design-gathering-locations-of-instructions-to-emit-into-a-section/83729/3)) but this was not possible due to how loading and calling an imported function works on AArch64. Specifically, they are emitted as `ADRP` + `LDR` (to load the symbol) then a `BR` (to do the call), so at the point when we have machine instructions, we would have to work backwards through the instructions to discover what is being called. An initial prototype did work by inspecting instructions; however, it didn't correctly handle the case where the same function was called twice in a row, which caused LLVM to elide the `ADRP` + `LDR` and reuse the previously loaded address. Worse than that, sometimes for the double-call case LLVM decided to spill the loaded address to the stack and then reload it before making the second call. So, instead of trying to implement logic to discover where the value in a register came from, I instead recorded the symbol being called at the last place where it was easy to do: instruction selection.
2025-01-06	[AArch64] Correct position of CFI Instruction for Pointer Authentication ↵	Jack Styles
	(#121559) As part #112171, support for FEAT_PAuthLR's CFI instructions was added. However, the CFI instructions are emitted in the incorrect location. This leads to incorrect CodeGen being generated and possible issues when running a program. According to the ABI, the CFI instructions should be emitted before the signing instruction. This is now done properly. ABI information can be found here: https://github.com/ARM-software/abi-aa/blob/bf0e2c8047c70987165f3e05e571d7836370ade9/aadwarf64/aadwarf64.rst#44call-frame-instructions
2024-12-18	[AMDGPU] Make max dwords of memory cluster configurable (#119342)	Ruiling, Song
	We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value.