llvm-project.git/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp, branch main

[AMDGPU] Fixed crash in getLastMIForRegion when the region is empty. (#168653)

2025-11-20T00:19:20+00:00

PreRARematStage builds region live-outs if GCN trackers are enabled. If
rematerialization leads to empty regions, this can cause a crash because
of dereference of an invalid iterator in getLastMIForRegion. The fix is
to skip calling getLastMIForRegion for empty regions.

This patch fixes another bug in the same code region. getLastMIForRegion
calls skipDebugInstructionsBackward which may immediately return the
RegionEnd if it is not the begin instruction and it is a non-debug
instruction. That would imply considering an instruction that is outside
the relevant region. The fix is to always pass the previous of RegionEnd
to skipDebugInstructionsBackward.

This bug was found while using GCN trackers on the existing LIT test
machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure.

llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168:
llvm::ilist_iterator::reference
llvm::ilist_iterator::operator*() const
[with OptionsT = llvm::ilist_detail::node_options; bool IsReverse = false; bool IsConst =
false; llvm::ilist_iterator::reference =
llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.

[AMDGPU] Avoid changing minOccupancy if unclustered schedule was not run for any region. (#162025)

2025-11-12T18:11:22+00:00

During init of unclustered schedule stage, minOccupancy may be
temporarily increased. But subsequently, if none of the regions are
scheduled because they don't meet the conditions of initGCNRegion,
minOccupancy remains incorrectly set. This patch avoids this
incorrectness by delaying the change of minOccupancy until a region is
about to be scheduled.

CodeGen: Remove TRI argument from reMaterialize (#158229)

2025-11-11T00:23:36+00:00

[AMDGPU] Examine instructions in pending queues during scheduling (#147653)

2025-10-16T15:46:29+00:00

Examine instructions in the pending queue when scheduling. This makes
instructions visible to scheduling heuristics even when they aren't
immediately issuable due to hardware resource constraints.

The scheduler has two hardware resource modeling modes: an in-order mode
where instructions must be ready to issue before scheduling, and
out-of-order models where instructions are always visible to heuristics.
Special handling exists for unbuffered processor resources in
out-of-order models. These resources can cause pipeline stalls when used
back-to-back, so they're typically avoided. However, for AMDGPU targets,
managing register pressure and reducing spilling is critical enough to
justify exceptions to this approach.

This change enables examination of instructions that can't be
immediately issued because they use an already occupied unbuffered
resource. By making these instructions visible to scheduling heuristics
anyway, we gain more flexibility in scheduling decisions, potentially
allowing better register pressure and hardware resource management.

[AMDGPU] Add register usage debug printing the point of maximum register pressure. (#161850)

2025-10-13T14:17:11+00:00

Basically this allows to analyze "why so many VGPRs used?".

It prints all live registers at the point of maximum register pressure
and for each register its defs/uses are dumped.

Currently can be run before and after the scheduler but would be nice if
it can be ran inbetween any passes (not sure this is possible with
legacy pass-manager). Requires debug or built with asserts compiler.

Highly recommended to run with debug info to have debug locations for
instructions.

Example output:

```
*** Register pressure info (VGPRs) for _ZN7ck_tile6ken.... ***
Max pressure is 256 VGPRs at 41780e@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10137:vreg_128_align2, %10141:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec

Live registers with single definition (123 VGPRs):
  %10126:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
    def 41600r@BB.18 (LoopHdr BB.16, Depth 1): undef %10126.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 15232, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1314, !noalias !60, addrspace 3)
    def 41608r@BB.18 (LoopHdr BB.16, Depth 1): %10126.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 16320, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1315, !noalias !60, addrspace 3)
    use 41848r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10126:vreg_128_align2, %10138:vreg_128_align2, %9856:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %10136:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
    def 41264r@BB.18 (LoopHdr BB.16, Depth 1): undef %10136.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 2176, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1294, !noalias !60, addrspace 3)
    def 41272r@BB.18 (LoopHdr BB.16, Depth 1): %10136.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 3264, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1295, !noalias !60, addrspace 3)
    use 41788r@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10136:vreg_128_align2, %10140:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %10129:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
...
Live registers with multiple definitions (133 VGPRs):
  %9856:VReg_512_Align2, LiveMask 00000000FFFFFFFF (16 VGPRs)
    def 16544r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def undef %9856.sub0_sub1:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16592r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub2_sub3:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16608r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub4_sub5:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16656r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub6_sub7:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16672r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub8_sub9:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16720r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub10_sub11:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16736r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub12_sub13:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16784r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub14_sub15:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def use 41828r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_...
...
********** INTERVALS **********
...
********** MACHINEINSTRS **********
# Machine code for function _ZN7ck_tile6kentr...
```

[AMDGPU] Use common allUsesAvailableAt implementation [nfc] (#161418)

2025-10-01T14:22:13+00:00

Replace the target specific copy with a call to the generic routine. I
don't spot any differences by eye, and there's nothing in the original
review discussion (#124327) which makes it clear why this was
duplicated.

[TII] Split isTrivialReMaterializable into two versions [nfc] (#160377)

2025-09-25T01:52:17+00:00

This change builds on https://github.com/llvm/llvm-project/pull/160319
which tries to clarify which *callers* (not backends) assume that the
result is actually trivial.

This change itself should be NFC. Essentially, I'm just renaming the
existing isTrivialRematerializable to the non-trivial version and then
adding a new trivial version (with the same name as the prior function)
and simplifying a few callers which want that semantic.

This change does *not* enable non-trivial remat any more broadly than
was already done for our targets which were lying through the old APIs;
that will come separately. The goal here is simply to make the code
easier to follow in terms of what assumptions are being made where.

---------

Co-authored-by: Luke Lau

[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization (#149224)

2025-08-08T12:26:04+00:00

The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.

This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.

[AMDGPU][Scheduler] Delete RegionsWithMinOcc bitvector from scheduler (NFC) (#142361)

2025-08-01T11:20:05+00:00

The `GCNScheduleDAGMILive`'s `RegionsWithMinOcc` bitvector is only used
by the `UnclusteredHighRPStage`. Its presence in the scheduler's state
forces us to maintain its value throughout scheduling even though it is
of no use to the iterative scheduling process itself. At any point
during scheduling it is possible to cheaply compute the occupancy
induced by a particular register pressure. Furthermore, the field
doesn't appear to be updated correctly throughout scheduling i.e., bits
corresponding to regions at minimum occupancy are not always set in the
vector.

This removes the bitvector from `GCNScheduleDAGMILive`.
`UnclusteredHighRPStage::initGCNRegion` now directly computes the
occupancy of possibly reschedulable regions instead of querying the
vector. Since it is the most expensive check, it is done last in the
list.

[MachineScheduler] Make cluster check more efficient (#150884)

2025-08-01T08:00:42+00:00