llvm-project.git/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll, branch main

[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016)

2025-11-21T14:12:17+00:00

This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds PTRADDs (#165427)

2025-11-19T10:48:12+00:00

For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, SDISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.

This patch only selects flat instructions with immediate offsets from PTRADD
address computations with the inbounds flag: If the PTRADD does not leave the
bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.

Affected tests:

- CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added
  new positive tests where we still do fold them.
- CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem
  integral to this test, so the test is not changed to make offset folding still
  happen.
- CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds
  addresses for accesses to be based on potentially OOB addresses used for
  prefetching.
- I think the remaining ones suffer from the limited preservation of the
  inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out
  in PR #165424 and the fact that
  `AMDGPUTargetLowering::SplitVector{Load|Store}` legalizes too-wide accesses by
  repeatedly splitting them in half.  Legalizing a V32S32 memory accesses
  therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd
  inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a
  single ptradd, but the involved transformations generally cannot preserve the
  inbounds flag (even though it would be valid in this case).

Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353
Analogous PR for GISel (merged): #153001

Fixes SWDEV-516125.

[AMDGPU] Enable XNACK on gfx1250 (#161457)

2025-10-03T15:04:55+00:00

This should be always on.

Fixes SWDEV-555931.

[AMDGPU][SDAG] Enable ISD::PTRADD for 64-bit AS by default (#146076)

2025-10-02T08:06:25+00:00

Also removes the command line option to control this feature.

There seem to be mainly two kinds of test changes:
- Some operands of addition instructions are swapped; that is to be expected
  since PTRADD is not commutative.
- Improvements in code generation, probably because the legacy lowering enabled
  some transformations that were sometimes harmful.

For SWDEV-516125.

[AMDGPU][gfx1250] Remove SCOPE_SE for scratch stores (#157640)

2025-09-10T09:03:58+00:00

[LICM] Do not reassociate constant offset GEP (#151492)

2025-08-01T07:43:15+00:00

LICM tries to reassociate GEPs in order to hoist an invariant GEP.
Currently, it also does this in the case where the GEP has a constant
offset.

This is usually undesirable. From a back-end perspective, constant GEPs
are usually free because they can be folded into addressing modes, so
this just increases register pressume. From a middle-end perspective,
keeping constant offsets last in the chain makes it easier to analyze
the relationship between multiple GEPs on the same base, especially
after CSE.

The worst that can happen here is if we start with something like

```
loop {
   p + 4*x
   p + 4*x + 1
   p + 4*x + 2
   p + 4*x + 3
}
```

And LICM converts it into:

```
p.1 = p + 1
p.2 = p + 2
p.3 = p + 3
loop {
   p + 4*x
   p.1 + 4*x
   p.2 + 4*x
   p.3 + 4*x
}
```

Which is much worse than leaving it for CSE to convert to:
```
loop {
   p2 = p + 4*x
   p2 + 1
   p2 + 2
   p2 + 3
}
```

[AMDGPU] Regenerate test checks (NFC)

2025-07-31T15:02:48+00:00

[AMDGPU] Select VMEM prefetch for llvm.prefetch on gfx1250 (#150493)

2025-07-24T20:22:50+00:00

We have a choice to use a scalar or vector prefetch for an uniform
pointer. Since we do not have scalar stores our scalar cache is
practically readonly. The rw argument of the prefetch intrinsic is
used to force vector operation even for an uniform case. On GFX12
scalar prefetch will be used anyway, it is still useful but it will
only bring data to L2.

[AMDGPU] Add safe-smem-prefetch SubtargetFeature off by default (#130050)

2025-03-07T10:10:21+00:00

S_PREFETCH_* instructions may cause host to terminate process in case of
the invalid address.

Co-authored-by: Stanislav Mekhanoshin

Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"

2024-11-09T01:21:16+00:00

This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.