| Age | Commit message (Collapse) | Author |
|
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated
Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
|
|
These shuffles can always be implemented using v_perm_b32, and so this
rewrites the analysis from the perspective of "how many v_perm_b32s does
it take to assemble each register of the result?"
The test changes in Transforms/SLPVectorizer/reduction.ll are
reasonable: VI (gfx8) has native f16 math, but not packed math.
|
|
Global with invariant should be treated identically to
constant.
|
|
SI_SPILL/SI_RESTORE." (#169068)
PR causes build failures with expensive checks enabled
Reverts llvm/llvm-project#168546
|
|
Fix a problem exposed by #166483 using AV classes in more places.
`isVectorRegister` only accepts registers of VGPR or AGPR classes.
`hasVectorRegisters` additionally accepts the combined AV classes.
Fixes: #168761
|
|
Use the default, which freely coalesces anything it can.
This mostly shows improvements, with a handful of regressions.
The main concern would be if introducing wider registers is more
likely to push the register usage up to the next occupancy tier.
|
|
This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
|
|
We previously got a duplicate implicit $exec operand. It didn't really
hurt anything (other than being a slight drag on compile-time
performance). Still, let's keep things clean.
|
|
(#168787)
|
|
Avoids regression which caused the revert 6d5f87fc42.
This is a hack on a hack. We currently have isUniformMMO,
which improperly treats unknown source value as known uniform.
This is hack from before we had divergence information in the
DAG, and should be removed. This is the minimum change to avoid
the regression; removing the aggressive handling of the unknown
case (or dropping isUniformMMO entirely) are more involved fixes.
|
|
(#168845)
…815)"
This reverts commit dcab4cb49bfb0aa17df3d3fabe582696100e0d35.
|
|
Supported Ops: `fadd`, `fsub`
|
|
Supported Ops: `fmin`, `fmax`
|
|
PreRARematStage builds region live-outs if GCN trackers are enabled. If
rematerialization leads to empty regions, this can cause a crash because
of dereference of an invalid iterator in getLastMIForRegion. The fix is
to skip calling getLastMIForRegion for empty regions.
This patch fixes another bug in the same code region. getLastMIForRegion
calls skipDebugInstructionsBackward which may immediately return the
RegionEnd if it is not the begin instruction and it is a non-debug
instruction. That would imply considering an instruction that is outside
the relevant region. The fix is to always pass the previous of RegionEnd
to skipDebugInstructionsBackward.
This bug was found while using GCN trackers on the existing LIT test
machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure.
llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168:
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const
[with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr,
true, true, void, false, void>; bool IsReverse = false; bool IsConst =
false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference =
llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
|
|
If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.
Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.
|
|
Currently LibcallLoweringInfo is defined inside of TargetLowering,
which is owned by the subtarget. Pass in the subtarget so we can
construct LibcallLoweringInfo with the subtarget. This is a temporary
step that should be revertable in the future, after LibcallLoweringInfo
is moved out of TargetLowering.
|
|
Note that llvm::size only works on types that allow std::distance in
O(1).
|
|
Remove leftover implicit operands from SI_SPILL/SI_RESTORE.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, SDISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.
This patch only selects flat instructions with immediate offsets from PTRADD
address computations with the inbounds flag: If the PTRADD does not leave the
bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.
Affected tests:
- CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added
new positive tests where we still do fold them.
- CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem
integral to this test, so the test is not changed to make offset folding still
happen.
- CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds
addresses for accesses to be based on potentially OOB addresses used for
prefetching.
- I think the remaining ones suffer from the limited preservation of the
inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out
in PR #165424 and the fact that
`AMDGPUTargetLowering::SplitVector{Load|Store}` legalizes too-wide accesses by
repeatedly splitting them in half. Legalizing a V32S32 memory accesses
therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd
inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a
single ptradd, but the involved transformations generally cannot preserve the
inbounds flag (even though it would be valid in this case).
Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353
Analogous PR for GISel (merged): #153001
Fixes SWDEV-516125.
|
|
(#168500)
Do not add latency for wavefront and singlethread scope fences during
barrier latency DAG mutation.
These scopes do not typically introduce any latency and adjusting
schedules based on them significantly impacts latency hiding.
|
|
|
|
|
|
A folow-up of #168458.
|
|
its lower 32-bit (#168458)
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.
Fixes SWDEV-567139.
|
|
Fixes unsigned int underflows in
`MFMASmallGemmSingleWaveOpt::applyIGLPStrategy`.
|
|
In general, "Flat instructions look at the per-workitem address and
determine for each work item if the target memory address is in global,
private or scratch memory." (RDNA2 ISA) That means that FLAT
instructions need to be considered for VMEM hazards even without
"specific segment". Also, LDS DMA should be considered for LDS hazard
detection.
See also #137148
|
|
|
|
While I am at it, this patch switches to the constructor that takes
a container instead of a pair of begin/end.
Identified with readability-const-return-type.
|
|
Also breaks the long inheritance chains by making both
`SIGfx10CacheControl` and
`SIGfx12CacheControl` inherit from `SICacheControl` directly.
With this patch, we now just have 3 `SICacheControl` implementations
that each
do their own thing, and there is no more code hidden 3 superclasses
above (which made this code harder to read and maintain than it needed
to be).
|
|
LDS block size should be 2048 bytes (512 dwords) based on current spec.
|
|
AMDGPUMCExpr lives in the MC layer it should not depend on Function.h or
GCNSubtarget.h
Move the function that needed GCNSubtarget to the one file that called
it.
|
|
|
|
Merge the following classes into `SIGfx6CacheControl`:
- SIGfx7CacheControl
- SIGfx90ACacheControl
- SIGfx940CacheControl
They were all very similar and had a lot of duplicated boilerplate just
to implement one or two codegen differences. GFX90A/GFX940 have a bit
more differences, but they're still manageable under one class because
the general behavior is the same.
This removes 500 lines of code and puts everything into a single place
which I think makes it a lot easier to maintain, at the cost of a slight
increase in complexity for some functions.
There is still a lot of room for improvement but I think this patch is
already big enough as is and I don't want to bundle too much into one
review.
|
|
(#165692)
This PR introduces `amdgpu-lower-exec-sync` pass which specifically
lowers named-barrier LDS globals introduced by #114550 .
Changes include:
- Moving the logic of lowering named-barrier LDS globals from
`amdgpu-lower-module-lds` pass to this new pass.
- This PR adds the pass to pipeline, remove the existing lowering logic for
named-barrier LDS in `amdgpu-lower-module-lds`
See #161827 for discussion on this topic.
|
|
This allows SDNodes to be validated against their expected type profiles
and reduces the number of changes required to add a new node.
Autogenerated node names start with "AMDGPUISD::", hence the changes in
the tests.
The few nodes defined in R600.td are *not* imported because TableGen
processes AMDGPU.td that doesn't include R600.td. Ideally, we would have
two sets of nodes, but that would require careful reorganization of td
files since some nodes are shared between AMDGPU/R600. Not sure if it
something worth looking into.
Some nodes fail validation, those are listed in
`AMDGPUSelectionDAGInfo::verifyTargetNode()`.
Part of #119709.
Pull Request: https://github.com/llvm/llvm-project/pull/168248
|
|
When shrinking and/or to bitset* remove leftover implicit scc def.
bitset* instructions do not set scc.
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.
|
|
This probably should have turned into a regular integer constant
earlier. This is to defend against future regressions.
|
|
Handle this for consistency with the zext case.
|
|
Some cases are relying on SIFixSGPRCopies to force VALU
reg_sequence inputs with SGPR inputs to use all VGPR inputs,
but this doesn't always happen if the reg_sequence isn't
invalid. Make sure we use a vgpr up-front here so we don't
rely on something later.
|
|
`getLanesWithProperty()` is called with virtual registers only.
|
|
These instructions use `src0`, `imm`, `src1` as operand.
Fixes SWDEV-566579.
|
|
The ds_gws_* instructions require gds as an operand. However, when nogds
is given, it is treated the same as gds. This patch fixes this to
disallow nogds.
|
|
This replaces the 2 bool flags and the anonymous union. This also
removes an implicit conversion from Register to unsigned and a call to
MCRegister::id().
The ArgDescriptor constructor was always assigning the union through the
MCRegister field even for stack offsets.
The change to SIMachineFunctionInfo.h fixes a case where getRegister was
being called on an unset ArgDescriptor. Since it was only this case, it
seemed cleaner to fix it at the caller. The other option would be to
make getRegister() return MCRegister() for an unset ArgDescriptor.
|
|
(#168017)
|
|
Fixes another verifier error after introducing AV registers.
Also fixes not clearing the subregister index if there was
one.
|
|
|
|
Reduces memory usage compiling backend sources, most notably for
AMDGPU by ~98 MB per source on average.
AMDGPUGenRegisterInfo.inc is tens of megabytes in size now, and
is even larger downstream. At the same time, it is included in
nearly all backend sources, typically just for a small portion of
its content, resulting in compilation being unnecessarily
memory-hungry, which in turn stresses buildbots and wastes their
resources.
Splitting .inc files also helps avoiding extra ccache misses
where changes in .td files don't cause changes in all parts of
what previously was a single .inc file.
It is thought that rather than building on top of the current
single-output-file design of TableGen, e.g., using `split-file`,
it would be more preferable to recognise the need for multi-file
outputs and give it a proper first-class support directly in
TableGen.
|
|
Ensure SCC is not live before shrinking s_and*/s_or* instructions to
s_bitset*.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
In true16 mode, D16 insts are lowered to a pseudo t16 first, and then
lowered to hi/lo inst in MC lowering using D16T16 table.
However, the D16T16 table selects both `flat_load_d16_t16 /
flat_load_d16_t16_saddr` to `flat_load_d16_(hi)_b16` which is wrong.
saddr pseudo inst `flat_load_d16_t16_saddr` should be selected to saddr
hi/lo inst
The global/scratch are correct while the flat seems to be the only one
with this issue.
|