| Age | Commit message (Collapse) | Author |
|
I think we need to keep the SelectionDAG code for volatile load/store so
we should support 4 byte alignment when possible.
|
|
(#168661)"
This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc.
This caused a couple test failures, likely due to a mid-air collision.
Reverting for now to get the tree back to green and allow the original
author to run UTC/friends and verify the output.
|
|
This maybe a bug which is introduced by commit
6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever
since.
In this case, `OtherReg` always overlaps with `DstReg` cause they from
the `Copy` all.
|
|
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated
Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
|
|
truncation exists (#169022)
Fixes #169017
|
|
This works fine on main, but broke after a future patch.
|
|
SI_SPILL/SI_RESTORE." (#169068)
PR causes build failures with expensive checks enabled
Reverts llvm/llvm-project#168546
|
|
chains (#168660)
Previously, the following:
%mul0 = mul nsw <8 x i32> %m00, %m01
%mul1 = mul nsw <8 x i32> %m10, %m11
%add0 = add <8 x i32> %mul0, splat (i32 32)
%add1 = add <8 x i32> %add0, %mul1
lowered to:
vsetivli zero, 8, e32, m2, ta, ma
vmul.vv v8, v8, v9
vmacc.vv v8, v11, v10
li a0, 32
vadd.vx v8, v8, a0
After this patch, now lowers to:
li a0, 32
vsetivli zero, 8, e32, m2, ta, ma
vmv.v.x v12, a0
vmadd.vv v8, v9, v12
vmacc.vv v8, v11, v10
Modeled on 0cc981e0 from the AArch64 backend.
C-code for the example case (`clang -O3 -S -mcpu=sifive-x280`):
```
int madd_fail(int a, int b, int * restrict src, int * restrict dst, int loop_bound) {
for (int i = 0; i < loop_bound; i += 2) {
dst[i] = src[i] * a + src[i + 1] * b + 32;
}
}
```
|
|
ARM relies on deprecated TableGen behavior of guessing instruction
properties from patterns (`def ARM : Target` doesn't have
`guessInstructionProperties` set to false).
Before #168209, TableGen conservatively guessed that `t2WhileLoopSetup`
has side effects because the instruction wasn't matched by any pattern.
After the patch, TableGen guesses it has no side effects because the
added pattern uses only `arm_wlssetup` node, which has no side effects.
Add `SDNPSideEffect` to the node so that TableGen guesses the property
right, and also `hasSideEffects = 1` to the instruction in case ARM ever
sets `guessInstructionProperties` to false.
|
|
Fix a problem exposed by #166483 using AV classes in more places.
`isVectorRegister` only accepts registers of VGPR or AGPR classes.
`hasVectorRegisters` additionally accepts the combined AV classes.
Fixes: #168761
|
|
Use the default, which freely coalesces anything it can.
This mostly shows improvements, with a handful of regressions.
The main concern would be if introducing wider registers is more
likely to push the register usage up to the next occupancy tier.
|
|
|
|
This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
|
|
(#163561)
This change adds dense and sparse MMA intrinsics with block scaling. The
implementation is based on [PTX ISA version
9.0](https://docs.nvidia.com/cuda/parallel-thread-execution/). Tests for
new intrinsics are added for PTX 8.7 and SM 120a and are generated by
`llvm/test/CodeGen/NVPTX/wmma-ptx87-sm120a.py`. The tests have been
verified with ptxas from CUDA-13.0 release.
Dense MMA intrinsics with block scaling were supported by
@schwarzschild-radius.
|
|
adding the vector {1, 1, 1, 1} (#160882)
This patch optimizes vector addition operations involving **`all-ones`**
vectors by leveraging the generation of vectors of -1s(using `xxleqv`,
which is cheaper than generating vectors of 1s(`vspltisw`). These are
the respective vector types.
`v2i64`: **`A + vector {1, 1}`**
`v4i32`: **`A + vector {1, 1, 1, 1}`**
`v8i16`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1}`**
`v16i8`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1}`**
The optimized version replaces `vspltisw (4 cycles)` with `xxleqv (2
cycles)` using the following identity:
`A - (-1) = A + 1`.
---------
Co-authored-by: himadhith <himadhith.v@ibm.com>
Co-authored-by: Tony Varghese <tonypalampalliyil@gmail.com>
|
|
This commit implements a complete load/store optimization pass for the
RISC-V Zilsd extension, which combines pairs of 32-bit load/store
instructions into single 64-bit LD/SD instructions when possible.
Default alignment is 8, it also provide zilsd-4byte-align feature for
looser condition.
Related work: https://reviews.llvm.org/D144002
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
|
|
We previously got a duplicate implicit $exec operand. It didn't really
hurt anything (other than being a slight drag on compile-time
performance). Still, let's keep things clean.
|
|
which reassigns scale operand in vgpr_32 register to agpr_32, not
permitted by instruction format. Reduced from ck.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: theRonShark <ron.lieberman@amd.com>
|
|
(#168787)
|
|
(#168786)
This reverts commit 6d5f87fc4284c4c22512778afaf7f2ba9326ba7b.
Previously this failed due to treating the unknown MachineMemOperand
value as known uniform.
|
|
|
|
Resolves #165694
|
|
Avoids regression which caused the revert 6d5f87fc42.
This is a hack on a hack. We currently have isUniformMMO,
which improperly treats unknown source value as known uniform.
This is hack from before we had divergence information in the
DAG, and should be removed. This is the minimum change to avoid
the regression; removing the aggressive handling of the unknown
case (or dropping isUniformMMO entirely) are more involved fixes.
|
|
|
|
Adds the fwidth intrinsic for HLSL.
The DXIL path only requires modification to the hlsl headers.
The SPIRV path implements the OpFwidth builtin in Clang and instruction
selection for the OpFwidth instruction in LLVM.
Also adds shader stage tests to the ddx_coarse and ddy_coarse
instructions used by fwidth.
Closes #99120
---------
Co-authored-by: Alexander Johnston <alexander.johnston@amd.com>
|
|
lanes are in use. (#168387)
While SVE support for exception safe floating point code generation is
bare bones we try to ensure inactive lanes remiain inert. I mistakenly
broke this rule when adding support for SVE-B16B16 by lowering some
bfloat operations of unpacked vectors to unpredicated instructions.
|
|
This allows DemandedBits to see that the SVE CNTP intrinsic will only
ever produce small positive integers. The maximum value you could get
here is 256, which is CNTP on a nxv16i1 on a machine with a 2048bit
vector size (the maximum for SVE).
Using this various redundant operations (zexts, sexts, ands, ors, etc)
can be eliminated.
|
|
(#168845)
…815)"
This reverts commit dcab4cb49bfb0aa17df3d3fabe582696100e0d35.
|
|
See if we can create a vector load from the src elements in reverse and
then shuffle these back into place.
SLP will (usually) catch this in the middle-end, but there are a few
BUILD_VECTOR scalarizations etc. that appear during DAG legalization.
I did start looking at a more general permute fold, but I haven't found
any good test examples for this yet - happy to take another look if
somebody has examples.
|
|
Treat it in the same manner of zero_extend_vector_inreg and generate an
extend_low_u if possible. This is to try an prevent expensive shuffles
from being generated instead. computeKnownBitsForTargetNode has also
been updated to specify known zeros on extend_low_u.
|
|
Supported Ops: `fadd`, `fsub`
|
|
Supported Ops: `fmin`, `fmax`
|
|
Default DAG combiner combine BUILD_VECTOR with same elements to
SPLAT_VECTOR, we can just map constant splat to PLI if possible.
|
|
(#168777)
This prevents it from being optimized out in non-asserts builds.
Update X86 test to remove REQUIRES: asserts and check for LLVM ERROR.
Add FileCheck to RISC-V test and remove UNSUPPORTED.
This is the more complete fix for #168772 and #168525.
|
|
In RISCVVLOptimizer we first compute all the demanded VLs, then we walk
backwards through the function and try to reduce any VLs.
We don't actually need to walk backwards anymore since after #124530 the
order in which we modify the instructions doesn't matter.
This patch changes it to just iterate over the instructions with a
demanded VL computed, which means we don't iterate over scalar
instructions etc.
This also fixes #168665, where we triggered an assert on instructions
with a dead $vxsat implicit-def:
dead %x:vr = PseudoVSADDU_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */,
0 /* tu, mu */, implicit-def dead $vxsat
Because $vxsat is a reserved register, DeadMachineInstructionElim won't
remove it and the instruction makes it to RISCVVLOptimizer.
And because the def of %x is dead, we don't reach this instruction in
the dataflow analysis. This instruction returns true for isCandidate, so
we would try to lookup its demanded VL which doesn't exist and assert.
But with this patch we don't try to reduce instructions that aren't in
DemandedVLs, which fixes the crash.
|
|
This is the fixed version of
https://github.com/llvm/llvm-project/pull/163019
|
|
Attempt to only define used subregisters when creating IMPLICIT_DEF fix
ups for live interval subranges. This avoids the appearance at the MIR
level of entire (wide) registers becoming live rather than relying only
on transient LiveIntervals dead definitions for unused subregisters.
|
|
PreRARematStage builds region live-outs if GCN trackers are enabled. If
rematerialization leads to empty regions, this can cause a crash because
of dereference of an invalid iterator in getLastMIForRegion. The fix is
to skip calling getLastMIForRegion for empty regions.
This patch fixes another bug in the same code region. getLastMIForRegion
calls skipDebugInstructionsBackward which may immediately return the
RegionEnd if it is not the begin instruction and it is a non-debug
instruction. That would imply considering an instruction that is outside
the relevant region. The fix is to always pass the previous of RegionEnd
to skipDebugInstructionsBackward.
This bug was found while using GCN trackers on the existing LIT test
machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure.
llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168:
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const
[with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr,
true, true, void, false, void>; bool IsReverse = false; bool IsConst =
false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference =
llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
|
|
If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.
Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.
|
|
|
|
|
|
The previous implementation of `combineOp_VLToVWOp_VL` manually replaced
old
nodes with newly created widened nodes, but only added the new node
itself to
the `DAGCombiner` worklist. Since the users of the new node were not
added,
some combine opportunities could be missed when external `DAGCombiner`
passes
expected those users to be reconsidered.
This patch replaces the custom replacement logic with a call to
`DCI.CombineTo()`, which performs node replacement in a way consistent
with
`DAGCombiner::Run`:
- Replace all uses of the old node.
- Add the new node and its users to the worklist.
- Clean up unused nodes when appropriate.
Using `CombineTo` ensures that `combineOp_VLToVWOp_VL` behaves
consistently with
the standard `DAGCombiner` update model, avoiding discrepancies between
the
private worklist inside this routine and the global worklist managed by
the
combiner.
This resolves missed combine cases involving VL -> VW operator widening.
---------
Co-authored-by: Kai Lin <omg_link@qq.com>
|
|
This is to show the spilling of WMMA scale values which are limited
to low 256 VGPRs. We have free registers, just RA allocates low 256
first.
|
|
fold (#168726)
This late into lowering we don't have a good way to handle constant build_vector lowering
Fixes #168594
|
|
This patch adds:
1. Support to recognize bf16 type in the frontend and isel/abi support
for scalar bf16 programs
Limitations: fp_to_bf16 is being generated with a tablegen pattern
instead of lowering via expansion. This is because we do not have
support for fcanonincalize instruction which should prevent an SNaN
being converted to an infinity due to truncation.
2. Vector codegen support for bf16
Patch By: Fateme Hosseini
Co-authored-by: Muntasir Mallick <quic_mallick@quicinc.com>
Co-authored-by: Muntasir Mallick <mallick@qti.qualcomm.com>
Co-authored-by: Kaushik Kulkarni <quic_kauskulk@quicinc.com>
|
|
GlobalISel now selects hadd family of intrinsics, without falling back
to SDAG.
|
|
Remove leftover implicit operands from SI_SPILL/SI_RESTORE.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
|
|
When the first element of a trn mask is undef, the `isTRNMask` function
assumes `WhichResult = 1`. That has a 50% chance of being wrong, so we
fail to match some valid trn1/trn2.
This patch introduces a more precise test to determine the correct value
of `WhichResult`, based on corresponding code in the `isZIPMask` and
`isUZPMask` functions.
- This change is based on #89578. I'd like to follow it up with a
further change along the lines of #167235.
|