summaryrefslogtreecommitdiff
path: root/llvm/test/CodeGen
AgeCommit message (Collapse)Author
2025-11-22[RISCV] Support zilsd-4byte-align for i64 load/store in SelectionDAG. (#169182)Craig Topper
I think we need to keep the SelectionDAG code for volatile load/store so we should support 4 byte alignment when possible.
2025-11-23Revert "[RegAlloc] Fix the terminal rule check for interfere with DstReg ↵Aiden Grossman
(#168661)" This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc. This caused a couple test failures, likely due to a mid-air collision. Reverting for now to get the tree back to green and allow the original author to run UTC/friends and verify the output.
2025-11-23[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661)hstk30-hw
This maybe a bug which is introduced by commit 6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever since. In this case, `OtherReg` always overlaps with `DstReg` cause they from the `Copy` all.
2025-11-22[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374)tyb0807
- Support serialization of the number of allocated preload kernarg SGPRs - Support serialization of the first preload kernarg SGPR allocated Together they enable reconstructing correctly MIR with preload kernarg SGPRs.
2025-11-22[DAGCombiner] Don't optimize insert_vector_elt into shuffle if implicit ↵Hongyu Chen
truncation exists (#169022) Fixes #169017
2025-11-21AMDGPU: Add baseline test for split/widen invariant loads (#168913)Matt Arsenault
This works fine on main, but broke after a future patch.
2025-11-21Revert "[AMDGPU] Remove leftover implicit operands from ↵Nathan Corbyn
SI_SPILL/SI_RESTORE." (#169068) PR causes build failures with expensive checks enabled Reverts llvm/llvm-project#168546
2025-11-21[RISCV] Incorporate scalar addends to extend vector multiply accumulate ↵Ryan Buchner
chains (#168660) Previously, the following: %mul0 = mul nsw <8 x i32> %m00, %m01 %mul1 = mul nsw <8 x i32> %m10, %m11 %add0 = add <8 x i32> %mul0, splat (i32 32) %add1 = add <8 x i32> %add0, %mul1 lowered to: vsetivli zero, 8, e32, m2, ta, ma vmul.vv v8, v8, v9 vmacc.vv v8, v11, v10 li a0, 32 vadd.vx v8, v8, a0 After this patch, now lowers to: li a0, 32 vsetivli zero, 8, e32, m2, ta, ma vmv.v.x v12, a0 vmadd.vv v8, v9, v12 vmacc.vv v8, v11, v10 Modeled on 0cc981e0 from the AArch64 backend. C-code for the example case (`clang -O3 -S -mcpu=sifive-x280`): ``` int madd_fail(int a, int b, int * restrict src, int * restrict dst, int loop_bound) { for (int i = 0; i < loop_bound; i += 2) { dst[i] = src[i] * a + src[i + 1] * b + 32; } } ```
2025-11-21[ARM] Restore hasSideEffects flag on t2WhileLoopSetup (#168948)Sergei Barannikov
ARM relies on deprecated TableGen behavior of guessing instruction properties from patterns (`def ARM : Target` doesn't have `guessInstructionProperties` set to false). Before #168209, TableGen conservatively guessed that `t2WhileLoopSetup` has side effects because the instruction wasn't matched by any pattern. After the patch, TableGen guesses it has no side effects because the added pattern uses only `arm_wlssetup` node, which has no side effects. Add `SDNPSideEffect` to the node so that TableGen guesses the property right, and also `hasSideEffects = 1` to the instruction in case ARM ever sets `guessInstructionProperties` to false.
2025-11-21[AMDGPU] Handle AV classes in SIFixSGPRCopies::processPHINode (#169038)Jay Foad
Fix a problem exposed by #166483 using AV classes in more places. `isVectorRegister` only accepts registers of VGPR or AGPR classes. `hasVectorRegisters` additionally accepts the combined AV classes. Fixes: #168761
2025-11-21AMDGPU: Stop implementing shouldCoalesce (#168988)Matt Arsenault
Use the default, which freely coalesces anything it can. This mostly shows improvements, with a handful of regressions. The main concern would be if introducing wider registers is more likely to push the register usage up to the next occupancy tier.
2025-11-21Fix test from #168609 (#169041)Walter Lee
2025-11-21[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016)Christudasan Devadasan
This patch enables the multi-group xnack replay mode by configuring the hardware MODE register at kernel entry. This aligns the hardware behavior with the compiler's existing multi-group s_wait_xcnt insertion logic.
2025-11-21[NVPTX] Support for dense and sparse MMA intrinsics with block scaling. ↵Kirill Vedernikov
(#163561) This change adds dense and sparse MMA intrinsics with block scaling. The implementation is based on [PTX ISA version 9.0](https://docs.nvidia.com/cuda/parallel-thread-execution/). Tests for new intrinsics are added for PTX 8.7 and SM 120a and are generated by `llvm/test/CodeGen/NVPTX/wmma-ptx87-sm120a.py`. The tests have been verified with ptxas from CUDA-13.0 release. Dense MMA intrinsics with block scaling were supported by @schwarzschild-radius.
2025-11-21[PowerPC] Replace vspltisw+vadduwm instructions with xxleqv+vsubuwm for ↵Himadhith
adding the vector {1, 1, 1, 1} (#160882) This patch optimizes vector addition operations involving **`all-ones`** vectors by leveraging the generation of vectors of -1s(using `xxleqv`, which is cheaper than generating vectors of 1s(`vspltisw`). These are the respective vector types. `v2i64`: **`A + vector {1, 1}`** `v4i32`: **`A + vector {1, 1, 1, 1}`** `v8i16`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1}`** `v16i8`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}`** The optimized version replaces `vspltisw (4 cycles)` with `xxleqv (2 cycles)` using the following identity: `A - (-1) = A + 1`. --------- Co-authored-by: himadhith <himadhith.v@ibm.com> Co-authored-by: Tony Varghese <tonypalampalliyil@gmail.com>
2025-11-21[llvm][RISCV] Implement Zilsd load/store pair optimization (#158640)Brandon Wu
This commit implements a complete load/store optimization pass for the RISC-V Zilsd extension, which combines pairs of 32-bit load/store instructions into single 64-bit LD/SD instructions when possible. Default alignment is 8, it also provide zilsd-4byte-align feature for looser condition. Related work: https://reviews.llvm.org/D144002 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20AMDGPU: Convert constant-address-space-32bit test to generated checks (#168975)Matt Arsenault
2025-11-20AMDGPU: Don't duplicate implicit operands in 3-address conversion (#168426)Nicolai Hähnle
We previously got a duplicate implicit $exec operand. It didn't really hurt anything (other than being a slight drag on compile-time performance). Still, let's keep things clean.
2025-11-20[AMDGPU] Precommit test for issue in amdgpu-rewrite-agpr-copy-mfma, (#168609)hjagasiaAMD
which reassigns scale operand in vgpr_32 register to agpr_32, not permitted by instruction format. Reduced from ck. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com> Co-authored-by: theRonShark <ron.lieberman@amd.com>
2025-11-20AMDGPU: Handle invariant loads when considering if a load can be scalar ↵Matt Arsenault
(#168787)
2025-11-20Reapply "DAG: Allow select ptr combine for non-0 address spaces" (#168292) ↵Matt Arsenault
(#168786) This reverts commit 6d5f87fc4284c4c22512778afaf7f2ba9326ba7b. Previously this failed due to treating the unknown MachineMemOperand value as known uniform.
2025-11-20[AMDGPU] Precommit tests for V_CVT_PK_[IU]16_F32 (#168893)Jay Foad
2025-11-20[X86] Lower mathlib call ldexp into scalef when avx512 is enabled (#166839)Kavin Gnanapandithan
Resolves #165694
2025-11-20AMDGPU: Fix treating divergent loads as uniform (#168785)Matt Arsenault
Avoids regression which caused the revert 6d5f87fc42. This is a hack on a hack. We currently have isUniformMMO, which improperly treats unknown source value as known uniform. This is hack from before we had divergence information in the DAG, and should be removed. This is the minimum change to avoid the regression; removing the aggressive handling of the unknown case (or dropping isUniformMMO entirely) are more involved fixes.
2025-11-20[RISCV] Do not write .s file in a test (#168865)Mikhail Gudim
2025-11-20[HLSL] Implement the `fwidth` intrinsic for DXIL and SPIR-V target (#161378)Alexander Johnston
Adds the fwidth intrinsic for HLSL. The DXIL path only requires modification to the hlsl headers. The SPIRV path implements the OpFwidth builtin in Clang and instruction selection for the OpFwidth instruction in LLVM. Also adds shader stage tests to the ddx_coarse and ddy_coarse instructions used by fwidth. Closes #99120 --------- Co-authored-by: Alexander Johnston <alexander.johnston@amd.com>
2025-11-20[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all ↵Paul Walker
lanes are in use. (#168387) While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.
2025-11-20[AArch64][SVE] Implement demanded bits for @llvm.aarch64.sve.cntp (#168714)Benjamin Maxwell
This allows DemandedBits to see that the SVE CNTP intrinsic will only ever produce small positive integers. The maximum value you could get here is 256, which is CNTP on a nxv16i1 on a machine with a 2048bit vector size (the maximum for SVE). Using this various redundant operations (zexts, sexts, ands, ors, etc) can be eliminated.
2025-11-20Revert "[AMDGPU] Add wave reduce intrinsics for float types - 2 (#161… ↵Aaditya
(#168845) …815)" This reverts commit dcab4cb49bfb0aa17df3d3fabe582696100e0d35.
2025-11-20[X86] EltsFromConsecutiveLoads - recognise reverse load patterns. (#168706)Simon Pilgrim
See if we can create a vector load from the src elements in reverse and then shuffle these back into place. SLP will (usually) catch this in the middle-end, but there are a few BUILD_VECTOR scalarizations etc. that appear during DAG legalization. I did start looking at a more general permute fold, but I haven't found any good test examples for this yet - happy to take another look if somebody has examples.
2025-11-20[WebAssembly] Lower ANY_EXTEND_VECTOR_INREG (#167529)Sam Parker
Treat it in the same manner of zero_extend_vector_inreg and generate an extend_low_u if possible. This is to try an prevent expensive shuffles from being generated instead. computeKnownBitsForTargetNode has also been updated to specify known zeros on extend_low_u.
2025-11-20[AMDGPU] Add wave reduce intrinsics for float types - 2 (#161815)Aaditya
Supported Ops: `fadd`, `fsub`
2025-11-20[AMDGPU] Add wave reduce intrinsics for float types - 1 (#161814)Aaditya
Supported Ops: `fmin`, `fmax`
2025-11-20[RISCV][llvm] Select splat_vector(constant) with PLI (#168204)Brandon Wu
Default DAG combiner combine BUILD_VECTOR with same elements to SPLAT_VECTOR, we can just map constant splat to PLI if possible.
2025-11-19[CFIInserter] Turn a reachable llvm_unreachable into a report_fatal_error. ↵Craig Topper
(#168777) This prevents it from being optimized out in non-asserts builds. Update X86 test to remove REQUIRES: asserts and check for LLVM ERROR. Add FileCheck to RISC-V test and remove UNSUPPORTED. This is the more complete fix for #168772 and #168525.
2025-11-20[RISCV] Only reduce VLs of instructions with demanded VLs (#168693)Luke Lau
In RISCVVLOptimizer we first compute all the demanded VLs, then we walk backwards through the function and try to reduce any VLs. We don't actually need to walk backwards anymore since after #124530 the order in which we modify the instructions doesn't matter. This patch changes it to just iterate over the instructions with a demanded VL computed, which means we don't iterate over scalar instructions etc. This also fixes #168665, where we triggered an assert on instructions with a dead $vxsat implicit-def: dead %x:vr = PseudoVSADDU_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */, implicit-def dead $vxsat Because $vxsat is a reserved register, DeadMachineInstructionElim won't remove it and the instruction makes it to RISCVVLOptimizer. And because the def of %x is dead, we don't reach this instruction in the dataflow analysis. This instruction returns true for isCandidate, so we would try to lookup its demanded VL which doesn't exist and assert. But with this patch we don't try to reduce instructions that aren't in DemandedVLs, which fixes the crash.
2025-11-19Re-land [Transform][LoadStoreVectorizer] allow redundant in Chain (#168135)Gang Chen
This is the fixed version of https://github.com/llvm/llvm-project/pull/163019
2025-11-20RenameIndependentSubregs: try to only implicit def used subregs (#167486)Carl Ritson
Attempt to only define used subregisters when creating IMPLICIT_DEF fix ups for live interval subranges. This avoids the appearance at the MIR level of entire (wide) registers becoming live rather than relying only on transient LiveIntervals dead definitions for unused subregisters.
2025-11-19[AMDGPU] Fixed crash in getLastMIForRegion when the region is empty. (#168653)Dhruva Chakrabarti
PreRARematStage builds region live-outs if GCN trackers are enabled. If rematerialization leads to empty regions, this can cause a crash because of dereference of an invalid iterator in getLastMIForRegion. The fix is to skip calling getLastMIForRegion for empty regions. This patch fixes another bug in the same code region. getLastMIForRegion calls skipDebugInstructionsBackward which may immediately return the RegionEnd if it is not the begin instruction and it is a non-debug instruction. That would imply considering an instruction that is outside the relevant region. The fix is to always pass the previous of RegionEnd to skipDebugInstructionsBackward. This bug was found while using GCN trackers on the existing LIT test machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure. llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168: llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const [with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr, true, true, void, false, void>; bool IsReverse = false; bool IsConst = false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference = llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
2025-11-19[AMDGPU] Prioritize allocation of low 256 VGPR classes (#167978)Stanislav Mekhanoshin
If we have 1024 VGPRs available we need to give priority to the allocation of these registers where operands can only use low 256. That is noteably scale operands of V_WMMA_SCALE instructions. Otherwise large tuples will be allocated first and take all low registers, so we would have to spill to get a room for these scale registers. Allocation priority itself does not eliminate spilling completely in large kernels, although helps to some degree. Increasing spill weight of a restricted class on top of it helps.
2025-11-19DAG: Use poison for some vector result widening (#168290)Matt Arsenault
2025-11-19[RISCV] Fix CFI Multiple Locations Test (#168772)Sam Elliott
2025-11-19[RISCV][DAGCombiner] Fix potential missed combine in VL->VW extension (#168026)Kai Lin
The previous implementation of `combineOp_VLToVWOp_VL` manually replaced old nodes with newly created widened nodes, but only added the new node itself to the `DAGCombiner` worklist. Since the users of the new node were not added, some combine opportunities could be missed when external `DAGCombiner` passes expected those users to be reconsidered. This patch replaces the custom replacement logic with a call to `DCI.CombineTo()`, which performs node replacement in a way consistent with `DAGCombiner::Run`: - Replace all uses of the old node. - Add the new node and its users to the worklist. - Clean up unused nodes when appropriate. Using `CombineTo` ensures that `combineOp_VLToVWOp_VL` behaves consistently with the standard `DAGCombiner` update model, avoiding discrepancies between the private worklist inside this routine and the global worklist managed by the combiner. This resolves missed combine cases involving VL -> VW operator widening. --------- Co-authored-by: Kai Lin <omg_link@qq.com>
2025-11-19[AMDGPU] Add baseline test to show spilling of wmma scale. NFC (#168163)Stanislav Mekhanoshin
This is to show the spilling of WMMA scale values which are limited to low 256 VGPRs. We have free registers, just RA allocates low 256 first.
2025-11-19[X86] X86ISelDAGToDAG - don't let ADD/SUB(X,1) -> SUB/ADD(X,-1) constant ↵Simon Pilgrim
fold (#168726) This late into lowering we don't have a good way to handle constant build_vector lowering Fixes #168594
2025-11-19[Hexagon] Enable soft bf16 in hexagon (#167924)Fateme Hosseini
This patch adds: 1. Support to recognize bf16 type in the frontend and isel/abi support for scalar bf16 programs Limitations: fp_to_bf16 is being generated with a tablegen pattern instead of lowering via expansion. This is because we do not have support for fcanonincalize instruction which should prevent an SNaN being converted to an infinity due to truncation. 2. Vector codegen support for bf16 Patch By: Fateme Hosseini Co-authored-by: Muntasir Mallick <quic_mallick@quicinc.com> Co-authored-by: Muntasir Mallick <mallick@qti.qualcomm.com> Co-authored-by: Kaushik Kulkarni <quic_kauskulk@quicinc.com>
2025-11-19[AArch64][GlobalISel] Added support for hadd family of intrinsics (#163985)Joshua Rodriguez
GlobalISel now selects hadd family of intrinsics, without falling back to SDAG.
2025-11-19[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE. (#168546)LU-JOHN
Remove leftover implicit operands from SI_SPILL/SI_RESTORE. --------- Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-19[RISCV][test] Add sincos-expansion.ll test caseAlex Bradbury
2025-11-19[AArch64] match TRN starting from undef elements (#167955)Philip Ginsbach-Chen
When the first element of a trn mask is undef, the `isTRNMask` function assumes `WhichResult = 1`. That has a 50% chance of being wrong, so we fail to match some valid trn1/trn2. This patch introduces a more precise test to determine the correct value of `WhichResult`, based on corresponding code in the `isZIPMask` and `isUZPMask` functions. - This change is based on #89578. I'd like to follow it up with a further change along the lines of #167235.