| Age | Commit message (Collapse) | Author |
|
I think we need to keep the SelectionDAG code for volatile load/store so
we should support 4 byte alignment when possible.
|
|
(#168661)"
This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc.
This caused a couple test failures, likely due to a mid-air collision.
Reverting for now to get the tree back to green and allow the original
author to run UTC/friends and verify the output.
|
|
This maybe a bug which is introduced by commit
6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever
since.
In this case, `OtherReg` always overlaps with `DstReg` cause they from
the `Copy` all.
|
|
While I am at it, this patch uses const l-value references for
std::shared_ptr. We don't need to increment the reference count by
passing std::shared_ptr by value.
Identified with llvm-use-ranges.
|
|
Identified with bugprone-unused-local-non-trivial-variable.
|
|
isUniformAcrossVFsAndUFs
Extract the PreservesUniformity logic from isSingleScalar into a shared
static helper function. Update isUniformAcrossVFsAndUFs to use this
logic for VPWidenRecipe and VPInstruction, so that any opcode that
preserves uniformity is considered uniform-across-vf-and-uf if its
operands are.
This unifies the uniformity checking logic and makes it easier to extend
in the future.
This should effectively by NFC currently.
|
|
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated
Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
|
|
Create phi recipes for scalar resume value up front in addInitialSkeleton during initial construction. This will allow moving the remaining code dealing with resume values to VPlan transforms/construction.
PR: https://github.com/llvm/llvm-project/pull/166099
|
|
If a threading path has cycles within it then the transformation is not
correct. This patch fixes a couple of cases that create such cycles.
Fixes https://github.com/llvm/llvm-project/issues/166868
|
|
(#169163)
Extends the `icmp(trunc(shl))` fold to handle any power of 2 constant as
the shift base, not just 1. This generalizes the following patterns by
adjusting the comparison offsets by `log2(Pow2)`.
```llvm
(trunc (1 << Y) to iN) == 0 --> Y u>= N
(trunc (1 << Y) to iN) != 0 --> Y u< N
(trunc (1 << Y) to iN) == 2**C --> Y == C
(trunc (1 << Y) to iN) != 2**C --> Y != C
; to
(trunc (Pow2 << Y) to iN) == 0 --> Y u>= N - log2(Pow2)
(trunc (Pow2 << Y) to iN) != 0 --> Y u< N - log2(Pow2)
(trunc (Pow2 << Y) to iN) == 2**C --> Y == C - log2(Pow2)
(trunc (Pow2 << Y) to iN) != 2**C --> Y != C - log2(Pow2)
```
Proof: https://alive2.llvm.org/ce/z/2zwTkp
|
|
For swift async code, we need to use a debug intrinsic that behaves like
an llvm.dbg.declare but can take any location type rather than just a
pointer or integer.
To solve this, a new debug instrinsic called llvm.dbg.declare_value has
been created, which behaves exactly like an llvm.dbg.declare but can
take non pointer and integer location types.
More information here:
https://discourse.llvm.org/t/rfc-introduce-new-llvm-dbg-coroframe-entry-intrinsic/88269
This is the first patch as part of a stack of patches, with the one
succeeding it being: https://github.com/llvm/llvm-project/pull/168134
|
|
Function &F is the more standard abbreviation (~4000 uses in llvm versus
~300 uses).
|
|
Remove the SCEV-based check refined in
https://github.com/llvm/llvm-project/pull/156910, as InstSimplifyFolder
manages to simplify the generated code to false directly as well.
|
|
Remove the Constraints class from dependence analysis as it is now unused.
|
|
truncation exists (#169022)
Fixes #169017
|
|
These shuffles can always be implemented using v_perm_b32, and so this
rewrites the analysis from the perspective of "how many v_perm_b32s does
it take to assemble each register of the result?"
The test changes in Transforms/SLPVectorizer/reduction.ll are
reasonable: VI (gfx8) has native f16 math, but not packed math.
|
|
of ExpandVariadic. (#168161)
This PR fixes the issue where profile metadata (`!prof`) is dropped from
the `VariadicWrapper` when `ExpandVariadics` runs in
`--expand-variadics-override=optimize` mode.
In optimize mode, the pass splits the original variadic function into
two parts:
- A **VariadicWrapper** (retaining the original name) that handles the
`va_list` setup.
- A **FixedArityReplacement** (new function) that contains the original
core logic.
During this process, the basic blocks and associated metadata are
spliced into the `FixedArityReplacement`. Consequently, the
`VariadicWrapper`—which serves as the entry point for callers—is left
without function entry count metadata.
This change explicitly copies the `MD_prof` metadata from the
`FixedArityReplacement` back to the `VariadicWrapper` after the split is
defined.
Co-authored-by: Jin Huang <jingold@google.com>
|
|
Global with invariant should be treated identically to
constant.
|
|
only" (#169073)
Reverts llvm/llvm-project#168518
|
|
This PR adds a Load method for resources, which takes an additional
parameter by reference, status. It fills the status parameter with a 1
or 0, depending on whether or not the resource access was mapped.
CheckAccessFullyMapped is also added as an intrinsic, and called in the
production of this status bit.
Only addresses DXIL for the below issue:
https://github.com/llvm/llvm-project/issues/138910
Also only addresses the DXIL variant for the below issue:
https://github.com/llvm/llvm-project/issues/99204
|
|
Identified with modernize-loop-convert.
|
|
SI_SPILL/SI_RESTORE." (#169068)
PR causes build failures with expensive checks enabled
Reverts llvm/llvm-project#168546
|
|
chains (#168660)
Previously, the following:
%mul0 = mul nsw <8 x i32> %m00, %m01
%mul1 = mul nsw <8 x i32> %m10, %m11
%add0 = add <8 x i32> %mul0, splat (i32 32)
%add1 = add <8 x i32> %add0, %mul1
lowered to:
vsetivli zero, 8, e32, m2, ta, ma
vmul.vv v8, v8, v9
vmacc.vv v8, v11, v10
li a0, 32
vadd.vx v8, v8, a0
After this patch, now lowers to:
li a0, 32
vsetivli zero, 8, e32, m2, ta, ma
vmv.v.x v12, a0
vmadd.vv v8, v9, v12
vmacc.vv v8, v11, v10
Modeled on 0cc981e0 from the AArch64 backend.
C-code for the example case (`clang -O3 -S -mcpu=sifive-x280`):
```
int madd_fail(int a, int b, int * restrict src, int * restrict dst, int loop_bound) {
for (int i = 0; i < loop_bound; i += 2) {
dst[i] = src[i] * a + src[i + 1] * b + 32;
}
}
```
|
|
|
|
Identified with modernize-loop-convert.
|
|
Remove getSplitIteration.
A follow-up patch will also remove DVEntry::Splitable and Dependnece::isSplitable.
|
|
ARM relies on deprecated TableGen behavior of guessing instruction
properties from patterns (`def ARM : Target` doesn't have
`guessInstructionProperties` set to false).
Before #168209, TableGen conservatively guessed that `t2WhileLoopSetup`
has side effects because the instruction wasn't matched by any pattern.
After the patch, TableGen guesses it has no side effects because the
added pattern uses only `arm_wlssetup` node, which has no side effects.
Add `SDNPSideEffect` to the node so that TableGen guesses the property
right, and also `hasSideEffects = 1` to the instruction in case ARM ever
sets `guessInstructionProperties` to false.
|
|
Some targets have a specific calling convention that should be used for
generated calls to runtime functions.
Pass that down and use it.
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
|
|
Fix a problem exposed by #166483 using AV classes in more places.
`isVectorRegister` only accepts registers of VGPR or AGPR classes.
`hasVectorRegisters` additionally accepts the combined AV classes.
Fixes: #168761
|
|
Use the default, which freely coalesces anything it can.
This mostly shows improvements, with a handful of regressions.
The main concern would be if introducing wider registers is more
likely to push the register usage up to the next occupancy tier.
|
|
(#166170)
…nd update docs
|
|
This allows us to strip an unnecessary TypeSwitch.
|
|
Only apply forced instruction costs to recipes with underlying values to
match the legacy cost model. A VPlan may have a number of additional
VPInstructions without underlying values that are not considered for its
cost, and assigning forced costs to them would incorrectly inflate its
cost.
This fixes a cost divergence between legacy and VPlan-based cost models
with forced instruction costs.
PR: https://github.com/llvm/llvm-project/pull/168372
|
|
This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
|
|
This patch replaces the delinearization function used in
LoopCacheAnalysis, switching from one that depends on type information
in GEPs to one that does not. Once this patch and
https://github.com/llvm/llvm-project/pull/161822 are landed, we can
delete `tryDelinearizeFixedSize` from Delienarization, which is an
optimization heuristic guided by GEP type information. After Polly
eliminates its use of `getIndexExpressionsFromGEP`, we will be able to
completely delete GEP-driven heuristics from Delinearization.
|
|
In 4 years the ELF debugger support plugin wasn't adapted to other
object formats or debugging approaches. After the renaming NFC in
https://github.com/llvm/llvm-project/pull/168343, this patch tailors the
plugin to ELF and section load-address patching. It allows removal of
abstractions and consolidate processing steps with the newly enabled
AllocActions from https://github.com/llvm/llvm-project/pull/168343.
The key change is to process debug sections in one place in a
post-allocation pass. Since we can handle the endianness of the ELF file
the single `visitSectionLoadAddresses()` visitor function now, we don't
need to track debug objects and sections in template classes anymore. We
keep using the `DebugObject` class and drop `DebugObjectSection`,
`ELFDebugObjectSection<ELFT>` and `ELFDebugObject`.
Furthermore, we now use the allocation's working memory for load-address
fixups directly. We can drop the `WritableMemoryBuffer` from the debug
object and most of the `finalizeWorkingMemory()` step, which saves one
copy of the entire debug object buffer. Inlining `finalizeAsync()` into
the pre-fixup pass simplifies quite some logic.
We still track `RegisteredObjs` here, because we want to free memory
once the corresponding code is freed. There will be a follow-up patch
that turns it into a dealloc action.
|
|
This PR adds hardware-measured latencies for all instructions defined in
Section 15 of the RVV specification: "Vector Mask Instructions" to the
SpacemiT-X60 scheduling model.
|
|
OpenMP 6.0 introduced a `fuse` directive, and with it a "loop sequence"
as the associated code. What used to be "loop association" has become
"loop-nest association".
Rename Association::Loop to LoopNest, add Association::LoopSeq to
represent the "loop sequence" association.
Change the association of fuse from "block" to "loop sequence".
|
|
(#168520)
This does not seem to be an issue currently, but when using
VECTOR_COMPRESS as part of another lowering, I found these BITCASTs
would result in "Unexpected illegal type!" errors.
For example, this would convert the legal nxv2f32 type into the illegal
nxv2i32 type. This patch avoids this by using no-op casts for unpacked
types.
|
|
(#163561)
This change adds dense and sparse MMA intrinsics with block scaling. The
implementation is based on [PTX ISA version
9.0](https://docs.nvidia.com/cuda/parallel-thread-execution/). Tests for
new intrinsics are added for PTX 8.7 and SM 120a and are generated by
`llvm/test/CodeGen/NVPTX/wmma-ptx87-sm120a.py`. The tests have been
verified with ptxas from CUDA-13.0 release.
Dense MMA intrinsics with block scaling were supported by
@schwarzschild-radius.
|
|
After truncating an integer-induction, neither nuw nor nsw hold.
Fixes #168902.
Co-authored-by: Florian Hahn <flo@fhahn.com>
|
|
adding the vector {1, 1, 1, 1} (#160882)
This patch optimizes vector addition operations involving **`all-ones`**
vectors by leveraging the generation of vectors of -1s(using `xxleqv`,
which is cheaper than generating vectors of 1s(`vspltisw`). These are
the respective vector types.
`v2i64`: **`A + vector {1, 1}`**
`v4i32`: **`A + vector {1, 1, 1, 1}`**
`v8i16`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1}`**
`v16i8`: **`A + vector {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1}`**
The optimized version replaces `vspltisw (4 cycles)` with `xxleqv (2
cycles)` using the following identity:
`A - (-1) = A + 1`.
---------
Co-authored-by: himadhith <himadhith.v@ibm.com>
Co-authored-by: Tony Varghese <tonypalampalliyil@gmail.com>
|
|
This change fixes the PTX and SM conditions for narrow FP
conversion intrinsics and adds support for family-conditionals.
|
|
PPCISelLowering.cpp:15567:27: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
15567 | CC == ISD::SETEQ && "CC mus be ISD::SETNE or ISD::SETEQ");
| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
This commit implements a complete load/store optimization pass for the
RISC-V Zilsd extension, which combines pairs of 32-bit load/store
instructions into single 64-bit LD/SD instructions when possible.
Default alignment is 8, it also provide zilsd-4byte-align feature for
looser condition.
Related work: https://reviews.llvm.org/D144002
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
Query RuntimeLibcalls for the support and the name. The check
that the implementation is exactly __guard_local instead of
unsupported feels a bit strange.
|
|
We previously got a duplicate implicit $exec operand. It didn't really
hurt anything (other than being a slight drag on compile-time
performance). Still, let's keep things clean.
|
|
This removes an unnecessary isel pattern for the RV32 HwMode.
|
|
Removes about 200 bytes of unneeded patterns from RISCVGenDAGISel.inc
|
|
Remove all constraint propagation functions in Dependence Analysis.
|