| Age | Commit message (Collapse) | Author |
|
(#149042)"
This reverts commit 62d1a080e69e3c5e98840e000135afa7c688a77b.
This appears to be causing some runtime failures on RISCV
https://lab.llvm.org/buildbot/#/builders/210/builds/5221
|
|
Building on top of https://github.com/llvm/llvm-project/pull/148817,
introduce a new abstract LastActiveLane opcode that gets lowered to
Not(Mask) → FirstActiveLane(NotMask) → Sub(result, 1).
When folding the tail, update all extracts for uses outside the loop the
extract the value of the last actice lane.
See also https://github.com/llvm/llvm-project/issues/148603
PR: https://github.com/llvm/llvm-project/pull/149042
|
|
Add a new Unpack VPInstruction (name to be improved) to explicitly
extract scalars values from vectors.
Test changes are movements of the extracts: they are no generated
together and also directly after the producer.
Depends on https://github.com/llvm/llvm-project/pull/155102 (included in
PR)
PR: https://github.com/llvm/llvm-project/pull/155670
|
|
Multiple places retrieve the region for a recipe. Add a helper to make
the code more compact and clearer.
|
|
When narrowing stores of a single-scalar, we currently use
ExtractLastElement, which extracts the last element across all parts.
This is not correct if the store's address is not uniform across all
parts. If it is only uniform-per-part, the last lane per part must be
extracted. Add a new ExtractLastLanePerPart opcode to handle this
correctly. Most transforms apply to both ExtractLastElement and
ExtractLastLanePerPart, with the only difference being their treatment
during unrolling.
Fixes https://github.com/llvm/llvm-project/issues/162498.
PR: https://github.com/llvm/llvm-project/pull/163056
|
|
(#158603)
Check if the scale-factor of the accumulator is the same as the request
ScaleFactor in tryToCreatePartialReductions.
This prevents creating partial reductions if not all instructions in the
reduction chain form partial reductions. e.g. because we do not form a
partial reduction for the loop exit instruction.
Currently code-gen works fine, because the scale factor of
VPPartialReduction is not used during ::execute, but it means we compute
incorrect cost/register pressure, because the partial reduction won't
reduce to the specified scaling factor.
PR: https://github.com/llvm/llvm-project/pull/158603
|
|
Update calculateRegisterUsageForPlan to track live-ness of VPValues
instead of recipes. This gives slightly more accurate results for
recipes that define multiple values (i.e. VPInterleaveRecipe).
When tracking the live-ness of recipes, all VPValues defined by an
VPInterleaveRecipe are considered alive until the last use of any of
them. When tracking the live-ness of individual VPValues, we can
accurately track the individual values until their last use.
Note the changes in large-loop-rdx.ll and pr47437.ll. This patch
restores the original behavior before introducing VPlan-based liveness
tracking.
PR: https://github.com/llvm/llvm-project/pull/155301
|
|
The InterleavedAccess pass already supports transforming
vector-predicated (vp) load/store intrinsics. With this patch, we start
enabling interleaved access under tail folding by EVL.
This patch introduces a new base class, VPInterleaveBase, and a concrete
class, VPInterleaveEVLRecipe. Both the existing VPInterleaveRecipe and
the new VPInterleaveEVLRecipe inherit from and implement
VPInterleaveBase.
Compared to VPInterleaveRecipe, VPInterleaveEVLRecipe adds an EVL
operand to emit vp.load/vp.store intrinsics.
Currently, tail folding by EVL is only supported for scalable
vectorization. Therefore, VPInterleaveEVLRecipe will only emit
interleave/deinterleave intrinsics. Reverse accesses are not yet
implemented, as masked reverse interleaved access under tail folding is
not yet supported.
Fixed #123201
|
|
|
|
`VPEVLBasedIVPHIRecipe` will lower to VPInstruction scalar phi and
generate scalar phi. This recipe will only occupy a scalar register just
like other phi recipes.
This patch fix the register usage for `VPEVLBasedIVPHIRecipe` from
vector
to scalar which is close to generated vector IR.
https://godbolt.org/z/6Mzd6W6ha shows that no register spills when
choosing `<vscale x 16>`.
Note that this test is basically copied from AArch64.
|
|
|
|
A lot of time getCanonicalIV() is used to get the canonical IV type,
e.g. to instantiate a VPTypeAnalysis or to get the LLVMContext.
However VPTypeAnalysis has a constructor that takes the VPlan directly
and there's a method on VPlan to get the LLVMContext directly, so use
those instead where possible.
This lets us remove a constructor on VPTypeAnalysis.
Also remove an unused LLVMContext argument in UnrollState whilst we're
here.
|
|
Epilogue vectorization currently relies on the resume phi for the
canonical induction being always available, which is why VPPhi are
considered to have side-effects, to prevent their removal.
This patch adds a new ResumeForEpilogue opcode to mark the resume phi as
used for epilogue vectorization. This allows treating VPPhis in general
as not having side-effects, enabling removal of unused VPPhis.
|
|
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.
|
|
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.
The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.
The patch removes the restrictions added in 6f43754e9 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.
I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
PR: https://github.com/llvm/llvm-project/pull/148817
|
|
Update LV to vectorize maxnum/minnum reductions without fast-math flags,
by adding an extra check in the loop if any inputs to maxnum/minnum are
NaN, due to maxnum/minnum behavior w.r.t to signaling NaNs. Signed-zeros
are already handled consistently by maxnum/minnum.
If any input is NaN,
*exit the vector loop,
*compute the reduction result up to the vector iteration that contained
NaN inputs and
* resume in the scalar loop
New recurrence kinds are added for reductions using maxnum/minnum
without fast-math flags.
PR: https://github.com/llvm/llvm-project/pull/148239
|
|
via maxBandwidth (#149056)
Currently if MaxBandwidth is enabled, the register pressure is checked
for each VF. This changes that to only perform said check if the VF
would not have otherwise been considered by the LoopVectorizer if
maxBandwidth was not enabled.
Theoretically this allows for higher VFs to be considered than would
otherwise be deemed "safe" (from a regpressure perspective), but more
concretely this reduces the amount of work done at compile-time when
maxBandwidth is enabled.
|
|
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
|
|
This patch adds a new recipe to combine multiple recipes into an
'expression' recipe, which should be considered as single entity for
cost-modeling and transforms. The recipe needs to be 'decomposed', i.e.
replaced by its individual recipes before execute.
This subsumes VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe and should make it easier to extend to
include more types of bundled patterns, like e.g. extends folded into
loads or various arithmetic instructions, if supported by the target.
It allows avoiding re-creating the original recipes when converting to
concrete recipes, together with removing the need to record various
information. The current version of the patch still retains the original
printing matching VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe, but this specialized print could be
replaced with printing the bundled recipes directly.
PR: https://github.com/llvm/llvm-project/pull/144281
|
|
Remove another use of the underlying IR phi.
|
|
Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y)
reductions where one of x or y is a decreasing induction, producing a SMin
reduction. It uses signed max as sentinel value.
PR: https://github.com/llvm/llvm-project/pull/140451
|
|
Explicitly unroll VPReplicateRecipes outside replicate regions by VF,
replacing them by VF single-scalar recipes. Extracts for operands are
added as needed and the scalar results are combined to a vector using a
new BuildVector VPInstruction.
It also adds a few folds to simplify unnecessary extracts/BuildVectors.
It also adds a BuildStructVector opcode for handling of calls that have
struct return types.
VPReplicateRecipe in replicate regions can will be unrolled as follow
up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e.
not executable.
PR: https://github.com/llvm/llvm-project/pull/142433
|
|
Add a new VPInstruction::ReductionStartVector opcode to create the start
values for wide reductions. This more accurately models the start value
creation in VPlan and simplifies VPReductionPHIRecipe::execute. Down the
line it also allows removing VPReductionPHIRecipe::RdxDesc.
PR: https://github.com/llvm/llvm-project/pull/142290
|
|
Add a dedicated opcode for any-of reduction, similar to
https://github.com/llvm/llvm-project/pull/132689 and
https://github.com/llvm/llvm-project/pull/132690.
The patch also explictly adds the start value to not require
RecurrenceDescriptor during execute. It also allows freezing the start
value to make it poison-safe.
PR: https://github.com/llvm/llvm-project/pull/141932
|
|
Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.
PR: https://github.com/llvm/llvm-project/pull/135673
|
|
Use regular VPPhi instead of a separate opcode for resume phis. This
removes an unneeded specialized opcode and unifies the code
(verification, printing, updating when CFG is changed).
Depends on https://github.com/llvm/llvm-project/pull/140132.
PR: https://github.com/llvm/llvm-project/pull/140405
|
|
corresponding vplan transformations. (#137746)
This patch introduce two new recipes.
* VPExtendedReductionRecipe
- cast + reduction.
* VPMulAccumulateReductionRecipe
- (cast) + mul + reduction.
This patch also implements the transformation that match following
patterns via vplan and converts to abstract recipes for better cost
estimation.
* VPExtendedReduction
- reduce(cast(...))
* VPMulAccumulateReductionRecipe
- reduce.add(mul(...))
- reduce.add(mul(ext(...), ext(...))
- reduce.add(ext(mul(ext(...), ext(...))))
The converted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before recipe execution.
Note that this patch still relies on legacy cost model the calculate the
cost for these patters.
Will enable vplan-based cost decision in #113903.
Split from #113903.
|
|
Add constructor that retrieves the scalar type from the trip count
expression, if no canonical IV is available. Used in the verifier, in
preparation for late verification, when the canonical IV has been
dissolved.
|
|
(#137030)
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.
PR: https://github.com/llvm/llvm-project/pull/137030
|
|
|
|
(#129706)
There are some opcodes that currently require specialized recipes, due
to their result type not being implied by their operands, including
casts.
This leads to duplication from defining multiple full recipes.
This patch introduces a new VPInstructionWithType subclass that also
stores the result type. The general idea is to have opcodes needing to
specify a result type to use this general recipe. The current patch
replaces VPScalarCastRecipe with VInstructionWithType, a similar patch
for VPWidenCastRecipe will follow soon.
There are a few proposed opcodes that should also benefit, without the
need of workarounds:
* https://github.com/llvm/llvm-project/pull/129508
* https://github.com/llvm/llvm-project/pull/119284
PR: https://github.com/llvm/llvm-project/pull/129706
|
|
Keep the start value as operand of ComputeFindLastIVResult. A follow-up
patch will use this to make sure the start value is frozen if needed.
Depends on https://github.com/llvm/llvm-project/pull/132689
PR: https://github.com/llvm/llvm-project/pull/132690
|
|
This moves the logic for computing the FindLastIV reduction result to
its own opcode. A follow-up patch will update the new opcode to also
take the start value, to fix
https://github.com/llvm/llvm-project/issues/126836.
PR: https://github.com/llvm/llvm-project/pull/132689
|
|
(#131086)
After #128718 lands there will be two ways of performing a reversed
widened memory access, either by performing a consecutive unit-stride
access and a reverse, or a strided access with a negative stride.
Even though both produce a reversed vector, only the former needs
VPReverseVectorPointerRecipe which computes a pointer to the last
element of each part. A strided reverse still needs a pointer to the
first element of each part so it will use VPVectorPointerRecipe.
This renames VPReverseVectorPointerRecipe to VPVectorEndPointerRecipe to
clarify that a reversed access may not necessarily need a pointer to the
last element.
|
|
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
|
|
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.
Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.
PR: https://github.com/llvm/llvm-project/pull/129767
|
|
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().
PR: https://github.com/llvm/llvm-project/pull/124644
|
|
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
|
|
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.
The intended use case for this is vectorizing intrinsics such as:
```
declare { float, float } @llvm.sincos.f32(float %x)
```
Mapping them to structure-returning library calls such as:
```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```
Or their widened form (such as `@llvm.sincos.v4f32` in this case).
Implementing this required two main changes:
1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
* This is mostly limited to parts of the cost model and scalarization
Since the supported use case is narrow, the required changes are
relatively small.
|
|
(#120567)
This work feeds part of PR
https://github.com/llvm/llvm-project/pull/88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.
All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:
1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.
I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.
NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
|
|
VTypeAnalysis contains some assertions which can be useful for reasoning
that the types of various operands match.
This patch teaches VPlanVerifier to invoke VTypeAnalysis to check them,
and catches some issues with VPInstruction types that are also fixed
here:
* Handles the missing cases for CalculateTripCountMinusVF,
CanonicalIVIncrementForPart and AnyOf
* Fixes ICmp and ActiveLaneMask to return i1 (to align with `icmp` and
`@llvm.get.active.lane.mask` in the LangRef)
The VPlanVerifier unit tests also need to be fleshed out a bit more to
satisfy the stricter assertions
|
|
operand fix. (#121744)
This relands the reverted #120721 with a fix for cases where neither
reduction operand are the reduction phi. Only
63114239cc8d26225a0ef9920baacfc7cc00fc58 and
63114239cc8d26225a0ef9920baacfc7cc00fc58 are new on top of the reverted
PR.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
|
|
This reverts commit c858bf620c3ab2a4db53e84b9365b553c3ad1aa6 as it casuse optimization crash on -O2, see https://github.com/llvm/llvm-project/pull/120721#issuecomment-2563192057
|
|
This re-lands the reverted #92418
When the VF is small enough so that dividing the VF by the scaling
factor results in 1, the reduction phi execution thinks the VF is scalar
and sets the reduction's output as a scalar value, tripping assertions
expecting a vector value. The latest commit in this PR fixes that by
using `State.VF` in the scalar check, rather than the divided VF.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
|
|
Use VPlan-based type analysis to retrieve type of phi node. Also adds
missing type inference for ResumePhi and ComputeReductionResult opcodes.
|
|
This reverts commit 060d62b48aeb5080ffcae1dc56e41a06c6f56701.
It looks like this is triggering an assertion when build llvm-test-suite
on ARM64 macOS.
Reproducer from MultiSource/Benchmarks/Ptrdist/bc/number.c
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-macosx15.0.0"
define void @test(i64 %idx.neg, i8 %0) #0 {
entry:
br label %while.body
while.body: ; preds = %while.body, %entry
%n1ptr.0.idx131 = phi i64 [ %n1ptr.0.add, %while.body ], [ %idx.neg, %entry ]
%n2ptr.0.idx130 = phi i64 [ %n2ptr.0.add, %while.body ], [ 0, %entry ]
%sum.1129 = phi i64 [ %add99, %while.body ], [ 0, %entry ]
%n1ptr.0.add = add i64 %n1ptr.0.idx131, 1
%conv = sext i8 %0 to i64
%n2ptr.0.add = add i64 %n2ptr.0.idx130, 1
%1 = load i8, ptr null, align 1
%conv97 = sext i8 %1 to i64
%mul = mul i64 %conv97, %conv
%add99 = add i64 %mul, %sum.1129
%cmp94 = icmp ugt i64 %n1ptr.0.idx131, 0
%cmp95 = icmp ne i64 %n2ptr.0.idx130, -1
%2 = and i1 %cmp94, %cmp95
br i1 %2, label %while.body, label %while.end.loopexit
while.end.loopexit: ; preds = %while.body
%add99.lcssa = phi i64 [ %add99, %while.body ]
ret void
}
attributes #0 = { "target-cpu"="apple-m1" }
> opt -p loop-vectorize
Assertion failed: ((VF.isScalar() || V->getType()->isVectorTy()) && "scalar values must be stored as (0, 0)"), function set, file VPlan.h, line 284.
|
|
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
|
|
|
|
(#114305)
Introduce a general recipe to generate a scalar phi. Lower
VPCanonicalIVPHIRecipe and VPEVLBasedIVRecipe to VPScalarIVPHIrecipe
before plan execution, avoiding the need for duplicated ::execute
implementations. There are other cases that could benefit, including
in-loop reduction phis and pointer induction phis.
Builds on a similar idea as
https://github.com/llvm/llvm-project/pull/82270.
PR: https://github.com/llvm/llvm-project/pull/114305
|
|
|