| Age | Commit message (Collapse) | Author |
|
If a threading path has cycles within it then the transformation is not
correct. This patch fixes a couple of cases that create such cycles.
Fixes https://github.com/llvm/llvm-project/issues/166868
|
|
(#169163)
Extends the `icmp(trunc(shl))` fold to handle any power of 2 constant as
the shift base, not just 1. This generalizes the following patterns by
adjusting the comparison offsets by `log2(Pow2)`.
```llvm
(trunc (1 << Y) to iN) == 0 --> Y u>= N
(trunc (1 << Y) to iN) != 0 --> Y u< N
(trunc (1 << Y) to iN) == 2**C --> Y == C
(trunc (1 << Y) to iN) != 2**C --> Y != C
; to
(trunc (Pow2 << Y) to iN) == 0 --> Y u>= N - log2(Pow2)
(trunc (Pow2 << Y) to iN) != 0 --> Y u< N - log2(Pow2)
(trunc (Pow2 << Y) to iN) == 2**C --> Y == C - log2(Pow2)
(trunc (Pow2 << Y) to iN) != 2**C --> Y != C - log2(Pow2)
```
Proof: https://alive2.llvm.org/ce/z/2zwTkp
|
|
(NFC) (#156578)
Add detailed comments explaining why each function should/shouldn't be
unroll-and-jammed based on memory access patterns and dependencies.
Fix loop bounds to ensure array accesses are within array bounds:
* sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
* sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
* sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0
|
|
These shuffles can always be implemented using v_perm_b32, and so this
rewrites the analysis from the perspective of "how many v_perm_b32s does
it take to assemble each register of the result?"
The test changes in Transforms/SLPVectorizer/reduction.ll are
reasonable: VI (gfx8) has native f16 math, but not packed math.
|
|
of ExpandVariadic. (#168161)
This PR fixes the issue where profile metadata (`!prof`) is dropped from
the `VariadicWrapper` when `ExpandVariadics` runs in
`--expand-variadics-override=optimize` mode.
In optimize mode, the pass splits the original variadic function into
two parts:
- A **VariadicWrapper** (retaining the original name) that handles the
`va_list` setup.
- A **FixedArityReplacement** (new function) that contains the original
core logic.
During this process, the basic blocks and associated metadata are
spliced into the `FixedArityReplacement`. Consequently, the
`VariadicWrapper`—which serves as the entry point for callers—is left
without function entry count metadata.
This change explicitly copies the `MD_prof` metadata from the
`FixedArityReplacement` back to the `VariadicWrapper` after the split is
defined.
Co-authored-by: Jin Huang <jingold@google.com>
|
|
Only apply forced instruction costs to recipes with underlying values to
match the legacy cost model. A VPlan may have a number of additional
VPInstructions without underlying values that are not considered for its
cost, and assigning forced costs to them would incorrectly inflate its
cost.
This fixes a cost divergence between legacy and VPlan-based cost models
with forced instruction costs.
PR: https://github.com/llvm/llvm-project/pull/168372
|
|
This patch replaces the delinearization function used in
LoopCacheAnalysis, switching from one that depends on type information
in GEPs to one that does not. Once this patch and
https://github.com/llvm/llvm-project/pull/161822 are landed, we can
delete `tryDelinearizeFixedSize` from Delienarization, which is an
optimization heuristic guided by GEP type information. After Polly
eliminates its use of `getIndexExpressionsFromGEP`, we will be able to
completely delete GEP-driven heuristics from Delinearization.
|
|
After truncating an integer-induction, neither nuw nor nsw hold.
Fixes #168902.
Co-authored-by: Florian Hahn <flo@fhahn.com>
|
|
Add detailed comments explaining each function's memory access patterns
and why they should/shouldn't be unroll-and-jammed:
- fore_aft_*: Dependencies between fore block and aft block
- fore_sub_*: Dependencies between fore block and sub block
- sub_aft_*: Dependencies between sub block and aft block
- sub_sub_*: Dependencies within sub block
- *_less: Backward dependency (i-1) - safe for fore/aft, fore/sub,
sub/aft; unsafe for sub/sub due to jamming conflicts
- *_eq: Same iteration dependency (i+0) - safe due to preserved
execution order
- *_more: Forward dependency (i+1) - unsafe due to write-after-write
races between unrolled iterations, except sub/sub case creates conflicts
|
|
When comparing additions with the same base where one has `nsw`, the
following simplification can be performed:
```llvm
icmp slt/sgt/sle/sge (x + C1), (x +nsw C2)
=>
icmp slt/sgt/sle/sge C1, C2
```
Previously this was only done for `slt`. This patch extends it to the
`sgt`, `sle`, and `sge` predicates when either of the conditions hold:
- `C1 <= C2 && C1 >= 0`, or
- `C2 <= C1 && C1 <= 0`
This patch also handles the `C1 == C2` case, which was previously
excluded.
Proof: https://alive2.llvm.org/ce/z/LtmY4f
|
|
Add a low trip count test that is currently vectorized but unprofitable,
for https://github.com/llvm/llvm-project/issues/167858.
|
|
Remove `VPWidenPointerInductionRecipe::IsScalarAfterVectorization` and
replace it with `onlyScalarValuesUsed`. This removes the need to carry
state from the legacy cost model through VPlan, and the VPlan-based
analysis gives more accurate results, avoiding a number of extracts.
PR: https://github.com/llvm/llvm-project/pull/168289
|
|
Need to check if the non-schedulable phi parent node has unique
operands, if the incoming node has copyables, and the node is
commutative. Otherwise, there might be issues with the correct
calculation of the dependencies.
Fixes #168589
|
|
Add extra tests for over-eager tail-folding for tiny trip-count loops.
Reduced from https://github.com/llvm/llvm-project/issues/167858.
|
|
We can't do anything meaningful to such functions: they aren't optimizable, and even if inlined, they would bring no code open to optimization.
|
|
https://github.com/llvm/llvm-project/pull/162822 added another
validation step to check if entries in a partial reduction chain have
the same scale factor. But the validation was still dependent on the
order of entries in PartialReductionChains, and would fail to reject
some cases (e.g. if the first first link matched the scale of the second
link, but the second link is invalidated later).
To fix that, group chains by their starting phi nodes, then perform the
validation for each chain, and if it fails, invalidate the whole chain
for the phi.
Fixes https://github.com/llvm/llvm-project/issues/167243.
Fixes https://github.com/llvm/llvm-project/issues/167867.
PR: https://github.com/llvm/llvm-project/pull/168036
|
|
LoopPeel sometimes proves that, when reached, the original loop always
executes at least two iterations. LoopPeel then unconditionally executes
both the remaining loop's initial iteration and the peeled final
iteration. But that increases the latter's frequency above its frequency
in the original loop. To maintain the total frequency, this patch
compensates by decreasing the remaininng loop's latch probability.
This is another step in issue #135812 and was discussed at
<https://github.com/llvm/llvm-project/pull/166858#discussion_r2528968542>.
|
|
The existing, recently added test contains a whole lot of noise in the
form of dead instructions. Also, prefer named values.
The new test isolates a separate issue with concatenating i8 vectors.
|
|
A pattern of the form reduce.add(ext(mul)) is valid for a partial
reduction as long as the mul and its operands fulfill the requirements
of a normal partial reduction. The mul's extend operands will be
optimised to the wider extend, and we already have oneUse checks in
place to make sure the mul and operands can be modified safely.
1. -> https://github.com/llvm/llvm-project/pull/165536
2. https://github.com/llvm/llvm-project/pull/165543
|
|
|
|
This is the fixed version of
https://github.com/llvm/llvm-project/pull/163019
|
|
The problem with the many def-use chain problems in SLP vectorizer are
related to the fact that some nodes reuse the same instruction as
insertion point. Insertion point is not the instruction, but the place
between instructions. To set it correctly, better to generate pseudo
instruction immediately after the last instruction, and use it as
insertion point. It resolves the issues in most cases.
Fixes #168512 #168576
|
|
intrinsics. (#168668)
We can constant fold interleave of identical splat vectors to a larger
splat vector.
|
|
deinterleave3-8. (#168640)
|
|
Given a set of pointers, check if they can be rearranged as follows (%s is a constant):
%b + 0 * %s + 0
%b + 0 * %s + 1
%b + 0 * %s + 2
...
%b + 0 * %s + w
%b + 1 * %s + 0
%b + 1 * %s + 1
%b + 1 * %s + 2
...
%b + 1 * %s + w
...
If the pointers can be rearanged in the above pattern, it means that the
memory can be accessed with a strided loads of width `w` and stride `%s`.
|
|
Clean up some of the existing predicated load/store sink/hosting tests
and add additional test coverage for more complex cases.
|
|
This matches how IR is printed.
|
|
Do not consider loops with a zero backedge taken count as candidates for
interchange. This seems like a sensible thing because it suggests the loop
doesn't execute and there is no point in interchanging. As a bonus, this
seems to avoid triggering an assert about phis and their uses from source
code, so this is a partial fix for #163954 but it needs more work to properly
fix that.
|
|
This patch replaces the delinearization function used in DA, switching
from one that depends on type information in GEPs to one that does not.
There are three types of changes in regression tests: improvements,
degradations, and degradations but the related features will be
removed. Since there were very few cases that are classified into the
second category, I believe the impact of this change should be
practically insignificant.
|
|
Use the recently refactored VPRecipeBase::print to print debug location
for all recipes.
PR: https://github.com/llvm/llvm-project/pull/168454
|
|
Consider skipping epilogue scalable VF when they are greater than
RemainingIterations same as fixed VF.
And skip scalable RemainingIterations from that comparison because
SCEV ATM can't evaluate non-canonical vscale-based expressions.
|
|
We don't have enough information to infer the probability of a weak function pointer being nullptr or not (open question if we could propagate this from the linker)
Issue #147390
|
|
FCmp instructions have both a predicate and fast-math flags. Introduce a
new FCmp kind, that combines both to model this correctly in the current
system.
This should be NFC modulo VPlan printing which now includes the correct
fast-math flags.
|
|
Follow up on a cse OpType-mismatch crash reported due to ef023cae388d
(Reland [VPlan] Expand WidenInt inductions with nuw/nsw), setting the
OpType correctly when returning from getFlagsFromIndDesc.
|
|
Identity masks can be treated as free when scalable vectorization is
possible making the check agnostic of the vectorization policy
fixed/scalable, This allows for aggressive vector combines for identity
shuffle masks.
|
|
interleave3-8. (#168473)
|
|
https://alive2.llvm.org/ce/z/YGT5SN
https://alive2.llvm.org/ce/z/PVDxCw
https://alive2.llvm.org/ce/z/8buR2N
This is tricky because with positive numbers, we only go up, so we can
in fact always hit the signed_max boundary. This is important because
the intrinsic we use has the behavior of going the OTHER way, aka clamp
to INT_MIN if it goes in that direction.
And the range checking we do only works for positive numbers.
Because of this issue, we can only do this for constants as well.
|
|
Update VPlan to populate VPIRFlags during VPInstruction construction and
use it when creating widened recipes, instead of constructing VPIRFlags
from the underlying IR instruction each time. The VPRecipeWithIRFlags
constructor taking an underlying instruction and setting the flags based
on it has been removed.
This centralizes initial VPIRFlags creation and ensures flags are
consistently available throughout VPlan transformations and makes sure
we don't accidentally re-add flags from the underlying instruction that
already got dropped during transformations.
Follow-up to https://github.com/llvm/llvm-project/pull/167253, which did
the same for VPIRMetadata.
Should be NFC w.r.t. to the generated IR.
PR: https://github.com/llvm/llvm-project/pull/168450
|
|
[andv, eorv, orv, s/uaddv, s/umaxv, s/uminv]
sve_reduce_##(none, ?) -> op's neutral value
sve_reduce_##(any, neutral) -> op's neutral value
[andv, orv, s/umaxv, s/uminv]
sve_reduce_##(all, splat(X)) -> X
[eorv]
sve_reduce_##(all, splat(X)) -> 0
|
|
Exceptions include intrinsics that:
* take or return floating point data
* read or write FFR
* read or write memory
* read or write SME state
|
|
This patch introduces preliminary support for additional memory
locations.
They are: target_mem0 and target_mem1 and they model memory locations
that cannot be represented with existing memory locations.
It was a solution suggested in :
https://discourse.llvm.org/t/rfc-improving-fpmr-handling-for-fp8-intrinsics-in-llvm/86868/6
Currently, these locations are not yet target-specific. The goal is to
enable the compiler to express read/write effects on these resources.
|
|
This patch implements a transform to hoists single-scalar replicated
loads with invariant addresses out of the vector loop to the preheader
when scoped noalias metadata proves they cannot alias with any stores in
the loop.
This enables hosting of loads we can prove do not alias any stores in
the loop due to memory runtime checks added during vectorization.
PR: https://github.com/llvm/llvm-project/pull/166247
|
|
|
|
This is a small code size optimization that lets us avoid both shifting
and comparing to a constant if we need the shifted value anyway. On most
architectures the zero comparison is cheaper than a constant comparison
(or free if the shift sets flags).
Although this change appears to remove the optimization entirely, we
continue to do this transform if there is one use because of the code
below the removed code that transforms the shift into an and, followed
by the PR10267 case in InstCombinerImpl::foldICmpAndConstConst that
transforms the and into a ult/ugt. Added a test case to verify this
explicitly.
Per [1] reduces clang .text size by 0.09% and dynamic instruction count
by 0.01%.
[1] https://llvm-compile-time-tracker.com/compare.php?from=1f38d49ebe96417e368a567efa4d650b8a9ac30f&to=0873787a12b8f2eab019d8211ace4bccc1807343&stat=size-text
Reviewers: nikic, dtcxzyw
Reviewed By: dtcxzyw
Pull Request: https://github.com/llvm/llvm-project/pull/168007
|
|
We build the callsite graph by first adding nodes and edges for all
allocation contexts, then match the interior callsite nodes onto actual
calls (IR or summary), which due to inlining may result in the
generation of new nodes representing the inlined context sequence. We
attempt to update edges correctly during this process, but in the case
of recursion this becomes impossible to always get correct.
Specifically, when creating new inlined sequence nodes for stack ids on
recursive cycles we can't always update correctly, because we have lost
the original ordering of the context.
This PR introduces a mechanism, guarded by -memprof-top-n-important=
flag, to keep track of extra information for the largest N cold
contexts. Another flag -memprof-fixup-important (enabled by default)
will perform more expensive fixup of the edges for those largest N cold
contexts, by saving and walking the original ordered list of stack ids
from the context.
|
|
Add test where we have loads with existing noalias metadata and noalias
metadata gets added by loop versioning.
|
|
Changes: The previous patch had to be reverted to a mismatching-OpType
assert in cse. The reduced-test has now been added corresponding to a
RVV pointer-induction, and the pointer-induction case has been updated
to use createOverflowingBinaryOp.
While at it, record VPIRFlags in VPWidenInductionRecipe.
|
|
For a scalar only VPlan with tail folding, if it has a phi live out then
legalizeAndOptimizeInductions will scalarize the widened canonical IV
feeding into the header mask:
<x1> vector loop: {
vector.body:
EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
vp<%5> = SCALAR-STEPS vp<%4>, ir<1>, vp<%0>
EMIT vp<%6> = icmp ule vp<%5>, vp<%3>
EMIT vp<%index.next> = add nuw vp<%4>, vp<%1>
EMIT branch-on-count vp<%index.next>, vp<%2>
No successors
}
Successor(s): middle.block
middle.block:
EMIT vp<%8> = last-active-lane vp<%6>
EMIT vp<%9> = extract-lane vp<%8>, vp<%5>
Successor(s): ir-bb<exit>
The verifier complains about this but this should still generate the
correct last active lane, so this fixes the assert by handling this case
in isHeaderMask. There is a similar pattern already there for
ActiveLaneMask, which also expects a VPScalarIVSteps recipe.
Fixes #167813
|
|
- Introduce the -aarch64-force-unroll-threshold option; when a loop’s
cost is below this value we set UP.Force = true (default 0 keeps current
behaviour)
- Add an AArch64 loop-unroll regression test that runs once at the
default threshold and once with the flag raised, confirming forced
unrolling
|
|
The compiler should not consider split vectorize nodes, when checking
for non-schedulable PHI-based parent nodes. Only pure PHI nodes must be
considered, they only can be considered as explicit users, split nodes
are not.
Fixes #168268
|