llvm-project.git/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, branch users/chapuni/cov/single/switch

[AArch64] Fix signed comparison warning. NFC

2025-01-08T08:59:15+00:00

[AArch64] Add a subvector extract cost. (#121472)

2025-01-08T08:13:07+00:00

These can generally be emitted using an ext instruction or mov from the
high half. The half half extracts can be free depending on the users,
but that is not handled here, just the basic costs. It originally
included all subvector extracts, but that was toned-down to just
half-vector extracts to try and help the mid end not breakup high/low
extracts without having the SLP vectorizer create a mess using other
shuffles.

[AArch64] Add BF16 fpext and fptrunc costs. (#119524)

2025-01-07T09:39:08+00:00

This expands the recently added fp16 fpext and fpround costs to bf16.
Some of the costs are taken from the rough number of instructions
needed, some are a little aspirational.
https://godbolt.org/z/bGEEd1vsW

[AArch64] Improve codegen of vectorised early exit loops (#119534)

2025-01-06T13:17:14+00:00

Once PR #112138 lands we are able to start vectorising more loops
that have uncountable early exits. The typical loop structure
looks like this:

vector.body:
  ...
  %pred = icmp eq <2 x ptr> %wide.load, %broadcast.splat
  ...
  %or.reduc = tail call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> %pred)
  %iv.cmp = icmp eq i64 %index.next, 4
  %exit.cond = or i1 %or.reduc, %iv.cmp
  br i1 %exit.cond, label %middle.split, label %vector.body

middle.split:
  br i1 %or.reduc, label %found, label %notfound

found:
  ret i64 1

notfound:
  ret i64 0

The problem with this is that %or.reduc is kept live after the loop,
and since this is a boolean it typically requires making a copy of
the condition code register. For AArch64 this requires an additional
cset instruction, which is quite expensive for a typical find loop
that only contains 6 or 7 instructions.

This patch attempts to improve the codegen by sinking the reduction
out of the loop to the location of it's user. It's a lot cheaper to
keep the predicate alive if the type is legal and has lots of
registers for it. There is a potential downside in that a little
more work is required after the loop, but I believe this is worth
it since we are likely to spend most of our time in the loop.

[AArch64][SME] Disable inlining of callees with new ZT0 state (#121338)

2025-01-06T12:02:28+00:00

Inlining must be disabled for new-ZT0 callees as the callee is required
to save ZT0 and toggle PSTATE.ZA on entry.

[IRBuilder] Refactor FMF interface (#121657)

2025-01-06T06:37:04+00:00

Up to now, the only way to set specified FMF flags in IRBuilder is to
use `FastMathFlagGuard`. It makes the code ugly and hard to maintain.

This patch introduces a helper class `FMFSource` to replace the original
parameter `Instruction *FMFSource` in IRBuilder. To maximize the
compatibility, it accepts an instruction or a specified FMF.
This patch also removes the use of `FastMathFlagGuard` in some simple
cases.

Compile-time impact:
https://llvm-compile-time-tracker.com/compare.php?from=f87a9db8322643ccbc324e317a75b55903129b55&to=9397e712f6010be15ccf62f12740e9b4a67de2f4&stat=instructions%3Au

[AArch64] Add an option for sve-prefer-fixed-over-scalable-if-equal. NFC

2024-12-31T11:07:42+00:00

Add a new option to control preferFixedOverScalableIfEqualCost, useful for
testing.

[AArch64] SME implementation for agnostic-ZA functions (#120150)

2024-12-23T19:10:21+00:00

This implements the lowering of calls from agnostic-ZA functions to
non-agnostic-ZA functions, using the ABI routines
`__arm_sme_state_size`, `__arm_sme_save` and `__arm_sme_restore`.

This implements the proposal described in the following PRs:
* https://github.com/ARM-software/acle/pull/336
* https://github.com/ARM-software/abi-aa/pull/264

[AArch64] Unroll some loops with early-continues on Apple Silicon. (#118499)

2024-12-22T13:10:54+00:00

Try to runtime-unroll loops with early-continues depending on
loop-varying loads; this helps with branch-prediction for the
early-continues and can significantly improve performance
for such loops

Builds on top of https://github.com/llvm/llvm-project/pull/118317.

PR: https://github.com/llvm/llvm-project/pull/118499.

[AArch64] Tweak truncate costs for some scalable vector types (#119542)

2024-12-19T10:07:41+00:00

== We were previously returning an invalid cost when truncating
anything to , which is incorrect since we can
generate perfectly good code for this.

== The costs for truncating legal or unpacked types to predicates
seemed overly optimistic. For example, when truncating
 to  we typically do
something like

  and z0.h, z0.h, #0x1
  cmpne   p0.h, p0/z, z0.h, #0

I guess it might depend upon whether the input value is
generated in the same block or not and if we can avoid the
inreg zero-extend. However, it feels safe to take the more
conservative cost here.

== The costs for some truncates such as

  trunc  %a to 

were 1, whereas in actual fact they are free and no instructions
are required.

== Also, for this

  trunc  %a to 

it's just a single uzp1 instruction so I reduced the cost to 1.

In general, I've added costs for all cases where the destination
type is legal or unpacked. One unfortunate side effect of this
is the costs for some fixed-width truncates when using SVE now
look too optimistic.