| Age | Commit message (Collapse) | Author |
|
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated
Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
|
|
I'm not sure if this is the best way forward or not, but we have a lot
of issues with forgetting that shuffle_vectors can be scalar again and
again. (There is another example from the recent known-bits code added
recently). As a scalar-dst shuffle vector is just an extract, and a
scalar-source shuffle vector is just a build vector, this patch makes
scalar shuffle vector illegal and adjusts the irbuilder to create the
correct node as required.
Most targets do this already through lowering or combines. Making scalar
shuffles illegal simplifies gisel as a whole, it just requires that
transforms that create shuffles of new sizes to account for the scalar
shuffle being illegal (mostly IRBuilder and LessElements).
|
|
Fix a whitespace regression in MIR printing that was introduced in
https://github.com/llvm/llvm-project/pull/137361.
The default value for `ListSeparator` is `", "`, so we don't need to
print an additional space in front of tokens for optional symbols and
other things printed after operands.
Note, the modified LIT test will fail at trunk without the fix,
demonstrating that the extra space before `, pre-instr-symbol <mcsymbol
>` on Line 63 exists currently and is fixed with this change.
|
|
|
|
This patch adds the MIR parsing and serialization support for save and
restore points with subsets of callee saved registers. That is, it
syntactically allows a function to contain two or more distinct
sub-regions in which distinct subsets of registers are spilled/filled as
callee save. This is useful if e.g. one of the CSRs isn't modified in
one of the sub-regions, but is in the other(s).
Support for actually using this capability in code generation is still
forthcoming. This patch is the next logical step for multiple
save/restore points support.
All points are now stored in DenseMap from MBB to vector of
CalleeSavedInfo.
Shrink-Wrap points split Part 4.
RFC:
https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581
Part 1: https://github.com/llvm/llvm-project/pull/117862 (landed)
Part 2: https://github.com/llvm/llvm-project/pull/119355 (landed)
Part 3: https://github.com/llvm/llvm-project/pull/119357 (landed)
Part 5: https://github.com/llvm/llvm-project/pull/119359 (likely to be
further split)
|
|
Hardware initializes a single value in ttmp9 which is either the
workgroup ID X or cluster ID X. Most of this patch is a refactoring to
use a single `PreloadedValue` enumerator for this value, instead of two
enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same
value.
This makes it simpler to have a single attribute
`amdgpu-no-workgroup-id-x` indicating that this value is not used, which
in turns sets the TGID_EN_X bit appropriately to tell the hardware
whether to initialize it.
All of the above applies to Y and Z similarly.
Fixes: LWPSCGFX13-568
Co-authored-by: Jay Foad <jay.foad@amd.com>
|
|
(#158122)
This patch adds serialization of
AArch64MachineFunctionInfo::HasStackFrame into MIR.
|
|
Using options like -filetype=null instead should allow tools to save
some work by not generating any output.
|
|
(#153226)
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
|
|
Currently mir supports only one save and one restore point
specification:
```
savePoint: '%bb.1'
restorePoint: '%bb.2'
```
This patch provide possibility to have multiple save and multiple
restore points in mir:
```
savePoints:
- point: '%bb.1'
restorePoints:
- point: '%bb.2'
```
Shrink-Wrap points split Part 3.
RFC:
https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581
Part 1: https://github.com/llvm/llvm-project/pull/117862
Part 2: https://github.com/llvm/llvm-project/pull/119355
Part 4: https://github.com/llvm/llvm-project/pull/119358
Part 5: https://github.com/llvm/llvm-project/pull/119359
|
|
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much
repeats some of the logic from instruction selection.
This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.
This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
|
|
Update MachineFunction::CallSiteInfo to extract numeric CalleeTypeIds
from callee_type metadata attached to indirect call instructions.
Reviewers: nikic, ilovepi
Reviewed By: ilovepi
Pull Request: https://github.com/llvm/llvm-project/pull/87575
|
|
This is the following PR of
https://github.com/llvm/llvm-project/pull/136553 which calculate
NoaliasAddrSpace.
This PR carries the info calculated into MIR by adding it into AAMDnodes
|
|
This reverts commit 05e08cdb3e576cc0887d1507ebd2f756460c7db7.
Adding the missing -mtriple flags in MIR/X86 test files which caused
these tests to fail which was the reason for reverting the patch.
|
|
Reverts llvm/llvm-project#87574, which breaks LLVM ::
CodeGen/MIR/X86/call-site-info-ambiguous-indirect-call-typeid.mir tests
on linux-arm64 builders.
|
|
Introducing `EnableCallGraphSection` target option to add
CalleeTypeIds field in CallSiteInfo. Read the callee type ids
in and out by the MIR parser/printer.
Reviewers: ilovepi
Reviewed By: ilovepi
Pull Request: https://github.com/llvm/llvm-project/pull/87574
|
|
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.
Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.
At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.
This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
a SReg_1 representing `%active`, which needs to be passed into
SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
special handling of %active. Since the return may be in a different
basic block, it's difficult to add the virtual reg for %active to
SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
marks any used VGPRs as WWM registers, which are then spilled and
restored with the usual logic.
Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
|
|
|
|
This change cleans up DAG-to-DAG instruction selection around FTZ and
SETP comparison mode. Largely these changes do not impact functionality
though support for `{sin.cos}.approx.ftz.f32` is added.
|
|
This change fixes v2i8 lowering for parameters and returned values. As
part of this work, I move the lowering for return values to use generic
ISD::STORE nodes as these are more flexible and have existing
legalization handling.
Note that calling a function with v2i8 arguments or returns is still not
working but this is left for a subsequent change as this MR is already
fairly large.
Partially addresses #128853
|
|
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
|
|
|
|
|
|
(#141711)
|
|
These classes are redundant, as the untyped "Int" classes can be used
for all float operations. This change is intended to be as minimal as
possible and leaves the many potential simplifications and refactors
this exposes as future work.
|
|
(#139292)
PTX 8.8+ introduces 256-bit-wide vector loads/stores under certain
conditions. This change extends the backend to lower these loads/stores.
It also overrides getLoadStoreVecRegBitWidth for NVPTX, allowing the
LoadStoreVectorizer to create these wider vector operations.
See the spec for the three relevant PTX instructions here:
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc
- https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st
|
|
Fixes #127781
|
|
(#137795)
This reverts partially this commit
0b73b5af60f2c544892b9dd68b4fa43eeff52fc1.
This is not a clear revert because other changes already landed.
CFI directives like `.cfi_negate_ra_state` must be emitted after the
instruction.
If the execution is stopped before the `paciasp` instruction is executed
the debugger/unwinder would try to authenticated the return address as
the `.cfi_negate_ra_state` already indicates it got signed.
fixes: #137802
|
|
Previously the AnnotateKernelFeatures pass infers two attributes:
amdgpu-calls and amdgpu-stack-objects, which are used to help determine
if flat scratch init is allowed. PR #118907 created the
amdgpu-no-flat-scratch-init attribute. Continuing with that work, this
patch makes use of this attribute to determine flat scratch init,
replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the
removal of the AnnotateKernelFeatures pass.
|
|
As far as I understand, the first operand of dbg_declare should be a
pointer (inside a metadata wrapper). However, using a non-pointer is
currently not rejected, and we have some tests that use non-pointer
types. As far as I can tell, these tests either meant to use dbg_value
or are just incorrect hand-crafted tests.
Ran into this while trying to `fix` #134008.
|
|
After outputting block scalar string, the indent will be wrong.
This patch fixes Padding after block scalar string to ensure the correct
format of yaml.
The new added ut will fail in main.
```diff
@@ -3,4 +3,4 @@
Just a block
scalar doc
-scalar: a
+ scalar: a
...\n
```
|
|
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.
This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)
Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.
Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
|
|
andorbitset.ll is interesting since it directly depends on the
difference between poison and undef. Not sure it's useful to keep
the version using poison, I assume none of this code makes it to
codegen.
si-spill-cf.ll was also a nasty case, which I doubt has been reproducing
its original issue for a very long time. I had to reclaim an older version,
replace some of the poison uses, and run simplify-cfg. There's a very
slight change in the final CFG with this, but final the output is approximately
the same as it used to be.
|
|
The IR doesn't matter so much in these.
|
|
EnableTailMerge is false by default and is handled by the pass builder.
Passes are independent of target pipeline options.
This completes the generic `MachineLateOptimization` passes for the NPM
pipeline.
|
|
This PR updates the SGPR layout to a striped caller/callee-saved design,
similar
to the VGPR layout.
To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout
starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead
of 8 to
avoid ending with a 2-wide stripe.
Fixes #113782.
|
|
targets that aren't `catchret` instructions (#129953)
This change splits out the renaming and comment updates from #129612 as a non-functional change.
|
|
Not sure why this was disabled in the first place (dates back to
<https://github.com/llvm/llvm-project/commit/fbe9c04c5f72cf3eca39793aafc92071ef13c046>),
but it appears to be working for me.
|
|
Use `-passes="regallocgreedy<[all|sgpr|wwm|vgpr]>` to insert the greedy
RA with a filter and `-regalloc-npm=<type>` to control which RA to use
in existing pipeline.
|
|
This change fold together the _ari, _ari64, and _asi variants of these
instructions into a single instruction capable of holding any address.
This allows for the removal of a lot of unnecessary code and moves us
towards a standard way of representing an address in NVPTX.
|
|
Remove load and store instructions which do not include an immediate,
and just use the immediate variants in all cases. These variants will be
emitted exactly the same when the immediate offset is 0. Removing the
non-immediate versions allows for the removal of a lot of code and would
make any MachineIR passes simpler.
|
|
llvm/llvm-project@9e436c2daa44 tries to handle register masks and
sub-registers, it avoids clobbering RegUnit presreved by regmask. But it
then introduces invalid pointer issues.
We delete the copies without invalidate all the use in the CopyInfo, so
we dereferenced invalid pointers in next interation, causing asserts.
Fixes: #126107
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
|
|
(#124767)
This patch contains a number of changes relating to the above flag;
primarily it updates comment references to the old flag names,
"-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names,
"-fextend-variable-liveness[={all,this}]". These changes are all NFC.
This patch also removes the explicit -fextend-this-ptr-liveness flag
alias, and shortens the help-text for the main flag; these are both
changes that were meant to be applied in the initial PR (#110000), but
due to some user-error on my part they were not included in the merged
commit.
|
|
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.
`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.
Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.
As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].
- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.
Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].
Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.
Fixes #118220.
|
|
|
|
to MSVC /d2ImportCallOptimization) (#121516)" (#122777)
This reverts commit 2f7ade4b5e399962e18f5f9a0ab0b7335deece51.
Fix is available in #122762
|
|
to MSVC /d2ImportCallOptimization) (#121516)"
Breaks sanitizer build: https://lab.llvm.org/buildbot/#/builders/52/builds/5179
This reverts commits:
5ee0a71df919a328c714e25f0935c21e586cc18b
d997a722c194feec5f3a94dec5acdce59ac5e55b
|
|
/d2ImportCallOptimization) (#121516)
This change implements import call optimization for AArch64 Windows
(equivalent to the undocumented MSVC `/d2ImportCallOptimization` flag).
Import call optimization adds additional data to the binary which can be
used by the Windows kernel loader to rewrite indirect calls to imported
functions as direct calls. It uses the same [Dynamic Value Relocation
Table mechanism that was leveraged on x64 to implement
`/d2GuardRetpoline`](https://techcommunity.microsoft.com/blog/windowsosplatform/mitigating-spectre-variant-2-with-retpoline-on-windows/295618).
The change to the obj file is to add a new `.impcall` section with the
following layout:
```cpp
// Per section that contains calls to imported functions:
// uint32_t SectionSize: Size in bytes for information in this section.
// uint32_t Section Number
// Per call to imported function in section:
// uint32_t Kind: the kind of imported function.
// uint32_t BranchOffset: the offset of the branch instruction in its
// parent section.
// uint32_t TargetSymbolId: the symbol id of the called function.
```
NOTE: If the import call optimization feature is enabled, then the
`.impcall` section must be emitted, even if there are no calls to
imported functions.
The implementation is split across a few parts of LLVM:
* During AArch64 instruction selection, the `GlobalValue` for each call
to a global is recorded into the Extra Information for that node.
* During lowering to machine instructions, the called global value for
each call is noted in its containing `MachineFunction`.
* During AArch64 asm printing, if the import call optimization feature
is enabled:
- A (new) `.impcall` directive is emitted for each call to an imported
function.
- The `.impcall` section is emitted with its magic header (but is not
filled in).
* During COFF object writing, the `.impcall` section is filled in based
on each `.impcall` directive that were encountered.
The `.impcall` section can only be filled in when we are writing the
COFF object as it requires the actual section numbers, which are only
assigned at that point (i.e., they don't exist during asm printing).
I had tried to avoid using the Extra Information during instruction
selection and instead implement this either purely during asm printing
or in a `MachineFunctionPass` (as suggested in [on the
forums](https://discourse.llvm.org/t/design-gathering-locations-of-instructions-to-emit-into-a-section/83729/3))
but this was not possible due to how loading and calling an imported
function works on AArch64. Specifically, they are emitted as `ADRP` +
`LDR` (to load the symbol) then a `BR` (to do the call), so at the point
when we have machine instructions, we would have to work backwards
through the instructions to discover what is being called. An initial
prototype did work by inspecting instructions; however, it didn't
correctly handle the case where the same function was called twice in a
row, which caused LLVM to elide the `ADRP` + `LDR` and reuse the
previously loaded address. Worse than that, sometimes for the
double-call case LLVM decided to spill the loaded address to the stack
and then reload it before making the second call. So, instead of trying
to implement logic to discover where the value in a register came from,
I instead recorded the symbol being called at the last place where it
was easy to do: instruction selection.
|
|
(#121559)
As part #112171, support for FEAT_PAuthLR's CFI instructions was added.
However, the CFI instructions are emitted in the incorrect location. This
leads to incorrect CodeGen being generated and possible issues when
running a program. According to the ABI, the CFI instructions should be
emitted before the signing instruction. This is now done properly.
ABI information can be found here:
https://github.com/ARM-software/abi-aa/blob/bf0e2c8047c70987165f3e05e571d7836370ade9/aadwarf64/aadwarf64.rst#44call-frame-instructions
|
|
We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.
|