summaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
AgeCommit message (Collapse)Author
2025-11-14[AMDGPU] Make use of getFunction and getMF. NFC. (#167872)Jay Foad
2025-09-12[AMDGPUPromoteAlloca][NFC] Avoid unnecessary APInt/int64_t conversions (#157864)Fabian Ritter
Follow-up to #157682
2025-09-10[AMDGPU] Generate canonical additions in AMDGPUPromoteAlloca (#157810)Fabian Ritter
When we know that one operand of an addition is a constant, we might was well put it on the right-hand side and avoid the work to canonicalize it in a later pass.
2025-09-10[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca (#157682)Fabian Ritter
[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca AMDGPUPromoteAlloca can transform i32 GEP offsets that operate on allocas into i64 extractelement indices. Before this patch, negative GEP offsets would be zero-extended, leading to wrong extractelement indices with values around (2**32-1). This fixes failing LlvmLibcCharacterConverterUTF32To8Test tests for AMDGPU.
2025-08-26[AMDGPU] AMDGPUPromoteAlloca: increase default max-regs to 32 (#155076)Carl Ritson
Increase promote-alloca-to-vector-max-regs to 32 from 16. This restores default promotion of 16 x double which was disabled by #127973. Fixes SWDEV-525817.
2025-06-24[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)Diana Picus
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.
2025-06-20AMDGPU: Remove legacy PM version of AMDGPUPromoteAllocaToVector (#144986)Matt Arsenault
This is only run in the middle end with the new pass manager now, so garbage collect the old PM version.
2025-06-16Revert "[AMDGPU] Extended vector promotion to aggregate types." (#144366)zGoldthorpe
Reverts llvm/llvm-project#143784 Patch fails some internal tests. Will investigate more thoroughly before attempting to remerge.
2025-06-13[AMDGPU] Extended vector promotion to aggregate types. (#143784)zGoldthorpe
Extends the `amdgpu-promote-alloca-to-vector` pass to also promote aggregate types whose elements are all the same type to vector registers. The motivation for this extension was to account for IR generated by the frontend containing several singleton struct types containing vectors or vector-like elements, though the implementation is strictly more general.
2025-06-02[AMDGPU] Promote nestedGEP allocas to vectors (#141199)Harrison Hao
Supports the `nestedGEP`pattern that appears when an alloca is first indexed as an array element and then shifted with a byte‑offset GEP: ```llvm %SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8 %row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j %elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4 %val = load i32, ptr addrspace(5) %elt1 ``` The pass folds the two levels of addressing into a single vector lane index and keeps the whole object in a VGPR: ```llvm %vec = freeze <20 x i32> poison ; alloca promote <20 x i32> %idx0 = mul i32 %j, 2 ; j * 2 %idx = add i32 %idx0, 1 ; j * 2 + 1 %val = extractelement <20 x i32> %vec, i32 %idx ``` This eliminates the scratch read.
2025-05-21[AMDGPU] PromoteAlloca: handle out-of-bounds GEP for shufflevector (#139700)Robert Imschweiler
This LLVM defect was identified via the AMD Fuzzing project. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-05-01[AMDGPU] Max. WG size-induced occupancy limits max. waves/EU (#137807)Lucas Ramirez
The default maximum waves/EU returned by the family of `AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of waves/EU supported by the subtarget (only a valid occupancy range in "amdgpu-waves-per-eu" may lower that maximum). This ignores maximum achievable occupancy imposed by flat workgroup size and LDS usage, resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces a maximum higher than the one from `AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`. This limits the waves/EU range's maximum to the maximum achievable occupancy derived from flat workgroup sizes and LDS usage. This only has an impact on functions which restrict flat workgroup size with "amdgpu-flat-work-group-size", since the default range of flat workgroup sizes achieves the maximum number of waves/EU supported by the subtarget. Improvements to the handling of "amdgpu-waves-per-eu" are left for a follow up PR (e.g., I think the attribute should be able to lower the full range of waves/EU produced by these methods).
2025-04-14[AMDGPU] Avoid crashes for non-byte-sized types in PromoteAlloca (#134042)Fabian Ritter
This patch addresses three problems when promoting allocas to vectors: - Element types with size < 1 byte in allocas with a vector type caused divisions by zero. - Element types whose size doesn't match their AllocSize hit an assertion. - Access types whose size doesn't match their AllocSize hit an assertion. With this patch, we do not attempt to promote affected allocas to vectors. In principle, we could handle these cases in PromoteAlloca, e.g., by truncating and extending elements from/to their allocation size. It's however unclear if we ever encounter such cases in practice, so that doesn't seem worth the added complexity. For SWDEV-511252
2025-03-31[IRBuilder] Add new overload for CreateIntrinsic (#131942)Rahul Joshi
Add a new `CreateIntrinsic` overload with no `Types`, useful for creating calls to non-overloaded intrinsics that don't need additional mangling.
2025-03-20[Target] Use *Set::insert_range (NFC) (#132140)Kazu Hirata
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently gained C++23-style insert_range. This patch replaces: Dest.insert(Src.begin(), Src.end()); with: Dest.insert_range(Src); This patch does not touch custom begin like succ_begin for now.
2025-03-19[AMDGPU] Fix typing error in multi dimensional promote alloca (#131763)Carl Ritson
Fix type error when GEP uses i64 index introduced in #127973.
2025-03-18AMDGPU: Use freeze poison instead of undef in alloca promotion (#131285)Matt Arsenault
Previously the value created to represent the uninitialized memory of the alloca was undef. Use freeze poison instead. Enables some optimization improvements (which need defeating in the limit tests), but also a few regressions. Seems to leave behind dead code in some cases too.
2025-03-14[NFC][AMDGPU] Replace direct arch comparison with `isAMDGCN()` (#131357)Shilei Tian
2025-03-12[AMDGPU] Fix typing error introduce in promote alloca changeCarl Ritson
Fix type error when GEP uses i64 offset introduced in #127973.
2025-03-12[AMDGPU] Extend promotion of alloca to vectors (#127973)Carl Ritson
* Add multi dimensional array support * Make maximum vector size tunable * Make ratio of VGPRs used for vector promotion tunable * Maximum array size now based on VGPR count (32b) instead of element count
2025-03-03[AMDGPU] Simplify conditional expressions. NFC. (#129228)Jay Foad
Simplfy `cond ? val : false` to `cond && val` and similar.
2025-02-24[AMDGPU] Update PromoteAlloca to handle GEPs with variable offset. (#122342)Sumanth Gundapaneni
In case of variable offset of a GEP that can be optimized out, promote alloca is updated to use the refereshed index to avoid an assertion. Issue found by fuzzer. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-02-09[AMDGPU] Avoid repeated hash lookups (NFC) (#126430)Kazu Hirata
2025-01-27[NFC][AMDGPU] Improve code introduced in #124607 (#124672)Shilei Tian
2025-01-27[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607)Shilei Tian
Fixes SWDEV-509327.
2025-01-23[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)Lucas Ramirez
Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.
2024-11-06AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144)Matt Arsenault
2024-10-29AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113)Matt Arsenault
This did not consider the IR change to allow a scalar base with a vector offset part. Reject any users that are not explicitly handled. In this situation we could handle the vector GEP, but that is a larger change. This just avoids the IR verifier error by rejecting it.
2024-10-17[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706)Jay Foad
Convert many instances of: Fn = Intrinsic::getOrInsertDeclaration(...); CreateCall(Fn, ...) to the equivalent CreateIntrinsic call.
2024-10-11[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752)Rahul Joshi
Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).
2024-09-30[NFC] Use initial-stack-allocations for more data structures (#110544)Jeremy Morse
This replaces some of the most frequent offenders of using a DenseMap that cause a malloc, where the typical element-count is small enough to fit in an initial stack allocation. Most of these are fairly obvious, one to highlight is the collectOffset method of GEP instructions: if there's a GEP, of course it's going to have at least one offset, but every time we've called collectOffset we end up calling malloc as well for the DenseMap in the MapVector.
2024-09-19[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)Jay Foad
It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.
2024-09-06AMDGPU: Remove unnecessary pointer bitcastMatt Arsenault
2024-08-14AMDGPU: Stop promoting allocas with addrspacecast users (#104051)Matt Arsenault
We cannot promote this case unless we know the value is only observed through flat operations. We cannot analyze this through a call. PointerMayBeCaptured was an imprecise check for this. A callee with a nocapture attribute may still cast to private and observe the address space, so really we need a different notion of nocapture. I doubt this was of any use anyway. The promotable cases should have optimized out addrspacecast to begin earlier. Fixes #66669 Fixes #104035
2024-05-28[AMDGPU][PromoteAlloca] Don't stop when an alloca is too big to promote (#93466)Pierre van Houtryve
When I rewrote this, I made a mistake in the control flow. I thought we could just stop promoting if an alloca is too big to vectorize, but we can't. Other allocas in the list may be promotable and fit within the budget. Fixes SWDEV-455343
2024-04-12[AMDGPU] Fix a potential wrong return value indicating whether a pass ↵Shilei Tian
modifies a function (#88197) When the alloca is too big for vectorization, the function could have already been modified in previous iteration of the `for` loop.
2024-03-19[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735)Pierre van Houtryve
Update PromoteAllocaToVector so it considers the whole function before promoting allocas. Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca. Passed internal performance testing.
2024-03-19[AMDGPU][PromoteAlloca] Drop bitcast handling (#85747)Pierre van Houtryve
This is no longer needed with opaque pointers.
2024-03-05[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597)bcahoon
The promote alloca to vector transformation assumes that the vector index is a constant value. If it is not a constant, then either an assert occurs or the tranformation generates an incorrect index.
2024-02-05[AMDGPU][PromoteAlloca] Support memsets to ptr allocas (#80678)Pierre van Houtryve
Fixes #80366
2023-12-20[AMDGPU] Handle object size and bail if assume-like intrinsic is used in ↵Mariusz Sikora
PromoteAllocaToVector (#68744) Attached test will cause crash without this change. We should not remove isAssumeLikeIntrinsic instruction if it is used by other instruction.
2023-11-28[AMDGPU] PromoteAlloca - bail always if load/store is volatile (#73228)Mariusz Sikora
This change is addressing case where alloca size is the same as load/store size.
2023-11-20[AMDGPU] Fix PromoteAlloca size check of alloca for store (#72528)bcahoon
When storing a subvector, too many element were written when the size of the alloca is smaller than the size of the vector store. This patch checks for the minimum of the alloca vector and the store vector to determine the number of elements to store.
2023-11-07[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant ↵Pierre van Houtryve
indexes (#71505) I assumed indexes were always ConstantInts, but that's not always the case. They can be other things as well. We can easily handle that by just emitting an add and let InstSimplify do the constant folding for cases where it's really a ConstantInt. Solves SWDEV-429935
2023-09-02AMDGPU: Pass in TargetMachine to AMDGPULowerModuleLDSPassMatt Arsenault
https://reviews.llvm.org/D157660
2023-07-26[AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elementspvanhout
The previous condition was incorrect in some cases, like storing <2 x i32> into a double. If IndexVal was >0, we ended up never storing anything. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D156308
2023-07-25[AMDGPU] Allow vector access types in PromoteAllocaToVectorpvanhout
Depends on D152706 Solves SWDEV-408279 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155699
2023-07-25[AMDGPU] Use SSAUpdater in PromoteAllocapvanhout
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly. Note PromoteAlloca is still reliant on SROA running first to canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D152706
2023-07-18[llvm] Remove uses of getWithSamePointeeType() (NFC)Nikita Popov
2023-06-28[llvm] Replace uses of Type::getPointerTo (NFC)Youngsuk Kim
Partial progress towards removing in-tree uses of `Type::getPointerTo`, before we can deprecate the API. If the API is used solely to support an unnecessary bitcast, get rid of the bitcast as well. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D153933