summaryrefslogtreecommitdiff
path: root/offload/DeviceRTL
AgeCommit message (Collapse)Author
2025-09-29[OpenMP] Fix 'libc' configuration when building OpenMPJoseph Huber
Summary: Forgot to port this option's old handling from offload. It's not way easier since they're built in the same CMake project. Also delete the leftover directory that's not used anymore, don't know how that was still there.
2025-09-08[OpenMP] Change build of OpenMP device runtime to be a separate runtime ↵Joseph Huber
(#136729) Summary: Currently we build the OpenMP device runtime as part of the `offload/` project. This is problematic because it has several restrictions when compared to the normal offloading runtime. It can only be built with an up-to-date clang and we need to set the target appropriately. Currently we hack around this by creating the compiler invocation manually, but this patch moves it into a separate runtimes build. This follows the same build we use for libc, libc++, compiler-rt, and flang-rt. This also moves it from `offload/` into `openmp/` because it is still the `openmp/` runtime and I feel it is more appropriate. We do want a generic `offload/` library at some point, but it would be trivial to then add that as a separate library now that we have the infrastructure that makes adding these new libraries trivial. This most importantly will require that users update their build configs, mostly adding the following lines at a minimum. I was debating whether or not I should 'auto-upgrade' this, but I just went with a warning. ``` -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda' \ -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \ -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \ ``` This also changed where the `.bc` version of the library lives, but it's still created.
2025-09-04[OpenMP][Offload] Restore __kmpc_* function signatures (#156104)Robert Imschweiler
Avoid altering existing function signatures of the kmpc interface to fix regressions in the runtime optimization (OpenMPOpt).
2025-08-28[OpenMP][clang] 6.0: num_threads strict (part 2: device runtime) (#146404)Robert Imschweiler
OpenMP 6.0 12.1.2 specifies the behavior of the strict modifier for the num_threads clause on parallel directives, along with the message and severity clauses. This commit implements necessary device runtime changes.
2025-08-28[OpenMP][Offload] Add SPMD-No-Loop mode to OpenMP offload runtime (#154105)Dominik Adamski
Kernels which are marked as SPMD-No-Loop should be launched with sufficient number of teams and threads to cover loop iteration space. No-Loop mode is described in RFC: https://discourse.llvm.org/t/rfc-no-loop-mode-for-openmp-gpu-kernels/87517/
2025-08-21[Offload] Add oneInterationPerThread param to loop device RTL (#151959)Dominik Adamski
Currently, Flang can generate no-loop kernels for all OpenMP target kernels in the program if the flags -fopenmp-assume-teams-oversubscription or -fopenmp-assume-threads-oversubscription are set. If we add an additional parameter, we can choose in the future which OpenMP kernels should be generated as no-loop kernels. This PR doesn't modify current behavior of oversubscription flags. RFC for no-loop kernels: https://discourse.llvm.org/t/rfc-no-loop-mode-for-openmp-gpu-kernels/87517
2025-08-05[OpenMP] Fix weak linkage on malloc declarationJoseph Huber
Summary: This being weak forces the external reference to be weak. Either we define it weak or not by pulling it from `libc`. Doing it here causes it to not be extracted properly.
2025-08-01[OpenMP] Use the `libc` malloc for AMDGPU if available (#151241)Joseph Huber
Summary: This patch enables the OpenMP runtime to use the general-purpose `malloc` interface in `libc` if the user built OpenMP with it enabled. All this requires is keeping `malloc` as an external function so it will be resolved later by the linker.
2025-06-24[Offload] Fix cmake warning (#145488)Ross Brunton
Cmake was unhappy that there was no space between arguments, now it is.
2025-05-20[OpenMP][GPU][FIX] Enable generic barriers in single threaded contexts (#140786)Johannes Doerfert
The generic GPU barrier implementation checked if it was the main thread in generic mode to identify single threaded regions. This doesn't work since inside of a non-active (=sequential) parallel, that thread becomes the main thread of a team, and is not the main thread in generic mode. At least that is the implementation of the APIs today. To identify single threaded regions we now check the team size explicitly. This exposed three other issues; one is, for now, expected and not a bug, the second one is a bug and has a FIXME in the single_threaded_for_barrier_hang_1.c file, and the final one is also benign as described in the end. The non-bug issue comes up if we ever initialize a thread state. Afterwards we will never run any region in parallel. This is a little conservative, but I guess thread states are really bad for performance anyway. The bug comes up if we optimize single_threaded_for_barrier_hang_1 and execute it in Generic-SPMD mode. For some reason we loose all the updates to b. This looks very much like a compiler bug, but could also be another logic issue in the runtime. Needs to be investigated. Issue number 3 comes up if we have nested parallels inside of a target region. The clang SPMD-check logic gets confused, determines SPMD (which is fine) but picks an unreasonable thread count. This is all benign, I think, just weird: ``` #pragma omp target teams #pragma omp parallel num_threads(64) #pragma omp parallel num_threads(10) {} ``` Was launched with 10 threads, not 64.
2025-05-06[Offload] Fix PowerPC builds that pass -mcpu (#138327)Joseph Huber
Summary: Another hacky fix done until https://github.com/llvm/llvm-project/pull/136729 lands. This time for `-mcpu`.
2025-05-05[OpenMP] Add pre sm_70 load hack back in (#138589)Joseph Huber
Summary: Different ordering modes aren't supported for an atomic load, so we just do an add of zero as the same thing. It's less efficient, but it works. Fixes https://github.com/llvm/llvm-project/issues/138560
2025-05-05[Offload] Fix dependency issue #126143 in CMakeYe Luo
2025-04-25[Offload] Override linker for device build (#137246)Joseph Huber
Summary: Override the default linker in case the user is passing it separately. This requires `lld` but it always did. This will be fixed *properly* when https://github.com/llvm/llvm-project/pull/136729 lands. Fixes https://github.com/llvm/llvm-project/issues/136822
2025-04-23[OpenMP] Update the bitcode library install and search path (#136754)Joseph Huber
Summary: This was accidentally kept in the old location when we moved to the new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the delta with https://github.com/llvm/llvm-project/pull/136729.
2025-04-23[Offload] Fix handling of 'bare' mode when environment missing (#136794)Joseph Huber
Summary: We treated the missing kernel environment as a unique mode, but it was kind of this random bool that was doing the same thing and it explicitly expects the kernel environment to be zero. It broke after the previous change since it used to default to SPMD and didn't handle zero in any of the other cases despite being used. This fixes that and queries for it without needing to consume an error.
2025-04-21[OpenMP] Remove dependency on LLVM include directory from DeviceRTL (#136359)Joseph Huber
Summary: Currently we depend on a single LLVM include directory. This is actually only required to define one enum, which is highly unlikely to change. THis patch makes the `Environment.h` include directory more hermetic so we no long depend on other libraries. In exchange, we get a simpler dependency list for the price of hard-coding `1` somewhere. I think it's a valid trade considering that this flag is highly unlikely to change at this point. @ronlieb AMD version https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3
2025-04-20[offload] Unset `-march` when building GPU libraries (#136442)Michał Górny
Unset `-march` when invoking the compiler and linker to build the GPU libraries. These libraries use GPU targets rather than the CPU targets, and an incidental `-march=native` causes Clang to be able to determine the GPU used — which causes the build to fail when there is no GPU available. Resetting `-march=` should suffice to revert to building generic code for the time being. See the discussion in: https://github.com/llvm/llvm-project/pull/126143#issuecomment-2816718492
2025-04-18[OpenMP] Remove 'libomptarget.devicertl.a' fatbinary and use static library ↵Joseph Huber
(#126143) Summary: Currently, we build a single `libomptarget.devicertl.a` which is a fatbinary. It is a host object file that contains the embedded archive files for both the NVIDIA and AMDGPU targets. This was done primarily as a convenience due to naming conflicts. Now that the clang driver for the GPU targets can appropriate link via the per-target runtime-dir, we can just make two separate static libraries and remove the indirection. This patch creates two new static libraries that get installed into ``` /lib/amdgcn-amd-amdhsa/libomp.a /lib/nvptx64-nvidia-cuda/libomp.a ``` for AMDGPU and NVPTX respectively. The link job created by the linker wrapper now simply needs to do `-lomp` and it will search those directories and link those static libraries. This requires far less special handling. This patch is a precursor to changing the build system entirely to be a runtimes based one. Soon this target will be a standard `add_library` and done through the GPU runtime targets. NOTE that this actually does remove an additional optimization step. Previously we merged all of the files into a single bitcode object and forcibly internalized some definitions. This, instead, just treats them like a normal static library. This may possibly affect performance for some files, but I think it's better overall to use static library semantics because it allows us to have an 'include-what-you-use' relationship with the library. Performance testing will be required. If we really need the merged blob then we can simply pack that into a new static library.
2025-04-10[AMDGPU] Fix code object version not being set to 'none' (#135036)Joseph Huber
Summary: Previously, we removed the special handling for the code object version global. I erroneously thought that this meant we cold get rid of this weird `-Xclang` option. However, this also emits an LLVM IR module flag, which will then cause linking issues.
2025-04-01[OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions (#133435)Sergio Afonso
This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions: - `__kmpc_distribute_for_static_loop*` - `__kmpc_distribute_static_loop*` - `__kmpc_for_static_loop*` Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received `num_iters` variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of `do`, `distribute` and `distribute parallel do` constructs. Expecting the application to pass `tripcount - 1` as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.
2025-03-28[Clang][AMDGPU] Remove special handling for COV4 libraries (#132870)Joseph Huber
Summary: When we were first porting to COV5, this lead to some ABI issues due to a change in how we looked up the work group size. Bitcode libraries relied on the builtins to emit code, but this was changed between versions. This prevented the bitcode libraries, like OpenMP or libc, from being used for both COV4 and COV5. The solution was to have this 'none' functionality which effectively emitted code that branched off of a global to resolve to either version. This isn't a great solution because it forced every TU to have this variable in it. The patch in https://github.com/llvm/llvm-project/pull/131033 removed support for COV4 from OpenMP, which was the only consumer of this functionality. Other users like HIP and OpenCL did not use this because they linked the ROCm Device Library directly which has its own handling (The name was borrowed from it after all). So, now that we don't need to worry about backward compatibility with COV4, we can remove this special handling. Users can still emit COV4 code, this simply removes the special handling used to make the OpenMP device runtime bitcode version agnostic.
2025-03-28[offload] Remove bad assert in StaticLoopChunker::Distribute (#132705)macurtis-amd
When building with asserts enabled, this can actually cause strange miscompilations because an incorrect llvm.assume is generated at the point of the assertion.
2025-03-19[OpenMP] Replace utilities with 'gpuintrin.h' definitions (#131644)Joseph Huber
Summary: Port more instructions. AMD version is at https://gist.github.com/jhuber6/235d7ee95f747c75f9a3cfd8eedac6aa
2025-03-18[openmp][nfc] Use builtin align in the devicertl (#131918)Jon Chesterfield
Noticed while extracting the smartstack as a test case
2025-03-18Revert "[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)"Jon Chesterfield
This reverts commit c02b935a9be888bbdf9f8cb0bf980bd411ae5893. Failed a check-offload test under CI
2025-03-18[OpenMP] Use 'gpuintrin.h' definitions for simple block identifiers (#131631)Joseph Huber
Summary: This patch ports the runtime to use `gpuintrin.h` instead of calling the builtins for most things. The `lanemask_gt` stuff was left for now with a fallback. AMD version for Ron https://gist.github.com/jhuber6/42014d635b9a8158727640876bf47226.
2025-03-18[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)Jon Chesterfield
Spirv doesn't have implicit conversions between address spaces (at least at present, we might need to change that) and address space qualified *this pointers are not handled well by clang. This commit changes the single instance of the smartstack to be explicitly a singleton, for fractionally simpler IR generation (no this pointer) and to sidestep the work in progress spirv64-- openmp target not being able to compile the original version.
2025-02-09[OpenMP] Replace use of target address space with <gpuintrin.h> local (#126119)Joseph Huber
Summary: This definition is more portable since it defines the correct value for the target. I got rid of the helper mostly because I think it's easy enough to use now that it's a type and being explicit about what's `undef` or `poison` is good.
2025-02-05[OpenMP] Port the OpenMP device runtime to direct C++ compilation (#123673)Joseph Huber
Summary: This removes the use of OpenMP offloading to build the device runtime. The main benefit here is that we no longer need to rely on offloading semantics to build a device only runtime. Things like variants are now no longer needed and can just be simple if-defs. In the future, I will remove most of the special handling here and fold it into calls to the `<gpuintrin.h>` functions instead. Additionally I will rework the compilation to make this a separate runtime. The current plan is to have this, but make including OpenMP and offloading either automatically add it, or print a warning if it's missing. This will allow us to use a normal CMake workflow and delete all the weird 'lets pull the clang binary out of the build' business. ``` -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload -DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa ``` After that, linking the OpenMP device runtime will be `-Xoffload-linker -lomp`. I.e. no more fat binary business. Only look at the most recent commit since this includes the two dependencies (fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).
2025-01-31[Offload][NFC] Fix typos discovered by codespell (#125119)Christian Clauss
https://github.com/codespell-project/codespell % `codespell --ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes --write-changes`
2025-01-27[Clang] Prevent `mlink-builtin-bitcode` from internalizing the RPC client ↵Joseph Huber
(#118661) Summary: Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA compiliations. This has the problem that it will internalize the RPC client symbol which needs to be visible to the host. To counteract that, I put `retain` on it, but this also prevents optimizations on the global itself, so the passes we have that remove the symbol don't work on OpenMP anymore. This patch does the dumbest solution, adding a special string check for it in clang. Not the best solution, the runner up would be to have a clang attribute for `externally_initialized` because those can't be internalized, but that might have some unfortunate side-effects. Alternatively we could make NVIDIA compilations do LTO all the time, but that would affect some users and it's harder than I thought.
2025-01-21[OpenMP] Remove usage of pointer-to-member in lookup (#123671)Joseph Huber
Summary: This is buggy and is currently being tracked in https://github.com/llvm/llvm-project/issues/123241. For now, replace it with a macro so that we can use address spaces directly.
2025-01-20[OpenMP] Make each atomic helper take an atomic scope argument (#122786)Joseph Huber
Summary: Right now we just default to device for each type, and mix an ad-hoc scope with the one used by the compiler's builtins. Unify this can make each version take the scope optionally. For @ronlieb, this will remove the need for `add_system` in the fork as well as the extra `cas` with system scope, just pass `system`.
2025-01-20[OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670)Joseph Huber
Summary: We used to avoid a lot of this stuff because we didn't properly handle variadics in device code. That's been solved for now, so we can just make an internal printf handler that forwards to the external `vprintf` function. This is either provided by NVIDIA's SDK or by the GPU libc implementation. The main reason for doing this is because it prevents the stupid AMDGPU printf pass from mangling our beautiful printfs!
2025-01-20[OpenMP] Fix mispelled attribute and warningJoseph Huber
Summary: This is spelled `ompx_aligned_barrier` when used directly, but wasn't included in the list of known assumptions. Fix that so now th test works.
2025-01-20[OpenMP] Remove 'omp assumes' scopes now that we have no inline ASM (#123611)Joseph Huber
Summary: We used this globally scoped `ext_no_call_asm` as a sort of hack around the compiler that allowed the attributor to optimize out inline assembly calls to PTX instructions. Quite some time ago I got rid of every inline assembly call and replaced it with a builitin, so this can just be deleted. Furthermore, I use the `[[omp::assume]]` attribute directly for the aligned barrier usage. This prints an unknown assumption warning (even though it isn't) so I'm just silencing that for now until I fix it later. --------- Co-authored-by: Michael Kruse <github@meinersbur.de>
2025-01-16[OpenMP] Remove hack around missing atomic load (#122781)Joseph Huber
Summary: We used to do a fetch add of zero to approximate a load. This is because the NVPTX backend didn't handle this properly. It's not an issue anymore so simply use the proper atomic builtin.
2025-01-10[OpenMP] Fix missing type getter for SFINAE helperJoseph Huber
Summary: This didn't get the type, which made using this always return false.
2025-01-09[OpenMP] Use __builtin_bit_cast instead of UB type punning (#122325)Joseph Huber
Summary: Use a normal bitcast, remove from the shared utils since it's not available in GCC 7.4
2025-01-09[OpenMP] Update atomic helpers to just use headers (#122185)Joseph Huber
Summary: Previously we had some indirection here, this patch updates these utilities to just be normal template functions. We use SFINAE to manage the special case handling for floats. Also this strips address spaces so it can be used more generally.
2024-12-24[OpenMP] Use generic IR for the OpenMP DeviceRTL (#119091)Joseph Huber
Summary: We previously built this for every single architecture to deal with incompatibility. This patch updates it to use the 'generic' IR that `libc` and other projects use. Who knows if this will have any side-effects, probably worth testing more but it passes the tests I expect to pass on my side.
2024-12-17[OpenMP] Only put `retain` for NVPTX so it can be optimized out for AMDJoseph Huber
Summary: This is a hack that only NVPTX needs.
2024-12-12[OpenMP] Replace AMDGPU fences with generic scoped fences (#119619)Joseph Huber
Summary: This is simpler and more common. I would've replaced the CUDA uses and made this the same but currently it doesn't codegen these fences fully and just emits a full system wide barrier as a fallback.
2024-12-09[Offload][AMDGPU] accept generic target (#118919)hidekisaito
Enables generic ISA, e.g., "--offload-arch=gfx11-generic" device code to run on gfx11-generic ISA capable device. Executable may contain one ELF that has specific target ISA and another ELF that has compatible generic ISA. Under that circumstance, this code should say both ELFs are compatible, leaving the rest to PluginManager to handle. Suggestions on how best to address that is welcome.
2024-12-07[offload] Support LIBOMPTARGET_DEVICE_ARCHITECTURES={amdgpu|nvptx} (#119070)Michał Górny
Add two more special values for LIBOMPTARGET_DEVICE_ARCHITECTURES: `amdgpu` and `nvptx`, to support building for all AMDGPU and NVPTX targets respectively. This can be used in place of `all` when offload is built with one of the GPU plugins only.
2024-12-06[offload] Add gfx1012 (Navi 14) to AMDGPU models list (#118857)Michał Górny
Fixes #118824
2024-12-03[Offload] Find libc relative to DeviceRTL path (#118497)Jan Patrick Lehr
This was discussed as a potential solution in https://github.com/llvm/llvm-project/pull/118173
2024-12-02[OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933)Joseph Huber
Summary: This patch adds an RPC interface that lives directly in the OpenMP device runtime. This allows OpenMP to implement custom opcodes. Currently this is only providing the host call interface, which is the raw version of reverse offloading. Previously this lived in `libc/` as an extension which is not the correct place. The interface here uses a weak symbol for the RPC client by the same name that the `libc` interface uses. This means that it will defer to the libc one if both are present so we don't need to set up multiple instances. The presense of this symbol is what controls whether or not we set up the RPC server. Because this is an external symbol it normally won't be optimized out, so there's a special pass in OpenMPOpt that deletes this symbol if it is unused during linking. That means at `O0` the RPC server will always be present now, but will be removed trivially if it's not used at O1 and higher.
2024-11-25[OpenMP] Remove use of '__AMDGCN_WAVEFRONT_SIZE' (#113156)Joseph Huber
Summary: This is going to be deprecated in https://github.com/llvm/llvm-project/pull/112849. This patch ports it to use the builtin instead. This isn't a compile constant, so it could slightly negatively affect codegen. There really should be an IR pass to turn it into a constant if the function has known attributes. Using the builtin is correct when we just do it for knowing the size like we do here. Obviously guarding w32/w64 code with this check would be broken.