summaryrefslogtreecommitdiff
path: root/offload/DeviceRTL/src
AgeCommit message (Collapse)Author
2025-09-08[OpenMP] Change build of OpenMP device runtime to be a separate runtime ↵Joseph Huber
(#136729) Summary: Currently we build the OpenMP device runtime as part of the `offload/` project. This is problematic because it has several restrictions when compared to the normal offloading runtime. It can only be built with an up-to-date clang and we need to set the target appropriately. Currently we hack around this by creating the compiler invocation manually, but this patch moves it into a separate runtimes build. This follows the same build we use for libc, libc++, compiler-rt, and flang-rt. This also moves it from `offload/` into `openmp/` because it is still the `openmp/` runtime and I feel it is more appropriate. We do want a generic `offload/` library at some point, but it would be trivial to then add that as a separate library now that we have the infrastructure that makes adding these new libraries trivial. This most importantly will require that users update their build configs, mostly adding the following lines at a minimum. I was debating whether or not I should 'auto-upgrade' this, but I just went with a warning. ``` -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda' \ -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \ -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \ ``` This also changed where the `.bc` version of the library lives, but it's still created.
2025-09-04[OpenMP][Offload] Restore __kmpc_* function signatures (#156104)Robert Imschweiler
Avoid altering existing function signatures of the kmpc interface to fix regressions in the runtime optimization (OpenMPOpt).
2025-08-28[OpenMP][clang] 6.0: num_threads strict (part 2: device runtime) (#146404)Robert Imschweiler
OpenMP 6.0 12.1.2 specifies the behavior of the strict modifier for the num_threads clause on parallel directives, along with the message and severity clauses. This commit implements necessary device runtime changes.
2025-08-28[OpenMP][Offload] Add SPMD-No-Loop mode to OpenMP offload runtime (#154105)Dominik Adamski
Kernels which are marked as SPMD-No-Loop should be launched with sufficient number of teams and threads to cover loop iteration space. No-Loop mode is described in RFC: https://discourse.llvm.org/t/rfc-no-loop-mode-for-openmp-gpu-kernels/87517/
2025-08-21[Offload] Add oneInterationPerThread param to loop device RTL (#151959)Dominik Adamski
Currently, Flang can generate no-loop kernels for all OpenMP target kernels in the program if the flags -fopenmp-assume-teams-oversubscription or -fopenmp-assume-threads-oversubscription are set. If we add an additional parameter, we can choose in the future which OpenMP kernels should be generated as no-loop kernels. This PR doesn't modify current behavior of oversubscription flags. RFC for no-loop kernels: https://discourse.llvm.org/t/rfc-no-loop-mode-for-openmp-gpu-kernels/87517
2025-08-01[OpenMP] Use the `libc` malloc for AMDGPU if available (#151241)Joseph Huber
Summary: This patch enables the OpenMP runtime to use the general-purpose `malloc` interface in `libc` if the user built OpenMP with it enabled. All this requires is keeping `malloc` as an external function so it will be resolved later by the linker.
2025-05-20[OpenMP][GPU][FIX] Enable generic barriers in single threaded contexts (#140786)Johannes Doerfert
The generic GPU barrier implementation checked if it was the main thread in generic mode to identify single threaded regions. This doesn't work since inside of a non-active (=sequential) parallel, that thread becomes the main thread of a team, and is not the main thread in generic mode. At least that is the implementation of the APIs today. To identify single threaded regions we now check the team size explicitly. This exposed three other issues; one is, for now, expected and not a bug, the second one is a bug and has a FIXME in the single_threaded_for_barrier_hang_1.c file, and the final one is also benign as described in the end. The non-bug issue comes up if we ever initialize a thread state. Afterwards we will never run any region in parallel. This is a little conservative, but I guess thread states are really bad for performance anyway. The bug comes up if we optimize single_threaded_for_barrier_hang_1 and execute it in Generic-SPMD mode. For some reason we loose all the updates to b. This looks very much like a compiler bug, but could also be another logic issue in the runtime. Needs to be investigated. Issue number 3 comes up if we have nested parallels inside of a target region. The clang SPMD-check logic gets confused, determines SPMD (which is fine) but picks an unreasonable thread count. This is all benign, I think, just weird: ``` #pragma omp target teams #pragma omp parallel num_threads(64) #pragma omp parallel num_threads(10) {} ``` Was launched with 10 threads, not 64.
2025-04-23[Offload] Fix handling of 'bare' mode when environment missing (#136794)Joseph Huber
Summary: We treated the missing kernel environment as a unique mode, but it was kind of this random bool that was doing the same thing and it explicitly expects the kernel environment to be zero. It broke after the previous change since it used to default to SPMD and didn't handle zero in any of the other cases despite being used. This fixes that and queries for it without needing to consume an error.
2025-04-21[OpenMP] Remove dependency on LLVM include directory from DeviceRTL (#136359)Joseph Huber
Summary: Currently we depend on a single LLVM include directory. This is actually only required to define one enum, which is highly unlikely to change. THis patch makes the `Environment.h` include directory more hermetic so we no long depend on other libraries. In exchange, we get a simpler dependency list for the price of hard-coding `1` somewhere. I think it's a valid trade considering that this flag is highly unlikely to change at this point. @ronlieb AMD version https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3
2025-04-18[OpenMP] Remove 'libomptarget.devicertl.a' fatbinary and use static library ↵Joseph Huber
(#126143) Summary: Currently, we build a single `libomptarget.devicertl.a` which is a fatbinary. It is a host object file that contains the embedded archive files for both the NVIDIA and AMDGPU targets. This was done primarily as a convenience due to naming conflicts. Now that the clang driver for the GPU targets can appropriate link via the per-target runtime-dir, we can just make two separate static libraries and remove the indirection. This patch creates two new static libraries that get installed into ``` /lib/amdgcn-amd-amdhsa/libomp.a /lib/nvptx64-nvidia-cuda/libomp.a ``` for AMDGPU and NVPTX respectively. The link job created by the linker wrapper now simply needs to do `-lomp` and it will search those directories and link those static libraries. This requires far less special handling. This patch is a precursor to changing the build system entirely to be a runtimes based one. Soon this target will be a standard `add_library` and done through the GPU runtime targets. NOTE that this actually does remove an additional optimization step. Previously we merged all of the files into a single bitcode object and forcibly internalized some definitions. This, instead, just treats them like a normal static library. This may possibly affect performance for some files, but I think it's better overall to use static library semantics because it allows us to have an 'include-what-you-use' relationship with the library. Performance testing will be required. If we really need the merged blob then we can simply pack that into a new static library.
2025-04-10[AMDGPU] Fix code object version not being set to 'none' (#135036)Joseph Huber
Summary: Previously, we removed the special handling for the code object version global. I erroneously thought that this meant we cold get rid of this weird `-Xclang` option. However, this also emits an LLVM IR module flag, which will then cause linking issues.
2025-04-01[OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions (#133435)Sergio Afonso
This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions: - `__kmpc_distribute_for_static_loop*` - `__kmpc_distribute_static_loop*` - `__kmpc_for_static_loop*` Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received `num_iters` variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of `do`, `distribute` and `distribute parallel do` constructs. Expecting the application to pass `tripcount - 1` as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.
2025-03-28[offload] Remove bad assert in StaticLoopChunker::Distribute (#132705)macurtis-amd
When building with asserts enabled, this can actually cause strange miscompilations because an incorrect llvm.assume is generated at the point of the assertion.
2025-03-19[OpenMP] Replace utilities with 'gpuintrin.h' definitions (#131644)Joseph Huber
Summary: Port more instructions. AMD version is at https://gist.github.com/jhuber6/235d7ee95f747c75f9a3cfd8eedac6aa
2025-03-18[openmp][nfc] Use builtin align in the devicertl (#131918)Jon Chesterfield
Noticed while extracting the smartstack as a test case
2025-03-18Revert "[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)"Jon Chesterfield
This reverts commit c02b935a9be888bbdf9f8cb0bf980bd411ae5893. Failed a check-offload test under CI
2025-03-18[OpenMP] Use 'gpuintrin.h' definitions for simple block identifiers (#131631)Joseph Huber
Summary: This patch ports the runtime to use `gpuintrin.h` instead of calling the builtins for most things. The `lanemask_gt` stuff was left for now with a fallback. AMD version for Ron https://gist.github.com/jhuber6/42014d635b9a8158727640876bf47226.
2025-03-18[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)Jon Chesterfield
Spirv doesn't have implicit conversions between address spaces (at least at present, we might need to change that) and address space qualified *this pointers are not handled well by clang. This commit changes the single instance of the smartstack to be explicitly a singleton, for fractionally simpler IR generation (no this pointer) and to sidestep the work in progress spirv64-- openmp target not being able to compile the original version.
2025-02-09[OpenMP] Replace use of target address space with <gpuintrin.h> local (#126119)Joseph Huber
Summary: This definition is more portable since it defines the correct value for the target. I got rid of the helper mostly because I think it's easy enough to use now that it's a type and being explicit about what's `undef` or `poison` is good.
2025-02-05[OpenMP] Port the OpenMP device runtime to direct C++ compilation (#123673)Joseph Huber
Summary: This removes the use of OpenMP offloading to build the device runtime. The main benefit here is that we no longer need to rely on offloading semantics to build a device only runtime. Things like variants are now no longer needed and can just be simple if-defs. In the future, I will remove most of the special handling here and fold it into calls to the `<gpuintrin.h>` functions instead. Additionally I will rework the compilation to make this a separate runtime. The current plan is to have this, but make including OpenMP and offloading either automatically add it, or print a warning if it's missing. This will allow us to use a normal CMake workflow and delete all the weird 'lets pull the clang binary out of the build' business. ``` -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload -DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa ``` After that, linking the OpenMP device runtime will be `-Xoffload-linker -lomp`. I.e. no more fat binary business. Only look at the most recent commit since this includes the two dependencies (fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).
2025-01-31[Offload][NFC] Fix typos discovered by codespell (#125119)Christian Clauss
https://github.com/codespell-project/codespell % `codespell --ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes --write-changes`
2025-01-27[Clang] Prevent `mlink-builtin-bitcode` from internalizing the RPC client ↵Joseph Huber
(#118661) Summary: Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA compiliations. This has the problem that it will internalize the RPC client symbol which needs to be visible to the host. To counteract that, I put `retain` on it, but this also prevents optimizations on the global itself, so the passes we have that remove the symbol don't work on OpenMP anymore. This patch does the dumbest solution, adding a special string check for it in clang. Not the best solution, the runner up would be to have a clang attribute for `externally_initialized` because those can't be internalized, but that might have some unfortunate side-effects. Alternatively we could make NVIDIA compilations do LTO all the time, but that would affect some users and it's harder than I thought.
2025-01-20[OpenMP] Make each atomic helper take an atomic scope argument (#122786)Joseph Huber
Summary: Right now we just default to device for each type, and mix an ad-hoc scope with the one used by the compiler's builtins. Unify this can make each version take the scope optionally. For @ronlieb, this will remove the need for `add_system` in the fork as well as the extra `cas` with system scope, just pass `system`.
2025-01-20[OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670)Joseph Huber
Summary: We used to avoid a lot of this stuff because we didn't properly handle variadics in device code. That's been solved for now, so we can just make an internal printf handler that forwards to the external `vprintf` function. This is either provided by NVIDIA's SDK or by the GPU libc implementation. The main reason for doing this is because it prevents the stupid AMDGPU printf pass from mangling our beautiful printfs!
2025-01-09[OpenMP] Use __builtin_bit_cast instead of UB type punning (#122325)Joseph Huber
Summary: Use a normal bitcast, remove from the shared utils since it's not available in GCC 7.4
2025-01-09[OpenMP] Update atomic helpers to just use headers (#122185)Joseph Huber
Summary: Previously we had some indirection here, this patch updates these utilities to just be normal template functions. We use SFINAE to manage the special case handling for floats. Also this strips address spaces so it can be used more generally.
2024-12-24[OpenMP] Use generic IR for the OpenMP DeviceRTL (#119091)Joseph Huber
Summary: We previously built this for every single architecture to deal with incompatibility. This patch updates it to use the 'generic' IR that `libc` and other projects use. Who knows if this will have any side-effects, probably worth testing more but it passes the tests I expect to pass on my side.
2024-12-17[OpenMP] Only put `retain` for NVPTX so it can be optimized out for AMDJoseph Huber
Summary: This is a hack that only NVPTX needs.
2024-12-12[OpenMP] Replace AMDGPU fences with generic scoped fences (#119619)Joseph Huber
Summary: This is simpler and more common. I would've replaced the CUDA uses and made this the same but currently it doesn't codegen these fences fully and just emits a full system wide barrier as a fallback.
2024-12-02[OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933)Joseph Huber
Summary: This patch adds an RPC interface that lives directly in the OpenMP device runtime. This allows OpenMP to implement custom opcodes. Currently this is only providing the host call interface, which is the raw version of reverse offloading. Previously this lived in `libc/` as an extension which is not the correct place. The interface here uses a weak symbol for the RPC client by the same name that the `libc` interface uses. This means that it will defer to the libc one if both are present so we don't need to set up multiple instances. The presense of this symbol is what controls whether or not we set up the RPC server. Because this is an external symbol it normally won't be optimized out, so there's a special pass in OpenMPOpt that deletes this symbol if it is unused during linking. That means at `O0` the RPC server will always be present now, but will be removed trivially if it's not used at O1 and higher.
2024-11-25[OpenMP] Remove use of '__AMDGCN_WAVEFRONT_SIZE' (#113156)Joseph Huber
Summary: This is going to be deprecated in https://github.com/llvm/llvm-project/pull/112849. This patch ports it to use the builtin instead. This isn't a compile constant, so it could slightly negatively affect codegen. There really should be an IR pass to turn it into a constant if the function has known attributes. Using the builtin is correct when we just do it for knowing the size like we do here. Obviously guarding w32/w64 code with this check would be broken.
2024-09-26[OpenMP] Add critical region lock for NVPTX targets (#110148)Joseph Huber
Summary: We define this on AMDGCN but not NVPTX, which leads to some failures dependong on the target.
2024-09-05[Offload][NFC] Reorganize `utils::` and make Device/Host/Shared clearer ↵Johannes Doerfert
(#100280) We had three `utils::` namespaces, all with different "meaning" (host, device, hsa_utils). We should, when we can, keep "include/Shared" accessible from host and device, thus RefCountTy has been moved to a separate header. `hsa_utils` was introduced to make `utils::` less overloaded. And common functionality was de-duplicated, e.g., `utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type punning now checks for the size of the result to make sure it matches the source type. No functional change was intended.
2024-08-22[PGO][OpenMP] Instrumentation for GPU devices (Revision of #76587) (#102691)Ethan Luis McDonough
This pull request is a revised version of #76587. This pull request fixes some build issues that were present in the previous version of this change. > This pull request is the first part of an ongoing effort to extends PGO instrumentation to GPU device code. This PR makes the following changes: > > - Adds blank registration functions to device RTL > - Gives PGO globals protected visibility when targeting a supported GPU > - Handles any addrspace casts for PGO calls > - Implements PGO global extraction in GPU plugins (currently only dumps info) > > These changes can be tested by supplying `-fprofile-instrument=clang` while targeting a GPU.
2024-08-14[OpenMP] Implement 'omp_alloc' on the device (#102526)Joseph Huber
Summary: The 'omp_alloc' function should be callable from a target region. This patch implemets it by simply calling `malloc` for every non-default trait value allocator. All the special access modifiers are unimplemented and return null. The null allocator returns null as the spec states it should not be usable from the target.
2024-07-26Reapply "[OpenMP][libc] Remove special handling for OpenMP printf (#98940)"Joseph Huber
This reverts commit fea5914c926e2f013a8b5e27eaa74c7047fb2c71.
2024-07-26Revert "[OpenMP][libc] Remove special handling for OpenMP printf (#98940)"Joseph Huber
This reverts commit 069e8bcd82c4420239f95c7e6a09e1f756317cfc. Summary: Some tests failing, revert this for now.
2024-07-26[OpenMP][libc] Remove special handling for OpenMP printf (#98940)Joseph Huber
Summary: Currently there are several layers to handle `printf`. Since we now have varargs and an implementation of `printf` this can be heavily simplified. 1. The frontend renames `printf` into `omp_vprintf` and gives it an argument buffer. Removing 1. triggered some code in the AMDGPU backend menat for HIP / OpenCL, so I hadded an exception to it. 2. Forward this to CUDA vprintf or ignore it. We no longer need special handling for it since we have varargs. So now we just forward this to CUDA vprintf if we have libc, otherwise just leave `printf` as an external function and expect that `libc` will be linked in.
2024-07-26[OpenMP] Do not define '__assert_fail' if we have the GPU libc (#100409)Joseph Huber
Summary: The C library is intended to provide `__assert_fail`, so in the cases that we have both we should defer to that. This means that if you build the C library for GPUs you'll get the RPC based asser, and if not you'll get the trap based one.
2024-07-22[NFC][Offload] Move variables to where they are used (#99956)Shilei Tian
2024-07-02[DeviceRTL] Make defined 'libc' functions weak in OpenMP (#97356)Joseph Huber
Summary: These functions provide special-case implementations internal to the OpenMP device runtime. This can potentially conflict with the symbols pulled in from the actual GPU `libc`. This patch makes these weak, so in the case that the GPU libc functions exist they will be overridden. This should not impact performance in the average case because the old `-mlink-builtin-bitcode` version does internalization, deleting weak, and the new LTO path will resolve to the strong reference and then internalize it.
2024-07-01[OpenMP][offload] Fix dynamic schedule tracking (#97065)Gheorghe-Teodor Bercea
This patch fixes the dynamic schedule tracking.
2024-06-28Revert "[PGO][OpenMP] Instrumentation for GPU devices (#76587)"Ethan Luis McDonough
This reverts commit 5fd2af38e461445c583d7ffc2fe23858966eee76. It caused build issues and broke the buildbot.
2024-06-28[PGO][OpenMP] Instrumentation for GPU devices (#76587)Ethan Luis McDonough
This pull request is the first part of an ongoing effort to extends PGO instrumentation to GPU device code. This PR makes the following changes: - Adds blank registration functions to device RTL - Gives PGO globals protected visibility when targeting a supported GPU - Handles any addrspace casts for PGO calls - Implements PGO global extraction in GPU plugins (currently only dumps info) These changes can be tested by supplying `-fprofile-instrument=clang` while targeting a GPU.
2024-06-03Reapply "[OpenMP][OMPX] Add shfl_down_sync (#93311)" (#94139)Shilei Tian
2024-05-26Revert "Reapply "[OpenMP][OMPX] Add shfl_down_sync (#93311)""Shilei Tian
This reverts commit 7b4865582299294455bc816358fd88a9c6e5e0be.
2024-05-26Reapply "[OpenMP][OMPX] Add shfl_down_sync (#93311)"Shilei Tian
This reverts commit 9b31cc71d66064dfaf2afabf4a835211321bb4a0.
2024-05-24Revert "[OpenMP][OMPX] Add shfl_down_sync (#93311)"Joseph Huber
This reverts commit 098c6dfa8157681699a71fce9e3d94515e66311f. This reverts commit 8c718a3a91df4ab68dc3f1ca3887ea730c9aed84. This reverts commit 4fb02de9d490d0773441aa30124bb4d1272230d3.
2024-05-24[OpenMP][OMPX] Add shfl_down_sync (#93311)Shilei Tian
2024-05-24[OpenMP][OMPX] Add ballot_sync (#91297)Shilei Tian
This patch adds the support for `ballot_sync` in ompx.