llvm-project.git/offload/DeviceRTL/src, branch users/nico/python-2

[OpenMP][GPU][FIX] Enable generic barriers in single threaded contexts (#140786)

2025-05-21T02:33:54+00:00

The generic GPU barrier implementation checked if it was the main thread
in generic mode to identify single threaded regions. This doesn't work
since inside of a non-active (=sequential) parallel, that thread becomes
the main thread of a team, and is not the main thread in generic mode.
At least that is the implementation of the APIs today.

To identify single threaded regions we now check the team size
explicitly.

This exposed three other issues; one is, for now, expected and not a
bug, the second one is a bug and has a FIXME in the
single_threaded_for_barrier_hang_1.c file, and the final one is also
benign as described in the end.

The non-bug issue comes up if we ever initialize a thread state.
Afterwards we will never run any region in parallel. This is a little
conservative, but I guess thread states are really bad for performance
anyway.

The bug comes up if we optimize single_threaded_for_barrier_hang_1 and
execute it in Generic-SPMD mode. For some reason we loose all the
updates to b. This looks very much like a compiler bug, but could also
be another logic issue in the runtime. Needs to be investigated.

Issue number 3 comes up if we have nested parallels inside of a target
region. The clang SPMD-check logic gets confused, determines SPMD (which
is fine) but picks an unreasonable thread count. This is all benign, I
think, just weird:

```
  #pragma omp target teams
  #pragma omp parallel num_threads(64)
  #pragma omp parallel num_threads(10)
  {}
```
Was launched with 10 threads, not 64.

[Offload] Fix handling of 'bare' mode when environment missing (#136794)

2025-04-23T13:16:39+00:00

Summary:
We treated the missing kernel environment as a unique mode, but it was
kind of this random bool that was doing the same thing and it explicitly
expects the kernel environment to be zero. It broke after the previous
change since it used to default to SPMD and didn't handle zero in any of
the other cases despite being used. This fixes that and queries for it
without needing to consume an error.

[OpenMP] Remove dependency on LLVM include directory from DeviceRTL (#136359)

2025-04-21T20:21:47+00:00

Summary:
Currently we depend on a single LLVM include directory. This is actually
only required to define one enum, which is highly unlikely to change.
THis patch makes the `Environment.h` include directory more hermetic so
we no long depend on other libraries. In exchange, we get a simpler
dependency list for the price of hard-coding `1` somewhere. I think it's
a valid trade considering that this flag is highly unlikely to change at
this point.

@ronlieb AMD version
https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3

[OpenMP] Remove 'libomptarget.devicertl.a' fatbinary and use static library (#126143)

2025-04-18T12:43:31+00:00

Summary:
Currently, we build a single `libomptarget.devicertl.a` which is a
fatbinary. It is a host object file that contains the embedded archive
files for both the NVIDIA and AMDGPU targets. This was done primarily as
a convenience due to naming conflicts. Now that the clang driver for the
GPU targets can appropriate link via the per-target runtime-dir, we can
just make two separate static libraries and remove the indirection.

This patch creates two new static libraries that get installed into
```
/lib/amdgcn-amd-amdhsa/libomp.a
/lib/nvptx64-nvidia-cuda/libomp.a
```
for AMDGPU and NVPTX respectively. The link job created by the linker
wrapper now simply needs to do `-lomp` and it will search those
directories and link those static libraries. This requires far less
special handling.

This patch is a precursor to changing the build system entirely to be a
runtimes based one. Soon this target will be a standard `add_library`
and done through the GPU runtime targets.

NOTE that this actually does remove an additional optimization step.
Previously we merged all of the files into a single bitcode object and
forcibly internalized some definitions. This, instead, just treats them
like a normal static library. This may possibly affect performance for
some files, but I think it's better overall to use static library
semantics because it allows us to have an 'include-what-you-use'
relationship with the library.

Performance testing will be required. If we really need the merged blob
then we can simply pack that into a new static library.

[AMDGPU] Fix code object version not being set to 'none' (#135036)

2025-04-10T16:31:21+00:00

Summary:
Previously, we removed the special handling for the code object version
global. I erroneously thought that this meant we cold get rid of this
weird `-Xclang` option. However, this also emits an LLVM IR module flag,
which will then cause linking issues.

[OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions (#133435)

2025-04-01T09:29:08+00:00

This patch removes the addition of 1 to the number of iterations when
calling the following DeviceRTL functions:
- `__kmpc_distribute_for_static_loop*`
- `__kmpc_distribute_static_loop*`
- `__kmpc_for_static_loop*`

Calls to these functions are currently only produced by the OMPIRBuilder
from flang, which already passes the correct number of iterations to
these functions. By adding 1 to the received `num_iters` variable,
worksharing can produce incorrect results. This impacts flang OpenMP
offloading of `do`, `distribute` and `distribute parallel do`
constructs.

Expecting the application to pass `tripcount - 1` as the argument seems
unexpected as well, so rather than updating flang I think it makes more
sense to update the runtime.

[offload] Remove bad assert in StaticLoopChunker::Distribute (#132705)

2025-03-28T09:53:00+00:00

When building with asserts enabled, this can actually cause strange
miscompilations because an incorrect llvm.assume is generated at the
point of the assertion.

[OpenMP] Replace utilities with 'gpuintrin.h' definitions (#131644)

2025-03-19T15:47:21+00:00

Summary:
Port more instructions. AMD version is at
https://gist.github.com/jhuber6/235d7ee95f747c75f9a3cfd8eedac6aa

[openmp][nfc] Use builtin align in the devicertl (#131918)

2025-03-18T21:31:49+00:00

Noticed while extracting the smartstack as a test case

Revert "[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)"

2025-03-18T20:43:05+00:00

This reverts commit c02b935a9be888bbdf9f8cb0bf980bd411ae5893.
Failed a check-offload test under CI