llvm-project.git/libc/src/__support/GPU/amdgpu/utils.h, branch users/chapuni/cov/single/condop

[libc] Switch to using the generic `` implementations (#121810)

2025-01-07T19:08:39+00:00

Summary:
This patch switches the GPU utility helpers to wrapping around the
gpuintrin.h ones with a C++ flavor.

[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98597)

2024-07-12T16:28:41+00:00

This is a part of #97655.

Revert "[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration" (#98593)

2024-07-12T07:12:13+00:00

Reverts llvm/llvm-project#98075

bots are broken

[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98075)

2024-07-11T19:35:22+00:00

This is a part of #97655.

[libc] Add memory fence utility to the GPU utilities (#91756)

2024-05-10T21:38:13+00:00

Summary:
GPUs like to execute instructions in the background until something
excplitely consumes them. We are working on adding some
microbenchmarking code, which requires flushing the pending memory
operations beforehand. This patch simply adds these utility functions
that will be used in the near future.

[libc] Add utility functions for warp-level scan and reduction (#84866)

2024-03-12T15:40:49+00:00

Summary:
The GPU uses a SIMT execution model. That means that each value actually
belongs to a group of 32 or 64 other lanes executing next to it. These
platforms offer some intrinsic fuctions to actually take elements from
neighboring lanes. With these we can do parallel scans or reductions.
These functions do not have an immediate user, but will be used in the
allocator interface that is in-progress and are generally good to have.
This patch is a precommit for these new utilitly functions.

[libc] Remove remaining GPU architecture dependent instructions (#81612)

2024-02-13T18:26:45+00:00

Summary:
Recent patches have added solutions to the remaining sources of
divergence. This patch simply removes the last occures of things like
`has_builtin`, `ifdef` or builtins with feature requirements. The one
exception here is `nanosleep`, but I made changes in the
`__nvvm_reflect` pass to make usage like this actually work at O0.

Depends on https://github.com/llvm/llvm-project/pull/81331

[libc] Rework the RPC interface to accept runtime wave sizes (#80914)

2024-02-13T16:45:43+00:00

Summary:
The RPC interface needs to handle an entire warp or wavefront at once.
This is currently done by using a compile time constant indicating the
size of the buffer, which right now defaults to some value on the client
(GPU) side. However, there are currently attempts to move the `libc`
library to a single IR build. This is problematic as the size of the
wave fronts changes between ISAs on AMDGPU. The builitin
`__builtin_amdgcn_wavefrontsize()` will return the appropriate value,
but it is only known at runtime now.

In order to support this, this patch restructures the packet. Now
instead of having an array of arrays, we simply have a large array of
buffers and slice it according to the runtime value if we don't know it
ahead of time. This also somewhat has the advantage of making the buffer
contiguous within a page now that the header has been moved out of it.

[libc] Remove CPU dependent AMDGPU instructions (#80707)

2024-02-06T13:22:13+00:00

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM `libm` is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.

[libc] Change the starting port index to use the SMID (#79200)

2024-01-30T19:06:58+00:00

Summary:
The RPC interface uses several ports to provide parallel access. Right
now we begin the search at the beginning, which heavily contests the
early ports. Using the SMID allows us to stagger the starting index
based off of the cluster identifier that is executing the current warp.
Multiple warps can share an SM, but it will guaruntee that the
contention for the low indices is lower.

This also increases the maximum port size to around 4096, this is
because 512 isn't enough to cover the full hardare parallelism needed to
guarantee this doesdn't deadlock.