summaryrefslogtreecommitdiff
path: root/offload/plugins-nextgen/cuda
AgeCommit message (Collapse)Author
2025-11-13[Offload] Add device info for shared memory (#167817)Kevin Sala Penades
2025-11-06[OpenMP] Fix tests relying on the heap size variableJoseph Huber
Summary: I made that an unimplemented error, but forgot that it was used for this environment variable.
2025-11-06[Offload] Remove handling for device memory pool (#163629)Joseph Huber
Summary: This was a lot of code that was only used for upstream LLVM builds of AMDGPU offloading. We have a generic and fast `malloc` in `libc` now so just use that. Simplifies code, can be added back if we start providing alternate forms but I don't think there's a single use-case that would justify it yet.
2025-11-04[Offload] Add device UID (#164391)Robert Imschweiler
Introduced in OpenMP 6.0, the device UID shall be a unique identifier of a device on a given system. (Not necessarily a UUID.) Since it is not guaranteed that the (U)UIDs defined by the device vendor libraries, such as HSA, do not overlap with those of other vendors, the device UIDs in offload are always combined with the offload plugin name. In case the vendor library does not specify any device UID for a given device, we fall back to the offload-internal device ID. The device UID can be retrieved using the `llvm-offload-device-info` tool.
2025-10-09[OFFLOAD] Remove unused init_device_info plugin interface (#162650)Alex Duran
This was used for the old interop code. It's dead code after #143491
2025-10-02[OFFLOAD] Restore interop functionality (#161429)Alex Duran
This implements two pieces to restore the interop functionality (that I broke) when the 6.0 interfaces were added: * A set of wrappers that support the old interfaces on top of the new ones * The same level of interop support for the CUDA amd AMD plugins
2025-09-26[Offload] Use Error for allocating/deallocating in plugins (#160811)Kevin Sala Penades
Co-authored-by: Joseph Huber <huberjn@outlook.com>
2025-09-20[Offload] Remove non-blocking allocation type (#159851)Joseph Huber
Summary: This was originally added in as a hack to work around CUDA's limitation on allocation. The `libc` implementation now isn't even used for CUDA so this code is never hit. Even if this case, this code never truly worked. A true solution would be to use CUDA's virtual memory API instead to allocate 2MiB slabs independenctly from the normal memory management done in the stream.
2025-09-19[OpenMP][NFC] Clean up a bunch of warnings and clang-tidy messages (#159831)Joseph Huber
Summary: I made the GPU flags accept more of the default LLVM warnings, which triggered some new cases. Clean those up and fix some other ones while I'm at it.
2025-09-17[LLVM] Fix offload and update CUDA ABI for all SM values (#159354)Joseph Huber
Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: https://github.com/llvm/llvm-project/issues/159088
2025-09-16[Offload] Copy loaded images into managed storage (#158748)Joseph Huber
Summary: Currently we have this `__tgt_device_image` indirection which just takes a reference to some pointers. This was all find and good when the only usage of this was from a section of GPU code that came from an ELF constant section. However, we have expanded beyond that and now need to worry about managing lifetimes. We have code that references the image even after it was loaded internally. This patch changes the implementation to instaed copy the memory buffer and manage it locally. This PR reworks the JIT and other image handling to directly manage its own memory. We now don't need to duplicate this behavior externally at the Offload API level. Also we actually free these if the user unloads them. Upside, less likely to crash and burn. Downside, more latency when loading an image.
2025-08-29[Offload] Add `OL_DEVICE_INFO_MAX_WORK_SIZE[_PER_DIMENSION]` (#155823)Ross Brunton
This is the total number of work items that the device supports (the equivalent work group properties are for only a single work group).
2025-08-28[Offload] Add PRODUCT_NAME device info (#155632)Ross Brunton
On my system, this will be "Radeon RX 7900 GRE" rather than "gfx1100". For Nvidia, the product name and device name are identical.
2025-08-27[NFC][offload] Fix error message for cuFuncSetAttribute (#155655)Kevin Sala Penades
2025-08-22[Offload] Implement olMemFill (#154102)Callum Fare
Implement olMemFill to support filling device memory with arbitrary length patterns. AMDGPU support will be added in a follow-up PR.
2025-08-22[Offload] `OL_EVENT_INFO_IS_COMPLETE` (#153194)Ross Brunton
A simple info query for events that returns whether the event is complete or not.
2025-08-19[Offload] Add olCalculateOptimalOccupancy (#142950)Ross Brunton
This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is currently only implemented on Cuda; AMDGPU and Host return unsupported. --------- Co-authored-by: Callum Fare <callum@codeplay.com>
2025-08-19[Offload] Define additional device info properties (#152533)Rafal Bielski
Add the following properties in Offload device info: * VENDOR_ID * NUM_COMPUTE_UNITS * [SINGLE|DOUBLE|HALF]_FP_CONFIG * NATIVE_VECTOR_WIDTH_[CHAR|SHORT|INT|LONG|FLOAT|DOUBLE|HALF] * MAX_CLOCK_FREQUENCY * MEMORY_CLOCK_RATE * ADDRESS_BITS * MAX_MEM_ALLOC_SIZE * GLOBAL_MEM_SIZE Add a bitfield option to enumerators, allowing the values to be bit-shifted instead of incremented. Generate the per-type enums using `foreach` to reduce code duplication. Use macros in unit test definitions to reduce code duplication.
2025-08-15[Offload] Introduce dataFence plugin interface. (#153793)Abhinav Gaba
The purpose of this fence is to ensure that any `dataSubmit`s inserted into a queue before a `dataFence` finish before finish before any `dataSubmit`s inserted after it begin. This is a no-op for most queues, since they are in-order, and by design any operations inserted into them occur in order. But the interface is supposed to be functional for out-of-order queues. The addition of the interface means that any operations that rely on such ordering (like ATTACH map-type support in #149036) can invoke it, without worrying about whether the underlying queue is in-order or out-of-order. Once a plugin supports out-of-order queues, the plugin can implement this function, without requiring any change at the libomptarget level. --------- Co-authored-by: Alex Duran <alejandro.duran@intel.com>
2025-08-15[Offload] `olLaunchHostFunction` (#152482)Ross Brunton
Add an `olLaunchHostFunction` method that allows enqueueing host work to the stream.
2025-08-13[Offload] Implement hasPendingWork on CUDA (#152728)Callum Fare
Following on from #152304, implement the new query in the CUDA plugin
2025-08-10[Offload] Fix return error with a condition (#152876)Kevin Sala Penades
Adds a conditional to the error return so that it only returns if there was an error.
2025-08-08[Offload] Make olLaunchKernel test thread safe (#149497)Ross Brunton
This sprinkles a few mutexes around the plugin interface so that the olLaunchKernel CTS test now passes when ran on multiple threads. Part of this also involved changing the interface for device synchronise so that it can optionally not free the underlying queue (which introduced a race condition in liboffload).
2025-08-07[Offload] Don't create events for empty queues (#152304)Ross Brunton
Add a device function to check if a device queue is empty. If liboffload tries to create an event for an empty queue, we create an "empty" event that is already complete. This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run quickly for empty queues.
2025-08-04[Offload] Rework `MAX_WORK_GROUP_SIZE` (#151926)Ross Brunton
`MAX_WORK_GROUP_SIZE` now represents the maximum total number of work groups the device can allocate, rather than the maximum per dimension. `MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old behaviour.
2025-07-21[LLVM] Update CUDA ELF flags for their new ABI (#149534)Joseph Huber
Summary: We rely on these flags to do things in the runtime and print the contents of binaries correctly. CUDA updated their ABI encoding recently and we didn't handle that. it's a new ABI entirely so we just select on it when it shows up. Fixes: https://github.com/llvm/llvm-project/issues/148703
2025-07-18[Offload] Allow "tagging" device info entries with offload keys (#147317)Ross Brunton
When generating the device info tree, nodes can be marked with an offload Device Info value. The nodes can also look up children based on this value.
2025-07-14[Offload] Return error rather than dropping it (#148609)Ross Brunton
2025-07-10[Offload] Allow querying the size of globals (#147698)Ross Brunton
The `GlobalTy` helper has been extended to make both the Size and Ptr be optional. Now `getGlobalMetadataFromDevice`/`Image` is able to write the size of the global to the struct, instead of just verifying it.
2025-07-08[Offload] Implement 'Vendor Name' device info for CUDA (#147334)Callum Fare
After #146345 the device info implementation requires a value for every query, rather than silently returning an empty string. This broke the test for `OL_DEVICE_INFO_VENDOR` on CUDA. Add a value to the CUDA plugin. We can quite safely hard code this one.
2025-07-07[Offload] Allow CUDA Kernels to use arbitrarily large shared memory (#145963)Giorgi Gvalia
Previously, the user was not able to use more than 48 KB of shared memory on NVIDIA GPUs. In order to do so, setting the function attribute `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` is required, which was not present in the code base. With this commit, we add the ability toset this attribute, allowing the user to utilize the full power of their GPU. In order to not have to reset the function attribute for each launch of the same kernel, we keep track of the maximum memory limit (as the variable `MaxDynCGroupMemLimit`) and only set the attribute if our desired amount exceeds the limit. By default, this limit is set to 48 KB. Feedback is greatly appreciated, especially around setting the new variable as mutable. I did this becuase the `launchImpl` method is const and I am not able to modify my variable otherwise. --------- Co-authored-by: Giorgi Gvalia <ggvalia@login33.chn.perlmutter.nersc.gov> Co-authored-by: Giorgi Gvalia <ggvalia@login07.chn.perlmutter.nersc.gov>
2025-06-27[Offload] Make CUDA Driver Version a string (#146049)Ross Brunton
AMD treats this value as a string, so for consistency require this in NVIDIA as well. This shouldn't change the output of the `llvm-offload-device-info` tool, but does fix an issue in liboffload when it tries to query the version.
2025-06-25[Offload] Add an `unloadBinary` interface to PluginInterface (#143873)Ross Brunton
This allows removal of a specific Image from a Device, rather than requiring all image data to outlive the device they were created for. This is required for `ol_program_handle_t`s, which now specify the lifetime of the buffer used to create the program.
2025-06-13[Offload] Replace device info queue with a tree (#144050)Ross Brunton
Previously, device info was returned as a queue with each element having a "Level" field indicating its nesting level. This replaces this queue with a more traditional tree-like structure. This should not result in a change to the output of `llvm-offload-device-info`.
2025-05-20[Offload] Use new error code handling mechanism and lower-case messages ↵Ross Brunton
(#139275) [Offload] Use new error code handling mechanism This removes the old ErrorCode-less error method and requires every user to provide a concrete error code. All calls have been updated. In addition, for consistency with error messages elsewhere in LLVM, all messages have been made to start lower case.
2025-01-31[Offload][NFC] Fix typos discovered by codespell (#125119)Christian Clauss
https://github.com/codespell-project/codespell % `codespell --ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes --write-changes`
2025-01-24[Offload] Add cuLaunchHostFunc to dynamic cudaJoseph Huber
Summary: This was missing, causing non-directly linked builds to fail.
2025-01-24[Offload] Move RPC server handling to a dedicated thread (#112988)Joseph Huber
Summary: Handling the RPC server requires running through list of jobs that the device has requested to be done. Currently this is handled by the thread that does the waiting for the kernel to finish. However, this is not sound on NVIDIA architectures and only works for async launches in the OpenMP model that uses helper threads. However, we also don't want to have this thread doing work unnnecessarily. For this reason we track the execution of kernels and cause the thread to sleep via a condition variable (usually backed by some kind of futex or other intelligent sleeping mechanism) so that the thread will be idle while no kernels are running.
2024-12-06[Offload][OMPX] Add the runtime support for multi-dim grid and block (#118042)Shilei Tian
2024-12-02[OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933)Joseph Huber
Summary: This patch adds an RPC interface that lives directly in the OpenMP device runtime. This allows OpenMP to implement custom opcodes. Currently this is only providing the host call interface, which is the raw version of reverse offloading. Previously this lived in `libc/` as an extension which is not the correct place. The interface here uses a weak symbol for the RPC client by the same name that the `libc` interface uses. This means that it will defer to the libc one if both are present so we don't need to set up multiple instances. The presense of this symbol is what controls whether or not we set up the RPC server. Because this is an external symbol it normally won't be optimized out, so there's a special pass in OpenMPOpt that deletes this symbol if it is unused during linking. That means at `O0` the RPC server will always be present now, but will be removed trivially if it's not used at O1 and higher.
2024-09-05[Offload][NFC] Reorganize `utils::` and make Device/Host/Shared clearer ↵Johannes Doerfert
(#100280) We had three `utils::` namespaces, all with different "meaning" (host, device, hsa_utils). We should, when we can, keep "include/Shared" accessible from host and device, thus RefCountTy has been moved to a separate header. `hsa_utils` was introduced to make `utils::` less overloaded. And common functionality was de-duplicated, e.g., `utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type punning now checks for the size of the result to make sure it matches the source type. No functional change was intended.
2024-06-13[Offload] Use flat array for cuLaunchKernel (#95116)Johannes Doerfert
We already used a flat array of kernel launch parameters for the AMD GPU launch but now we also use this scheme for the NVIDIA GPU launch. The only remaining/required use of the indirection is the host plugin (due ot ffi). This allows to us simplify the use for non-OpenMP kernel launch.
2024-06-06[Libomptarget] Rework device initialization and image registration (#93844)Joseph Huber
Summary: Currently, we register images into a linear table according to the logical OpenMP device identifier. We then initialize all of these images as one block. This logic requires that images are compatible with *all* devices instead of just the one that it can run on. This prevents us from running on systems with heterogeneous devices (i.e. image 1 runs on device 0 image 0 runs on device 1). This patch reworks the logic by instead making the compatibility check a per-device query. We then scan every device to see if it's compatible and do it as they come.
2024-05-23[Offload] Fix enabling plugins on unsupported platforms (#93186)Joseph Huber
Summary: Certain plugins can only be built on specific platforms. Previously this didn't cause issues becaues each one was handled independently. However, now that we link these all directly they need to be in a CMake list. Furthermore we use this list to generate a config file. For this reason these checks are moved to where we normalize the support. Fixes: https://github.com/llvm/llvm-project/issues/93183
2024-05-22[Offload] Rework handling for loading vendor runtimes (#93073)Joseph Huber
Summary: We previously had multiple options for this, this patch replaces them with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to dynamically use. It defaults to everything right now. This ignores the `host` plugin because the `libffi` dependency is going to be removed soon hopefully in https://github.com/llvm/llvm-project/pull/91264.
2024-05-22[Offload] Use newer CUDA API functions when dynamically loaded (#93057)Joseph Huber
Summary: CUDA does its versioning by putting a redirection in the header so the API functions remain the same while the symbol changes. These weren't being used for some functions that required it in the dynamic cuda version. These functions have newer verisons that should be used. These are fairly old as far as I'm aware so we should be able to sweep backward compatibility under the rug.
2024-05-22[Offload] libomptarget force dlopen vendor libraries by default. (#92788)Ye Luo
Since #87009, libomptarget directly links all the plugins statically. All the dependencies of plugins got exposed to libomptarget. The CUDA plugin depends on libcuda and the amdgpu plugin depends on libhsa if not forced using dlopen. On a cluster with different compute node architectures, libomptarget can be built and run on different nodes. In the build stage, if cmake founds libcuda and `LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA=OFF`, libomptarget links libcuda.so directly and the result libomptarget may not run a node without a NVIDIA driver for example a CPU or AMD GPU only machine with a complaint that libcuda.so not found. The solution is setting `LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA` and `LIBOMPTARGET_FORCE_DLOPEN_LIBHSA` `ON`. Preferably this should be default to maximize the usability of libomptarget. If cmake detects NVIDIA or AMD software on an OS imaging building node, the resulted libomptarget may not be able to function on the user side due to the requirement the existence of vendor runtime libraries.
2024-05-17[Offload][NFC] Remove 'libomptarget' message helpers (#92581)Joseph Huber
Summary: This isn't `libomptarget` anymore, and these messages were always unnecessary because no other project uses these prefixed messages. The effect of this is that no longer will the logs have `LIBOMPTARGET --` in front of everything. We have a message stating when we start building the offload project so it'll still be trivial to find.
2024-05-17[Offload][NFC] Remove all trailing whitespace from offload/ (#92578)Joseph Huber
Summary: This patch cleans up the training whitespace in a bunch of tests and CMake files. Most just in preparation for other cleanups.
2024-05-17[Offload][NFC] Remove header license in CMake files (#92544)Joseph Huber
Summary: No other project has these in the CMake itself, and they're wildly inconsistent even within the project. These don't really add anything so I think they should be removed.