| Age | Commit message (Collapse) | Author |
|
|
|
Introduced in OpenMP 6.0, the device UID shall be a unique identifier of
a device on a given system. (Not necessarily a UUID.) Since it is not
guaranteed that the (U)UIDs defined by the device vendor libraries, such
as HSA, do not overlap with those of other vendors, the device UIDs in
offload are always combined with the offload plugin name. In case the
vendor library does not specify any device UID for a given device, we
fall back to the offload-internal device ID.
The device UID can be retrieved using the `llvm-offload-device-info`
tool.
|
|
Summary:
This causes issues with CUDA's teardown order when the init is separated
from the total init scope.
|
|
Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.
This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.
Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?
Fixes: https://github.com/llvm/llvm-project/issues/159636
|
|
Summary:
Need to verify this actually has a device. We really need to rework this
to point to a real impolementation, or streamline it to handle this
automatically.
|
|
Currently, devices store a raw pointer to back to their owning Platform.
Platforms are stored directly inside of a vector. Modifying this vector
risks invalidating all the platform pointers stored in devices.
This patch allocates platforms individually, and changes devices to
store a reference to its platform instead of a pointer. This is safe,
because platforms are guaranteed to outlive the devices they contain.
|
|
|
|
Summary:
The host plugin is basically OpenMP specific and doesn't work very well.
Previously we were skipping over it in the list instead of just not
adding it at all.
|
|
If olMemAlloc happens to allocate memory that was already allocated
elsewhere (possibly by another device on another platform), it is now
thrown away and a new allocation generated.
A new `AllocBases` vector is now available, which is an ordered list
of allocation start addresses.
|
|
Summary:
This exposes the 'isDeviceCompatible' routine for checking if a binary
*can* be loaded. This is useful if people don't want to consume errors
everywhere when figuring out which image to put to what device.
I don't know if this is a good name, I was thining like `olIsCompatible`
or whatever. Let me know what you think.
Long term I'd like to be able to do something similar to what OpenMP
does where we can conditionally only initialize devices if we need them.
That's going to be support needed if we want this to be more
generic.
|
|
Summary:
Currently we have this `__tgt_device_image` indirection which just takes
a reference to some pointers. This was all find and good when the only
usage of this was from a section of GPU code that came from an ELF
constant section. However, we have expanded beyond that and now need to
worry about managing lifetimes. We have code that references the image
even after it was loaded internally. This patch changes the
implementation to instaed copy the memory buffer and manage it locally.
This PR reworks the JIT and other image handling to directly manage its
own memory. We now don't need to duplicate this behavior externally at
the Offload API level. Also we actually free these if the user unloads
them.
Upside, less likely to crash and burn. Downside, more latency when
loading an image.
|
|
This is the total number of work items that the device supports (the
equivalent work group properties are for only a single work group).
|
|
Previously, `olDestroyQueue` would not actually destroy the queue,
instead leaving it for the device to clean up when it was destroyed.
Now, the queue is either released immediately if it is complete or put
into a list of "pending" queues if it is not. Whenever we create a new
queue, we check this list to see if any are now completed. If there are
any we release their resources and use them instead of pulling from
the pool.
This prevents long running programs that create and drop many queues
without syncing them from leaking memory all over the place.
|
|
On my system, this will be "Radeon RX 7900 GRE" rather than "gfx1100". For Nvidia, the product name and device name are identical.
|
|
Implement olMemFill to support filling device memory with arbitrary
length patterns. AMDGPU support will be added in a follow-up PR.
|
|
A simple info query for events that returns whether the event is
complete or not.
|
|
This wasn't handled with the normal info API, so needs special handling.
|
|
Both these functions update an `AllocInfoMap` structure in the context,
however they did not use any locks, causing random failures in threaded
code. Now they use a mutex.
|
|
This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is
currently
only implemented on Cuda; AMDGPU and Host return unsupported.
---------
Co-authored-by: Callum Fare <callum@codeplay.com>
|
|
Add the following properties in Offload device info:
* VENDOR_ID
* NUM_COMPUTE_UNITS
* [SINGLE|DOUBLE|HALF]_FP_CONFIG
* NATIVE_VECTOR_WIDTH_[CHAR|SHORT|INT|LONG|FLOAT|DOUBLE|HALF]
* MAX_CLOCK_FREQUENCY
* MEMORY_CLOCK_RATE
* ADDRESS_BITS
* MAX_MEM_ALLOC_SIZE
* GLOBAL_MEM_SIZE
Add a bitfield option to enumerators, allowing the values to be
bit-shifted instead of incremented. Generate the per-type enums using
`foreach` to reduce code duplication.
Use macros in unit test definitions to reduce code duplication.
|
|
Add an `olLaunchHostFunction` method that allows enqueueing host work
to the stream.
|
|
list (#153441)
|
|
This sprinkles a few mutexes around the plugin interface so that the
olLaunchKernel CTS test now passes when ran on multiple threads.
Part of this also involved changing the interface for device synchronise
so that it can optionally not free the underlying queue (which
introduced a race condition in liboffload).
|
|
Add a queue query that (if possible) reports whether the queue is empty
|
|
Add a device function to check if a device queue is empty. If liboffload
tries to create an event for an empty queue, we create an "empty" event
that is already complete.
This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run
quickly for empty queues.
|
|
`MAX_WORK_GROUP_SIZE` now represents the maximum total number of work
groups the device can allocate, rather than the maximum per dimension.
`MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old
behaviour.
|
|
Instead using strings to look up device information (which is brittle
and slow), use the new tags that the plugins specify when building the
nodes.
|
|
Rather than having every "enqueue"-type function have an output pointer
specifically for an output event, just provide an `olCreateEvent`
entrypoint which pushes an event to the queue.
For example, replace:
```cpp
olMemcpy(Queue, ..., EventOut);
```
with
```cpp
olMemcpy(Queue, ...);
olCreateEvent(Queue, EventOut);
```
|
|
This function causes a queue to wait until all the provided events have
completed before running any future scheduled work.
|
|
This more closely matches the nomenclature used by CUDA, AMDGPU and
the plugin interface.
|
|
When creating a new symbol, check that it already exists. If it does,
return that pointer rather than building a new symbol structure.
|
|
This is a hotfix for #148615 - it fixes the issue for me locally.
I think a broader issue is that in the test environment we're calling
olShutDown from a global destructor in the test binaries. We should do
something more controlled, either calling olInit/olShutDown in every
test, or move those to a GTest global environment. I didn't do that
originally because it looked like it needed changes to LLVM's GTest
wrapper.
|
|
Add two new symbol info types for getting the bounds of a global
variable. As well as a number of tests for reading/writing to it.
|
|
This mirrors the similar functions for other handles. The only
implemented info at the moment is the symbol's kind.
|
|
`olGetKernel` has been replaced by `olGetSymbol` which accepts a
`Kind` parameter. As well as loading information about kernels, it
can now also load information about global variables.
|
|
In the future, we want `ol_symbol_handle_t` to represent both kernels
and global variables The first step in this process is a rename and
promotion to a "typed handle".
|
|
Add info queries for queues and events.
`olGetQueueInfo` only supports getting the associated device. We were
already tracking this so we can implement this for free. We will likely
add other queries to it in the future (whether the queue is empty, what
flags it was created with, etc)
`olGetEventInfo` only supports getting the associated queue. This is
another thing we were already storing in the handle. We'll be able to
add other queries in future (the event type, status, etc)
|
|
This adds a new device info query for the maximum workgroup/block size
for each dimension.
|
|
This makes several small changes to how the platform and device info
queries are handled:
* ReturnHelper has been replaced with InfoWriter which is more explicit
in how it is invoked.
* InfoWriter consumes `llvm::Expected` rather than values directly, and
will early exit if it returns an error.
* As a result of the above, `GetInfoString` now correctly returns errors
rather than empty strings.
* The host device now has its own dedicated "getInfo" function rather
than being checked in multiple places.
|
|
`olShutDown` was not properly calling deinit on the platforms, resulting
in random segfaults on AMD devices.
As part of this, `olInit` and `olShutDown` now alloc and free the
offload context rather than it being static. This
allows `olShutDown` to be called within a destructor of a static object
(like the tests do) without having to worry about destructor ordering.
|
|
Rather than creating a new device info tree for each call to
`olGetDeviceInfo`, we instead do it on device initialisation. As well
as improving performance, this fixes a few lifetime issues with returned
strings.
This does unfortunately mean that device information is immutable,
but hopefully that shouldn't be a problem for any queries we want to
implement.
This also meant allowing offload initialization to fail, which it can
now do.
|
|
This allows removal of a specific Image from a Device, rather than
requiring all image data to outlive the device they were created for.
This is required for `ol_program_handle_t`s, which now specify the
lifetime of the buffer used to create the program.
|
|
Previously, if a binary failed to load due to failures when jit
compiling, the function would return success with nullptr. Now it
returns a new plugin error, `COMPILE_FAILURE`.
|
|
Rather than being "stringly typed", store values as a std::variant that
can hold various types. This means that liboffload doesn't have to do
any string parsing for integer/bool device info keys.
|
|
All entry points (except olInit) now check that offload has been
initialized. If not, a new `OL_ERRC_UNINITIALIZED` error is returned.
|
|
Rather than having a number of static local variables, we now use
a single `OffloadContext` struct to store global state. This is
initialised by `olInit`, but is never deleted (de-initialization of
Offload isn't yet implemented).
The error reporting mechanism has not been moved to the struct, since
that's going to cause issues with teardown (error messages must outlive
liboffload).
|
|
Previously, device info was returned as a queue with each element having
a "Level" field indicating its nesting level. This replaces this queue
with a more traditional tree-like structure.
This should not result in a change to the output of
`llvm-offload-device-info`.
|
|
(#143901)
This is a three element x, y, z size_t vector that can be used any place
where a 3D vector is required. This ensures that all vectors across
liboffload are the same and don't require any resizing/reordering
dances.
|
|
|
|
Previously we decided to check in files that we generate with tablegen.
The justification at the time was that it helped reviewers unfamiliar
with `offload-tblgen` see the actual changes to the headers in PRs.
After trying it for a while, it's ended up causing some headaches and is
also not how tablegen is used elsewhere in LLVM.
This changes our use of tablegen to be more conventional. Where
possible, files are still clang-formatted, but this is no longer a hard
requirement. Because `OffloadErrcodes.inc` is shared with libomptarget
it now gets generated in a more appropriate place.
|