summaryrefslogtreecommitdiff
path: root/mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
AgeCommit message (Collapse)Author
2025-10-17[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion (#163520)Keshav Vinayak Jha
Added Pattern for lowering `Math::ClampFOp` to `ROCDL::FMED3`. Also added `chipet` option to `MathToRocdl` pass to check for arch support ISA instructions Solves [#15072](https://github.com/llvm/llvm-project/issues/157052) Reapplies https://github.com/llvm/llvm-project/pull/160100 Un-reverts the merged https://github.com/llvm/llvm-project/pull/163259, and fixes the error. --------- Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>
2025-10-14Revert "[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion" (#163447)Fabian Mora
Reverts llvm/llvm-project#163259. Reverting due to missing link libraries causing failures in shared build bots.
2025-10-14[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion (#163259)Keshav Vinayak Jha
Added Pattern for lowering `Math::ClampFOp` to `ROCDL::FMED3`. Also added `chipset` option to `MathToRocdl` pass to check for arch support ISA instructions Solves [#15072](https://github.com/llvm/llvm-project/issues/157052) Reapplies https://github.com/llvm/llvm-project/pull/160100 --------- Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>
2025-09-16[MLIR] Apply clang-tidy fixes for llvm-qualified-auto in ↵Mehdi Amini
LowerGpuOpsToROCDLOps.cpp (NFC)
2025-09-10[mlir][gpu] Refactor GpuOpsToROCDLOps pass interface (NFC) (#157402)Pablo Antonio Martinez
This PR deletes the `createLowerGpuOpsToROCDLOpsPass` constructor from the .td file, making the `createConvertGpuOpsToROCDLOps` pass available to users. This has the following effects: 1. `createLowerGpuOpsToROCDLOpsPass` is not available anymore. Instead, `createConvertGpuOpsToROCDLOps` should be used. This makes the interface consistent with ConvertGpuOpsToNVVMOps. 2. To call `createConvertGpuOpsToROCDLOps`, the options must be passed via ConvertGpuOpsToROCDLOpsOptions. This has the side effect of making the `allowed-dialects` option available, which was not accessible via C++ before.
2025-09-08[mlir][gpu] Revert gpu.subgroup_broadcast with any_lane (#157373)Jakub Kuderski
This partially reverts https://github.com/llvm/llvm-project/pull/152808. Post-commit comments revealed that the `any_lane` variant hasn't been fully agreed upon at the time of landing.
2025-08-30[mlir][gpu] Add `subgroup_broadcast` op (#152808)Ivan Butygin
`subgroup_broadcast` allow to broadcast the value from one lane to all lanes in subgroup. Supported modes: * `first_active_lane` - broadcast value from the first active lane in subgroup. * `specific_lane` - broadcast value from the specified lane, lane index must be within subgroup. * `any_lane` - if `src` value is uniform across all the subgroup lanes return it unchanged, otherwise result is poison. This variant essentially an uniformity hint for the compiler, conveying that specific value is uniform across all subgroup lanes. Dropping `any_lane` broadcast should not change the code semantics.
2025-08-24[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933)Tim Gymnich
- promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap %src {16,32}`
2025-08-13[mlir][ROCDL] Annotate lane ID functions with noundef, ranges (#151396)Krzysztof Drewniak
Now that we have general support for setting argument and result attributes on LLVM intrinsics, extend the definitions of mbcnt.lo and mbcnt.hi to carry such attributes. With that, update the construction of the mbcnt.lo/mbcnt.hi calls used to get the lane ID to be `noundef` (since the lane ID is always defined) and to be annotated with the correct ranges (so that generic LLVM passes can correctly optimized based on the fact that there are never more than 32/64 lanes). (Also, handle a pattern that wasn't using getLaneId() and get rid of a dead argument)
2025-07-22[mlir][NFC] update `Conversion` create APIs (5/n) (#149887)Maksim Levental
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-04[mlir] Remove unused includes (NFC) (#147101)Kazu Hirata
These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.
2025-05-19[AMDGPU] Set AS8 address width to 48 bitsAlexander Richardson
Of the 128-bits of buffer descriptor only 48 bits are address bits, so following the discussion on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54, the logic conclusion is to set the index width to 48 bits instead of the current value of 128. Most of the test changes are mechanical datalayout updates, but there is one actual change: the ptrmask test now uses .i48 instead of .i128 and I had to update SelectionDAGBuilder to correctly extend the mask. Reviewed By: krzysz00 Pull Request: https://github.com/llvm/llvm-project/pull/139419
2025-05-13[mlir][gpu] Pattern to promote `gpu.shuffle` to specialized AMDGPU ops (#137109)Ivan Butygin
Only swizzle promotion for now, may add DPP ops support later.
2025-05-12[mlir][ROCDL] Remove unneeded bf16 expansion in LowerGPUToROCDL (#139603)Krzysztof Drewniak
The umbrella pass fol lowering GPU ops to ROCDL (aka lowering to LLVM + the AMDGPU-specific setup) would call the arith patterns that manually implemented extf and truncf on bfloat because the LLVM AMDGPU backend used to not suppport those operaitons. Since the backend does now support these operations and has for quite some time, remove these patterns from the default lowering flow.
2025-04-25[mlir][gpu] Fix breaking constructor from GPUSubgroupSizeToROCDL (#137439)Stanley Winata
This PR addressed a bug from llvm/llvm-project#137360. which was using GPUSubgroupSizeToROCDL to patterns function that do not have a valid constructor for it. This is causing compilation error below: error: constructor inherited by 'GPUSubgroupSizeOpToROCDL' from base class 'ConvertOpToLLVMPattern<mlir::gpu::SubgroupSizeOp>' is implicitly deleted Signed-off-by: Stanley Winata <stanley.winata@amd.com>
2025-04-25[mlir] Fix a warningKazu Hirata
This patch fixes: mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp:170:5: error: default label in switch which covers all enumeration values [-Werror,-Wcovered-switch-default]
2025-04-25[MLIR][ROCDL] Lower `gpu.subgroup_size` to `wavefrontsize` (#137360)Alan Li
2025-04-25[mlir] added gpu.shuffle mode UP support (#137300)Gaurav Verma
Added support for `gpu.shuffle` mode `UP` Signed-off-by: xintin <gaurav.verma@amd.com>
2025-04-18[mlir] Fix a warningKazu Hirata
This patch fixes: mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp:140:10: error: unused variable 'shflType' [-Werror,-Wunused-variable]
2025-04-18[mlir] GPUToROCDL: Add support for non-i32/f32 shuffle types (#136320)Ivan Butygin
Use recently added repacking utilities to support other datatypes. Also, tighten `gpu.shuffle` verification to reject scalable vectors
2025-04-13[mlir] GPUToROCDL: Fix crashes with unsupported shuffle datatypes (#135504)Ivan Butygin
Calling `getIntOrFloatBitWidth` on non-int/float types (`gpu.shuffle` also accepts vectors) will crash.
2025-02-13[mlir][gpu] GPUToROCDL/NVVM: use generic llvm conversion interface instead ↵Ivan Butygin
of hardcoded conversions. (#124439) Using `ConvertToLLVMPatternInterface` allows to unhardcode specific dialect conversions from passes and, more importantly, allows downstream projects to inject their ops/types translation here by registering corresponding interface. Add `allowed-dialects` option so user can control which dialects can be used to populate conversions.
2025-01-06[mlir][GPU] Add NVVM-specific `cf.assert` lowering (#120431)Matthias Springer
This commit add an NVIDIA-specific lowering of `cf.assert` to to `__assertfail`. Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and `getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can be reused.
2024-12-31[mlir][nfc] GpuToROCDL: Remove some dead code (#121403)Ivan Butygin
2024-12-31Revert "[mlir][nfc] GpuToROCDL: Remove some dead code" (#121402)Ivan Butygin
Reverts llvm/llvm-project#121395
2024-12-31[mlir][nfc] GpuToROCDL: Remove some dead code (#121395)Ivan Butygin
2024-12-20[mlir] Enable decoupling two kinds of greedy behavior. (#104649)Jacques Pienaar
The greedy rewriter is used in many different flows and it has a lot of convenience (work list management, debugging actions, tracing, etc). But it combines two kinds of greedy behavior 1) how ops are matched, 2) folding wherever it can. These are independent forms of greedy and leads to inefficiency. E.g., cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes. Using the driver one ends up needlessly retrying folding/having multiple rounds of folding attempts, where one final run would have sufficed. Of course folks can locally avoid this behavior by just building their own, but this is also a common requested feature that folks keep on working around locally in suboptimal ways. For downstream users, there should be no behavioral change. Updating from the deprecated should just be a find and replace (e.g., `find ./ -type f -exec sed -i 's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety) as the API arguments hasn't changed between the two.
2024-11-20[MLIR][AMDGPU] Support gpu::ShuffleMode::DOWN lowering in ROCDL (#106237)Dragan Mladjenovic
2024-10-05[mlir][NFC] Mark type converter in `populate...` functions as `const` (#111250)Matthias Springer
This commit marks the type converter in `populate...` functions as `const`. This is useful for debugging. Patterns already take a `const` type converter. However, some `populate...` functions do not only add new patterns, but also add additional type conversion rules. That makes it difficult to find the place where a type conversion was added in the code base. With this change, all `populate...` functions that only populate pattern now have a `const` type converter. Programmers can then conclude from the function signature that these functions do not register any new type conversion rules. Also some minor cleanups around the 1:N dialect conversion infrastructure, which did not always pass the type converter as a `const` object internally.
2024-09-23[mlir][AMDGPU] Add support for AMD f16 math library calls (#108809)Daniel Hernandez-Juarez
In this PR we add support for AMD f16 math library calls (`__ocml_*_f16`) CC: @krzysz00 @manupak
2024-09-12[MLIR][ROCDL] Add dynamically legal ops to LowerGpuOpsToROCDLOpsPass (#108302)Nirvedh Meshram
Similar to https://github.com/llvm/llvm-project/pull/108266 After https://github.com/llvm/llvm-project/pull/102971 It is legal to generate `LLVM::ExpOp` and `LLVM::LogOp` if the type is is a float16 or float32
2024-09-11Update legalizations for LowerGpuOpsToROCDLOps (#108266)Nirvedh Meshram
LLVM::FAbsOp and LLVM::SqrtOp are legal after https://github.com/llvm/llvm-project/pull/102971
2024-08-13[mlir][GPU] Improve `gpu.module` op implementation (#102866)Matthias Springer
- Replace hand-written parser/printer with auto-generated assembly format. - Remove implicit `gpu.module_end` terminator and use the `NoTerminator` trait instead. (Same as `builtin.module`.) - Turn the region into a graph region. (Same as `builtin.module`.)
2024-08-09[MLIR][GPU-LLVM] Convert `gpu.func` to `llvm.func` (#101664)Victor Perez
Add support in `-convert-gpu-to-llvm-spv` to convert `gpu.func` to `llvm.func` operations. - `spir_kernel`/`spir_func` calling conventions used for kernels/functions. - `workgroup` attributions encoded as additional `llvm.ptr<3>` arguments. - No attribute used to annotate kernels - `reqd_work_group_size` attribute using to encode `gpu.known_block_size`. - `llvm.mlir.workgroup_attrib_size` used to encode workgroup attribution sizes. This will be attached to the pointer argument workgroup attributions lower to. **Note**: A notable missing feature that will be addressed in a follow-up PR is a `-use-bare-ptr-memref-call-conv` option to replace MemRef arguments with bare pointers to the MemRef element types instead of the current MemRef descriptor approach. --------- Signed-off-by: Victor Perez <victor.perez@codeplay.com>
2024-07-17[MLIR][ROCDL] Refactor conversion of math operations to ROCDL calls to a ↵Jan Leyonberg
separate pass (#98653) This patch refactors the conversion of math operations to ROCDL library calls. This pass will also be used in flang to lower Fortran intrinsics/math functions for OpenMP target offloading codgen.
2024-06-17[mlir][GPU] Improve handling of GPU bounds (#95166)Krzysztof Drewniak
This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled. 1. `known_block_size` and `known_grid_size` become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, the `gpu.func` lowering patterns now only look at the inherent attributes when setting target-specific attributes on the `llvm.func` that they lower to. 2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size` are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering for `gpu.thread_id` couldn't know about the bounds if it had already been moved from a `gpu.func` to an `llvm.func`) and allows for range information to be provided even when `gpu.*_{id,dim}` are being used outside of a `gpu.func` context. 3. All of these index operations have gained an optional `upper_bound` attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.) When inferring bounds (either for range inference or for setting `range` during lowering) these sources of information are consulted in order of specificity (`upper_bound` > inherent attribute > discardable attribute, except that dimension sizes check for `known_*_bounds` to see if they can be constant-folded before checking their `upper_bound`). This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
2024-05-28[mlir][ROCDL] Update the LLVM data layout for ROCDL lowering. (#92127)stefankoncarevic
This change updates the dataLayout string to ensure alignment with the latest LLVM TargetMachine configuration. The aim is to maintain consistency and prevent potential compilation issues related to memory address space handling.
2024-02-27[mlir][ROCDL] Set the LLVM data layout when lowering to ROCDL LLVM (#74501)Krzysztof Drewniak
In order to ensure operations lower correctly (especially memref.addrspacecast, which relies on the data layout benig set correctly then dealing with dynamic memrefs) and to prevent compilation issues later down the line, set the `llvm.data_layout` attribute on GPU modules when lowering their contents to a ROCDL / AMDGPU target. If there's a good way to test the embedded string to prevent it from going out of sync with the LLVM TargetMachine, I'd appreciate hearing about it. (Or, alternatively, if there's a place I could farctor the string out to).
2024-02-19[MLIR] Add ODS support for generating helpers for dialect (discardable) ↵Mehdi Amini
attributes (#77024) This is a new ODS feature that allows dialects to define a list of key/value pair representing an attribute type and a name. This will generate helper classes on the dialect to be able to manage discardable attributes on operations in a type safe way. For example the `test` dialect can define: ``` let discardableAttrs = (ins "mlir::IntegerAttr":$discardable_attr_key, ); ``` And the following will be generated in the TestDialect class: ``` /// Helper to manage the discardable attribute `discardable_attr_key`. class DiscardableAttrKeyAttrHelper { ::mlir::StringAttr name; public: static constexpr ::llvm::StringLiteral getNameStr() { return "test.discardable_attr_key"; } constexpr ::mlir::StringAttr getName() { return name; } DiscardableAttrKeyAttrHelper(::mlir::MLIRContext *ctx) : name(::mlir::StringAttr::get(ctx, getNameStr())) {} mlir::IntegerAttr getAttr(::mlir::Operation *op) { return op->getAttrOfType<mlir::IntegerAttr>(name); } void setAttr(::mlir::Operation *op, mlir::IntegerAttr val) { op->setAttr(name, val); } bool isAttrPresent(::mlir::Operation *op) { return op->hasAttrOfType<mlir::IntegerAttr>(name); } void removeAttr(::mlir::Operation *op) { assert(op->hasAttrOfType<mlir::IntegerAttr>(name)); op->removeAttr(name); } }; DiscardableAttrKeyAttrHelper getDiscardableAttrKeyAttrHelper() { return discardableAttrKeyAttrName; } ``` User code having an instance of the TestDialect can then manipulate this attribute on operation using: ``` auto helper = testDialect.getDiscardableAttrKeyAttrHelper(); helper.setAttr(op, value); helper.isAttrPresent(op); ... ```
2024-01-31[mlir] Use `create` instead of `createOrFold` for ConstantOp as folding has ↵Hugo Trachino
no effect (NFC) (#80129) This aims to clean-up confusing uses of builder.createOrFold<ConstantOp> since folding of constants fails.
2023-12-05[mlir][gpu] Add lowering dynamic_shared_memory op for rocdl (#74473)Guray Ozen
This PR adds lowering of `gpu.dynamic_shared_memory` to rocdl target.
2023-11-01[MLIR][GPUToROCDL] Remove typed pointer support (#70908)Christian Ulmann
This commit removes the support for lowering GPU to ROCDL dialect with typed pointers. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-09-04[mlir][GPUToROCDL] Lower arith.remf to GPU intrinsic.Adrian Kuegel
Differential Revision: https://reviews.llvm.org/D159423
2023-08-24[mlir][ROCM] Add Wave/Warp shuffle lowering and op for ROCM.Stanley Winata
Reduction is heavily used for many DL workload especially with softmax/Attention layers. Wave/Warp shuffle and reduction is known to be a speedy/efficient way to do these reductions. In this patch we introduce AMD shuffle intrinsic Ops to ROCDL, along with it's corresponding lowering from gpu.shuffle. This should speed up a lot of DL workloads on ROCM backend. Currently, we have support for xor and idx, which are the more common ones. In the future, we plan on adding support for Down and Up, as well as using the ds_swizzle to further enhance it's performance when width and offsets are constant. Reviewed By: antiagainst Differential Revision: https://reviews.llvm.org/D158684
2023-08-17[mlir][AMDGPU] Improve BF16 handling through AMDGPU compilationKrzysztof Drewniak
Many previous sets of AMDGPU dialect code have been incorrect in the presence of the bf16 type (when lowered to LLVM's bfloat) as they were developed in a setting that run a custom bf16-to-i16 pass before LLVM lowering. An overall effect of this patch is that you should run --arith-emulate-unsupported-floats="source-types=bf16 target-type=f32" on your GPU module before calling --convert-gpu-to-rocdl if your code performs bf16 arithmetic. While LLVM now supports software bfloat, initial experiments showed that using this support on AMDGPU inserted a large number of conversions around loads and stores which had substantial performance imparts. Furthermore, all of the native AMDGPU operations on bf16 types (like the WMMA operations) operate on 16-bit integers instead of the bfloat type. First, we make the following changes to preserve compatibility once the LLVM bfloat type is reenabled. 1. The matrix multiplication operations (MFMA and WMMA) will bitcast bfloat vectors to i16 vectors. 2. Buffer loads and stores will operate on the relevant integer datatype and then cast to bfloat if needed. Second, we add type conversions to convert bf16 and vectors of it to equivalent i16 types. Third, we add the bfloat <-> f32 expansion patterns to the set of operations run before the main LLVM conversion so that MLIR's implementation of these conversion routines is used. Finally, we extend the "floats treated as integers" support in the LLVM exporter to handle types other than fp8. We also fix a bug in the unsupported floats emulation where it tried to operate on `arith.bitcast` due to an oversight. Reviewed By: rsuderman Differential Revision: https://reviews.llvm.org/D156361
2023-08-10[mlir][transform] Enable gpu-to-nvvm via conversion patterns driven by TDNicolas Vasilache
This revision untangles a few more conversion pieces and allows rewriting the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass in a declarative fashion that provides a much better understanding and control. Differential Revision: https://reviews.llvm.org/D157617
2023-07-26[MLIR][ROCDL] Add conversion for gpu.lane_id to ROCDLSJW
Creates rocdl.lane_id op with llvm conversion to: __device__ static unsigned int __lane_id() { return __builtin_amdgcn_mbcnt_hi( -1, __builtin_amdgcn_mbcnt_lo(-1, 0)); } Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D154666
2023-05-15[MLIR][ROCDL] add gpu to rocdl erf supportManupa Karunaratne
This commit adds lowering of lib func call to support erf in rocdl. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D150355
2023-05-12[mlir] Move casting calls from methods to function callsTres Popp
The MLIR classes Type/Attribute/Operation/Op/Value support cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast functionality in addition to defining methods with the same name. This change begins the migration of uses of the method to the corresponding function call as has been decided as more consistent. Note that there still exist classes that only define methods directly, such as AffineExpr, and this does not include work currently to support a functional cast/isa call. Caveats include: - This clang-tidy script probably has more problems. - This only touches C++ code, so nothing that is being generated. Context: - https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…" - Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443 Implementation: This first patch was created with the following steps. The intention is to only do automated changes at first, so I waste less time if it's reverted, and so the first mass change is more clear as an example to other teams that will need to follow similar steps. Steps are described per line, as comments are removed by git: 0. Retrieve the change from the following to build clang-tidy with an additional check: https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check 1. Build clang-tidy 2. Run clang-tidy over your entire codebase while disabling all checks and enabling the one relevant one. Run on all header files also. 3. Delete .inc files that were also modified, so the next build rebuilds them to a pure state. 4. Some changes have been deleted for the following reasons: - Some files had a variable also named cast - Some files had not included a header file that defines the cast functions - Some files are definitions of the classes that have the casting methods, so the code still refers to the method instead of the function without adding a prefix or removing the method declaration at the same time. ``` ninja -C $BUILD_DIR clang-tidy run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\ -header-filter=mlir/ mlir/* -fix rm -rf $BUILD_DIR/tools/mlir/**/*.inc git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\ mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\ mlir/lib/**/IR/\ mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\ mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\ mlir/test/lib/Dialect/Test/TestTypes.cpp\ mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\ mlir/test/lib/Dialect/Test/TestAttributes.cpp\ mlir/unittests/TableGen/EnumsGenTest.cpp\ mlir/test/python/lib/PythonTestCAPI.cpp\ mlir/include/mlir/IR/ ``` Differential Revision: https://reviews.llvm.org/D150123
2023-02-21[mlir][GPUToLLVM] Add support for emitting opaque pointersMarkus Böck
Part of https://discourse.llvm.org/t/rfc-switching-the-llvm-dialect-and-dialect-lowerings-to-opaque-pointers/68179 This patch adds the new pass option `use-opaque-pointers` to the GPU to LLVM lowerings (including ROCD and NVVM) and adapts the code to support using opaque pointers in addition to typed pointers. The required changes mostly boil down to avoiding `getElementType` and specifying base types in GEP and Alloca. In the future opaque pointers will be the only supported model, hence tests have been ported to using opaque pointers by default. Additional regression tests for typed-pointers have been added to avoid breaking existing clients. Note: This does not yet port the `GpuToVulkan` passes. Differential Revision: https://reviews.llvm.org/D144448