llvm-project.git/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h, branch users/mingmingl-llvm/samplefdo-profile-format

[AMDGPU] Ensure positive InstOffset for buffer operations (#145504)

2025-09-04T13:37:46+00:00

GFX12+ buffer ops require positive InstOffset per AMD hardware spec.
Modified assembler/disassembler to reject negative buffer offsets.

[LLVM][MC][DecoderEmitter] Add support to specialize decoder per bitwidth (#154865)

2025-09-01T20:44:18+00:00

This change adds an option to specialize decoders per bitwidth, which
can help reduce the (compiled) code size of the decoder code.

**Current state**:
Currently, the code generated by the decoder emitter consists of two key
functions: `decodeInstruction` which is the entry point into the
generated code and `decodeToMCInst` which is invoked when a decode op is
reached while traversing through the decoder table. Both functions are
templated on `InsnType` which is the raw instruction bits that are
supplied to `decodeInstruction`.

Several backends call `decodeInstruction` with different `InsnType`
types, leading to several template instantiations of these functions in
the final code. As an example, AMDGPU instantiates this function with
type `DecoderUInt128` type for decoding 96/128-bit instructions,
`uint64_t` for decoding 64-bit instructions, and `uint32_t` for decoding
32-bit instructions. Since there is just one `decodeToMCInst` in the
generated code, it has code that handles decoding for *all* instruction
sizes. However, the decoders emitted for different instructions sizes
rarely have any intersection with each other. That means, in the AMDGPU
case, the instantiation with InsnType == DecoderUInt128 has decoder code
for 32/64-bit instructions that is *never exercised*. Conversely, the
instantiation with InsnType == uint64_t has decoder code for
128/96/32-bit instructions that is never exercised. This leads to
unnecessary dead code in the generated disassembler binary (that the
compiler cannot eliminate by itself).

**New state**:
With this change, we introduce an option
`specialize-decoders-per-bitwidth`. Under this mode, the DecoderEmitter
will generate several versions of `decodeToMCInst` function, one for
each bitwidth. The code is still templated, but will require backends to
specify, for each `InsnType` used, the bitwidth of the instruction that
the type is used to represent using a type-trait `InsnBitWidth`. This
will enable the templated code to choose the right variant of
`decodeToMCInst`. Under this mode, a particular instantiation will only
end up instantiating a single variant of `decodeToMCInst` generated and
that will include only those decoders that are applicable to a single
bitwidth, resulting in elimination of the code duplication through
instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given
bitwidth (as opposed to across all bitwidths without this option), so
the decoder index values assigned are smaller, and consume less bytes in
their ULEB128 encoding. As a result, the generated decoder tables can
also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build,
this results in a net 55% reduction in the .text size of
libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. For
RISCV, which today uses a single `uint64_t` type, this results in a 3.7%
increase in code size (expected as we instantiate the code 3 times now).

Actual measured sizes are as follows:
```
Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.7% increase   
.rodata        35768       35839      0.1% increase
```

AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684)

2025-07-21T17:09:42+00:00

Co-authored-by: Stanislav Mekhanoshin

[AMDGPU] MC support for v_fmaak_f64/v_fmamk_f64 gfx1250 intructions (#148282)

2025-07-11T21:17:03+00:00

[AMDGPU] gfx1250: MC support for 64-bit literals (#147861)

2025-07-10T05:25:47+00:00

[NFC][TableGen] Change DecoderEmitter `insertBits` to use integer types only (#147613)

2025-07-09T15:56:07+00:00

The `insertBits` templated function generated by DecoderEmitter is
called with variable `tmp` of type `TmpType` which is:

```
using TmpType = std::conditional_t::value, InsnType, uint64_t>;
```

That is, `TmpType` is always an integral type. Change the generated
`insertBits` to be valid only for integer types, and eliminate the
unused `insertBits` function from `DecoderUInt128` in
AMDGPUDisassembler.h

Additionally, drop some of the requirements `InsnType` must support as
they no longer seem to be required.

[AMDGPU] Rename call instructions from b64 to i64 (#145103)

2025-06-22T04:42:09+00:00

These get renamed in gfx1250 and on from B64 to I64:

  S_CALL_I64
  S_GET_PC_I64
  S_RFE_I64
  S_SET_PC_I64
  S_SWAP_PC_I64

[AMDGPU][NFC] Remove _DEFERRED operands. (#139123)

2025-05-09T09:10:53+00:00

All immediates are deferred now.

[AMDGPU][NFC] Get rid of OPW constants. (#139074)

2025-05-08T17:42:07+00:00

We can infer the widths from register classes and represent them as
numbers.

[AMDGPU][Disassembler][NFCI] Always defer immediate operands. (#138885)

2025-05-08T10:43:50+00:00

Removes the need to parameterise decoders with OperandSemantics,
ImmWidth and MandatoryLiteral.

Likely allows further simplification of handling _DEFERRED immediates.

Tested to work downstream.