<feed xmlns='http://www.w3.org/2005/Atom'>
<title>llvm-project.git/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h, branch users/mingmingl-llvm/samplefdo-profile-format</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/'/>
<entry>
<title>[AMDGPU] Ensure positive InstOffset for buffer operations (#145504)</title>
<updated>2025-09-04T13:37:46+00:00</updated>
<author>
<name>Aleksandar Spasojevic</name>
<email>aleksandar.spasojevic@amd.com</email>
</author>
<published>2025-09-04T13:37:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=1b47135c9da92a8de3ded888f709081ff599ce03'/>
<id>1b47135c9da92a8de3ded888f709081ff599ce03</id>
<content type='text'>
GFX12+ buffer ops require positive InstOffset per AMD hardware spec.
Modified assembler/disassembler to reject negative buffer offsets.</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
GFX12+ buffer ops require positive InstOffset per AMD hardware spec.
Modified assembler/disassembler to reject negative buffer offsets.</pre>
</div>
</content>
</entry>
<entry>
<title>[LLVM][MC][DecoderEmitter] Add support to specialize decoder per bitwidth (#154865)</title>
<updated>2025-09-01T20:44:18+00:00</updated>
<author>
<name>Rahul Joshi</name>
<email>rjoshi@nvidia.com</email>
</author>
<published>2025-09-01T20:44:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=dafffe262d6d1114fa83ec155241aad4e7793845'/>
<id>dafffe262d6d1114fa83ec155241aad4e7793845</id>
<content type='text'>
This change adds an option to specialize decoders per bitwidth, which
can help reduce the (compiled) code size of the decoder code.

**Current state**:
Currently, the code generated by the decoder emitter consists of two key
functions: `decodeInstruction` which is the entry point into the
generated code and `decodeToMCInst` which is invoked when a decode op is
reached while traversing through the decoder table. Both functions are
templated on `InsnType` which is the raw instruction bits that are
supplied to `decodeInstruction`.

Several backends call `decodeInstruction` with different `InsnType`
types, leading to several template instantiations of these functions in
the final code. As an example, AMDGPU instantiates this function with
type `DecoderUInt128` type for decoding 96/128-bit instructions,
`uint64_t` for decoding 64-bit instructions, and `uint32_t` for decoding
32-bit instructions. Since there is just one `decodeToMCInst` in the
generated code, it has code that handles decoding for *all* instruction
sizes. However, the decoders emitted for different instructions sizes
rarely have any intersection with each other. That means, in the AMDGPU
case, the instantiation with InsnType == DecoderUInt128 has decoder code
for 32/64-bit instructions that is *never exercised*. Conversely, the
instantiation with InsnType == uint64_t has decoder code for
128/96/32-bit instructions that is never exercised. This leads to
unnecessary dead code in the generated disassembler binary (that the
compiler cannot eliminate by itself).

**New state**:
With this change, we introduce an option
`specialize-decoders-per-bitwidth`. Under this mode, the DecoderEmitter
will generate several versions of `decodeToMCInst` function, one for
each bitwidth. The code is still templated, but will require backends to
specify, for each `InsnType` used, the bitwidth of the instruction that
the type is used to represent using a type-trait `InsnBitWidth`. This
will enable the templated code to choose the right variant of
`decodeToMCInst`. Under this mode, a particular instantiation will only
end up instantiating a single variant of `decodeToMCInst` generated and
that will include only those decoders that are applicable to a single
bitwidth, resulting in elimination of the code duplication through
instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given
bitwidth (as opposed to across all bitwidths without this option), so
the decoder index values assigned are smaller, and consume less bytes in
their ULEB128 encoding. As a result, the generated decoder tables can
also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build,
this results in a net 55% reduction in the .text size of
libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. For
RISCV, which today uses a single `uint64_t` type, this results in a 3.7%
increase in code size (expected as we instantiate the code 3 times now).

Actual measured sizes are as follows:
```
Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.7% increase   
.rodata        35768       35839      0.1% increase
```</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This change adds an option to specialize decoders per bitwidth, which
can help reduce the (compiled) code size of the decoder code.

**Current state**:
Currently, the code generated by the decoder emitter consists of two key
functions: `decodeInstruction` which is the entry point into the
generated code and `decodeToMCInst` which is invoked when a decode op is
reached while traversing through the decoder table. Both functions are
templated on `InsnType` which is the raw instruction bits that are
supplied to `decodeInstruction`.

Several backends call `decodeInstruction` with different `InsnType`
types, leading to several template instantiations of these functions in
the final code. As an example, AMDGPU instantiates this function with
type `DecoderUInt128` type for decoding 96/128-bit instructions,
`uint64_t` for decoding 64-bit instructions, and `uint32_t` for decoding
32-bit instructions. Since there is just one `decodeToMCInst` in the
generated code, it has code that handles decoding for *all* instruction
sizes. However, the decoders emitted for different instructions sizes
rarely have any intersection with each other. That means, in the AMDGPU
case, the instantiation with InsnType == DecoderUInt128 has decoder code
for 32/64-bit instructions that is *never exercised*. Conversely, the
instantiation with InsnType == uint64_t has decoder code for
128/96/32-bit instructions that is never exercised. This leads to
unnecessary dead code in the generated disassembler binary (that the
compiler cannot eliminate by itself).

**New state**:
With this change, we introduce an option
`specialize-decoders-per-bitwidth`. Under this mode, the DecoderEmitter
will generate several versions of `decodeToMCInst` function, one for
each bitwidth. The code is still templated, but will require backends to
specify, for each `InsnType` used, the bitwidth of the instruction that
the type is used to represent using a type-trait `InsnBitWidth`. This
will enable the templated code to choose the right variant of
`decodeToMCInst`. Under this mode, a particular instantiation will only
end up instantiating a single variant of `decodeToMCInst` generated and
that will include only those decoders that are applicable to a single
bitwidth, resulting in elimination of the code duplication through
instantiation and a reduction in code size.

Additionally, under this mode, decoders are uniqued only within a given
bitwidth (as opposed to across all bitwidths without this option), so
the decoder index values assigned are smaller, and consume less bytes in
their ULEB128 encoding. As a result, the generated decoder tables can
also reduce in size.

Adopt this feature for the AMDGPU and RISCV backend. In a release build,
this results in a net 55% reduction in the .text size of
libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. For
RISCV, which today uses a single `uint64_t` type, this results in a 3.7%
increase in code size (expected as we instantiate the code 3 times now).

Actual measured sizes are as follows:
```
Baseline commit: 72c04bb882ad70230bce309c3013d9cc2c99e9a7
Configuration: Ubuntu clang version 18.1.3, release build with asserts disabled.
 
AMDGPU        Before       After      Change
======================================================
.text         612327       275607     55% reduction
.rodata       369728       351336      5% reduction          

RISCV:
======================================================
.text          47407       49187      3.7% increase   
.rodata        35768       35839      0.1% increase
```</pre>
</div>
</content>
</entry>
<entry>
<title>AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684)</title>
<updated>2025-07-21T17:09:42+00:00</updated>
<author>
<name>Changpeng Fang</name>
<email>changpeng.fang@amd.com</email>
</author>
<published>2025-07-21T17:09:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=d6094370cb3f5ed24249800c42632e453d4ada3f'/>
<id>d6094370cb3f5ed24249800c42632e453d4ada3f</id>
<content type='text'>
Co-authored-by: Stanislav Mekhanoshin &lt;Stanislav.Mekhanoshin@amd.com&gt;</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Co-authored-by: Stanislav Mekhanoshin &lt;Stanislav.Mekhanoshin@amd.com&gt;</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU] MC support for v_fmaak_f64/v_fmamk_f64 gfx1250 intructions (#148282)</title>
<updated>2025-07-11T21:17:03+00:00</updated>
<author>
<name>Stanislav Mekhanoshin</name>
<email>rampitec@users.noreply.github.com</email>
</author>
<published>2025-07-11T21:17:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=f09055435965fb084afb51edf09dfc4d0cfcce4f'/>
<id>f09055435965fb084afb51edf09dfc4d0cfcce4f</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU] gfx1250: MC support for 64-bit literals (#147861)</title>
<updated>2025-07-10T05:25:47+00:00</updated>
<author>
<name>Stanislav Mekhanoshin</name>
<email>rampitec@users.noreply.github.com</email>
</author>
<published>2025-07-10T05:25:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=00a85e57049ee637a6089b2c696d5e37db8cd47b'/>
<id>00a85e57049ee637a6089b2c696d5e37db8cd47b</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>[NFC][TableGen] Change DecoderEmitter `insertBits` to use integer types only (#147613)</title>
<updated>2025-07-09T15:56:07+00:00</updated>
<author>
<name>Rahul Joshi</name>
<email>rjoshi@nvidia.com</email>
</author>
<published>2025-07-09T15:56:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=23b4f4eb9b15e0c3a8d86d9c0857075afcfc7fe3'/>
<id>23b4f4eb9b15e0c3a8d86d9c0857075afcfc7fe3</id>
<content type='text'>
The `insertBits` templated function generated by DecoderEmitter is
called with variable `tmp` of type `TmpType` which is:

```
using TmpType = std::conditional_t&lt;std::is_integral&lt;InsnType&gt;::value, InsnType, uint64_t&gt;;
```

That is, `TmpType` is always an integral type. Change the generated
`insertBits` to be valid only for integer types, and eliminate the
unused `insertBits` function from `DecoderUInt128` in
AMDGPUDisassembler.h

Additionally, drop some of the requirements `InsnType` must support as
they no longer seem to be required.</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The `insertBits` templated function generated by DecoderEmitter is
called with variable `tmp` of type `TmpType` which is:

```
using TmpType = std::conditional_t&lt;std::is_integral&lt;InsnType&gt;::value, InsnType, uint64_t&gt;;
```

That is, `TmpType` is always an integral type. Change the generated
`insertBits` to be valid only for integer types, and eliminate the
unused `insertBits` function from `DecoderUInt128` in
AMDGPUDisassembler.h

Additionally, drop some of the requirements `InsnType` must support as
they no longer seem to be required.</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU] Rename call instructions from b64 to i64 (#145103)</title>
<updated>2025-06-22T04:42:09+00:00</updated>
<author>
<name>Stanislav Mekhanoshin</name>
<email>rampitec@users.noreply.github.com</email>
</author>
<published>2025-06-22T04:42:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=fa0b84f23c08994ce4e780f7c406e981d37599b1'/>
<id>fa0b84f23c08994ce4e780f7c406e981d37599b1</id>
<content type='text'>
These get renamed in gfx1250 and on from B64 to I64:

  S_CALL_I64
  S_GET_PC_I64
  S_RFE_I64
  S_SET_PC_I64
  S_SWAP_PC_I64</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
These get renamed in gfx1250 and on from B64 to I64:

  S_CALL_I64
  S_GET_PC_I64
  S_RFE_I64
  S_SET_PC_I64
  S_SWAP_PC_I64</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU][NFC] Remove _DEFERRED operands. (#139123)</title>
<updated>2025-05-09T09:10:53+00:00</updated>
<author>
<name>Ivan Kosarev</name>
<email>ivan.kosarev@amd.com</email>
</author>
<published>2025-05-09T09:10:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=66d3980b53086f787d0236814c3cc34fc568e25e'/>
<id>66d3980b53086f787d0236814c3cc34fc568e25e</id>
<content type='text'>
All immediates are deferred now.</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
All immediates are deferred now.</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU][NFC] Get rid of OPW constants. (#139074)</title>
<updated>2025-05-08T17:42:07+00:00</updated>
<author>
<name>Ivan Kosarev</name>
<email>ivan.kosarev@amd.com</email>
</author>
<published>2025-05-08T17:42:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=71f8f2b1554b0a34abe4f14bcceadebfbf687739'/>
<id>71f8f2b1554b0a34abe4f14bcceadebfbf687739</id>
<content type='text'>
We can infer the widths from register classes and represent them as
numbers.</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We can infer the widths from register classes and represent them as
numbers.</pre>
</div>
</content>
</entry>
<entry>
<title>[AMDGPU][Disassembler][NFCI] Always defer immediate operands. (#138885)</title>
<updated>2025-05-08T10:43:50+00:00</updated>
<author>
<name>Ivan Kosarev</name>
<email>ivan.kosarev@amd.com</email>
</author>
<published>2025-05-08T10:43:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.belthelziquor.com/llvm-project.git/commit/?id=d9bdc2d6a2d3efcce81ecab151b393f19a81696b'/>
<id>d9bdc2d6a2d3efcce81ecab151b393f19a81696b</id>
<content type='text'>
Removes the need to parameterise decoders with OperandSemantics,
ImmWidth and MandatoryLiteral.

Likely allows further simplification of handling _DEFERRED immediates.

Tested to work downstream.</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Removes the need to parameterise decoders with OperandSemantics,
ImmWidth and MandatoryLiteral.

Likely allows further simplification of handling _DEFERRED immediates.

Tested to work downstream.</pre>
</div>
</content>
</entry>
</feed>
