Ok that was a flashy clickbait-y title! What? You didn’t like it? I’m proud of that one!
So I recently tried to implement 16-bit floats into our shaders.
I knew it was untested. What I was not prepared is… how untested everything is!
We use macros like #define midf min16float
. We use midf
because half
is actually in use in Metal.
Our new code has 3 modes of operation:
Full32: midf is just float. Nothing special. The default
#define midf float
#define midf4 float4
Midf16: midf is just float16_t
. It requires shaderFloat16
and storageInputOutput16
to be YES, and in turn extensions VK_KHR_shader_float16_int8
and VK_KHR_16bit_storage
.
Only Vulkan and Metal support this feature. It’s excellent for testing and debugging 16-bit floats, because you force 16-bit precision regardless of whether the driver/HW can perform more efficient with it.
#define midf float16_t
#define midf4 f16vec4
Support is actually scarce and limited mostly to post-Vega and a few Intel cards. And Metal.
Relaxed midf is mediump float
on Vulkan and min16float
on D3D11 (though we disabled it for D3D11). Support is much broader.
But that’s because most drivers will default to 32-bit, so you have no way to test it other than via the reference rasterizer or an Android phone.
#define midf mediump float
#define midf4 mediump vec4
// Unfortunately casts and construction e.g.
// midf myFloat = midf_c(5.0) requires
// a different macro because:
// mediump float myFloat = mediump float(5.0)
// is not valid syntax.
#define midf_c float
#define midf4_c vec4
Bugs, bugs everywhere
I knew Godot had problems with mediump float
because mixing mediump with highp would cause some PSOs to fail to build in older Qualcomm drivers. So we’re off to a bad start.
What I didn’t expect:
- MS FXC with optimizations fails to compile on a simple shader with 2-level nested static branch. It would seems it tries to generated a cmov and fail.
- Due to this and fxc being no longer maintained, we prefer to disable
min16float
support on Direct3D 11. - Direct3D 12 is not in our roadmap.
- Due to this and fxc being no longer maintained, we prefer to disable
- SPIRV-Reflect would randomly fail if mediump precision is used. This bug has been fixed now.
- SPIRV debugging in RenderDoc will be unreliable if float16_t is used.
- RADV would ignore vertex layouts for float16_t vertex inputs. i.e.
in f16vec4 vertexPosition
will only work if the vertex data is 16-bit (16_unorm, 16_snorm or 16_half) but it won’t work correctly if it is stored as 32-bit float/unorm/snorm. I didn’t report this bug because it may be ‘working as intended’ since something similar happens when the input data is declared as int. Autoconversion will no longer work. This is easy to workaround. Just declare vertexPosition as vec4 and then cast it tof16vec4( vertexPosition )
. Fortunately we know which data is natively stored as 16-bit so those are declared as f16vec4.
On RDNA2, half 16-bit is not a sure win
I was surprised to see (and later confirmed), RDNA2 does not support converting during an OP.
e.g. with the following code:
uniform float k; // Inside an UBO (simplified)
float16_t b = ...;
float16_t a = b + float16_t( k );
First k is loaded into SGPR then converted to half in VGPR. Then the addition happens.
If we use float all the way, k is loaded into SGRP and kept there. If b
is in VGPR, the addition will be VGPR a = VGPR b + SGRP k
.
The only solution to this is to declare uniform float16_t k
so that the data can be natively loaded as 16-bit.
However with such scarce support and FP16 not being natively supported by C/C++, it is very difficult to support such path. We can’t ditch FP32 paths, and supporting both is a lot of effort.
It is much easier to send all the data as FP32 and then let the GPU load and convert automatically.
What this means in practice:
The following pixel shader:
#version 450
#extension GL_EXT_shader_16bit_storage: require
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
layout( ogre_P0 ) uniform Params {
uniform vec4 myParamA[128];
uniform vec4 myParamB;
uniform vec4 myParamC;
};
// #define f16vec4 vec4
layout( location = 0 )
out f16vec4 fragColour;
void main()
{
f16vec4 tmp0 = f16vec4( float16_t( 0 ) );
f16vec4 tmp1 = f16vec4( float16_t( 0 ) );
f16vec4 tmp2 = f16vec4( float16_t( 0 ) );
f16vec4 tmp3 = f16vec4( float16_t( 0 ) );
for( int i = 0; i < 128; ++i )
{
tmp0 += f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
tmp1 += tmp0 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
tmp2 += tmp0 + tmp1 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
tmp3 += tmp0 + tmp1 + tmp2 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
}
fragColour = (tmp0 * tmp1 + tmp2) * tmp3;
}
Produces the following ISA with RADV:
BB0:
v_mov_b32_sdwa v0, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0002f9 00861480
v_mov_b32_sdwa v1, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0202f9 00861480
v_mov_b32_sdwa v2, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0402f9 00861480
v_mov_b32_sdwa v3, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0602f9 00861480
v_mov_b32_sdwa v4, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0802f9 00861480
v_mov_b32_sdwa v5, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0a02f9 00861480
v_mov_b32_sdwa v6, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0c02f9 00861480
v_mov_b32_sdwa v7, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0e02f9 00861480
v_mov_b32_sdwa v8, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1002f9 00861480
v_mov_b32_sdwa v9, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1202f9 00861480
v_mov_b32_sdwa v10, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1402f9 00861480
v_mov_b32_sdwa v11, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1602f9 00861480
v_mov_b32_sdwa v12, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1802f9 00861480
v_mov_b32_sdwa v13, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1a02f9 00861480
v_mov_b32_sdwa v14, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1c02f9 00861480
v_mov_b32_sdwa v15, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1e02f9 00861480
s_mov_b32 s0, 0 ; be800380
BB1:
s_cmp_ge_i32 s0, 0x80 ; bf03ff00 00000080
s_cbranch_scc1 BB5 ; bf850054
BB4:
s_add_i32 s4, 16, s2 ; 81040290
s_movk_i32 s5, 0x8000 ; b0058000
s_load_dwordx4 s[4:7], s[4:5], 0x0 ; f4080102 fa000000
s_lshl_b32 s1, s0, 4 ; 8f018400
s_waitcnt lgkmcnt(0) ; bf8cc07f
s_clause 0x2 ; bfa10002
s_buffer_load_dwordx4 s[8:11], s[4:7], s1 ; f4280202 02000000
s_buffer_load_dwordx4 s[12:15], s[4:7], 0x800 ; f4280302 fa000800
s_buffer_load_dwordx4 s[4:7], s[4:7], 0x810 ; f4280102 fa000810
s_add_u32 s0, s0, 1 ; 80008100
s_waitcnt lgkmcnt(0) ; bf8cc07f
v_cvt_f16_f32_e32 v16, s8 ; 7e201408
v_cvt_f16_f32_e32 v17, s9 ; 7e221409
v_cvt_f16_f32_e32 v18, s10 ; 7e24140a
v_cvt_f16_f32_e32 v19, s11 ; 7e26140b
v_cvt_f16_f32_e32 v20, s12 ; 7e28140c
v_cvt_f16_f32_e32 v21, s13 ; 7e2a140d
v_cvt_f16_f32_e32 v22, s14 ; 7e2c140e
v_cvt_f16_f32_e32 v23, s15 ; 7e2e140f
v_cvt_f16_f32_e32 v24, s4 ; 7e301404
v_cvt_f16_f32_e32 v25, s5 ; 7e321405
v_cvt_f16_f32_e32 v26, s6 ; 7e341406
v_cvt_f16_f32_e32 v27, s7 ; 7e361407
v_fma_f16 v28, v20, v24, v16 ; d74b001c 04423114
v_fma_f16 v29, v21, v25, v17 ; d74b001d 04463315
v_fma_f16 v30, v22, v26, v18 ; d74b001e 044a3516
v_fma_f16 v31, v23, v27, v19 ; d74b001f 044e3717
v_add_f16_e32 v12, v12, v28 ; 6418390c
v_add_f16_e32 v13, v13, v29 ; 641a3b0d
v_add_f16_e32 v14, v14, v30 ; 641c3d0e
v_add_f16_e32 v15, v15, v31 ; 641e3f0f
v_add_f16_e32 v28, v12, v16 ; 6438210c
v_add_f16_e32 v29, v13, v17 ; 643a230d
v_add_f16_e32 v30, v14, v18 ; 643c250e
v_add_f16_e32 v31, v15, v19 ; 643e270f
v_fmac_f16_e32 v28, v20, v24 ; 6c383114
v_fmac_f16_e32 v29, v21, v25 ; 6c3a3315
v_fmac_f16_e32 v30, v22, v26 ; 6c3c3516
v_fmac_f16_e32 v31, v23, v27 ; 6c3e3717
v_add_f16_e32 v8, v8, v28 ; 64103908
v_add_f16_e32 v9, v9, v29 ; 64123b09
v_add_f16_e32 v10, v10, v30 ; 64143d0a
v_add_f16_e32 v11, v11, v31 ; 64163f0b
v_add_f16_e32 v28, v12, v8 ; 6438110c
v_add_f16_e32 v29, v13, v9 ; 643a130d
v_add_f16_e32 v30, v14, v10 ; 643c150e
v_add_f16_e32 v31, v15, v11 ; 643e170f
v_add_f16_e32 v32, v28, v16 ; 6440211c
v_add_f16_e32 v33, v29, v17 ; 6442231d
v_add_f16_e32 v34, v30, v18 ; 6444251e
v_add_f16_e32 v35, v31, v19 ; 6446271f
v_fmac_f16_e32 v32, v20, v24 ; 6c403114
v_fmac_f16_e32 v33, v21, v25 ; 6c423315
v_fmac_f16_e32 v34, v22, v26 ; 6c443516
v_fmac_f16_e32 v35, v23, v27 ; 6c463717
v_add_f16_e32 v4, v4, v32 ; 64084104
v_add_f16_e32 v5, v5, v33 ; 640a4305
v_add_f16_e32 v6, v6, v34 ; 640c4506
v_add_f16_e32 v7, v7, v35 ; 640e4707
v_add_f16_e32 v28, v28, v4 ; 6438091c
v_add_f16_e32 v29, v29, v5 ; 643a0b1d
v_add_f16_e32 v30, v30, v6 ; 643c0d1e
v_add_f16_e32 v31, v31, v7 ; 643e0f1f
v_add_f16_e32 v16, v28, v16 ; 6420211c
v_add_f16_e32 v17, v29, v17 ; 6422231d
v_add_f16_e32 v18, v30, v18 ; 6424251e
v_add_f16_e32 v19, v31, v19 ; 6426271f
v_fmac_f16_e32 v16, v20, v24 ; 6c203114
v_fmac_f16_e32 v17, v21, v25 ; 6c223315
v_fmac_f16_e32 v18, v22, v26 ; 6c243516
v_fmac_f16_e32 v19, v23, v27 ; 6c263717
v_add_f16_e32 v0, v0, v16 ; 64002100
v_add_f16_e32 v1, v1, v17 ; 64022301
v_add_f16_e32 v2, v2, v18 ; 64042502
v_add_f16_e32 v3, v3, v19 ; 64062703
s_branch BB1 ; bf82ffa9
BB5:
v_fmac_f16_e32 v4, v12, v8 ; 6c08110c
v_fmac_f16_e32 v5, v13, v9 ; 6c0a130d
v_fmac_f16_e32 v6, v14, v10 ; 6c0c150e
v_fmac_f16_e32 v7, v15, v11 ; 6c0e170f
v_mul_f16_e32 v0, v4, v0 ; 6a000104
v_mul_f16_sdwa v0, v5, v1 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0 ; 6a0002f9 04041505
v_mul_f16_e32 v1, v6, v2 ; 6a020506
v_mul_f16_sdwa v1, v7, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0 ; 6a0206f9 04041507
exp mrt0 v0, v0, v1, v1 done compr vm ; f8001c0f 80800100
s_endpgm ; bf810000
Pixel Shader:
*** SHADER STATS ***
SGPRs: 128
VGPRs: 40
Spilled SGPRs: 0
Spilled VGPRs: 0
PrivMem VGPRs: 0
Code size: 532
LDS size: 0
Scratch size: 0
Subgroups per SIMD: 24
Hash: 3918390333
Instructions: 105
Copies: 18
Branches: 2
Latency: 3032
Inverse Throughput: 1072
VMEM Clause: 0
SMEM Clause: 2
Pre-Sched SGPRs: 10
Pre-Sched VGPRs: 36
********************
But if we compile it as 32-bit (by replacing all f16vec4 with vec4):
BB0:
v_lshrrev_b64 v[0:1], 0, 0 ; d7000000 00010080
v_lshrrev_b64 v[2:3], 0, 0 ; d7000002 00010080
v_lshrrev_b64 v[4:5], 0, 0 ; d7000004 00010080
v_lshrrev_b64 v[6:7], 0, 0 ; d7000006 00010080
v_lshrrev_b64 v[8:9], 0, 0 ; d7000008 00010080
v_lshrrev_b64 v[10:11], 0, 0 ; d700000a 00010080
v_lshrrev_b64 v[12:13], 0, 0 ; d700000c 00010080
v_lshrrev_b64 v[14:15], 0, 0 ; d700000e 00010080
s_mov_b32 s0, 0 ; be800380
BB1:
s_cmp_ge_i32 s0, 0x80 ; bf03ff00 00000080
s_cbranch_scc1 BB5 ; bf85004c
BB4:
s_add_i32 s4, 16, s2 ; 81040290
s_movk_i32 s5, 0x8000 ; b0058000
s_load_dwordx4 s[4:7], s[4:5], 0x0 ; f4080102 fa000000
s_lshl_b32 s1, s0, 4 ; 8f018400
s_waitcnt lgkmcnt(0) ; bf8cc07f
s_clause 0x2 ; bfa10002
s_buffer_load_dwordx4 s[8:11], s[4:7], 0x810 ; f4280202 fa000810
s_buffer_load_dwordx4 s[12:15], s[4:7], 0x800 ; f4280302 fa000800
s_buffer_load_dwordx4 s[4:7], s[4:7], s1 ; f4280102 02000000
s_add_u32 s0, s0, 1 ; 80008100
s_waitcnt lgkmcnt(0) ; bf8cc07f
v_mul_f32_e64 v16, s12, s8 ; d5080010 0000100c
v_mul_f32_e64 v17, s13, s9 ; d5080011 0000120d
v_mul_f32_e64 v18, s14, s10 ; d5080012 0000140e
v_mul_f32_e64 v19, s15, s11 ; d5080013 0000160f
v_add_f32_e32 v20, s4, v16 ; 06282004
v_add_f32_e32 v21, s5, v17 ; 062a2205
v_add_f32_e32 v22, s6, v18 ; 062c2406
v_add_f32_e32 v23, s7, v19 ; 062e2607
v_add_f32_e32 v12, v12, v20 ; 0618290c
v_add_f32_e32 v13, v13, v21 ; 061a2b0d
v_add_f32_e32 v14, v14, v22 ; 061c2d0e
v_add_f32_e32 v15, v15, v23 ; 061e2f0f
v_add_f32_e32 v20, s4, v12 ; 06281804
v_add_f32_e32 v21, s5, v13 ; 062a1a05
v_add_f32_e32 v22, s6, v14 ; 062c1c06
v_add_f32_e32 v23, s7, v15 ; 062e1e07
v_add_f32_e32 v20, v20, v16 ; 06282114
v_add_f32_e32 v21, v21, v17 ; 062a2315
v_add_f32_e32 v22, v22, v18 ; 062c2516
v_add_f32_e32 v23, v23, v19 ; 062e2717
v_add_f32_e32 v8, v8, v20 ; 06102908
v_add_f32_e32 v9, v9, v21 ; 06122b09
v_add_f32_e32 v10, v10, v22 ; 06142d0a
v_add_f32_e32 v11, v11, v23 ; 06162f0b
v_add_f32_e32 v20, v12, v8 ; 0628110c
v_add_f32_e32 v21, v13, v9 ; 062a130d
v_add_f32_e32 v22, v14, v10 ; 062c150e
v_add_f32_e32 v23, v15, v11 ; 062e170f
v_add_f32_e32 v24, s4, v20 ; 06302804
v_add_f32_e32 v25, s5, v21 ; 06322a05
v_add_f32_e32 v26, s6, v22 ; 06342c06
v_add_f32_e32 v27, s7, v23 ; 06362e07
v_add_f32_e32 v24, v24, v16 ; 06302118
v_add_f32_e32 v25, v25, v17 ; 06322319
v_add_f32_e32 v26, v26, v18 ; 0634251a
v_add_f32_e32 v27, v27, v19 ; 0636271b
v_add_f32_e32 v4, v4, v24 ; 06083104
v_add_f32_e32 v5, v5, v25 ; 060a3305
v_add_f32_e32 v6, v6, v26 ; 060c3506
v_add_f32_e32 v7, v7, v27 ; 060e3707
v_add_f32_e32 v20, v20, v4 ; 06280914
v_add_f32_e32 v21, v21, v5 ; 062a0b15
v_add_f32_e32 v22, v22, v6 ; 062c0d16
v_add_f32_e32 v23, v23, v7 ; 062e0f17
v_add_f32_e32 v20, s4, v20 ; 06282804
v_add_f32_e32 v21, s5, v21 ; 062a2a05
v_add_f32_e32 v22, s6, v22 ; 062c2c06
v_add_f32_e32 v23, s7, v23 ; 062e2e07
v_add_f32_e32 v16, v20, v16 ; 06202114
v_add_f32_e32 v17, v21, v17 ; 06222315
v_add_f32_e32 v18, v22, v18 ; 06242516
v_add_f32_e32 v19, v23, v19 ; 06262717
v_add_f32_e32 v0, v0, v16 ; 06002100
v_add_f32_e32 v1, v1, v17 ; 06022301
v_add_f32_e32 v2, v2, v18 ; 06042502
v_add_f32_e32 v3, v3, v19 ; 06062703
s_branch BB1 ; bf82ffb1
BB5:
v_fmac_f32_e32 v4, v12, v8 ; 5608110c
v_fmac_f32_e32 v5, v13, v9 ; 560a130d
v_fmac_f32_e32 v6, v14, v10 ; 560c150e
v_fmac_f32_e32 v7, v15, v11 ; 560e170f
v_mul_f32_e32 v0, v4, v0 ; 10000104
v_mul_f32_e32 v1, v5, v1 ; 10020305
v_mul_f32_e32 v2, v6, v2 ; 10040506
v_mul_f32_e32 v3, v7, v3 ; 10060707
v_cvt_pkrtz_f16_f32_e32 v0, v0, v1 ; 5e000300
v_cvt_pkrtz_f16_f32_e32 v1, v2, v3 ; 5e020702
exp mrt0 v0, v0, v1, v1 done compr vm ; f8001c0f 80800100
s_endpgm ; bf810000
Pixel Shader:
*** SHADER STATS ***
SGPRs: 128
VGPRs: 32
Spilled SGPRs: 0
Spilled VGPRs: 0
PrivMem VGPRs: 0
Code size: 436
LDS size: 0
Scratch size: 0
Subgroups per SIMD: 32
Hash: 1646907196
Instructions: 91
Copies: 10
Branches: 2
Latency: 2925
Inverse Throughput: 948
VMEM Clause: 0
SMEM Clause: 2
Pre-Sched SGPRs: 15
Pre-Sched VGPRs: 28
********************
We can notice a few things:
- Inverse Throughput (Estimated busy cycles to execute one wave, i.e. lower is better) 1072 vs 948. 32-bit wins
- Latency (Issue cycles plus stall cycles, i.e. lower is better) 3032 vs 2925. 32-bit wins
- RADV generated zero
V_PK_ADD_F16
andV_PK_FMAC_F16
despite being plenty of opportunities - I don’t know if
V_CVT_F16_F32_SDWA
is possible or if it has a cost. It seems SDWA instructions need 2 DWORDS. - 16-bit needed 36 VGPR vs 32-bit needing 28 VGRPs
- Part of this is explained by data being kept in scalar registers in 32-bit; whereas 16-bit needs to move everything to VGPR
- Another part is RADV not being able to pack 2 FP16 in the same VGPR register
- That only happens twice at the end where we see
v_mul_f16_sdwa
instructions
- That only happens twice at the end where we see
In other words
- RDNA2 does not seem to double the register count with FP16. It would be cool to see registers s[0:512] and v[0:128] alias to sh[0:1024] and vh[0:512] respectively. This can either be implemented as register aliasing, or via additional instructions that operate on high/low bits of the register. This is an implementation detail, whereas externally it can be seen as register aliasing. It would be easier to mentally track.
- RDNA2 does support packed math. i.e. operating on two FP16 at the same time living on the same 32-bit register
- RDNA2 also supports SDWA suffix to target the 2nd FP16 value in a VGPR register
- RADV / ACO does not seem yet to take advantage of packed math instructions
- Proprietary drivers on Windows do use packed operation. Our PBS shader was filled with V_CVT_PKRTZ_F16_F32 and PK arithmetic instructions
- However no noticeable change in performance was noticed
- It’s quite possible I was not stressing the GPU enough
- VGPR went down, SGPR went up. Overall a win (yay!)
- With AMD_shader_info:
- “Total Cycles” FP16 was 6% higher
- “Total Stalls” for FP16 is 96% higher. Sounds bad to me, but I don’t know.
- However no noticeable change in performance was noticed
- Loading float from const buffers and converting it to FP16 needs a
V_CVT
instruction and taking away 1 VGPR, since conversion cannot happen on SGRP- This is a big problem
- Light data is in FP32 (position x3, direction x3, colour x3, spot params x3, attenuation x4)
- In advanced games light data typically ends up in VGPR for Deferred/Forward+ data BUT
- Power-constrained games (like the ones targetting mobile! where F16 is most useful!) light data lives in constant buffers as they use regular forward
- Material data is in FP32 (kD x3, kS x3, fresnel x1, roughness x1, transparency x1)
- Light data is in FP32 (position x3, direction x3, colour x3, spot params x3, attenuation x4)
- This is a big problem
- This could be fixed if instead of
S_LOAD_DWORDX16
a new instruction likeS_LOADCVT_DWORDXn_LO
could load then batch-convert the FP32 floats in s[0:1] into two FP16 in sh[0:1] leaving s0 with two FP16 floats in it, and s1 with either the original FP32 value.S_LOADCVT_DWORDXn_HI
would do the same but the results stored in sh[2:3], leaving the original FP32 in s0.V_CVT_PKRTZ_F16_F32
already does this but:- It operates exclusively on VGPR results
- It does not respect the current rounding mode (it assumes round to zero)
- It works on 2 floats at a time. It’s not in bulk.
S_LOADCVT_DWORDX16
would convert 16 floats at a time. S_LOADCVT_DWORDXn_LO
andS_LOADCVT_DWORDXn_HI
make sense because based on my observations a few operations need the original data from the const buffer as both FP16 and FP32. Thus being able to load-and-convert F32-to-F16 data while preserving half of the original FP32 values could be very useful.- The goal behind an amalgamated instruction
S_LOADCVT_DWORDn
is to perform load and conversion without anS_WAITCNT
instruction. The rationale is the same asBUFFER_LOAD_FORMAT_XYZW
(untyped buffer load and conversion) but targetting scalar registers. - Besides, no other instruction other than raw load dword operations seems to support writing to SGPR. This is obviously for simplicity. Thus having a ASIC convert data on the fly during load would make sense.
- Alternatively, arithmetic instructions that take SGPR input and converts on the spot to half could work as well
With the Steam Deck disembarking in a few weeks with RDNA2 and Samsung soon to release an RDNA2-powered phone I thought FP16 support would be more advanced (Note: I don’t know if Samsung uses RADV or proprietary).
Valve should invest on getting packed FP16 math for RADV. If AMD or Valve needs a test case, they can run our samples. It runs on Wine so it should run on Proton too.
I modified the samples to add the cmd line --force16
which will force Midf16 mode as described above. Look for “RUNNING WITH 16-bit PRECISION AND SUPPORTED! :)” in the Log. And of course, choose Vulkan.
For native Linux, build master and apply this patch to force Midf16:
diff --git a/OgreMain/src/OgreHlms.cpp b/OgreMain/src/OgreHlms.cpp
index 55e00e70bb..3498c8b0df 100644
--- a/OgreMain/src/OgreHlms.cpp
+++ b/OgreMain/src/OgreHlms.cpp
@@ -275,7 +275,7 @@ namespace Ogre
#else
mDebugOutputProperties( false ),
#endif
- mPrecisionMode( PrecisionFull32 ),
+ mPrecisionMode( PrecisionMidf16 ),
mFastShaderBuildHack( false ),
mDefaultDatablock( 0 ),
mType( type ),