So, I saw NVIDIA’s Do’s and Don’ts.
Thanks NV! It’s useful. Go check it out if you haven’t already. But beware a couple gotchas, like, there’s a few bits that are NV specific that I feel should be talked about:
- Be aware of the fact that there is a cost associated with setup and reset of a command list
- You still need a reasonable number of command lists for efficient parallel work submission
- Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)
Err… I feel this is a bit NV specific. I’ve been suspecting the NV driver needs roundtrips between CPU & GPU for doing certain jobs that is normally thought to be done GPU side (thus causing unexplainable stalls), and this almost feels like a confirmation. At least on GCN most of the work can be done GPU side hence the need for driver-side splitting of commands is very small.
May be I’m mistaken? But anyway, NV specific or not, it is very important to get clarification of”various reasons”. Clearly the guys at MS didn’t think these commands would cause splitting or otherwise the API would be different or documentation more explicit about it.
- Check carefully if the use of a separate compute command queues really is advantageous
- Even for compute tasks that can in theory run in parallel with graphics tasks, the actual scheduling details of the parallel work on the GPU may not generate the results you hope for
- Be conscious of which asynchronous compute and graphics workloads can be scheduled together
This is true for all HW, but it is much truer for NV hardware (Intel has only one queue so it doesn’t matter). AMD has an edge here.
Place constants, CBVs, SRVs and UAVs directly into the root signature if possible
- Start with the entries for the pixel stage
- Constants that sit directly in root can speed up pixel shaders significantly on NVidia hardware – specifically consider shader constants that toggle parts of uber-shaders
- CBVs that sit in the root signature can also speed up pixel shaders significantly on NVidia hardware
- Carry on with decreasing execution frequency of the shader stages
- Doesn’t require a descriptor heap, versioning entries and extra indirection,
- Remember root views don’t do bounds checking and have other limitations
This time it is marked with an NV-specific hint. However MS says “keep most of the stuff away from Root Signature”, NV here says “put everything in Root”. What’s going on here? And btw, what’s that about the “decreasing execution frequency”?
Let’s start on why MS says “avoid Root”:
When you modify a regular Descriptor Table or Heap, you’ve got to manually keep track of them and ensure they’re not still in use by the GPU. On Root however, you may have notice that you change the values, but you don’t have to keep track of the previous contents. DirectX12 does its own versioning. It’s DX11-style of keeping track of state changes. Changed a value? DX API will increase or change its internal buffer and write the new value to a different location, so that previous commands can read the old value, and new commands will be reading the new value.
The versioning implementation is done entirely by the driver MS, and may change with Runtime and may change with driver upgrades (although I doubt this will ever happen). Abusing Root can blow the memory that is out of your control, or cause stalls. It’s DX11-like behavior. That’s why they advise not to use it.
The reason for Root signature was a sort of “why not?” combined with “it’s extremely useful for debug stuff” and “it’s extremely useful (and faster in certain HW*) for global stuff referenced everywhere that barely changes per frame or for very small amount of data that changes very frequently” (e.g. sending a single 32-bit integer via Root is faster than changing a 64-bit pointer that points to a CBV that contains this 32-bit integer. Setting an array CBV and indexing it via shader is also possible but will incur in a slight GPU overhead; clearly the Root path will be the fastest path, specially if you’re GPU bound; indexing via shader should be preferred if you’re CPU bound).
(*) It removes a level of indirection and gives a chance to the driver to trigger a particular fast path.
Update: Got word that the driver is entirely or almost entirely in charge of the Root implementation, not by Microsoft.
Why decreasing execution frequency?
This was explained in Advanced Graphics & Performance around minute 10 and in Getting the best out of D3D12 around slide 31.
This is due to certain older GPU archs, specially looking at mobile. Apparently certain GPU archs have a limited amount of resources that can be re-bound dynamically and individually, while the rest of the descriptors are “static”. e.g. if you change one of those static descriptors, then all descriptors must be flushed and likely trigger a GPU pipeline “mini” stall (not the kind that waits for the CPU, but rather the kind that must wait for all wavefront/warps to finish so we can update the new resources and then resume with more wavefront/warp work).
This is the good old “texture state changes are expensive as hell” with the addition “…except if you only change the first N textures” minus the CPU overhead due to DX12’s different design.
How many descriptors are “dynamic”, what’s the value of N? Well, that’s architecture dependent. GCN certainly doesn’t have this problem at all because it’s pure bindless. As far as I know NVIDIA HW isn’t affected either, but I’m not fully sure.
Update: Got word from AMD. Even though their HW isn’t directly affected by this as I thought, there’s still two very important elements to keep in mind, which once said out loud, sound like common sense:
- If you’ve changed elements 0-3 in the Root table, the driver will quickly upload 4 DWORDS. However if the dirty elements are 1, 4, 24 and 37… the driver may send 4 DWORDs or a bit more (which means they are keeping track of each individual element… meaning more CPU overhead!) or may assume you keep everything contiguous and thus send the whole [1; 37] range (36 DWORDs!) because it’s only doing a minimal tracking (upper and lower bounds). A driver written to minimize CPU overhead means it will assume the developer is doing the smart thing ™ and keeping all of his changes contiguous. This is very important to keep in mind!
- There are restrictions on how much of the Root table is kept on-chip, and thus the driver assumes frequently changing elements are put first.
- For these reasons above AMD recommends that generally applications should only put things in the root if they change on every draw, and that those things should be first in the signature.
Thank you a lot AMD for the feedback!
Why NV says bind PS constants to Root?
This is pure speculation on my behalf: Unlike GCN hardware which can reference like 2GB of constant data in the same shader, NV still uses a 64kb (32kb?) register file for storing constant data. My guess is that when you bind it at Root level, the driver can check if the data will be <64kb and prepare the register file. Whereas non-Root paths the driver has very limited opportunities and likely has to resort to either flushing the register file aggressively, or avoid it entirely and emulate with texture fetches instead. This also means no unbounded indexing on the shader on that CBV or else this hint will likely not work (PSO will definitely have to use tex fetches as it cannot predict you’ll be addressing less than 64kb, hence no register files… btw if I find out the driver swaps the PSO bytecode with a special hidden version that uses register files I will be very mad as that defeats the entire purpose of using PSOs).
Technically speaking this is still within MS’ vision about the Root descriptor: for stuff that doesn’t need much versioning, it may be faster due to one less level of indirection, and the driver gets a chance to grab a fast path.
However “starting with CBVs in the pixel stage then carry on with decreasing execution frequency of the shader stages” may conflict with MS’ advise of “arrange by decreasing execution frequency” depending on what you want to do. So it’s important to know what’s going on behind the scenes and what kind of hardware you’re targeting.
As with anything, but DX12 in particular: DX12 gives you a lot of flexibility and a lot of paths. The industry still trying to understand which paths are fastest and this certainly can vary across GPU architectures, so nothing is set in stone!