Q: I’m concerned that dropping D3D9 will cause me to lose a significant market share.
Steam stats reveals that D3D9 Hardware is sitting around 1% of the market share. That hardware is more than 10 years old.
However, there a lot of machines that are D3D10 or D3D11 level hardware running on Windows XP. Specially in China.
To support that combo (XP + D3D10/11 level GPU), Ogre uses the GL3+ RenderSystem (OpenGL 3.3) which should deliver equal or faster performance than even the D3D11 RenderSystem.
In other words, no significant market share is lost at all. Unreal Engine 4 is also following the same approach.
Q: Is D3D11 slower than D3D9?
If we put two exact machines (apples to apples comparison), DX11 level hardware; one running on Windows XP (D3D9) the other running on Windows 7/8 (D3D11) and assuming both are well written; D3D11 should outperform D3D9 by a large margin.
Q: You said D3D11 is faster than D3D9, but my experience says otherwise…
In order for a D3D11 application to perform fast, it needs to be well written. A lot of games and engines, including Ogre 1.x; force the “D3D9 style of doing things” into D3D11. This only results in D3D11 running slower than D3D9.
Ogre 2.0 Final is being deeply refactored to render graphics the “D3D11 way”.
Just as a simple example, during rendering D3D9 forces us to send almost all parameters to the shaders every frame for every object.
Both GL3+ and D3D11 instead, introduce the concept of “constant buffers” which allows us to split parameters by update frequency. Parameters like lights’ data is shared across all objects, thus we only need to send them once per frame. Parameters like diffuse, specular rarely change and can be uploaded only once and be kept in the GPU until they change. Only stuff like transform matrices needs to be sent every frame for every object.
Thus the the CPU overhead is greatly reduced. But in order to do that, the code first needs to separate the data by update frequency. And the new Hlms (high level material system) does exactly that.
There are other improvements, like better instancing, texture arrays, or in GL3+’s case, MultiDrawIndirect and persistent mapping; which is the holy grail of low API overhead.
Q: You mention GL3+, but isn’t OpenGL incredibly slow?
GL3+ is receiving a huge amount of attention from me (Matías Goldberg aka dark_sylinc). A lot of of bugs have been fixed. Profilers used to show >80% of redundant API calls. Right now it’s sitting between 25-35% of redundant API calls depending on the scene; thanks to the Hlms.
And there’s still AZDO rendering coming which we hope should lower the number of redundant API calls to less than 5%; while also reducing the overall number of non-redundant API calls too.
D3D11 is also receiving a lot of attention from another Ogre dev; and as soon as I finish with GL3+; I will also try to focus my attention to D3D11.
However, I may shift my attention towards D3D12 directly since the AZDO low level/explicit memory manager can be reused for D3D12 and is therefore a better fit.
Q: One big buffer for all meshes. Isn’t this wasteful?
Yes but No.
The new system reserves a big chunk of memory we’ll be manually managing using explicit fences and synchronization primitives. This is pretty much how D3D12 works and how OpenGL 3/4 works (forcing GL_ARB_sync as a mandatory extension). If the amount of memory requested is bigger than our pool, we request the API another pool. There are several factors at play here:
- The user controls how big the pool is. By default 128MB are requested. But if the user is making a small program and during production he realizes he nevers exceeds 16MB; he can set the pool size to that value. If the user uses a maximum of 200 MB, he can increase it to avoid having two pools of 128 each (and wasting 56MB). In other words, not wasteful.
- The user can request a pool entirely for his own, if for some odd reason you want full control over an entire buffer delivered “as is” by the API.
Besides, whatever little waste that can appear, does not counter all of the benefits from this approach (see next question)
Q: Why one big buffer?
The advantage from having one big buffer benefits all RenderSystems:
- Vertex and Index buffers need only to be set/bound once (or at extremely low frequency). This reduces API overhead tremendously. This affects D3D11, D3D12, GLES 2/3 & GL3+
- If we also sort by vertex formats, the need to respecify the vertex attributes (GLES 2), rebind a different VAO (GL3+, GLES 3), reset the input layout (D3D11), or change the PSO (D3D12) is reduced to a bare minimum. This reduces API overhead tremendously.
- If we then sort by Mesh, D3D11, GL3 and GLES3 can use instancing to reduce the number of draw calls per primitives. Again, API overhead is reduced tremendously.
- In GL4, we can user MDI (MultiDrawIndirect), which means many different meshes can be rendered in a single draw call or be driven from a compute shader. It’s like instancing on stereoids.
- Creating and releasing buffers is virtually free (unless we need to create another pool or the user asked for its own exclusive buffer), instead of causing huge stalls.
It’s not just about reducing API overhead, but also mitigating bad GL implementations. Some GL driver implementations (particularly in mobile) are extremely inefficient and may stall rendering for random reasons at random intervals. The most efficient API calls are those that we don’t call.
With this approach, we reduce the number of opportunities the driver has to screw us.
It also fits how the GPU actually works: Just memory reads and writes, then tell the API to reintepret that memory in a different way.
Q: Are there disadvantages from this “one big buffer” approach?
Yes, some memory will always be wasted (the concern from the previous question); however it could be argued that with former approaches, this waste was hidden and only known by the driver. It’s something we don’t know.
Some old HW (usually applied to extremely old DX9 hardware, but might also apply to a few old mobile devices) emulated vertex shaders in software. And for CPU cache efficiency reasons, they would apply the vertex shader to all the vertices, and not just the range being rendered. In this particular scenario, the big buffer approach could bring rendering performance to a crawl. We don’t expect this problem to be a problem at all, but it’s worth mentioning.
More importantly, we’re in full control of the chunk of memory; which means we hinder the driver from its ability to swap the data in and out of GPU VRAM. For example, in a 1GB GPU, usually you could request 1.2GB in meshes and textures without getting an out of video memory exception. Thanks to a technique known as paging.
As long as you don’t use those 1.2GB at once in the same frame; the driver would swap all textures and vertex buffers that aren’t going to be used. If you were to render everything in the same frame, an out of video memory exception is surely bound to happen.
However by having “one big buffer” (or a couple of big buffers) the driver is unable to distinguish nor separate unused data from used data; thus having to keep the whole big buffer resident.
It might be worth mentioning that this paging is sometimes the reason some GL drivers cause slowdowns: they unnecessarily conclude that something needs swapping; and stall rendering for no reason.
This isn’t an unsolvable problem though. Ogre could do the paging explicitly. After all in D3D12 and GL3+ we’ve got unsynchronized access and explicit fences. And D3D11 has relatively efficient subresource updates. Plus, paging can be done at higher levels (i.e. dividing a terrain in sections, and page according to camera’s position), where a driver has to be guessing.
This is probably something that will be implemented in the future. We’re still waiting for technology to catch up and stabilize. AMD has its AMD_pinned_memory extension which looks great but hasn’t picked up by other vendors.
ARB_sparse_buffer terrific great for implementing paging, but it was just released a couple months ago.
Most mobile devices have shared GPU & CPU memory but to our knowledge there is no standardized way to “just swap” its visibility from CPU to GPU and viceversa.
Update:
Q: What is this new “Item” class? When should I used it over Entity?
Item is replacing Entity and InstancedEntity (because Items can automatically instance themselves).
However, due to time and resource constraints; Items and Entities (and InstancedEntity) will coexist for a long time until all the features are ported.
For example, you will probably want to use “Item” to render most of your scene (static buildings, props, foes, players, characters, NPCs, etc).
However during a cinematic you may want to use pose animations (i.e. facial expressions) which are not yet supported by Items. So you will have to use an Entity there.
Or if you want to use software skeleton animation to work on the transformed vertices (i.e. medical and scientific research), you will have to use Entity.