In Ogre 2.1 we’ve modernized the whole engine. We started by adopting OpenGL 4.4 as our main API, and then we adopted D3D11 as well.
It’s been an enlightening experience. From a performance point of view there is no definitive answer “OpenGL is always faster” or “D3D11 is always faster”, because the truth is, some of our demos (and real world code) run faster on GL, others run faster on D3D11, while sometimes the reverse happens if you switch GPU vendors.
And sometimes, there is no difference at all. Assuming you’re using modern practices, that is.
It’s been a few months now, and this is what I could gather:
- GL gives you lower access to synchronizations, with fences, memory barriers for asynchronous shaders (they’re a bit lame because they’re too coarse, but far better than D3D11) and Persistent Mapping. Hitting no stalls is very easy now. In contrast, D3D11 sucks on this end, and it’s easy to hit DISCARD memory limits from the driver, which means stalls. Furthermore, the main problem is that map NO_OVERWRITE wasn’t supported on constant & shader resource buffers until D3D11.1 (which means Windows 8.1). This is a huge problem for low level management. One workaround we sometimes do is to use a vertex buffer with no_overwrite and then a CopySubResource to the actual buffer. This can sometimes be even faster because the GPU may use a DMA engine to transfer the data from CPU to GPU. But this trick doesn’t always work, and penalizes also integrated GPUs which don’t have DMA engines and have limited bandwidth. Having no_overwrite on shader & constant resources to be a Win 8.1-only feature was a dick move.
- When dealing with legacy code (aka “the v1 objects” in Ogre) and has therefore “less AZDO in them”, these “DISCARD” semantics that were getting in the way in the previous point really give an edge to D3D11 here. Users with code relying on a lot of legacy v1 objects will notice that D3D11 may perform significantly faster than GL.
- HLSL shaders are often faster than their GLSL counterparts. No surprises here. The fxc compiler is very aggressive (though sometimes too aggressive) and the translation from D3D asm to the GPU ISA has been optimized to death thanks to a bazillion of games relying on D3D.
- HLSL shaders sometimes are optimized too much. Containing precision errors or even blatantly incorrect results (which we’ve reported); whereas their GLSL counterparts produce the correct output.
- Compilation time for HLSL shaders is measured in seconds. Compilation time for GLSL shaders is measured in milliseconds. This is both good (shader optimizations) and bad (why does it still take +3 seconds with skip optimization flag??? sounds like an accidentally quadratic behavior when parsing the syntax, triggered by arrays in const buffers, the bigger the array, the much, much longer it will take)
- D3D is often affected by Driver-tweakable settings in the control panel; while generally GL is not. I was wondering why D3D was looking blurrier (and running faster) compared to GL; until I found out the AMD CCC settings were not set to “maximum quality”. Once I did that, the blurriness was gone, and the performance was more or less the same; which means GL applications were not affected by this setting. I’ve been hearing similar things from our NV testers.
- Texture management on GL is a royal PITA. Implementations have bugs everywhere. Can’t blame them though. TextureStorage is reasonable, but it still lacks the well defined rules and the aggressive validation layer from D3D11. This is more of a problem when the end-user has old drivers; but every now and then we hit a bug in the newest drivers, particularly with the less used features such as texture arrays, 3D textures, and cubemaps.
- Compatibility-wise, D3D11 is a clear winner for Intel cards. For the very latest cards with latest drivers, compatibility is pretty much the same for all three major vendors. The biggest problem comes from older cards which rarely gets updates anymore, and thus D3D11 is more compatible: For Intel, it seems “old” means anything that is more than one year old. For AMD, the Radeon HD 4xxx series went into legacy, while HD 5xxx series are still getting updates. For NV, that means anything older than the GeForce 4xx series. On “old” hardware, D3D11 is much more likely to run without glitches. However remember that as mentioned above, sometimes HLSL is so thoroughly-incorrectly optimized that it produces bad results, ironically making GL more likely to run without glitches in these scenarios (vendor-specific).
- That being said, D3D11 can have very strange anomalies where switching to OpenGL can be an alternative that works (not to mention that Ogre itself can have bugs in the D3D11 implementation that may not be present in the GL implementation, or vice-versa).
- Compatibility-wise GL is the only good choice on Linux. Mesa drivers are yet unable to run our samples though. They’ve made significant progress but it seems likely their GLSL compiler produces bad output.
- D3D’s separation of samplers and textures allows them to use around 12 different samplers and 128 bound textures per shader. Whereas GL, having them together, is often limited to between 16 and 32 depending on the GPU model & driver (and for any practical use… that means 16). At 128 bound textures (which can also be texture arrays), we can almost literally emulate bindless. But I didn’t even care to try how that would perform since GL is so limited.
- D3D’s resemblance of UAVs to RenderTargets is annoying as hell as compared to GL’s resemlance of UAVs to Textures. They may have good reasons being a state-tracking API, but it is annoying as hell.
- Following modern GL, performance on Windows is often around the same as on Linux. We noticed sometimes Linux wins, but sometimes loses. There are several factors, GCC is often more aggressive at optimizing (which means our actual code is what is running faster), by default we compile in 64-bit on Linux while by default we compile in 32-bit on Windows, Linux doesn’t give a damn about GPU security (we suspect in very specific scenarios that performance differences were due to Windows zero’ing GPU memory before returning control to us, while Linux just gave us a buffer right away with garbage on it; that’s a massive security breach). However things like the Unity compositor in Ubuntu can severely degrade the performance. And VSync on Linux is broken. Just broken. It’s a miracle when it works; and often when it works, performance is severely degraded. Because of all this, our benchmarks vary wildly. These factors tend to counter each other giving pretty much the same performance as Windows, while sometimes one of them gets noticed due to specific conditions (making it faster, or making it slower: we’re talking about +/- 30-40% differences here, they’re not minor). And what do we compare against Windows? An out of the box Ubuntu, which is quite common? Or a Lubuntu box which is considerably faster at most of our benchs? At least they all look exactly the same, though Linux drivers tend to have a few more bugs.
- OS X is just… nevermind. Apple just doesn’t take 3D graphics seriously.
Well… both APIs have their own quirks. There is no definitive winner. If you stick strictly to modern practices (ie. our “v2” interface); you will get lower overhead on GL. However we can’t say better performance, since often the HLSL shaders perform slightly faster (and very rarely: significantly faster).
So it ends up as a combination of what is your bottleneck (CPU vs GPU), and how good the driver is at compiling the GLSL shader and how good it is at translating the D3D asm to ISA.
I have yet to determine how much difference the “asynchronous compute shaders / UAVs” can do to GL (with explicit-but-coarse Memory Barriers) vs D3D11 (with implicit memory barriers on every single invocation).
Since Ogre is able to switch between both at runtime, the best answer is to try to support both APIs and let the user decide.
With D3D12 around the corner and Vulkan in the works, this post may sound very late. But if you care about cross platform compatibility and still supporting legacy hardware, it can still be very useful.
It’s late, so I could be talking out of my bum, but:
>D3D’s resemblance of UAVs to RenderTargets is annoying as hell as compared to GL’s resemlance of UAVs to Textures. They may have good reasons being a state-tracking API, but it is annoying as hell.
Don’t bind them as a OM UAVs then? Just use a RWTexture2D or something and you can still do PSSetShaderResource.
Just do this:
RWTexture2D rwt2d : register(t1);
instead of this
RWTexture2D rwt2d : register(u1);
Create a normal Texture2D (with the SHADER_RESOURCE bind flag), and then use CreateShaderResourceView and then simply do PSSetShaderResources. Should all work.
AFAIK that is invalid on D3D11. But I admit, I haven’t tried to force it.
Would be interesting if it actually works. I should check.
I’ve tried it now, and it’s invalid.
Importantly the HLSL compiler takes:
RWTexture2D rwt2d : register(t1);
and makes it:
dcl_uav_typed_texture2d (float,float,float,float) u1
without issuing any complaints or notices.
Can you elaborate on your problems with OS X? Feels a bit dismissive and non-informative.
Apple is stuck on GL4.1; which is an API from around 2010 (and it took ages for them to update from their last iteration).
But it’s not so much the problem that they’re stuck in a 5 year old API; but rather the problem is that there is a before and after with GL 4.2
There’s a lot of very important features were added in 4.2 and 4.3:
* Lots of new GLSL shader syntaxes & keywords that are key to achieving low overhead, high performance rendering (see https://www.opengl.org/registry/specs/ARB/shading_language_420pack.txt).
* Lots of “range” bindings functions such as glTexBufferRange (see https://www.opengl.org/sdk/docs/man/docbook4/xhtml/glTexBufferRange.xml). This one is *very* important. It lets us map a whole buffer, update it, and then only load partial chunks to the shader; instead of having many buffers and mapping every one of them every frame. This is a massive performance sink.
* Lacks GL_ARB_base_instance (this one is ****very**** important for efficient rendering; lets the developer draw multiple meshes without having to change or respecify the VAO, while also sending a draw ID to the shader; these are performance eaters)
* Lacks GL_ARB_buffer_storage (persistent mapping)
* Lacks other extensions (just as MultiDrawIndirect, SSBOs, Image Store) which we could live without them, but would be nice to have.
Any modern OpenGL to be considered “modern” uses the feature set provided by +4.2 implementations (and preferably +4.3’s). So either you keep two completely different code paths (one for OS X, another for the rest), or you target OS X as the bare minimum and severely penalize the performance in all platforms.
Usually you will already have an additional codepath for mobile, targeting GLES2 and/or GLES3. However that downgrades OS X’s OpenGL capabalities to GLES3 levels.
For comparison, it’s as if Windows 7 would be stuck in somewhere between Direct3D 9 and 10; while Windows 8.1 is at D3D 11.1
The difference is massive.
Even Metal on iOS can do more and better than OGL 4.1 can do on OS X.
A company that is stuck on OpenGL 4.1 and takes more than half a decade between each update (without even considering the implementation bugs) is sending a very clear message: They do not care.