Threading results were disappointing. Well, I was expecting 4x improvement on my quad core machine. The result was between 35-40% improvement which is still something.
After a closer look, I ended up concluding that at 250×250 instances we’re memory bandwidth bound (i.e. disabling all math in UpdateAllTransforms and just doing a raw copy of the positions to derived positions did nothing to performance)
I tried very hard other approaches to optimize UpdateAllTransforms & UpdateAllBounds which should theoretically reduce cache misses (like doing some AoS to SoA conversion using shufps instead of simple movss) but which required a few extra bytes per instance and the framerate only decreased.
The only solution I can think to that problem is using what is done with large matrices: we currently update the transforms of all nodes, then all bounds, then cull all of them.
Perhaps if we do this by smaller batches, it could be possible to keep everything in the cache (The data of 62500 instances does NOT fit in my 2x6MB cache, do the math) i.e. update 1000 nodes then update 1000 bounds, then cull them; repeat. But this is very hard to pull off correctly.
Anyway I’m happy with the improvements over Ogre 1.x; I can’t refactor over recently refactored code otherwise we would never get anywhere.
The other bottlenecks I suspect could be living in the parts I didn’t touch: RenderQueue, AutoParamsData
One thing is awesome though, and that is I can run 100×100 rotating instances at 60 fps… IN DEBUG MODE 🙂 🙂 🙂
Debug mode is a lot less bandwidth bound (for obvious reasons) so threading is usually more visible there.
Test
The tests are similar to the previous ones. They have been done on 4 threads. Do not compare the framerate as I’ve changed the number of instances per batch (which bumped performance) to get rid of the slowdown effects from the API. I was trying to benchmark the threading here.
Not a silver bullet
Before I begin, I don’t even need my usual disclaimers because there’s something I have to tell you:
Threading results can vary a lot. It depends on your scene, how many entities are scene. In some cases you may not see a performance increase at all. I’ve seen the CPU usage stay around 50% or go to 100% when not looking at anything (since all of the work Ogre has to do is threaded)
I’m cheating?
After launching the profiler, it was called to my attention that the test was spending significant cpu time inside of SceneNode::roll; that is animating the cubes.
So what did I do? I let Ogre run it in parallel too. But that’s cheating! you may think. Well the thing is, in Ogre 1.x it was impossible to call roll or setPosition from multiple threads. The SceneManager would crash (more quickly with Octree) or the Node would become in an invalid state due to race conditions.
In Ogre 2.x; it’s perfectly safe to call roll, setOrientation, setPosition or setScale from multiple threads as long as it is not the same Node in both threads. So I thought it’s a valid excuse to do them in parallel.
But it won’t happen in a real game! You may say… ummm… yes it will happen! A real game probably runs Bullet, Havok or PhysX, and positions & orientations are copied every frame from the physics engine to the graphics engine. Now you can do the copy from multiple threads.
If you still think I’m cheating, you can download the source code, disable the aforementioned code, and post your own results. But if you do that, don’t call roll from a single thread, that’s unfair.
Results
“Dynamic Rotating” is the demo with the cubes rotating.
“Dynamic” is the demo with the cubes not rotating, but created with SCENE_DYNAMIC.
“Static” is the demo with the cubes not rotating, and created with SCENE_STATIC flag.
I’ve normalized the CPU usage to the range [0; 400%] where 400% which means all cores running at 100%; and 100% can either mean one core running at 100% or all four at 25% each.
Test – Dynamic Rotating | 4 threads | 1 thread | Speedup |
All entities | 20.01 ms – 240% CPU | 28.50 ms – 100% CPU | 1.42x |
Few entities | 11.06 ms – 320% CPU | 18.60 ms – 100% CPU | 1.68x |
No entities | 10.88 ms – 348% CPU | 18.20 ms – 100% CPU | 1.67x |
Test – Dynamic | 4 threads | 1 thread | Speedup |
All entities | 16.70 ms – 208% CPU | 20.50 ms – 100% CPU | 1.22x |
Few entities | 7.50 ms – 340% CPU | 10.70 ms – 100% CPU | 1.43x |
No entities | 7.32 ms – 332% CPU | 9.99 ms – 100% CPU | 1.36x |
Test – Static | 4 threads | 1 thread | Speedup |
All entities | 10.40 ms – 110% CPU | 10.50 ms – 100% CPU | 1.01x |
Few entities | 1.00 ms – 132% CPU | 1.00 ms – 100% CPU | 1.00x |
No entities | 0.69 ms – 136% CPU | 0.59ms – 100% CPU | 0.86x |
As it becomes evident, attaining more than 1.5x improvement is possible, though very unfeasible. Still, overall it’s something; specially when gamers are so desperate to get their games reach the 60fps landmark. And if your engine already runs its logic from another thread; getting all cores at full utilization becomes a real possibility.
The static test “few entities” had no speedup at all (1.00x) but was coloured in red because I noticed a significant amount of extra CPU was used in the threaded version; which may affect battery life in mobile.
May be more Ogre devs (including the community) can help track inefficiencies that are preventing scalability. And take note not everything is being threaded; particularly the RenderQueue; which should in theory a candidate for threading.
Furthermore our threading system is synchronous (for simplicity), but it might be worth researching whether an async approach could gain a few extra speedup.
Now I have to focus in providing a cross-platform facility to autodetect the numbers of cores in a machine.
Reproducing results
As always, the data you need to reproduce my results. I’m very interesting in hearing your results: my Quad Core has a 2x6MB L2 cache. Which means 2 Cores share a 6MB L2 cache, while the other 2 share another 6MB L2 cache.
When going multithreading, my system is able to use the full 12MB cache for Ogre; which could be another explanation of the speed increase.
AFAIK newer Intel systems share the L2 cache is unique to all cores but they share the L3 cache. Also, I didn’t tested AMD systems. Results could be indeed very interesting if they happen to vary much from mine.
- Precompiled pack: Mediafire Mirror, Yosoy Games Mirror
- Source Code
Hi Matias
Very interesting results. I have done a very quick benchmark on a couple of machines which you may be interested in.
Standard Desktop machine (Core 2 quad, Q9550, 2x6MB L2)
– This was a little disappointing, I wont go into detail but the most I really got was 130% speed up. When I say disappointing, any speedup is good!
Rendering Machine (2*Xeon, 2*X5650, with 6 × 256 KB L2, 12MB L3, on each processor)
Dyn Rot Few – 240% speedup
Dyn Rot many – 201% speedup
Dyn Rot none – 256% speedup
Dynamic Few – 402% speedup 🙂 !! ??
Dynamic many – 260% speed up
Dynamic none – 343 % speedup
static – Almost exactly the same results as yours.
Now although this machine has 12 physical cores I am using your pre-compiled version which appears to be locked to 4 threads. So when I got over 400% speedup I was a little dubious! Anyway I ran a few more tests and there can be a 25% difference between instances of running the tests. for example, if I was to run Dyn rotation on one 1 core it may start at 15ms frames, if I was to close it and start again it would then be 20ms. This happens on both multi-threaded, and single-threaded version and is always around a 25% difference.
The good news it that if you take this into account we are still seeing over 300% speedup which is great :). However this is an interesting phenomenon and would be good to understand, it could be a bug in your code relating to how data is aligned in memory, although I think its more likely to do with the fact the computer as 2 CPU’s on one motherboard. Hopefully some people can run their own tests. I will also try and compile the test for a 12 core machine.
All in all, these are some very promising results. I guess its animations next 🙂 It would also be interesting to do this scene comparison with Ogre 1.x.
Keep up the awesome work!
Gus
Wow thanks! Indeed your “Standard Desktop” is very similar to my machine (Intel Core 2 Quad QX9650 @3Ghz 2x6MB L1; pretentious name, but sounds awesome)
As for your rendering machine… great! Another user with L3 cache posted also performance improvements. This could mean we’re doing a lot of synchronization or colliding some info (perhaps a hidden variable with write access from all cores?)
Also, we would need to see if RAM bandwidth is affecting results. Newer machines tend to have more memory bandwidth.
Your 25% difference can be explained in many different ways, depending on how the scheduler splits the threads (all of them to the same processor, one of them in another processor, etc); or due to memory alignment as you said.
Multithreading is a complex issue, and I’m sure my initial implementation can be improved further. The big win is that despite how we do it, the new code is multithreaded-ready; as opposed to Ogre 1.x where running in parallel was unthinkable (there were SO many race conditions that could happen!)
Thanks for posting the results.
I looking forward testing on my laptop at home which is a 3rd gen Intel Core cpu.
I did some more tests on the 25% difference issue, it turns out that one if the CPU’s is 25% faster than the other! If I leave the test running and change the cpu affinity it is very obvious the second processor is slower. No idea why, but proves its an issue with that machine! I can say that I definatly get around 200-300% speedup maybe even a bit more when using either cpu.
Now, I am just compiling your demo to see how much faster it may go if I use all the cores on the rendering machine, or at least see when mem bandwidth becomes an issue. Firstly it did not compile as OgreBarrierWin.cpp includes “OgreBarrier.h” when I think it should include “Threading/OgreBarrier.h”, secondly what culling method were you using for the above method, ie one of the input variables to the scene manager?
Cheers
Gus
Yes, “Threading/OgreBarrier.h” I have no idea why why it is working on my machine. Thanks for the report.
Fixed locally. Will later commit.
For the 4 threads version, this is what I did:
const size_t numThreads = 4;
InstancingTheadedCullingMethod threadedCullingMethod = INSTANCING_CULLING_THREADED;
// Create the SceneManager, in this case a generic one
mSceneMgr = mRoot->createSceneManager(ST_GENERIC, numThreads, threadedCullingMethod, “ExampleSMInstance”);
Here’s mine.
(i7 3930k hex core with 6 x 256 KB L2, 12MB L3, geforce gtx 680)
All objects visible:
dynamic 1 thread rotating: 13.02ms
dynamic 4 thread rotating: 5.33ms
dynamic 1 thread: 6.88ms
dynamic 4 thread: 3.64ms
static 1 thread: 2.88ms
static 4 thread: 2.42ms
No objects visible:
dynamic 1 thread rotating: 10.6ms
dynamic 4 thread rotating: 3.31ms
dynamic 1 thread: 4.38ms
dynamic 4 thread: 1.60ms
static 1 thread: 0.25ms
static 4 thread: 0.25ms