There’s a lot of mysticism regarding inline & __forceinline even thought it is documented. What Microsoft tells us is that __forceinline turns off cost/benefit analysis of inlining, and tries to honour the developer’s request (note that it tries, it can’t do that always).
Often developers are told the compiler knows what’s best and they should not force stuff. But this isn’t always true, specially when the compiler doesn’t have information that we do.
And here’s a real case example. Beware a large explanation is first needed:
Instance HW VTF
In Ogre, we have an instance manager to manage instanced entities. Instance Manager has many implementations, one of them is ‘HW VTF’, which in english means we use hardware instancing frequency to repeat drawing of the same object, and a VTF (vertex texture fetch) to store skeletal animation matrices (or world matrices if the instances aren’t animated)
But it doesn’t end here. HW VTF has a lot of functionality that has been incorporated through contributions:
- No animations (1 matrix per instance)
- Animations (N matrices per instance)
- LUT Animations: LUT stands for Look Up Table. Basically we have N instances, but M animations are calculated (where N > M); and then reuse/repeat animations for the remaining instances. This is great for achieving high framerate with very large crowds.
- Dual Quaternion Skinning
While this variety gives +10 points for flexibility, it’s a pita in terms of performance & maintenance. I wrote the original HW VTF technique (which only supported no animations & animations) and my original straight loop to copy matrices from RAM to GPU’s VTF ended up in a nasty loop full of branches: if animated then … else if LUT_enabled …. else if dualquat ….
I had to clean that up. The problem is, how do I move away so many evaluations in the inner loop?So I decided to take another approach. I could write 4 loops but that was cumbersome (and definitely a bad practice, I am repeating myself 4 times) because of the following problem I’ll describe.
Threading readiness
Ogre 2.0 is written with threading in mind. Although the threading is not ready (nor working), the foundation is being written. This means we could cull N entities in M threads. Thus each core would only have to cull N / M entities.
The approach we use is very simple. Because all our entities are contiguous (in virtual address) we just divide the entities to parse in M sections. Imagine M = 2, for example a dual core machine. The pseudo code would be like this:
vector<MovableObject*> entitiesToCull; vector<MovableObject*> culledEntities[2]; //One per thread
Thread 0 | Thread 1 |
for( int i=0; entitiesToCull.size() / 2; ++i ) { culledEntities[0].push_back( isVisible( entitiesToCull[i] ) ); }; |
for( int i=entitiesToCull.size() / 2; entitiesToCull.size(); ++i ) { culledEntities[1].push_back( isVisible( entitiesToCull[i] ) ); }; |
Sound nice right? So far, so good.
The problem is once we’re done. We have to collect the data from all threads. Updating the VTF means our loop will end up like more or less like this:
//First loop: threads for( int i=0; i<numThreads; ++i ) { //Second loop: Culled entities per thread for( int j=0; j<culledEntities[i].size(); ++j ) { //No animations for simplicity *vtfPtr++ = culledEntities[i][j]->getMatrix(); } }
If we’re going to write this double loop once per technique (no animation, with animation, LUT animation, dual quaternion) it’s going to be a nasty piece of work to maintain (or even understand what’s going on)
Templates to the rescue
For those familiar with C#, this will sound very similar to LINQ. The idea is to let the compiler write the loops for us and execute a custom function every iteration (one function for each technique).
We use it in combination with std::for_each()
The template works like this (this is simplified actual code from Ogre 2.0):
//This template does the magic of creating the code for us, and we use it on every technique template struct VisibleObjsPerThreadOperator { T &mSecondOperator; VisibleObjsPerThreadOperator( T &secondOperator ) : mSecondOperator(secondOperator) {} void operator () ( const MovableObject::MovableObjectArray &visibleObjects ) { std::for_each( visibleObjects.begin(), visibleObjects.end(), mSecondOperator ); } }; //This is the "no animation" technique struct SendAllSingleTransformsToTexture { float * mDest; //Pointer to VTF texture SendAllSingleTransformsToTexture( float * dstPtr ) : mDest( dstPtr ) {} inline void operator () ( const MovableObject *mo ) { //Write transform matrix movableObject->writeSingleTransform3x4( mDest ); mDest += 12; } }; if( notAnimated ) { std::for_each( &culledEntities[0], &culledEntities[2], VisibleObjsPerThreadOperator ( SendAllSingleTransformsToTexture(pDest) ) ); }
This works great! Now I just need to define another struct for the animated version (just like SendAllSingleTransformsToTexture) and the compiler makes the code for us. It’s a lot more maintainable (specially since the per-thread data iteration code has now been automated) and a lot easier to read when you’re looking at the function that updates the VTF (there’s a lot more going on)
Inline: The right and wrong sections of code
First of all, let me clarify that all involving functions are either implicitly inlined, explicitly inlined or templates. When we look at the assembly generated code, some of the functions look fine like this (pseudo assembly):
NoAnimations: call std@for_each@VisibleObjsPerThreadOperator@NoAnimations std@for_each:VisibleObjsPerThreadOperator@NoAnimations dec threadnum cmp threadnum, 0 call SendAllSingleTransformsToTexture@operator() jne std@for_each:VisibleObjsPerThreadOperator ret SendAllSingleTransformsToTexture@operator(): dec instance_count cmp instance_count, 0 //writeSingleTransform3x4() code gets inlined properly movaps xmm0, [eax] movntps [esi], xmm0 movaps xmm0, [eax+16] movntps [esi+16], xmm0 movaps xmm0, [eax+32] movntps [esi+32], xmm0 movaps xmm0, [eax+48] movntps [esi+48], xmm0 //Here ends code from writeSingleTransform3x4() add eax, 64 add esi, 64 jne SendAllSingleTransformsToTexture@operator() ret
This is really great assembly. The compiler evaluated that inlining all loops would thrash the L1 cache, so it decided it’s better to call once per thread, and then inline writeSingleTransform3x4. Really nice.
Definitely this turned out to be even better than hand-writing the loop for every technique.
Except… for one problem. When we go back to the other techniques, which are a bit more complex (for example animations), the assembly generated looks like this (pseudo code):
WithAnimations: call std@for_each@VisibleObjsPerThreadOperator@WithAnimations std@for_each:VisibleObjsPerThreadOperator@WithAnimations dec threadnum cmp threadnum, 0 //Code from SendAllAnimatedTransformsToTexture@operator() got this time inlined SendAllSingleTransformsToTexture@operator(): dec instance_count cmp instance_count, 0 call writeAnimationTransform3x4 //Oops jne SendAllSingleTransformsToTexture@operator() jne std@for_each:VisibleObjsPerThreadOperator ret
This is really bad. The compiler again decided the code was too large, but this time inlined everything except the biggest hotspot! We get one call for each instance there is!
But of course the compiler doesn’t know where the hotspot is! No matter how “more clever than you” people tell you the compiler is, it has no way to know there are much more instances than threads! All it knows there could be a million threads and near zero instances, in which case this generated assembly would work great.
But the compiler doesn’t know that. All it sees is that there is data to iterate.
Note the call overhead is more than just one instruction. I was showing pseudo code back there. An actual call involves multiple push & pop instructions and other similar overhead that disappears when the code gets inlined.
__forceinline saves the day
Adding a __forceinline to both writeAnimationTransform3x4 and the operator fixed the problem. The compiler uses its cost/benefit analysis on the rest of inlined functions:
//This is the "no animation" technique struct SendAllAnimatedTransformsToTexture { float * mDest; //Pointer to VTF texture SendAllAnimatedTransformsToTexture( float * dstPtr ) : mDest( dstPtr ) {} __forceinline void operator () ( const MovableObject *mo ) { //Write transform matrix movableObject->writeAnimationTransform3x4( mDest ); //This function also had to be declared with __forceinline mDest += 12; } };
Just to clarify what’s wrong with the compiler analysis: The compiler is right that the generated code is too long to be inlined an is better treated as call. However the compiler misses in which parts it chooses to inline and which not. It’s choosing to inline sections of code that would get rarely called, and prefer to use calls in areas that are going to be iterated very frequently. We’re just aiding the compiler, not overruling its decisions.
Even the biggest CPUs out there in the consumer market have 32 cores (including hyperthreading). 32 calls per frame is completely acceptable (assuming we would have 1 threads per core or hyperthread). However we could easily have 6000 objects or more. 6000 calls is not ok.
I also have yet to analyze what happens with our other two major compiler, GCC & Clang.
I’m guessing you’ll have a similar issue in gcc (though without your exact code, it is difficult to know!), in which case I’ve found __attribute__((always_inline)) to be the useful equivalent of Microsoft’s __forceinline
You probably already knew this, but I’m watching this project with interest. Glad S/W engineering is dragging itself out of the lengthy lull where we relied on Moore’s law to get us out of this kind of performance problem.
The code is already checked in bitbucket.org/dark_sylinc/ogre2-gsoc
You can already take a peek.
If it doesn’t compile, I forgot to commit a file (though I haven’t tested gcc in a while, there could be some minor compiling issue, i.e. case sensitive included header, non-strictly-standard-compliant c++ code, etc)