There’s a lot of memory waste in a MovableObject, just to name a few:
bool mVisible; bool mCastShadows; //Calls visitRenderables on itself every time to ask the material if it receives shadows. Nasty. bool MovableObject::getReceivesShadows();
The first two take 2 bytes per and cause lots of branches and cache misses (if visible && getCastShadows…), and the third one while it doesn’t take extra ram, it’s a nasty call stack of virtual calls.
These all 3 have a lot in common:
- They can be represented with a bit
- They all manage visibility (don’t render if visible is off, only render casters during shadow pass, only include receivers when processing receiver aabb)
- They can be processed in batches using old school bitwise arithmetic
- They’re great for SIMD
The only natural solution would be to go old school. And it happens… there’s already a 32-bit visibility flag variable! It’s probably the most useful, most underused feature in Ogre (a mentality I wish to change for Ogre 2.0; everybody renders using layers!).
Combined with render queue ids, one can do very powerful compositing: Keep translucent objects in one layer, simple alpha blended objects in another, opaque in another, special particle effects in another layer, etc.
Ok. Back to topic, being one of the most underused features, this gives me some liberty to reserve a few bits for myself. For instance Blender allows up to 20 different layers, and for now I’m planning on leaving 29 for the user.
There’s another reason why I’m doing this: Because we’re using SoA, this means I can save a few pointers, load instructions and indirections by storing data in mVisibilityFlags (which is already a SoA pointer now)
Bits 31, 30 & 29 are now reserved for visibility, shadow casting and shadow receiving.
Shadow casting comes naturally: Shadow cameras should only have bit 30 set, so any non caster will be skipped. Bit 31 can be tested using SIMD instructions for conditional moving the updated receiverAabb while calculating 4 objects at a time.
Bit 29 can also be tested using SIMD instructions then to ‘and’ the final mask and decide whether to add the object to the render queue.
This may sound complex but it’s not:
Shadow casting:
uint32 viewportFlags = 1 << 30; if( obj[i].mVisibilityFlags & viewportFlags ) //Non-casters will be skipped addRenderable( this );
Shadow receivers:
This code expands the aabb of receivers
Vector3 vMin, vMax; //Assumed to be initialized uint32 cmovFlag = obj[i].mVisibilityFlags & (1 << 31) != 0 ? 0xffffffff : 0; vMin = CMov( vMin, std::min( vMin, this->getMinimum(), cmovFlag ); vMax = CMov( vMax, std::min( vMax, this->getMaximum(), cmovFlag );
Checking visibility:
uint32 isVisible = obj[i].mVisibilityFlags & (1 << 29) != 0 ? 0xffffffff : 0; if( isVisible ) addRenderable( this );
Everything combined:
Vector3 vMin, vMax; //Assumed to be initialized uint32 cmovFlag = obj[i].mVisibilityFlags & (1 << 31) != 0 ? 0xffffffff : 0; vMin = CMov( vMin, std::min( vMin, this->getMinimum(), cmovFlag ); vMax = CMov( vMax, std::min( vMax, this->getMaximum(), cmovFlag ); uint32 isVisible = obj[i].mVisibilityFlags & (1 << 29) != 0 ? 0xffffffff : 0; uint32 viewportFlags = 1 << 30; if( obj[i].mVisibilityFlags & viewportFlags & isVisible ) addRenderable( this );
Unfortunately the actual code looks uglier because it’s in SIMD style to process four at a time; but this is basically what we’re doing. (Replace the ‘&’ with Mathlib::And, the ‘|’ with Mathlib::Or, and the ‘!=0 ? a : b’ with Mathlib::TestFlags4 and you’ll get the idea)
The result is that we’ve saved 2 bytes per instance, having to load 3 values from multiple places every time we need them, and a conditional check (shadow caster checking became free, as we already had to test visibility flags); not to mention the whole thing is a lot more simd friendly.
Update: Fixed typo (30 << 1) instead of (1 << 30). Thanks Xavier Verguin.