Why does vkMergePipelineCaches exist?   Recently updated !


You may have noticed Vulkan offers vkMergePipelineCaches.

The reason behind this (or at least, one of the reasons) is extreme concurrency (Amdahl’s Law). However, most people create just one VkPipelineCache for the entire app and then forget about. Specially because of how simple it is to setup just one VkPipelineCache for everyone.

When you call vkCreateGraphicsPipelines() the driver must internally do something among these lines (pseudo code):

// One possibility, single-level cache.

mutex.lock_read();
entry = check_cache();
mutex.unlock();

if( !entry )
{
   compile();

   mutex.lock_write();
   // Note: insert_cache() may do nothing if two threads compiled the exact same PSO
   // at the same time and both returned null in check_cache()
   insert_cache( entry );
   mutex.unlock();
}


// Or alternate strategy, a 2-level cache:

mutex.lock_read();
entry = check_read_only_cache();
mutex.unlock();

if( !entry )
{
   mutex.lock_write();
   entry = check_read_write_cache();
   if( !entry )
   {
       entry = compile();
       insert_cache( entry );
   }
   mutex.unlock();
}

// A worker thread must periodically lock everything to migrate
// entries from read_write_cache into check_read_only_cache.

The exact locking strategy may vary across driver implementations, but a few things stand out:

  1. Contention can happen. Some strategies may minimize it, but won’t be able to completely get rid of it.
  2. Some strategies allow duplicates, others don’t at the expense of potentially larger contention.
  3. Some form of locking is unavoidable (even if “lockless” algorithms are used, atomic operations still have overhead at the HW level at some point).
  4. Very stupid drivers could lock the cache throughout the entire compilation phase. I don’t know if there’s any driver that does this, but unrelated to VkPipelineCache entirely, a colleague of mine did see a mobile driver vendor serialize all PSO compilation threads blocked on just one driver thread doing a big chunk of the work 🙁

So the reason vkMergePipelineCaches exists is that you can use the following strategy to minimize contention:

parallel_for
{
  // One VkPipelineCache PER THREAD
  vkCreateGraphicsPipelines( psoCache[thread_id], ... );
}


// After we know for certain all threads are done:
// Note we start from i = 1, not i = 0.
for i = 1; i < num_threads
{
  // Merge all caches into one
  vkMergePipelineCaches( psoCache[0], psoCache[thread_id] );
}

// Now psoCache[0] contains all threads and can be used on the next
// run for mostly read only access, maximizing concurrency.

Notice that by the time we’re done, psoCache[0] will have all shaders together, which can be reused on the next run.

This means our initial parallel_for will run with no contention at all.

The disadvantage of this approach is that if your engine sends a lot of duplicate PSOs (i.e. generates lots of PSOs that are identical) to different threads, none of them will be cached, causing compilation times to be longer.

But locking is not avoided!

We fixed contention (yay!), but we didn’t fix the need for locking!

The driver doesn’t know every thread is getting its own VkPipelineCache, so it will still lock the cache. Even if the lock() doesn’t block, it still hits syscalls (at worst) or at best performs atomic instructions (which are cheaper but not free).

Someone noticed this problem and added VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT through the VK_EXT_pipeline_creation_cache_control extension.

If you create a VkPipelineCache with the VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT set you’re basically telling the driver “bro trust me, you don’t need to lock the cache. I guarantee no other thread will be accessing it while you use it”. Of course if you break that trust, crashes or corruption will happen. If you have one VkPipelineCache per thread + created with EXTERNALLY_SYNCHRONIZED_BIT, you’ll get maximum concurrency.

But your engine code must be designed in a way that at some point you vkMergePipelineCaches() the caches together, specially done before the second app launch. And don’t miss the fine print that EXTERNALLY_SYNCHRONIZED_BIT includes vkMergePipelineCaches(). You can’t call vkMergePipelineCaches() while another thread is inside vkCreateGraphicsPipelines() if you’ve created the cache with EXTERNALLY_SYNCHRONIZED_BIT. How you solve that efficiently is up to you.