Descriptors are hard

Over the weekend, I asked on twitter if people would be interested in a rant about descriptor sets. As of the writing of this post, it has 46 likes so I’ll count that as a yes.

I kind-of hate descriptor sets…

Well, not descriptor sets per se. More descriptor set layouts. The fundamental problem, I think, was that we too closely tied memory layout to the shader interface. The Vulkan model works ok if your objective is to implement GL on top of Vulkan. You want 32 textures, 16 images, 24 UBOs, etc. and everything in your engine fits into those limits. As long as they’re always separate bindings in the shader, it works fine. It also works fine if you attempt to implement HLSL SM6.6 bindless on top of it. Have one giant descriptor set with all resources ever in giant arrays and pass indices into the shader somehow as part of the material.

The moment you want to use different binding interfaces in different shaders (pretty common if artists author shaders), things start to get painful. If you want to avoid excess descriptor set switching, you need multiple pipelines with different interfaces to use the same set. This makes the already painful situation with pipelines worse. Now you need to know the binding interfaces of all pipelines that are going to be used together so you can build the combined descriptor set layout and you need to know that before you can compile ANY pipelines. We tried to solve this a bit with multiple descriptor sets and pipeline layout compatibility which is supposed to let you mix-and-match a bit. It’s probably good enough for VS/FS mixing but not for mixing whole materials.

The problem space

So, how did we get here? As with most things in Vulkan, a big part of the problem is that Vulkan targets a very diverse spread of hardware and everyone does descriptor binding a bit differently. In order to understand the problem space a bit, we need to look at the hardware…

DISCLAIMER:

I’m about to spill a truckload of hardware beans. Let me reassure you all that I am not violating any NDAs here. Everything I’m about to tell you is either publicly documented (AMD and Intel) or can be gleaned from reading public Mesa source code.

Descriptor binding methods in hardware can be roughly broken down into 4 broad categories, each with its own advantages and disadvantages:

  1. Direct access (D): This is where the shader passes the entire descriptor to the access instruction directly. The descriptor may have been loaded from a buffer somewhere but the shader instructions do not reference that buffer in any way; they just take what they’re given. The classic example here is implementing SSBOs as “raw” pointer access. Direct access is extremely flexible because the descriptors can live literally anywhere but it comes at the cost of having to pass the full descriptor through the shader every time.

  2. Descriptor buffers (B): Instead of passing the entire descriptor through the shader, descriptors live in a buffer. The buffers themselves are bound to fixed binding points or have their base addresses pushed into the shader somehow. The shader instruction takes either a fixed descriptor buffer binding index or a base address (as appropriate) along with some form of offset to the descriptor in the buffer. The difference between this and the direct access model is that the descriptor data lives in some other bit of memory that the hardware must first read before it can do the actual access. Changing buffer bindings, while definitely not free, is typically not incredibly expensive.

  3. Descriptor heaps (H): Descriptors of a particular type all live in a single global table or heap. Because the table is global, changing it typically involves a full GPU stall and maybe dumping a bunch of caches. This makes changing out the table fairly expensive. Shader instructions which access these descriptors are passed an index into the global table. Because everything is fixed and global, this requires the least amount of data to pass through the shader of the three bindless mechanisms.

  4. Fixed HW bindings (F): In this model, resources are bound to fixed HW slots, often by setting registers from the command streamer or filling out small tables in memory. With the push towards bindless, fixed HW bindings are typically only used for fixed-function things on modern hardware such as render targets and vertex, index, and streamout buffers. However, we still need to consider them because Vulkan 1.0 was designed to support pre-bindless hardware which might not be quite as nice.

Here’s a quick run-down on where things sit with most of the hardware shipping today:

Hardware Textures Images Samplers Border Colors Typed buffers UBOs SSBOs
NVIDIA (Kepler+) H H H H D/F D
AMD D D D H D D D
Intel (Skylake+) H H H H H/D/F H/D
Intel (pre-Skylake) F F F F D/F F
Arm (Valhal+) B B B B B/D/F B/D
Arm (Pre-Valhal) F F F F D/F D
Qualcomm (a5xx+) B B B B B B
Broadcom (vc5) D D D D D D

The line above for “Intel (pre-Skylake)” is a bit misleading. I’m labeling everything as fixed HW bindings but it’s actually a bit more flexible than most fixed HW binding mechanisms. It a sort of heap model but where, instead of indexing into heaps directly from the shader, everything goes through a second layer of indirection called a binding table which is restricted to 240 entries. On Skylake and later hardware, the binding table hardware still exists and uses a different set up heaps which provides a nice back-door for drivers. More on that when we talk about D3D12.

The Vulkan 1.0 descriptor set model

As you can see from above, the hardware landscape is quite diverse when it comes to descriptor binding. Everyone has made slightly different choices depending on the type of descriptor and picking a single model for everyone isn’t easy. The Vulkan answer to this was, of course, descriptor sets and their dreaded layouts.

Ignoring UBOs for the moment, the mapping from the Vulkan API to these hardware descriptors is conceptually fairly simple. The descriptor set layout describe a set of bindings, each with a binding type and a number of descriptors in that binding. The driver maps the binding type to the type of HW binding it uses and computes how much GPU or CPU memory is needed to store all the bindings. Fixed HW bindings are typically stored CPU-side and the actual bindings get set as part of vkCmdBindDescriptorSets() or vkCmdDraw/Dispatch(). For everything in one of the three bindless categories, they allocate GPU memory. For heap descriptors, descriptors may be allocated as part of the descriptor set or, to save memory, as part of the image or buffer view object. Given that descriptor heaps are often limited in size, allocating them as part of the view object is often preferred.

UBOs get weird. I’m not going to try and go into all of the details because there are often heuristics involved and it gets complicated fast. However, as you can see from the above table, most hardware has some sort of fixed HW binding for UBOs, even on bindless hardware. This is because UBOs are the hottest of hot paths and even small differences in UBO fetch speed turn into real FPS differences in games. This is why, even with descriptor indexing, UBOs aren’t required to support update-after-bind. The Intel Linux driver has three or four different paths a UBO may take based on how often it’s used relative to other UBOs, update-after-bind, and which shader stage it’s being accessed from.

The other thing I have yet to mention is dynamic buffers. These typically look like a fixed HW binding. How they’re implemented varies by hardware and driver. Often they use fixed HW bindings or the descriptors are loaded into the shader as push constants. Even if the buffer pointer comes from descriptor set memory, the dynamic offset has to get loaded in via some push-like mechanism.

The D3D12 descriptor heap

DISCLAIMER:

Again, I’m going to talk details here. Again, in spite of the fact that there are exactly zero open-source D3D12 drivers, I can safely say that I’m not violating any NDAs. I’ve literally never seen the inside of a D3D12 driver. I’ve just read public documentation and am familiar with how hardware works and is driven. This is all based on D3D12 drivers I’ve written in my head, not the real deal. I may get a few things wrong.

For D3D12, Microsoft took a very different approach. They embraced heaps. D3D12 has these heavy-weight descriptor heap objects which have to be bound before you can execute any 3D or compute commands. Shaders have the usual HLSL register notation for describing the descriptor interface. When shaders are compiled into pipelines, descriptor tables are used to map the bindings in the shader to ranges in the relevant descriptor heap. While the size of a descriptor heap range remains fixed, each such range has a dynamic offset which allows the application to move it around at will.

With SM6.6, Microsoft added significant flexibility and further embraced heaps. Now, instead of having to use descriptor tables in the root descriptor, applications can use heap indices directly. This provides a full bindless experience. All the application developer has to do is manage heap allocations with resource lifetimes and figure out how to get indices into their shader. Gone are the days of fiddling with fixed interface layouts through side-bind pipeline create APIs. From what I’ve heard, most developers love it.

If D3D12 has embraced heaps, how does it work on AMD? They use descriptor buffers, don’t they? Yup. But, fortunately for Microsoft, a descriptor heap is just a very restrictive descriptor buffer. The AMD driver just uses two of their descriptor buffer bindings (resource and sampler heaps are separate in D3D12) and implements the heap as a descriptor buffer.

One downside to the descriptor heap approach is that it forces some amount of extra indirection, especially with the SM6.6 bindless model. If your application is using bindless, you first have to load a heap index from a constant buffer somewhere and then pass that to the load/store op. The load/store turns into a sequence of instruction that fetches the descriptor from the heap, does the offset calculation, and then does the actual load or store from the corresponding pointer. Depending on how often the shader does this, how many unique descriptors are involved, and the compiler’s ability to optimize away redundant descriptor fetches, this can add up to real shader time in a hurry.

The other major downside to the D3D12 model is that handing control of the hardware heaps to the application really ties driver writers’ hands. Any time the client does a copy or blit operation which isn’t implemented directly in the DMA hardware, the driver has to spin up the 3D hardware, set up a pipeline, and do a few draws. In order to do a blit, the pixel shader needs to be able to read from the blit source image. This means it needs a texture or UAV descriptor which needs to live in the heap which is now owned by the client. On AMD, this isn’t a problem because they can re-bind descriptor sets relatively cheaply or just use one of the high descriptor set bindings which they’re not using for heaps. On Intel, they have the very convenient back-door I mentioned above where the old binding table hardware still exists for fragment shaders.

Where this gets especially bad is on NVIDIA, which is a bit ironic given that the D3D12 model is basically exactly NVIDIA hardware. NVIDIA hardware only has one texture/image heap and switching it is expensive. How do they implement these DMA operations, then? First off, as far as I can tell, the only DMA operation in D3D12 that isn’t directly supported by NVIDIA’s DMA engine is MSAA resolves. D3D12 doesn’t have an equivalent of vkCmdBlitImage(). Applications are told to implement that themselves if they really want it. What saves them, I think (I can’t confirm), is that D3D12 exposes 106 descriptors to the application but NVIDIA hardware supports 220 descriptors. That leaves about 48k descriptors for internal usage. Some of those are reserved by Microsoft for tools such as PIX but I’m guessing a few of them are reserved for the driver as well. As long as the hardware is able to copy descriptors around a bit (NVIDIA is very good at doing tiny DMA ops), they can manage their internal descriptors inside this range. It’s not ideal, but it does work.

Towards a better future?

I have nothing to announce but me and others have been thinking about descriptors in Vulkan and how to make them better. I think we should be able to do something that’s better than the descriptor sets we have today. What is that? I’m personally not sure yet.

The good news is that, if we’re willing to ignore non-bindless hardware (I think we are for forward-looking things), there are really only two models: heaps and buffers. (Anything direct access can be stored in the heap or buffer and it won’t hurt anything.) I too can hear the siren call of D3D12 heaps but I’d really like to avoid tying the drivers hands like that. Even if NVIDIA were to rework their hardware to support two heaps today to get around the internal descriptors problem and make it part of the next generation of GPUs, we wouldn’t be able to rely on users having that for 5-10 years, longer depending on application targets.

If we keep letting drivers managing their own heaps, D3D12 layering on top of Vulkan becomes difficult. D3D12 doesn’t have image or buffer view objects in the same sense that Vulkan does. You just create descriptors and stick them in the heap somewhere. This means we either need to come up with a way to get rid of view objects in Vulkan or a D3D12 layer needs a giant cache of view objects, the lifetimes if which are difficult to manage to say the least. It’s quite the pickle.

As with many of my rant posts, I don’t really have a solution. I’m not even really asking for feedback and ideas. My primary goal is to educate people and help them understand the problem space. Graphics is insanely complicated and hardware vendors are notoriously cagey about the details. I’m hoping that, by demystifying things a bit, I can at the very least garner a bit of sympathy for what we at Khronos are trying to do and help people understand that it’s a near miracle that we’ve gotten where we are. 😅