©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Intro to GPU Occlusion
Leon Brands
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
LEON BRANDS
Graphics Programmer,
Behaviour Rotterdam
Email:
lbrands@bhvr.com
Website:
https://loen.tech
LinkedIn: /leonbrands/
Insert
picture
here
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Why
Concept
In Practice
What’s in a Frame?
Stability
Compute
Results
AGENDA
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
WHY
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
The Challenge
Co-dev
• Port to PS5, Xbox, Switch
• Doesn’t hit performance targets
Game
• Player-created levels
• Pre-authored chunks
Nintendo Switch
• Target: 30fps -> 31ms GPU time
• GPU, CPU, and Memory above budget
WHY
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
We needed a miracle
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
CONCEPT
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Re-using Data
• Depth Buffer
• Discard anything behind
• Also used for lighting
• But we can do more
CONCEPT
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Occlusion
Project AABB
Compare to depth
Further? Occluded
That’s a lot of samples
CONCEPT
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Hi-Z Buffer
• Depth Buffer with mips
• Bilinearly pick the max value
• At mip 4, each pixel represents 16 pixels
CONCEPT
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
© 2024 BEHAVIOUR INTERACTIVE INC. CONFIDENTIAL. DO NOT REDISTRIBUTE.
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
IN PRACTICE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
WHAT’S IN A FRAME?
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Collect Objects
• Chicken or the egg
• We want the data now
• Buffer of uint8
• Wait + Copy from GPU
• Frustum objects
• Each object has an ID
• Index for OC in/output
• Read results
• That’s it?
• Prepare for next frame
• Store AABB (world-space)
• Fully encapsulates obj
• ID for verification
Collect Objects
WHAT’S IN A FRAME?
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Depth Pre-Pass
• We've collected our visible objects
• Depth-only pre-pass
Collect Objects
WHAT’S IN A FRAME?
Generate Hi-Z
• Nearest power of 2 downward
• More than enough detail
• Iterate every mip, gather max
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Occlusion
• Prepare data
• Bind resources
• Compute
WHAT’S IN A FRAME?
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Visual Stability
• Efforts to keep stable
• Inherited broken implementation
• Things I’ve done to improve stability
STABILITY
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Skipping Objects
• Clear ID if object isn’t considered for culling
Nothing. Goes. Wrong.
• Fence timed out?
• No index?
• Invalid result?
STABILITY
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
IN PRACTICE/
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Setup
• Simple compute setup
• Skip if OOB
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
AABB
• Fetch item
• Project each corner
• Keep min and max xy, closest z
• UV-space square
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• Smallest encompassing square
• Compute Hi-Z buffer mip level
• Point sample, each corner
• Log2 of the largest dimension
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• For example
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• Is mip 4 correct?
• Hi-Z Buffer mip 0: 1024 x 512
• Square in front of plant, behind chair
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• Is mip 4 correct?
• Hi-Z Buffer: 1024 x 512
• Mip 3:
• 128 x 64
• Each pixel: 8x8 area
• Mip 4:
• 64 x 32
• Each pixel: 16x16 area
• Mip 5:
• 32 x 16
• Each pixel: 32x32 area
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• Is mip 4 correct?
• Hi-Z Buffer: 1024 x 512
• Mip 3:
• 128 x 64
• Each pixel: 8x8 area
• Mip 4:
• 64 x 32
• Each pixel: 16x16 area
• Mip 5:
• 32 x 16
• Each pixel: 32x32 area
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Compute Mip
• Is mip 4 correct?
• Hi-Z Buffer: 1024 x 512
• Mip 3:
• 128 x 64
• Each pixel: 8x8 area
• Mip 4:
• 64 x 32
• Each pixel: 16x16 area
• Mip 5:
• 32 x 16
• Each pixel: 32x32 area
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Sample Depth
• Sample 2x2 square
• Largest Z value in AABB’s area
• Is it entirely concealed?
• No if, small bias
COMPUTE
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
RESULTS
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Flaws
Artifacts
• Previous frame’s data
• Moving through objects
• Objects “occluded” for one frame
False positives
• AABB exceeds NDC space
• For sampling it’s clamped
• Z value doesn’t represent what’s on screen
• False positives…
RESULTS
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Performance Impact
• Cost: 0.034114ms
• Benefit: 4.7833ms
• Applicable to more than just meshes
• Foliage Tiles
• Particle Systems
• Fog Volumes
• Water Planes
RESULTS
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
Q&A
©
2024
BEHAVIOUR
INTERACTIVE
INC.
CONFIDENTIAL.
DO
NOT
REDISTRIBUTE.
© 2024 BEHAVIOUR INTERACTIVE INC. CONFIDENTIAL. DO NOT REDISTRIBUTE.

Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)

Editor's Notes

  • #1 Hi, today I’ll be taking you through an intro to gpu occlusion. In this talk, I will break down our adaptation of GPU-based occlusion culling using a hierarchical z-buffer, and the results it’s lead to in our client’s project 
  • #2 Hi, today I’ll be taking you through an intro to gpu occlusion. In this talk, I will break down our adaptation of GPU-based occlusion culling using a hierarchical z-buffer, and the results it’s lead to in our client’s project 
  • #3 But first a quick intro Leon Brands, Graphics Programmer At Behaviour for 1 year, in the industry for 3. Optimizations are my bread and butter (I spent years working on a custom multi-gpu raytracing engine for VR, and currently at BHVR I’m squeezing out every bit of performance to get our client’s game ready for the Nintendo Switch release.
  • #4 So, today we’ll talk about why we integrated gpu occlusion, the concept or idea behind how it works, and the practical application with a more in-depth breakdown of the math involved. Finally I’ll show you the impact it’s had on the project and we’ll talk about some of the flaws with our current implementation
  • #5 Why bother with GPU-based occlusion culling? Why did we go through all this effort and why should you?
  • #6 Co-Dev The project with our client is co-dev We work side-by-side with them on the game And in that we’re primarily responsible for porting to PS5, Xbox, and Switch Game runs on all platforms but doesn’t hit performance targets, so a big part of my workload is optimizations We have limited control over the code-base, no direct control over assets Game: Heavily features user generated content. Players can create and design their own levels using pre-authored assets. As a result, we can’t easily do any form of baking. On top of that, the engine isn’t designed with baking in mind either, meaning we don’t have any reliable way to recognize objects as non-moving/static When I joined this project, I was presented with a challenge [CLICK] We had a functioning port for the Nintendo Switch But far above target Need 31ms frames GPU frames of 42 frames with post-fx disabled CPU can’t maintain 30ms and regularly goes much higher Memory budget exceeded (understatement of the year)
  • #7 We need a miracle. Something that improves GPU performance without impacting CPU, or memory usage, and without any form of baking or fundamental engine restructures.
  • #8 And now for the concept, which is largely built on the idea of “reusing data is free”
  • #9 Re-using data is free (or at least cheaper) What data can we re-use? In this case we’ll be looking at the depth buffer, which gives us a lot of information. It tells us exactly what is going to get rendered Anything behind it can get discarded Using a depth-pre-pass we reduce overdraw, and with the depth buffer we also determine which lights affect our geometry, etc. But we can do more. Using the depth buffer of the previous frame, we can prevent trying to draw geometry in the first place.
  • #10 How do we do this? The idea is to project objects as AABBs onto the screen, and if the AABB is further away in depth than all the pixels in its area, then it can be considered occluded. A concern with this approach is the number of samples required to make sure the AABB is always visible when it should be… We can’t just check some random points or corners, because if we don’t check for every pixel in its area, we might not be properly accounting for gaps in the occluding geometry.
  • #11 To solve that, we can use a Hi-Z buffer This is a depth buffer with runtime generated mips. Each mip bilinearly pick the largest (furthest) value instead of the average value. A single pixel at a higher mip level thus represents the furthest value in a larger area. Example: In a 1024x512 HZB, at mip level 4, each pixel represents 16 pixels This means that with few samples, we can now get the furthest value in an area. Perfect for cheap & quick depth comparisons.
  • #12 Here's a quick visualization of the Hi-Z buffer which should show how it's generated more clearly
  • #13 How does this work out in practice?
  • #14 Here’s a basic breakdown of our frames, ofcourse ignoring countless of other render passes
  • #15 Chicken or the Egg: Occlusion Culling uses a depth buffer of the early depth pass, but the early depth pass is also where we can gain the most by doing occlusion culling. Thus, we instead use the previous frame's results. The occlusion culling pass of the previous frame has resulted in a buffer of uint8's, indicating with 0 or 1 whether the object is visible (and the rest of the numbers are for error codes) Before we can use this data, we do need to wait to make sure it's ready, so we wait on a fence that was signaled after the previous frame's occlusion culling pass. Frustum Objects: When we're collecting objects for rendering, we first skip any objects that are out of view-distance or out of frustum. Each object has an index stored; this index is used to read the culling results and register data for the next frame We can read out the results using it to determine the object’s visibility That’s it – well we still have to prepare the object for the next frame We generate a new index using an atomic counter. A lot of our object visibility collection happens in jobs so this ensures that each object has a unique index Using the given index, we can store our AABB. Described by a min and max, the AABB is world-space and fully encapsulates the object We also store the index for some basic verification since the AABB can’t tell us by itself if it’s valid data.
  • #16 Depth Pre-Pass: * After we're done collecting all our visible objects, we can render them, assured that we've minimized the number of objects that need to be drawn. * The depth-only pre-pass draws all the visible objects and generates a depth buffer. Generate Hi-Z: For the purposes of mipmap generation, the Hi-Z buffer’s mip 0 is aligned to the nearest power of 2 (downward), so a 1080p texture would be represented by a 1024x512 texture, which is more than enough detail for our use case here; worst case it is a tad more lenient for tiny objects. Using the depth buffer we can then generate the hi-z buffer; this is done in a very simple shader pass, ran once for each mip, taking the previous mip, performing a bilinear Gather() and returning the maximum value it finds, and storing it in the current mip.
  • #17 Finally we prepare the occlusion culling pass. We upload all the AABBs that were registered previously fill the result vector with invalid codes and we bind the shader's resources And then, we dispatch our compute shader.
  • #18 Before we get into the intricacies of the compute shader, I just want to take a moment to cover some of our efforts to keep this technique as stable as possible. The implementation of this technique was inherited from previous attempts of integrating it, at the time of which it was incredibly broken and unstable. On the next slide are some of the things I’ve done to create a stable result, and I think they’re worth mentioning.
  • #19 Clear IDs reliably With data running across frames, objects going in/out of frustum, new objects getting added/removed, etc. etc., we found that it's important to be vigilant about clearing the object's ID. Thus, we clear the object whenever it isn’t considered for occlusion culling, if the object Nothing goes wrong And if it does, we pretend it doesn’t Visual stability is key, we'd much rather give up a little bit of performance on an uncertainty than have an object noticeably appear/disappear on-screen. On some occasions the fence we wait on can take too long, in which case we time out. If this happens, we just run the frame as normal and pretend everything is visible. If our object doesn’t have an ID, or if the shader returned an error code, the object is also visible. Of course, while developing and debugging, we can breakpoint in this points, or even error out. But when used in release, we’d rather have some false positives.
  • #20 Now, for the GPU implementation.
  • #21 On the compute side of things: Pretty simple setup we have a structured buffer containing our items a texture2d for our Hi-Z buffer that we generated a read write buffer with uints And some constants to help us along the way. Each thread handles a single item and is expected to write a result value into results. If there’s no result, we write an error code.
  • #22 Each thread fetches an item from the items buffer using the thread index. We’ll compute the aabb’s size, and then use that to calculate each corner, then each corner is passed into a function which transforms the corner’s position into 0-1 screen-space. After that, the function outputs the closest z value, and the minimum and maximum xy values  Using this data, we can compute a uv-space square, which we can use to sample the depth buffer.
  • #23 In the previous step we computed the smallest shape-encompassing screen-space square in 0-1 UV space. Our next goal will be to figure out a mip level on our Hi-Z Buffer in which our square can be depth tested accurately with the least number of samples necessary. We’ll be using a point sampler because we care about the precise values of the depth buffer, which means that if our square covers multiple pixels, all need to be sampled. To match those requirements, we’ll compute a mip level in which our square can be represented by one sample per corner. This means that each sample needs to cover at least half the square’s size. We can compute a matching mip level by converting the square’s size to pixels and taking the log base2 of its largest dimension. This lets us calculate how many times we need to divide the square’s size by 2 before it fits in a 2x2 square, which is exactly what our mips also do.
  • #24 EXAMPLE In this example our square is 0.01 by 0.005, which means that its largest dimension is 10 pixels The log base 2 of that is 3.356, which after rounding up, results in 4 Which is this mip on the right…
  • #25 Is mip 4 correct? Our Hi-Z buffer for 1920x1080 starts at 1024x512. Let’s say our square is an imaginary model we want to draw here, in front of the plant, but behind the chair
  • #26 Why can’t we go to a more detailed mip 3? If we were to go to mip 3, each sample would cover 8x8, meaning that our 10 px wide square could have pixels left unsampled in the center. For example, in this case, the corners of the square have all sampled a part of the chair, and thus are considered occluded, which is incorrect because we should be able to see its center. So, in general, sampling at too low of a mip level might mean that we don’t appropriately account for gaps in geometry, in this case the chair, which causes objects to be occluded when they shouldn’t.
  • #27 At mip 4, the HZB has a resolution of 64x32, meaning that each pixel represents an area of 16x16. As you can see, the HZB at this mip level has started ignoring the chair’s parts because a sample in the area may still be visible through the gaps. The pot’s depth remains, which we’ve established our square is in front of, meaning our AABB will be correctly recognized as visible.
  • #28 If we were to go to mip 5, we’d cover a larger area than necessary. With our square here, this doesn’t affect us negatively; this area is still covered by the plant pot. But if our square would’ve been behind the pot but near the edge, it might’ve been considered visible where it didn’t need to be, causing a false positive. Taking too high of a mip level isn’t the worst thing ever, but the more accurate we can be the fewer objects we’ll render.
  • #29 Using our computed mip level we can sample the square’s corners. We’ll use the previously computed min/max to determine the exact sample points. From the 4 depths, we take the largest value, representing the furthest pixel in that area. Then, if our AABB’s closest value is further than the furthest pixel in the depth buffer, we can guarantee that it’s concealed, in which case we can output false. Of course, in our actual code that last part looks more like this, but that’s less nice to explain. We also add a small bias in favor of very thin AABBS (including decals).
  • #30 Let’s talk about what all of this work resulted in
  • #31 … Flaws I’ll quickly cover some of the flaws we noticed: As discussed, since the occlusion culling results must be available before the early depth pass, we use the previous frame’s data. This sometimes means that objects pop into existence, in practice the occlusion culling system is lenient, but this can still happen especially if the camera moves through a wall. We find that it isn’t very noticeable at higher frame-rates, and it’s more than acceptable for our use case. Other than that, the occlusion culling approach I’ve described can result in false positives. Particularly with large AABBs, which might exceed NDC space. We clamp their corners to a 0-1 range, but it’s very possible that a 0 or 1 z value doesn’t represent what’s actually seen on screen. Luckily this only ever results in false positives, so it’s not a big deal.
  • #32 Now, the actual results: On Nintendo Switch, the occlusion pass takes just 0.03ms in GPU time and saves us around 4.8ms. The exact numbers ofcourse differ depending on the scene, area you’re looking, etc, but always results in a positive impact. All our other platforms also benefit from this technique, to different extents, for example on PS5 we gained 1.9ms, and on PC 1.8. Thanks to the flexible AABB design of our occlusion culling code-base, I’ve also been able to extend it to occlude foliage, particle systems, fog volumes, and water planes, which has also increased the pass's impact.