S3032 Advanced Scenegraph Rendering Pipeline PDF

Advanced Scenegraph Rendering Pipeline
Markus Tavenrath NVIDIA matavenrath@nvidia.com

Christoph Kubisch NVIDIA ckubisch@nvidia.com
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
SceneGraph Rendering
Traditional approach is render while
traversing a SceneGraph
Scene complexity increases
Deep hierarchies, traversal expensive
Large objects split up into a lot of
little pieces, increased draw call count
Unsorted rendering, lot of state
changes
CPU becomes bottleneck when

rendering those scenes
models courtesy of PTC
2
Introduction
SceneGraph
SceneTree
Renderer
ShapeList
Overview
SceneGraph
SceneTree
ShapeList
G0
G0
ShapeList
S0
T0
T1
S0
G1
T2
T0
S0
Gi
G1
S1
S2
G1
T3
T2
T3
T2
T3
S2
S1
S2
S1
S2
Group
S2
T1
Renderer
RendererCache
S0
S1
S1
Ti
Transform
S1
S2
Si
S1
S2
Shape
Introduction
SceneGraph
SceneTree
Renderer
ShapeList
SceneGraph
G0
T0
SceneGraph is DAG
No unique path to a node
T1
S0
Cannot efficiently cache path-dependent data per node
Traversal runs over 14 nodes for rendering.
G1
Processed 6 Transform Nodes
T2
T3
S1
S2
Gi
Group
6 matrix/matrix multiplications and inversions

Nodes are usually large and not linear in memory
Each node access generates at least one, most
likely cache misses
Ti
Transform
Si
Shape
Introduction
SceneGraph
SceneTree
Renderer
ShapeList
SceneTree construction
G0
G0
T0
T1
S0
G1
Observer based
synchronization
T0
S0
T1
G1
G1
T2
T3
T2
T3
T2
T3
S1
S2
S1
S2
S1
S2
Gi
Group
Ti
Transform
G0
T0
T1
S0
G1
T2
S1
T3
S2
->
->
->
->
->
->
->
->
->
(G0)
(T0)
(T1)
(S0)
(G1)
(G1,G1)
(T2)
(T2,T2)
(S1)
(S1,S1)
(T3)
(T3,T3)
(S2)
(S2,S2)
Si
Shape
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
SceneTree
G0
T0
S0
SceneTree has unique path to each

node
T1
G1
Store accumulated attributes like

transforms or visibility in each Node
G1
Trade memory for performance

T2
S1
Gi
T3
S2
T2
S1
Group
T3
S2
64-byte per node, 100k nodes ~6MB
Transforms stored separate vector
Traversal still processes 14 nodes.

Ti
Transform
Si
Shape
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
SceneTree invalidate attributes cache

G0
Keep dirty flags per node
T0
S0
G1
T2
S1
SceneGraph change notifications

invalidated nodes
G1
T3
S2
T2
S1
T3
T3
If not dirty, mark dirty and add to

dirty vector
S2
O(1) operation, no sorting required

upon changes
Before rendering a frame process

dirty vector
Dirty vector
T3
Keep dirty vector per flag
T1
T1
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
SceneTree validate attribute cache

G0
Walk through dirty vector
T0
S0
Node marked dirty -> search top dirty
T1
G1
Validate subtree from top dirty
Validation example
G1
T3 dirty, traverse up to root node

T2
T3
T2
T3
T3 top dirty node, validate T3 subtree
T3 dirty, traverse up to root node

S1
S2
S1
S2
T1 top dirty node, validate T1 subtree
T1 not dirty
No work to do
Dirty vector
T3
T3
T1
Introduction
SceneGraph
SceneTree
Renderer
ShapeList
SceneTree to ShapeList
G0
Add Events for ShapeList generation
T0
addShape(Shape)
T1
removeShape(Shape)
S0
G1
G1
ShapeList
S0
T2
T3
T2
T3
S1
S2
S1
S2
Gi
Group
Ti
S1
Transform
S2
S1
Si
S2
Shape
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Summary
SceneGraph to SceneTree synchronization
Store accumulated data per node instance
SceneTree to ShapeList synchronization

Avoid SceneTree traversal
Next: Efficient data structure for renderer based on

ShapeList
10
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Renderer Data Structures

Shape
Program
colored
Geometry
Shape1
Program
Shader
Vertex
Shader
Fragment
ParameterData
Camera1
ParameterDescription
Camera
ParameterData
Lightset 1
Light
ParameterData
Transform 1
Matrices
ParameterData
red
Material
ParamaterDescription
Name
Type
Arraysize
ambient
vec3
2
diffuse
vec3
2
specular
vec3
2
texture
Sampler 0
11
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Example Parameter Grouping

Parameters
Frequency
Shader independent globals, i.e. camera

constant
Shader dependent globals, i.e. environment map
Light, i.e. light sources and shadow maps
rare
Material raw values, i.e. float, int and bool

frequent
Material handles, i.e. textures and buffers
Object parameters, i.e. position/rotation/scaling
always
12
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Rendering structures
shapelist
colored
Group
by
Program
textured
Shapes
colored
colored
colored
textured
Shapes
textured
colored
colored
textured
colored
textured
Sort by ParameterData
13
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
ParameterData Cache
Parameters
colored
red
blue
Parameters
textured
wood
marble
Cache is a big char[] with all ParameterData.
ParameterData are sorted by first usage.

Parameters are converted to Target-API datatype, i.e.
Int8 to int32, TextureHandle to bindless texture...
Updating parameters is only playback of data in memory,

no conditionals.
Filter for used parameters to reduce cache size
14
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Vertex Attribute Cache

attributes pos
normal
colored
pos
normal
pos
normal
Big char[] with vertex attribute pointers

Bindless pointers, VBOs or VAB streams
Each set of attributes stored only once

Ordered by first usage
Attributes required by program are known

Store only used attributes in Cache
Useful for special passes like depth pass where only pos is
required
15
Introduction
SceneGraph
SceneTree
ShapeList
Renderer
Renderer Cache complete

Parameters
colored
Shapes
colored
Phong red
colored
Attributes pos
normal
colored
Phong blue
colored
pos
normal
colored
pos
normal
foreach(shape) {
if (visible(shape)) {
if (changed(parameters)) render(parameters);
if (changed(attributes)) render(attributes);
render(shape);
}
}
16
Achievements
CPU boundedness improved (application)
Recomputation of attributes (transforms)
SceneTree
Deep hierarchies: traversal expensive
ShapeList
Renderer
Unsorted rendering, lot of state changes
CPU boundedness remaining (OpenGL usage)

Large objects split up into a lot of little pieces,
increased draw call count
RendererCache
Renderer
OpenGL
implementation
17
Enabling Hardware Scalability

Avoid data redundancy
Data stored once, referenced multiple times
Update only once (less host to gpu transfers)
Increase batching potential

Further cuts api calls
Less driver CPU work
Minimize CPU/GPU interaction

Allow GPU to update its own data
Lower api usage when scene is changed little
E.g. GPU-based culling
18
OpenGL Research Framework

model courtesy of PTC
Avoids classic SceneGraph design

Geometry
Vertex/IndexBuffer
BoundingBox
Divided into parts (CAD features)
Material
Node Hierarchy
Object
Node and Geometry reference
For each Geometry part
Material reference
Enabled state
99000 total parts, 3.8 Mtris, 5.1 Mverts

700 geometries, 128 materials
2000 objects
19
Performance baseline
Kepler Quadro K5000, i7
vbo bind and drawcall per part, i.e. 99 000
drawcalls
scene draw time > 38 ms (CPU bound)
vbo bind per geometry, drawcalls per part

scene draw time > 14 ms (CPU bound)
All subsequent techniques raise perf significantly

scene draw time < 6 ms
1.8 ms with occlusion culling
20
Drawcall Reduction
MultiDraw (1.x)
Render ranges from current VBO/IBO
Single drawcall for many distinct objects
Reduces overhead for low complexity objects
ARB_draw_indirect (4.x)
ARB_multi_draw_indirect
Store drawcall information on GPU or HOST
Let GPU create/modify GPU buffers
DrawElementsIndirect
{
GLuint
count;
GLuint
instanceCount;
GLuint
firstIndex;
GLint
baseVertex;
GLuint
baseInstance;
}
21
Drawing Techniques
All use multidraw capabilites to
render across gaps
BATCHED use CPU generated list of
combined parts with same state
Objects part cache must be rebuilt
based on material/enabled state
Parts with different materials in geometry

a+b
Grouped and grown drawcalls
INDIVIDUAL stay on per-part level

No caches, can update assignment or
cmd buffers directly
Single call, encode material/matrix

assignment via vertex attribute
22
Parameters
Group parameters by frequency of change
Generating shader strings allows different storage
backend for uniforms
Effect "Phong {
Group material" (many) {
vec4 "ambient"
vec4 "diffuse"
vec4 "specular"
}
Group view" (few) {
vec4 viewProjTM
}
Group object" (many) {
mat4 worldTM
}
... Code ...
}
OpenGL 2 uniforms
OpenGL 3,4 buffers

NVIDIA bindless technology...
23
Parameters
uniform mat4 worldMatrices[2];
GL2 approach:
Avoid many small
uniforms
Arrays of uniforms,
grouped by frequency of
update, tightly-packed
uniform vec4 materialData[8];

#define matrix_world
worldMatrices[0]
#define matrix_worldIT worldMatrices[1]
#define material_diffuse materialData[0]
#define material_emissive materialData[1]
#define material_gloss
materialData[2].x
// GL3 can use floatBitsToInt and friends
// for free reinterpret casts within
// macros
...
wPos = matrix_world * oPos;
...
// in fragment shader
color = material_diffuse +
material_emissive;
...
24
Parameters
GL4 approach:
TextureBufferObject
(TBO) for matrices
UniformBufferObject
(UBO) with array data
to save costly binds
Assignment indices
passed as vertex
attribute
in vec4 oPos;
uniform
samplerBuffer matrixBuffer;
uniform materialBuffer {
Material materials[512];
};
in ivec2 vAssigns;
flat out ivec2 fAssigns;
// in vertex shader
fAssigns = vAssigns;
worldTM = getMatrix (matrixBuffer,
vAssigns.x);
wPos = worldTM * oPos;
...
// in fragment shader
color = materials[fAssigns.y].color;
...
25
OpenGL 4.x approach

setupSceneMatrixAndMaterialBuffer (scene);
foreach (obj in scene) {
if ( isVisible(obj) ) {
setupDrawGeometryVertexBuffer (obj);
// iterate over different materials used
foreach ( batch in obj.materialCaches) {
glVertexAttribI2i (indexAttr, batch.materialIndex, matrixIndex);
glMultiDrawElements (GL_TRIANGLES, batch.counts, GL_UNSIGNED_INT ,
batch.offsets,batched.numUsed);
}
}
}
26
Per drawcall vertex attribute

glVertexAttribDivisor
== 0 : VArray[ gl_VertexID + baseVertex ]
glVertexAttribDivisor
!= 0 : VArray[ gl_InstanceID / VDivisor + baseInstance ]

VArray[
MultiDrawIndirect
Buffer
Material & Matrix Index
VertexBuffer (divisor:1)
0 /
...
instanceCount = 1
baseInstance = 0
1 + baseInstance ]
...
instanceCount = 1
baseInstance = 1
baseinstance = 1
vertex attributes
fetched for last
vertex in second
drawcall
Position & Normal

VertexBuffer (divisor:0)
27
OpenGL 4.2+ indirect approach

...
foreach ( obj in scene.objects ) {

...
// instead of glVertexAttribI2i calls and a loop
// we use the baseInstance for the attribute
// bind special assignment buffer as vertex attribute
glBindBuffer ( GL_ARRAY_BUFFER, obj->assignBuffer);
glVertexAttribIPointer (indexAttr, 2, GL_INT, . . . );
// draw everything in one go
glMultiDrawElementsIndirect ( GL_TRIANGLES, GL_UNSIGNED_INT,
obj->indirectOffset, obj->numIndirects, 0 );
}
28
Vertex Setup
Separates format from data
/* setup once, similar to glVertexAttribPointer

but with relative offset last */
glVertexAttribFormat (ATTR_NORMAL, 3,
GL_FLOAT, GL_TRUE, offsetof(Vertex,normal));
glVertexAttribFormat (ATTR_POS, 3,
GL_FLOAT, GL_FALSE, offsetof(Vertex,pos));
// bind to stream
glVertexAttribBinding (ATTR_NORMAL, 0);
glVertexAttribBinding (ATTR_POS, 0);
Bind multiple vertex

attributes to one buffer
// switch single stream buffer

glBindVertexBuffer (0, bufID, 0, sizeof(Vertex));
ARB_vertex_attrib_binding
(VAB)
Avoids many buffer changes
NV_vertex_buffer_unified_
memory (VBUM)
Allows very fast switching
through GPU pointers
// NV_vertex_buffer_unified_memory
// enable once and set stride
glEnableClientState (GL_VERTEX...NV);...
glBindVertexBuffer (0, 0, 0, sizeof(Vertex));
// switch single buffer via pointer
glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR,
bufSize);
29
NV_bindless_multidraw_indirect
Time in microseocnds [us]
Vertex/Index setup inside MultiDrawIndirect command

1000
~ 2400 drawcalls, GL4 BATCHED style
800
one GL call to draw entire scene
600
GPU benefit depends on triangles
per drawcall (> ~ 500)
400
200
0
VBO
Lower is
Better
VAB
Effect on CPU time
VAB+VBUM
BINDLESS
INDIRECT HOST
VAB+VBUM
BINDLESS
INDIRECT GPU
VAB+VBUM 30
Bindless (green) always reduces CPU, and

may help framerate/GPU a bit
6000
99.000
2.000
Time in microseconds [us]
5000
2.400 hw drawcalls
2.400 sw drawcalls
4000
K = Kepler 5000, regular VBO

KB = Kepler 5000, VBUM + VAB
3000
2000
GPU
1000
CPU
0
K
Lower is
Better
KB
GL4 INDIRECTHOST
INDIVIDUAL
KB
GL4 INDIRECTGPU
INDIVIDUAL
KB
GL4 BATCHED
KB
GL2 BATCHED
31
MultiDrawIndirect achieves almost 20 Mio drawcalls per

second (2000 VBO changes, only 1/3 perf lost).
GPU-buffered commands save lots of CPU time
Scene-dependent!
INDIVIDUAL could be as fast
if enough work per drawcall
6000
99.000
2.000
5000
2.400 hw drawcalls
2.400 sw drawcalls
4000

3000
2000
GPU
1000
CPU
0
K
Lower is
Better
KB
GL4 INDIRECTHOST
INDIVIDUAL
KB
GL4 INDIRECTGPU
INDIVIDUAL
KB
GL4 BATCHED
KB
GL2 BATCHED
32
GL2 uniforms beat paletted UBO a bit in GPU, but are slower on
CPU side. (1 glUniform call with 8x vec4, vs indexed UBO)
Scene-dependent!
GL4 better when more
materials changed per object
6000
99.000
2.000
5000
2.400 hw drawcalls
2.400 sw drawcalls
4000

3000
2000
GPU
1000
CPU
0
K
Lower is
Better
KB
GL4 INDIRECTHOST
INDIVIDUAL
KB
GL4 INDIRECTGPU
INDIVIDUAL
KB
GL4 BATCHED
KB
GL2 BATCHED
33
Recap
Share geometry buffers for batching
Group parameters for fast updating
MultiDraw/Indirect for keeping objects
independent or remove additional loops
baseInstance to provide unique
index/assignments for drawcall
Bindless to reduce validation

overhead/add flexibility
34
GPU Culling Basics

GPU friendly processing
Matrix and bbox buffer, object buffer
XFB/Compute or invisible rendering
Vs. old techniques: Single GPU job for ALL objects!
Results
Readback GPU to Host
0,1,0,1,1,1,0,0,0
Can use GPU to pack into bit stream
Indirect GPU to GPU

Set DrawIndirects instanceCount to 0 or 1
buffer cmdBuffer{
Command cmds[];
};
...
cmds[obj].instanceCount = visible;
35
Algorithm by
Evgeny Makarov, NVIDIA
Occlusion Culling
depth
buffer
OpenGL 4.2+
Passing bbox fragments

enable object
Depth-Pass
Raster invisible bounding boxes
Disable Color/Depth writes
Geometry Shader to create the three
visible box sides
Depth buffer discards occluded fragments
(earlyZ...)
Fragment Shader writes output:
visible[objindex] = 1
// GLSL fragment shader

// from ARB_shader_image_load_store
layout(early_fragment_tests) in;
buffer visibilityBuffer{
int visibility[];
};
flat in int objID;
void main(){
visibility[objID] = 1;
}
// buffer would have been cleared
// to 0 before
36
Temporal Coherence
invisible
frame: f 1
camera
visible
Exploit that majority of objects dont change

much relative to camera
Draw each object only once (vertex/drawcallbound)
Render last visible, fully shaded

(last)
Test all against current depth:
frame: f
last visible
camera
moved
bboxes pass depth

(visible)
bboxes occluded
(visible)
Render newly added visible:

none, if no spatial changes made
(~last) & (visible)
(last) = (visible)
new visible
37
Culling Readback vs Indirect

For readback results, CPU has to

wait for GPU idle
2500
37% faster with

culling
In the draw new visible phase indirect cannot

benefit of nothing to setup/draw in advance,
still processes empty lists
33% faster with

culling
37% faster with

culling
2000
NV_bindless_
multidraw_indirect
saves CPU and bit of
GPU time
1500
1000
Lower is
Better
GPU
500
CPU
GL4 BATCHED style
readback
indirect
NVindirect
Scene-dependent,
i.e. triangles per
drawcall and # of
invisible
38
Culling Results
Temporal culling very useful for object/vertex-boundedness
Can also apply for Z-pass...
Readback vs Indirect
Readback variant easier to be faster (no setups...), but syncs!
NV_bindless_multidraw benefit depends on scene (VBO changes
and primitives per drawcall)
Working towards GPU autonomous system

(NV_bindless)/ARB_multidraw_indirect as mechanism for GPU
creating its own work, research and feature work in progresss
39
glFinish();
Thank you!
Contact
ckubisch@nvidia.com
matavenrath@nvidia.com
40
NVIDIA Bindless Technology

Family of extensions to use
native handles/addresses
NV_vertex_buffer_unified_memory
// GLSL with true pointers

uniform MyStruct* mystructs;
// API
glUniformui64NV (bufferLocation,
bufferADDR);
NV_shader_buffer_load/store texHDL
Pointers in GLSL
NV_bindless_texture
No more unit restrictions
References inside buffers
= glGetTextureHandleNV (tex);
// later instead of glBindTexture
glUniformHandleui64NV (texLocation,
texHDL)
// GLSL
// can also store textures in resources
uniform materialBuffer {
sampler2D manyTextures [LARGE];
}
41
2500
Time in microseconds
Culling
Readback
vs
Indirect 2000
For readback results, CPU has to

wait for GPU idle
432 fps with culling
315 without
Nothing new to draw, but CPU doesnt

know, still setting things up, GPU runs
thru empty cmd buffer

289 without

313 without
6. Draw New Visible

5. Update Internals
4. Occlusion Cull
3. Draw Last Visible
2. Update Internals
1. Frustum Cull
1500
1000
500
0
Readback Readback Indirect
GPU
CPU
GPU
Indirect NVIndirectNVIndirect
CPU
GPU
CPU
Special bindless indirect

version can save lots of
CPU and a bit GPU costs
for drawing the scene
with a single big cmd
buffer
42

S3032 Advanced Scenegraph Rendering Pipeline PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

S3032 Advanced Scenegraph Rendering Pipeline PDF

Caricato da

Copyright:

Formati disponibili

Advanced Scenegraph Rendering Pipeline

Markus Tavenrath NVIDIA matavenrath@nvidia.com

CPU becomes bottleneck when

No unique path to a node

Cannot efficiently cache path-dependent data per node

Traversal runs over 14 nodes for rendering.

Processed 6 Transform Nodes

6 matrix/matrix multiplications and inversions

SceneTree has unique path to each

Store accumulated attributes like

Trade memory for performance

64-byte per node, 100k nodes ~6MB

Transforms stored separate vector

Traversal still processes 14 nodes.

SceneTree invalidate attributes cache

Keep dirty flags per node

SceneGraph change notifications

If not dirty, mark dirty and add to

O(1) operation, no sorting required

Before rendering a frame process

Keep dirty vector per flag

SceneTree validate attribute cache

Walk through dirty vector

Node marked dirty -> search top dirty

Validate subtree from top dirty

T3 dirty, traverse up to root node

T3 top dirty node, validate T3 subtree

T3 dirty, traverse up to root node

T1 top dirty node, validate T1 subtree

Add Events for ShapeList generation

SceneTree to ShapeList synchronization

Next: Efficient data structure for renderer based on

Renderer Data Structures

Example Parameter Grouping

Shader independent globals, i.e. camera

Material raw values, i.e. float, int and bool

Cache is a big char[] with all ParameterData.

ParameterData are sorted by first usage.

Updating parameters is only playback of data in memory,

Vertex Attribute Cache

Big char[] with vertex attribute pointers

Each set of attributes stored only once

Attributes required by program are known

Renderer Cache complete

Deep hierarchies: traversal expensive

Unsorted rendering, lot of state changes

CPU boundedness remaining (OpenGL usage)

Enabling Hardware Scalability

Increase batching potential

Minimize CPU/GPU interaction

OpenGL Research Framework

Avoids classic SceneGraph design

99000 total parts, 3.8 Mtris, 5.1 Mverts

vbo bind per geometry, drawcalls per part

All subsequent techniques raise perf significantly

Parts with different materials in geometry

Grouped and grown drawcalls

INDIVIDUAL stay on per-part level

Single call, encode material/matrix

OpenGL 3,4 buffers

uniform vec4 materialData[8];

OpenGL 4.x approach

Per drawcall vertex attribute