Optimizing Directx Graphic Applications Using Software Vertex Processing

Optimizing DirectX* Graphic Applications using Software Vertex Processing
Ronen Zohar/Kim Pallister Intel Corporation

1
Meltdown 2001
Copyright 2001 Intel Corporation. *Other names and brands may be claimed as the property of others.
Agenda
Do I need SW vertex processing? The PSGP Using SW vertex processing for maximum performance: memory, batching and render-states SW vertex processing and DirectX*s 8.0 new features
Meltdown 2001
Do I need SW vertex processing?

Your publisher wants:
Eye-candy graphics, using all the latest 3D features Lower the minimum system requirements and many more
Problem: older systems does not support all the eyecandy features Solution1: Disable features for low-end systems Solution2: Use SW vertex processing (at least for the features that you can) and keep some features
Meltdown 2001
Inside DirectX* Graphics

Application DirectX run-time API Front-end
HW vertex processing path
SW Vertex processing (PSGP)
Communication to the driver (DDI) Driver

Meltdown 2001
PSGP Processor Specific Geometry Pipeline

Part of DirectX graphics responsible for the SW vertex processing algorithms, optimized for the clients processor DirectXs 8.0 PSGP is optimized for:
Intel Pentium III processor Intel Pentium 4 processor
Meltdown 2001
The PSGP
Fixed function path Transformation Lighting Tex Gen Format data to outputFVF
VB
Map stream to registers Execute vertex shader code
Vertex shader path
Clipper
IB To
Meltdown 2001
driver
Internal temporary VBs

6
PSGP Principles
Use SIMD to process multiple vertices in each iteration
Vertical processing Data is swizzled on the fly
Prefetch input streams to hide memory latency Write output to temporary VBs based on XYZRHW FVF code
In system memory if need to read back transformed vertices In driver memory if no read-back is required More on this later
7
Meltdown 2001
Input Stream Memory Allocation

Create SW processed primitives in system memory (using the D3DUSAGE_SOFTWAREPROCESSING usage create flags). If the same VB is processed both in SW and HW
Try to avoid it If you must - create multiple copies, one in system and one in driver memory
If the primitive is never clipped, use the D3DUSAGE_DONOTCLIP usage flag

Meltdown 2001
Primitive Batching
Batch all the SW processed primitives together SW processes the entire VB range that you submit, if multiple primitives are using the same VB squeeze the vertices range As with HW, bigger primitives are always better (the PSGP have long setup)
Meltdown 2001
Primitive Batching (Cont)

The PSGP is batching the processed vertices before sending them to HW (to reduce HWs VB changes) Primitives are batched as long as their output FVF is equal:
and XYZ | DIFFUSE | TEX1 have the same output FVF (XYZRHW | DIFFUSE | TEX1) In SW mode, changing the VB FVF does not mean a slowdown (unlike HW)
XYZ | NORMAL | TEX1
Meltdown 2001
10
Clipping Render-state
When clipping is enabled, the PSGP
Stores its output to system memory buffer
As it need to read vertices in order to clip
Driver need to copy it across the AGP
When clipping disabled writes to driver allocated buffer

No Copy here!
Calculates clip flags (out-codes) for each vertex

more execution cycles per vertex
Clips
Minimize the amount of clipping

Use bounding boxes/spheres on your objects Dont forget to take the guard-band into account
Meltdown 2001
11
Clipping Render-state (Cont)

Pseudo-code to minimize clipping
If (BB is outside screen)
Dont render primitive
Elseif (BB is inside guard-band)

Render with clipping off
Else
Render with clipping on
Typical game scene should have <10% of primitives clipped

Biggest problem is front plane clipping
Meltdown 2001
12
Performance Render-states
Specular very expensive LocalViewer smaller performance impact than HW, but still costs more NormalizeNormals extra work for the PSGP, use only when needed Fog written as specular alpha, can change PSGPs output FVF
Meltdown 2001
13
DirectX* 8.0 Graphics New Features

Point sprites Tweening Indexed vertex blending/ Indexed palette skinning Vertex Shaders
Meltdown 2001
14
Point Sprites
PSGP writes in native FVF format If HW does not support
Each point is expanded to quad, using the point size calculated The quad list is submitted to the driver
Very slow solution if no HW support for point sprites, try to avoid it

Meltdown 2001
15
Tweening
Tween the position and normal before transformation (in SIMD) After tweening continuous the standard PSGP flow Costs very few cycles
But, for tweening and transformation only a vertex shader would run faster Try to compare your exact scenario to a vertex shader
Meltdown 2001
16
Indexed Skinning
Transforms all vertices to matrix0 space
Using scalar code, with lookup for the needed matrix
Than continuous the normal PSGP flow DirectX* 7 style skinning is supported by some HW and may run faster, but requires multiple models and DrawPrimitive calls
Meltdown 2001
17
Vertex shaders
At vertex shader creation
The shader code is compiled to equivalent IA32 code Using all possible assembly optimizations and instructions available on clients CPU to achieve fastest code
At vertex shader execution

Calling the generated code
SW vertex shaders have excellent performance

Meltdown 2001
18
SW Vs. HW Vertex Shaders

Calculates more than one vertex in a single iteration
Based on the processor SIMD width
Not every shader instruction is 1 clock

But, the CPU runs with much higher frequency than todays 3D graphics chips
Meltdown 2001
19

(Cont)
Simple compilation sample:

Mul r0.xyz,v0,c0
Movaps Mulps Movaps Mulps Movaps Mulps Movaps Movaps Movaps xmm0,[v0.x] xmm0,[c0.x] xmm1,[v0.y] xmm1,[c0.y] xmm2,[v0.z] xmm2,[c0.z] [r0.x],xmm0 [r0.y],xmm1 [r0.z],xmm2
20
Meltdown 2001

(cont)
Data that you write, is data that the CPU have to calculate
Write only needed data (using the vertex shader write mask) Use the swizzle modifiers, and dont duplicate written data
Vertex shader instructions are blended to achieve maximum performance

But, keeping dependency chains squeezed will help the compiler in physical register assignments
Meltdown 2001
21
Performance Tips for SW Vertex Shaders

m?x? macros have better performance than the un-expanded macros Try to minimize the use of the address register
Due to the parallelism of the SW vertex shader Sort the VB by values used in the address register
Meltdown 2001
22
Performance Tips for SW Vertex Shaders

lit, expp and logp are big cycle consumers
Use the worse accuracy (i.e. expp.x) when possible Use either .x or .z (but not both) exp and log are worse than expp, logp
Dont implicitly saturate color values

it is done automatically
Meltdown 2001
23
Optimized Vertex Shader

dp4 oPos.x, v0, c2 dp4 oPos.y, v0, c3 dp4 oPos.z, v0, c4 dp4 oPos.w, v0, c5 add r1, c6,-v0 dp3 r2, r1, r1 rsq r2, r2 mov oT0, v2 mul r1,r1,r2 dp3 r3, v1, r1 max r3,r3,c8 add r3, r3, c7 min oD0,r3,c9
Meltdown 2001
m4x4 oPos, v0, c[2] add r1.xyz, c6,-v0 dp3 r2.w, r1, r1 rsq r2.w, r2.w mul r1.xyz,r1,r2.w dp3 r2.w, v1, r1 max r2.w,r2.w,c8 add oD0.xyz, r2.w, c7 mov oT0, v2
24
Questions??
Ronen.Zohar@intel.com Kim.Pallister@intel.com
Intel, Pentium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright 2001 Intel Corp.
Meltdown 2001
Copyright 2001 Intel Corporation.
25
*Other names and brands may be claimed as the property of others.
Backup
Meltdown 2001
26
Tweening + transformation vertex shader

Mul Mad M4x4 Mov Mov r0.xyz,v0,c0.x // c0.x r0.xyz,v1,c0.y,r0 // c0.y (1- ) oPos,r0,c1 oD[0].xyz,v2 oT[0].xy,v3
Meltdown 2001
27
Not Equal Address Value

Address register (x4) 1 2 1 2 Const register file (x4) 1.0f 1.0f 1.0f 1.0f 2.0f 2.0f 2.0f 2.0f 3.0f 3.0f 3.0f 3.0f
Need to re-arrange a combination register for the SIMD instruction to use
(costs ~20 cycles)
1.0f 2.0f 1.0f 2.0f Instruction argument

Copyright 2001 Intel Corporation.
Meltdown 2001
28
*Other names and brands may be claimed as the property of others.
Equal Address Value

Address register (x4) 2 2 2 2
Accessing directly the x4 constant register file. No penalty for re-arranging vertices
Const register file (x4) 1.0f 1.0f 1.0f 1.0f 2.0f 2.0f 2.0f 2.0f 3.0f 3.0f 3.0f 3.0f
Instruction argument
Address accessing mode is selected when storing address value

Meltdown 2001
29

Optimizing Directx Graphic Applications Using Software Vertex Processing

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Optimizing Directx Graphic Applications Using Software Vertex Processing

Caricato da

Copyright:

Formati disponibili

Optimizing DirectX* Graphic Applications using Software Vertex Processing

Ronen Zohar/Kim Pallister Intel Corporation

Do I need SW vertex processing?

Inside DirectX* Graphics

HW vertex processing path

SW Vertex processing (PSGP)

Communication to the driver (DDI) Driver

PSGP Processor Specific Geometry Pipeline

Vertex shader path

Internal temporary VBs

Input Stream Memory Allocation

If the primitive is never clipped, use the D3DUSAGE_DONOTCLIP usage flag

Primitive Batching (Cont)

When clipping disabled writes to driver allocated buffer

Calculates clip flags (out-codes) for each vertex

Minimize the amount of clipping

Clipping Render-state (Cont)

Elseif (BB is inside guard-band)

Typical game scene should have <10% of primitives clipped

DirectX* 8.0 Graphics New Features

Very slow solution if no HW support for point sprites, try to avoid it

At vertex shader execution

SW vertex shaders have excellent performance

SW Vs. HW Vertex Shaders

Not every shader instruction is 1 clock

SW Vs. HW Vertex Shaders

Simple compilation sample:

SW Vs. HW Vertex Shaders

Vertex shader instructions are blended to achieve maximum performance

Performance Tips for SW Vertex Shaders

Performance Tips for SW Vertex Shaders

Dont implicitly saturate color values

Optimized Vertex Shader

Copyright 2001 Intel Corporation.

*Other names and brands may be claimed as the property of others.

Tweening + transformation vertex shader

Not Equal Address Value

Need to re-arrange a combination register for the SIMD instruction to use

(costs ~20 cycles)

1.0f 2.0f 1.0f 2.0f Instruction argument

*Other names and brands may be claimed as the property of others.

Equal Address Value

Address accessing mode is selected when storing address value

Potrebbero piacerti anche