Sei sulla pagina 1di 64

At Least We Arent Doing

That
Real Life Performance Pitfalls
Allan Murphy
Fallible Engineer
XNA Developer Connection
Microsoft

Welcome
Coding at your desk
When you dont spot the ; on the end of the
for()
When you dereference that NULL
When youre tempted to alter the assert
condition instead of fixing the root cause
Its tempting to think everyone else is
ahead of you


How Do We Know?
XDC performance studies
Major titles
real game situations
MAJOR titles
But we cant name names
The top 15 CPU performance pitfalls
Some had plenty of the 15
No title had none of the pitfalls


Whose Top 15?
Top 15 pitfalls presented by
Frequency of occurrence
Not by
How easy they are to fix
Summary of quick fixes at end
Some are indisputably hard to fix
Some are unavoidable to some degree
#1 Shader Patching
What is it?
D3D patches shaders on the fly to match:
Vertex shader to vertex declaration
Vertex shader output to pixel shader inputs
Every frame it costs
CPU time & command buffer space
What is the performance impact?
Could be several % of a frame on CPU
Depending on size of shader library
#1 Shader Patching
How do you detect it?
PIX GPU capture
In warnings tab
PIX system monitor
DR PIX warning
XbPerfView sampling capture
Harsh but fair
Flag on D3D device creation
D3DCREATE_NO_SHADER_PATCHING

#1 Shader Patching
What can I do about it?
Avoid patching
But my shaders take a lot of space?
Look at stripping shaders
via XGMicrocodeDelete*() calls
Can save MB on larger shader libraries
Use explicit Bind() interface
Binds vertex & pixel shader without patching
DirectX 10 has strict shader linkage rules

#2 Render Setup Time
What is it?
Most engines have separate render
thread
Often the major bound on CPU side
State setting
Draw calls
Resource locks
Patching
#2 Render Setup Time
What is the performance impact?
Wasted CPU time
Possibly with redundant work
How do you detect it?
PIX system timing capture
Pick out render thread
Compare to other hardware threads
#2 Render Setup Time
What can I do about it?
Optimize the render thread
Reduce redundant state setting
This is good prep for DirectX 10 state buffer
handling
Reduce constant setting
And structuring those by update frequency
This is good prep for DirectX 10 constant buffers
Avoid shader patching (see #1)
Keep away from GPU state readback


#3 The Dread LHS
What is it?
Everyone said loadhitstore all the
time
Didnt make sense to me
til I said load hits store
Oh right!
Load hits in-flight store to single
address

#3 The Dread LHS
What is the performance impact?
Pipeline flush
~40-80 cycles per LHS
You can throw away 100Ks of cycles easily
How do you detect it?
PIX CPU trace recording, top issues tab
XbPerfView sampling capture
/QXSTALLS pipeline simulation
Libpmcpb counters

#3 The Dread LHS
What can I do about it?
Understand why data is stored at all
Definitely, maybe:
Avoid passing data across register sets
Work with VMX
Be careful with casts
Use __restrict
Sometimes, nothing
No store forwarding hardware

#4 Bytes Used Per Cache Line
What is it?
Any data you pick up in the CPU comes
via the cache
A cache line is 128 bytes
A cache miss causes a whole line read
If you only access 1 byte of the line.
you are wasting memory bandwith
To the tune of 127 bytes per 128 loaded


#4 Bytes Used Per Cache Line
What is the performance impact?
You name it
Dependent on
Data structure choice
Memory layout
Of structures
Of class/struct members
Other simultaneous cache access
From other hardware threads

#4 Bytes Used Per Cache Line
Particle system pathological example
List of particles with an active flag
Traverse list
Pick up active flag
Process if active
Implies a cache line load per particle
or even a load per particle
if in array but sizeof(particle)>128

#4 Bytes Used Per Cache Line
How do you detect it?
PIX CPU instruction trace, summary tab
108.82 bytes referenced per L2 cache line, 85.01%
Typical title 30-40% usage
Below this is worrying
Above this is perfectly possible
Requires your whole team to be thinking
#4 Bytes Used Per Cache Line
PIX CPU instruction trace, top issues tab
L2 miss
% of data used
e.g. 0.78% -> byte flag check per line
Inspection of structure alignment
Think about data structure layout

#4 Bytes Used Per Cache Line
What can I do about it?
Hot cold data split
Particle active array
Ideal for cache friendly block allocator
Use the smallest memory data you can
shorts, bytes, half floats, bit fields
Load and convert to more convenient types

#4 Bytes Used Per Cache Line
Noncontiguous data structures
Traverse once only if you can
Consider more contiguous structures
Fixed array versus list
Cache optimized structures exist
e.g. van Emde Boas trees
But sometimes, you cant avoid it
Other dictates override cache efficiency

#5 Heavy Duty Maths
What is it?
Maths takes up disproportionate CPU %
In league with our old friend LHS
And maybe L2 miss as well
Calculations with:
Float by float calculation
Mixed float and int
Mixed vector and int

#5 Heavy Duty Maths
What is the performance impact?
You name it, depends on game
Especially shows up in:
Physics & collision
Particles
Animation playback
Unpacking

#5 Heavy Duty Maths
How do you detect it?
Any timing mechanism you have
PIX CPU instruction trace
Function view, exclusive %
Top issues tab
fcmp, fdiv, fsqrt, int mul, int div, sraw
XDC sekrit measure
% of VMX instructions in summary tab
XbPerfView sampling capture
#5 Heavy Duty Maths
What can I do about it?
Restructure:
For VMX
Careful to use native storage ie __vector4
Give VMX room to beat straight float
To remove LHS
In particular register set swaps
Use /QXSTALLS, XbPerfView and PIX

#6 Floating-Point Compares
What is it?
Part of heavy maths signature
fcmp instruction
Nothing wrong with the instruction
Compiler very often uses fcmp to
Control a branch immediately after fcmp
Pipeline problem:
Branch pipeline much shorter than scalar float
#6 Floating-Point Compares
Our beloved CPU is in-order
With no result to branch with, CPU must:
Flush instruction pipelines
Reissue instructions
Rinse and repeat until branch results ready
Can result in several flushes
#6 Floating-Point Compares
What can I do about it?
Refactor code to remove branch
Calculate both results, choose with __fsel
float __fself(float a, float b, float c);

float result = (a > 0.0) ? b : c;
float result = __fself(a,b,c);
Compiler only sometimes does it for you
Be aware
Sometimes hefty calc cheaper than fcmp
Similarly for vcmp.

#7 Page Table Thrash
What is it?
Very common problem
but well documented
Each Xbox 360 core has a 1024 entry TLB
Translates virtual to physical addresses
Each entry references a memory page
TLB referenced on every read and write
A TLB miss is very expensive
#7 Page Table Thrash
What is the performance impact?
Only 1024 pages
Accessing more than 1024 causes TLB thrash
Can cost 5% or more of your CPU time
How do you detect it?
PIX CPU instruction trace, summary tab
TLB statistics section will actually say
Use large pages to avoid excess TLB misses.


#7 Page Table Thrash
What can I do about it?
Easy fix
4 KB pages used by default
1024 * 4 KB = 4 MB, 1024 * 64 KB = 64 MB
Core unlikely to reference > 64 MB in a frame
Use 64K pages
Pass MEM_LARGE_PAGES flag to
VirtualAlloc()
XPhysicalAlloc()
Make stack size a multiple of 64K

#8 Failed Inlining
What is it?
Small functions are not inlined
You pay the cost of the function call
branch
Maybe some GPR save work
Maybe an LHS on very short functions
Marked as function prolog/epilog in PIX
What is the performance impact?
Depends on your code base
Can be surprisingly nasty on stuff like accessors
#8 Failed Inlining
How do you detect it?
XbPerfView sampling capture
PIX CPU instruction trace, functions tab
Sort by function size
Spot functions
With 1-5 instructions
That are called often
#8 Failed Inlining
What can I do about it?
Checks for inlining
Put small functions in headers, not cpps
The compiler cant inline a virtual
Unless it is certain of the exact class
Ensure functions called are also inlineable
Use LTCG to inline across modules
Use/Ob2 compiler option
Use __forceinline (with care)

#9 Block On Thread Sync
What is it?
Threads spending a lot of time syncing
Perhaps on:
Shared data access
Waiting on other tasks completing
Blocking API calls
What is the performance impact?
You name it, the sky is the limit
Down to how you handle multiple cores

#9 Block On Thread Sync
How do you detect it?
PIX system timing capture
Look for large areas of white
This is your hardware thread waiting
#9 Block On Thread Sync
What can I do about it?
Love to tell you theres a flag to set
XFILL_IN_SYNC_GAPS_WITH_WORK
But there isnt - requires careful design
Generally, you cant fix this issue quickly
if (thisProject.NearEndOfProject(cEpsilon))
{
thisProject.OptimiseElsewhere();
thisTeam.CrossFingers();
engineTeam.QueueTask(RedesignMulticoreSystem);
}

#10 memcpy
What is it?
Your title calls memcpy()
Perhaps for very small blocks
Perhaps very often
Perhaps for unsuitably aligned memory
Memcpy
copies byte by byte on < 4-byte aligned blocks
Penalty is severe
Memset and memcpy are general purpose


#10 memcpy
What is the performance impact?
Depends how much you call them
How do you detect it?
PIX CPU instruction trace, function tab
Sort by name
Sometimes the XDK calls memcpy

#10 memcpy
What can I do about it?
Align to 8 bytes for best performance from
memcpy
And then
Dont use memcpy, use XMemCpy
Align as large as you can, 16 if possible
XMemCpy uses dcbz128
Clears line we know were going to completely
fill
Cuts memory bandwidth for copy by 1/3
Since we dont need to read a cache line

#10 memcpy
Consider XMemCpy128
Faster again
Size is multiple of 128
Source 16 byte, dest 128 byte aligned
Consider XMemCpyStreaming
Warning
Memcpy faster in a few scenarios
But not typical game situations
Check XDK docs for exact information
#11 High Instruction Count
What is it?
Plain and simple
Trying to do too much one hardware thread
Hardware cant get it all done in time
What is the performance impact?
You dont run at 30 fps


#11 High Instruction Count
How do you detect it?
Game runs outside your planned interval
Duh, thanks!
Scientific evidence shows:
Games get 0.2 instructions per cycle average
(3.2 GHz * 0.2 IPC) / 30 fps = ~20M
PIX CPU instruction trace for whole frame
>20M total instructions executed
Youre up Dawsons Creek
#11 High Instruction Count
What can I do about it?
You can optimize like hell
But maybe, this isnt going to hack it
Especially if average case is taking too long
Or, you can consider the unpalatable
Do less
Good luck with your publisher
Reconsider threading strategy
Especially with very static approach
#12 D3D Resource Locks
What is it?
CPU stalls on D3D resource access
Waiting to locking a buffer
Either in use by GPU
or another thread
Reading back from GPU
What is the performance impact?
Again, depends on your game

#12 D3D Resource Locks
How do you detect it?
Simplest way is via PIX
Dr PIX warning in System Monitor
% Frame D3D Blocked
What can I do about it?
Double buffer anything you will need to lock
Ensure multiple threads not trying to lock
the same thing

#13 Threading Model
What is it?
This is really the big brother of #9
We have seen titles that dont thread at all
But much less common now
All threading models are equal
Well yes and no
Some are much more equal than others
Various feasible tasking models
Task pool systems scale best, we have found
Good for Windows too
#13 Threading Model
What can I do about it?
Plan for the future
Multicore design ever more crucial
Many-core hardware
Parallel rendering setup support
Rendering and DirectX futures
CPU side Gamefest, Intel, etc.
See some of the multicore presentations

#14 Thread ID Queries
What is it?
Engine code checks what thread its on
Perhaps via GetCurrentThreadId()
Some engines do stuff like:
if (GetCurrentThreadId() == MASTER_THREAD_ID)
DoSomeDirectHardwareOperation();
else
QueueUpHardwareOperationForMasterThread();
Gets expensive
Better to specialize the threads
#14 Thread ID Queries
What is the performance impact?
Depends how often you do it
As an aside, TlsGetValue() and others are slow
How do you detect it?
PIX CPU instruction trace
Functions tab
GetCurrentThreadId() with high call count

#14 Thread ID Queries
What can I do about it?
Avoid GetCurrentThreadId()
Specialize master/slave threads
Implicitly know which thread they run on
Or
Write hardware threadagnostic tasks
See also task pool / multicore materials
For thread local storage
Use __declspec(thread)
Not TlsGetValue()

#15 C++ Exceptions
What is it?
Using standard C++ exceptions in Xbox 360 compiler
And then
Using them in code running every frame

What is the performance impact?
try/ catch block
Inserts set of branches around catch block
Which are executed every time, regardless of error
Bloats function containing catch (slightly)
As a regularly used error handling mechanism
.this isnt performance friendly

#15 C++ Exceptions
How do you detect it?
Turn off /EHsc compiler option, build and
let the compiler point out exception usage

What can I do about it?
SEH works ok
But dont use standard exceptions
They dont work in all situations
They will not be fully supported on Xbox 360
Many console developers not in habit of using exceptions
Compiler team focusing on more in-demand features


1 Hour to a Pay Rise
Checklist
TLB Thrash
Specify large pages
Stack size 64-KB multiple
PgoLite
1 or more PIX CPU instruction trace(s)
PgoLite commandlike tool builds link order file
Replace memset and memcpy
With XMemSet and XmemCpy, obviously

1 Hour to a Pay Rise
Compiler options

/Ox full optimization
Includes many of the below

/Ob2 - inline any suitable
/Oi - enable intrinsics
/Os - favor small code
/fp:fast - fast float
/Oc - disable traps around integer divide
/Ou - prescheduling optimization on
1 Hour to a Pay Rise
_SECURE_SCL=0
Ditches STL iterator checking
De-bloats calls to iterators
Whole Program Optimization LTCG
/GL on all modules
Use ltcg builds of libs
D3d9, xaudio, xact, xmcore, xui
Some middleware providers, too
Too slow?
Keep it for build machine / final builds

1 Hour to a Pay Rise
Linker options
/opt:ref remove unused code
/opt:icf collapse identical functions
Can cause profiling confusion
/ltcg
@linkorderfile.txt
As generated by PgoLite

Useful Tools
Useful tools
PIX system monitor
PIX system timing capture
PIX CPU instruction trace
PIX retroactive system timing capture
Via continuous capture
great for CPU spikes


Useful Tools
Instrumented libraries
d3d9i.lib, xapilibi.lib
Ditch d3d9i.lib for CPU profiling
PgoLite
Pipeline animator
/QXSTALLS
LTCG
Check for middleware LTCG libs
If none, ask them why not

Useful References
Valuable references
Gamefest 2007
Performance Updates and Optimization Case
Studies
Gamasutra LHS article
GDC 2003 Memory Optimization deck

Useful References
White papers
A Detailed Examination of the Xbox 360 CPU
Pipelines
Xbox 360 CPU: Best Practices
Xbox 360 CPU Caches
Xbox 360 Memory Page Sizes

The End
With thanks to
Dave Cook
Bruce Dawson
The PIX team

and the shadowy Dr PIX

Any questions?

www.xnagamefest.com
2008 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only.
Microsoft makes no warranties, express or implied, in this summary.
Dawsons Creek Figures
Clock rate = 3.2 GHz = 3,200,000,000 cycles per second
60 fps = 53,333,333 cycles per frame
30 fps = 106,666,666 cycles per frame

Dawsons Law: average 0.2 IPC in a game title
Therefore
at 60 fps, you can do 10,666,666 instructions ~= 10M
at 30 fps, you can do 21,333,333 instructions ~= 21M

Or put another way how bad is a 1M-cycle penalty?
Its approx 200K instructions of quality execution going missing.
1M cycles is 1/50
th
2% of a frame at 60 fps, or 1/100
th
1% of a
frame at 30 fps, or 1% of a frame at 30 fps
1M cycles is 0.32 ms.

Potrebbero piacerti anche