Thinking Parallel, Part 2

10/10/13
Thinking Parallel, Part II: Tree Traversal on the GPU | NVIDIA Developer Zone
NVIDIA Developer Zone Dev eloper Centers Resources Log In CUDA ZONE Hom e > CUDA ZONE > Parallel Forall Thinking Parallel, Part II: Tree Traversal on t he GPU By Tero Karras, posted Nov 2 6 2 01 2 at 09 :09 PM Secondary links Technologies Tools
Com m unity
Tags: Algorithms, Parallel Programming
In the first part of this series , we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their ax is-aligned bounding box es (AABBs). Each of the two algorithms has its weaknesses: sort and sw eep suffers from high ex ecution div ergence, while uniform grid relies on too many simplify ing assumptions that limit its applicability in practice. In this part we will turn our attention to a more sophisticated approach, hierarchical tree traversal, that av oids these issues to a large ex tent. In the process, we will further ex plore the role of div ergence in parallel programming, and show a couple of practical ex amples of how to improv e it.
Bounding Volume Hierarchy

We will build our approach around a bounding volume hierarchy (BV H), which is a commonly used acceleration structure in ray tracing (for ex ample). A bounding v olume hierarchy is essentially a hierarchical grouping of 3D objects, where each group is associated with a conserv ativ e bounding box .
https://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu
1/10
10/10/13
Suppose we hav e eight objects, O1 -O8, the green triangles in the figure abov e. In a BV H, indiv idual objects are represented by leaf nodes (green spheres in the figure), groups of objects by internal nodes (N1 -N7 , orange spheres), and the entire scene by the root node (N1 ). Each internal node (e.g. N2 ) has two children (N4 and N5 ), and is associated with a bounding v olume (orange rectangle) that fully contains all the underly ing objects (O1 -O4 ). The bounding v olumes can basically be any 3D shapes, but we will use ax is-aligned bounding box es (AABBs) for simplicity . Our ov erall approach is to first construct a BV H ov er the giv en set of 3D objects, and then use it to accelerate the search for potentially colliding pairs. We will postpone the discussion of efficient hierarchy construction to the third part of this series. For now, lets just assume that we already hav e the BV H in place.
Independent Traversal
Giv en the bounding box of a particular object, it is straightforward to formulate a recursiv e algorithm to query all the objects whose bounding box es it ov erlaps. The following function takes a BV H in the parameter b v hand an AABB to query against it in the parameter q u e r y A A B B . It tests the AABB against the BV H recursiv ely and returns a l i s tof potential collisions.
v o i dt r a v e r s e R e c u r s i v e (C o l l i s i o n L i s t &l i s t , c o n s tB V H & c o n s tA A B B & i n t N o d e P t r { / /B o u n d i n gb o xo v e r l a p st h eq u e r y= >p r o c e s sn o d e . i f( c h e c k O v e r l a p ( b v h . g e t A A B B ( n o d e ) ,q u e r y A A B B ) ) { / /L e a fn o d e= >r e p o r tc o l l i s i o n . i f( b v h . i s L e a f ( n o d e ) ) l i s t . a d d ( q u e r y O b j e c t I d x ,b v h . g e t O b j e c t I d x ( n o d e ) ) ; / /I n t e r n a ln o d e= >r e c u r s et oc h i l d r e n .

https://developer.nvidia.com/content/thinking-parallel-part-ii-tree-traversal-gpu 2/10
b v h , q u e r y A A B B , q u e r y O b j e c t I d x , n o d e )
10/10/13
e l s e { N o d e P t rc h i l d L=b v h . g e t L e f t C h i l d ( n o d e ) ; N o d e P t rc h i l d R=b v h . g e t R i g h t C h i l d ( n o d e ) ; t r a v e r s e R e c u r s i v e ( b v h ,l i s t ,q u e r y A A B B , q u e r y O b j e c t I d x ,c h i l d L ) ; t r a v e r s e R e c u r s i v e ( b v h ,l i s t ,q u e r y A A B B , q u e r y O b j e c t I d x ,c h i l d R ) ; } } } The idea is to trav erse the hierarchy in a top-down manner, starting from the root. For each node, we first check whether its bounding box ov erlaps with the query . If not, we know that none of the underly ing leaf nodes will ov erlap it either, so we can skip the entire subtree. Otherwise, we check whether the node is a leaf or an internal node. If it is a leaf, we report a potential collision with the corresponding object. If it is an internal node, we proceed to test each of its children in a recursiv e fashion. To find collisions between all objects, we can simply ex ecute one such query for each object in parallel. Lets turn the abov e code into CUDA C++ and see what happens.
_ _ d e v i c e _ _v o i dt r a v e r s e R e c u r s i v e (C o l l i s i o n L i s t &l i s t , c o n s tB V H & c o n s tA A B B & i n t N o d e P t r { / /s a m ea sb e f o r e . . . } _ _ g l o b a l _ _v o i df i n d P o t e n t i a l C o l l i s i o n s (C o l l i s i o n L i s tl i s t , B V H A A B B * i n t { i n ti d x=t h r e a d I d x . x+b l o c k D i m . x*b l o c k I d x . x ; i f( i d x<n u m O b j e c t s ) t r a v e r s e R e c u r s i v e ( b v h ,l i s t ,o b j e c t A A B B s [ i d x ] , i d x ,b v h . g e t R o o t ( ) ) ; } Here, we hav e added the _ _ d e v i c e _ _ key word to the declaration of t r a v e r s e R e c u r s i v e ( ) , to indicate that the code is to be ex ecuted on the GPU. We hav e also added a _ _ g l o b a l _ _ kernel function that we can launch from the CPU side. The B V H and C o l l i s i o n L i s t objects are conv enience wrappers that store the GPU memory pointers needed to access BV H nodes and report collisions. We set them up on the CPU side, and pass them to the kernel by v alue. b v h , o b j e c t A A B B s , n u m O b j e c t s ) b v h , q u e r y A A B B , q u e r y O b j e c t I d x , n o d e )
3/10
10/10/13
The first line of the kernel computes a linear 1 D index for the current thread. We do not make any assumptions about the block and grid sizes. It is enough to launch at least n u m O b j e c t s threads in one way or anotherany ex cess threads will get terminated by the second line. The third line fetches the bounding box of the corresponding object, and calls our function to perform recursiv e trav ersal, passing the objects index and the pointer to the root node of the BV H in the last two arguments. To test our implementation, we will run a dataset taken from APEX Destruction using a GeForce GTX 690 GPU. The data set contains 1 2,486 objects representing debris falling from the walls of a corridor, and 7 3,7 04 pairs of potentially colliding objects, as shown in the following screenshot.
The total ex ecution time of our kernel for this dataset is 3.8 milliseconds. Not v ery good considering that this kernel is just one part of collision detection, which is only one part of a simulation that we would ideally like to run at 60 FPS (1 6 ms). We should be able to do better.
Minimizing Divergence
The most obv ious problem with our recursiv e implementation is high ex ecution div ergence. The decision of whether to skip a giv en node or recurse to its children is made independently by each thread, and there is nothing to guarantee that nearby threads will remain in sy nc once they hav e made different decisions. We can fix this by performing the trav ersal in an iterativ e fashion, and managing the recursion stack ex plicitly , as in the following function.
_ _ d e v i c e _ _v o i dt r a v e r s e I t e r a t i v e (C o l l i s i o n L i s t &l i s t , B V H &b v h , A A B B &q u e r y A A B B , i n tq u e r y O b j e c t I d x ) { / /A l l o c a t et r a v e r s a ls t a c kf r o mt h r e a d l o c a lm e m o r y , / /a n dp u s hN U L Lt oi n d i c a t et h a tt h e r ea r en op o s t p o n e dn o d e s . N o d e P t rs t a c k [ 6 4 ] ; N o d e P t r *s t a c k P t r=s t a c k ;

10/10/13
* s t a c k P t r + +=N U L L ;/ /p u s h / /T r a v e r s en o d e ss t a r t i n gf r o mt h er o o t . N o d e P t rn o d e=b v h . g e t R o o t ( ) ; d o { / /C h e c ke a c hc h i l dn o d ef o ro v e r l a p . N o d e P t rc h i l d L=b v h . g e t L e f t C h i l d ( n o d e ) ; N o d e P t rc h i l d R=b v h . g e t R i g h t C h i l d ( n o d e ) ; b o o lo v e r l a p L=(c h e c k O v e r l a p ( q u e r y A A B B , b v h . g e t A A B B ( c h i l d L ) )) ; b o o lo v e r l a p R=(c h e c k O v e r l a p ( q u e r y A A B B , b v h . g e t A A B B ( c h i l d R ) )) ; / /Q u e r yo v e r l a p sal e a fn o d e= >r e p o r tc o l l i s i o n . i f( o v e r l a p L& &b v h . i s L e a f ( c h i l d L ) ) l i s t . a d d ( q u e r y O b j e c t I d x ,b v h . g e t O b j e c t I d x ( c h i l d L ) ) ; i f( o v e r l a p R& &b v h . i s L e a f ( c h i l d R ) ) l i s t . a d d ( q u e r y O b j e c t I d x ,b v h . g e t O b j e c t I d x ( c h i l d R ) ) ; / /Q u e r yo v e r l a p sa ni n t e r n a ln o d e= >t r a v e r s e . b o o lt r a v e r s e L=( o v e r l a p L& &! b v h . i s L e a f ( c h i l d L ) ) ; b o o lt r a v e r s e R=( o v e r l a p R& &! b v h . i s L e a f ( c h i l d R ) ) ; i f( ! t r a v e r s e L& &! t r a v e r s e R ) n o d e=* s t a c k P t r ;/ /p o p e l s e { n o d e=( t r a v e r s e L )?c h i l d L:c h i l d R ; i f( t r a v e r s e L& &t r a v e r s e R ) * s t a c k P t r + +=c h i l d R ;/ /p u s h } } w h i l e( n o d e! =N U L L ) ; } The loop is ex ecuted once for ev ery internal node that ov erlaps the query box . We begin by checking the children of the current node for ov erlap, and report an intersection if one of them is a leaf. We then check whether the ov erlapped children are internal nodes that need to be processed in a subsequent iteration. If there is only one child, we simply set it as the current node and start ov er. If there are two children, we set the left child as the current node and push the right child onto the stack. If there are no children to be trav ersed, we pop a node that was prev iously pushed to the stack. The trav ersal ends when we pop N U L L , which indicates that there are no more nodes to process. The total ex ecution time of this kernel is 0.91 millisecondsa rather substantial improv ement ov er 3.8 m s for the recursiv e kernel! The reason for the improv ement is that each thread is now simply ex ecuting the same loop ov er and ov er, regardless of which trav ersal decisions it ends up
10/10/13
making. This means that nearby threads ex ecute ev ery iteration in sy nc with each other, ev en if they are trav ersing completely different parts of the tree. But what if threads are indeed trav ersing completely different parts of the tree? That means that they are accessing different nodes (data divergence ) and ex ecuting a different number of iterations (execution divergence ). In our current algorithm, there is nothing to guarantee that nearby threads will actually process objects that are nearby in 3D space. The amount of div ergence is therefore v ery sensitiv e to the order in which the objects are specified. Fortunately , we can ex ploit the fact that the objects we want to query are the same objects from which we constructed the BV H. Due to the hierarchical nature of the BV H, objects close to each other in 3D are also likely to be located in nearby leaf nodes. So lets order our queries the same way , as shown in the following kernel code.
_ _ g l o b a l _ _v o i df i n d P o t e n t i a l C o l l i s i o n s (C o l l i s i o n L i s tl i s t , B V H { i n ti d x=t h r e a d I d x . x+b l o c k D i m . x*b l o c k I d x . x ; i f( i d x<b v h . g e t N u m L e a v e s ( ) ) { N o d e P t rl e a f=b v h . g e t L e a f ( i d x ) ; t r a v e r s e I t e r a t i v e ( l i s t ,b v h , b v h . g e t A A B B ( l e a f ) , b v h . g e t O b j e c t I d x ( l e a f ) ) ; } } Instead of launching one thread per object, as we did prev iously , we are now launching one thread per leaf node. This does not affect the behav ior of the kernel, since each object will still get processed ex actly once. Howev er, it changes the ordering of the threads to minimize both ex ecution and data div ergence. The total ex ecution time is now 0.43 millisecondsthis triv ial change improv ed the performance of our algorithm by another 2x ! There is still one minor problem with our algorithm: ev ery potential collision will be reported twiceonce by each participating objectand objects will also report collisions with themselv es. Reporting twice as many collisions also means that we hav e to perform twice as much work. Fortunately , this can be av oided through a simple modification to the algorithm. In order for object A to report a collision with object B, we require that A must appear before B in the tree. To av oid trav ersing the hierarchy all the way to the leav es in order to find out whether this is the case, we can store two additional pointers for ev ery internal node, to indicate the rightmost leaf that can be reached through each of its children. During the trav ersal, we can then skip a node whenev er we notice that it cannot be used to reach any leav es that would be located after our query node in the tree. b v h )
_ _ d e v i c e _ _v o i dt r a v e r s e I t e r a t i v e (C o l l i s i o n L i s t &l i s t , B V H & A A B B &

b v h , q u e r y A A B B ,
6/10
10/10/13
i n t N o d e P t r { . . .
q u e r y O b j e c t I d x , q u e r y L e a f )
/ /I g n o r eo v e r l a pi ft h es u b t r e ei sf u l l yo nt h e / /l e f t h a n ds i d eo ft h eq u e r y . i f( b v h . g e t R i g h t m o s t L e a f I n L e f t S u b t r e e ( n o d e )< =q u e r y L e a f ) o v e r l a p L=f a l s e ; i f( b v h . g e t R i g h t m o s t L e a f I n R i g h t S u b t r e e ( n o d e )< =q u e r y L e a f ) o v e r l a p R=f a l s e ; . . . } After this modification, the algorithm runs in 0.25 milliseconds. That is a 1 5x improv ement ov er our starting point, and most of our optimizations were only aimed at minimizing div ergence.
Simultaneous Traversal
In independent trav ersal, we are trav ersing the BV H for each object independently , which means that no work we perform for a giv en object is ev er utilized by the others. Can we improv e upon this? If many small objects happen to be located nearby in 3D, each one of them will essentially end up performing almost the same trav ersal steps. What if we grouped the nearby objects together and performed a single query for the entire group? This line of thought leads to an algorithm called simultaneous traversal. Instead of looking at indiv idual nodes, the idea is to consider pairs of nodes. If the bounding box es of the nodes do not ov erlap, we know that there will be no ov erlap any where in their respectiv e subtrees, either. If, on the other hand, the nodes do ov erlap, we can proceed to test all possible pairings between their children. Continuing this in a recursiv e fashion, we will ev entually reach pairs of ov erlapping leaf nodes, which correspond to potential collisions. On a single-core processor, simultaneous trav ersal works really well. We can start from the root, paired with itself, and perform one big trav ersal to find all the potential collisions in one go. The algorithm performs significantly less work than independent trav ersal, and there really is no downside to itthe implementation of one trav ersal step looks roughly the same in both algorithms, but there are simply less steps to ex ecute in simultaneous trav ersal (60% less in our ex ample). Its a better algorithm, right? To parallelize simultaneous trav ersal, we must find enough independent work to fill the entire GPU. One easy way to accomplish this is to start the trav ersal a few lev els deeper in the hierarchy . We could, for ex ample, identify an appropriate cut of 256 nodes near the root, and launch one thread for each pairing of the nodes (32,896 in total). This would result in sufficient parallelism without increasing the total amount of work too much. The only source of ex tra work is that we need to perform at least one ov erlap test for each initial pair, whereas the single-core implementation would av oid some of the pairs altogether.
10/10/13
So, the parallel implementation of simultaneous trav ersal does less work than independent trav ersal, and it does not lack in parallelism, either. Sounds good, right? Wrong. It actually performs a lot worse than independent trav ersal. How is that possible? The answer isy ou guessed itdiv ergence. In simultaneous trav ersal, each thread is working on a completely different portion of the tree, so the data div ergence is high. There is no correlation between the trav ersal decisions made by nearby threads, so the ex ecution div ergence is also high. To make matters ev en worse, the ex ecution times of the indiv idual threads v ary wildly threads that are giv en a non-ov erlapping initial pair will ex it immediately , whereas the ones giv en a node paired with itself are likely to ex ecute the longest. May be there is a way to organize the computation differently so that simultaneous trav ersal would y ield better results, similar to what we did with independent trav ersal? There hav e been many attempts to accomplish something like this in other contex ts, using clev er work assignment, packet trav ersal, warp-sy nchronous programming, dy namic load balancing, and so on. Long story short, y ou can get pretty close to the performance of independent trav ersal, but it is ex tremely difficult to actually beat it.
Discussion
We hav e looked at two way s of performing broad-phase collision detection by trav ersing a hierarchical data structure in parallel, and we hav e seen that minimizing div ergence through relativ ely simple algorithmic modifications can lead to substantial performance improv ements. Comparing independent trav ersal and simultaneous trav ersal is interesting because it highlights an important lesson about parallel programming. Independent trav ersal is a simple algorithm, but it performs more work than necessary . ov erall. Simultaneous trav ersal, on the other hand, is more intelligent about the work it performs, but this comes at the price of increased complex ity . Complex algorithms tend to be harder to parallelize, are more susceptible to div ergence, and offer less flex ibility when it comes to optimization. In our ex ample, these effects end up completely nullify ing the benefits of reduced ov erall computation. Parallel programming is often less about how much work the program performs as it is about whether that work is div ergent or not. Algorithmic complex ity often leads to div ergence, so it is important to try the simplest algorithm first. Chances are that after a few rounds of optimization, the algorithm runs so well that more complex alternativ es hav e a hard time competing with it. In my nex t post, I will focus on parallel BV H construction, talk about the problem of occupancy , and present a recently published algorithm that ex plicitly aims to max imize it. About the author: Tero Karras is a graphics research scientist at NV IDIA Research. Parallel Forall is the NV IDIA Parallel Programming blog. If y ou enjoy ed this post, subscribe to the Parallel Forall RSS feed !
NVIDIA Developer Programs Get exclusiv e access to the latest software, report bugs and receiv e notifications for special ev ents.
10/10/13
Learn m ore and Register
Recommended Reading About Parallel Forall Contact Parallel Forall Parallel Forall Blog Feat ured Art icles
Udacity CS3 4 4 : Intro to Parallel Program m ing Prev iousPauseNext Tag Index accelerom eter (1 ) Algorithm s (3 ) Android (1 ) ANR (1 ) ARM (2 ) Array Fire (1 ) Audi (1 ) Autom otiv e & Em bedded (1 ) Blog (2 0) Blog (2 3 ) Cluster (4 ) com petition (1 ) Com pilation (1 ) Concurrency (2 ) Copperhead (1 ) CUDA (2 3 ) CUDA 4 .1 (1 ) CUDA 5.5 (3 ) CUDA C (1 5) CUDA Fortran (1 0) CUDA Pro Tip (1 ) CUDA Profiler (1 ) CUDA Spotlight (1 ) CUDA Zone (81 ) CUDACasts (2 ) Debug (1 ) Debugger (1 ) Debugging (3 ) Dev elop 4 Shield (1 ) dev elopm ent kit (1 ) DirectX (3 ) Eclipse (1 ) Ev ents (2 ) FFT (1 ) Finite Difference (4 ) Floating Point (2 ) Gam e & Graphics Dev elopm ent (3 5) Gam es and Graphics (8) GeForce Dev eloper Stories (1 ) getting started (1 ) google io (1 ) GTC (2 ) Hardware (1 ) Interv iew (1 ) Kepler (1 ) Lam borghini (1 ) Libraries (4 ) m em ory (6 ) Mobile Dev elopm ent (2 7 ) Monte Carlo (1 ) MPI (2 ) Multi-GPU (3 ) nativ e_app_glue (1 ) NDK (1 ) NPP (1 ) Nsight (2 ) NSight Eclipse Edition (1 ) Nsight Tegra (1 ) NSIGHT Visual Studio Edition (1 ) Num baPro (2 ) Num erics (1 ) NVIDIA Parallel Nsight (1 ) nv idia-sm i (1 ) Occupancy (1 ) OpenACC (6 ) OpenGL (3 ) OpenGL ES (1 ) Parallel Forall (6 9 ) Parallel Nsight (1 ) Parallel Program m ing (5) PerfHUD ES (2 ) Perform ance (4 ) Portability (1 ) Porting (1 ) Pro Tip (5) Professional Graphics (6 ) Profiling (3 ) Program m ing Languages (1 ) Py thon (3 ) Robotics (1 ) Shape Sensing (1 ) Shared Mem ory (6 ) Shield (1 ) Stream s (2 ) tablet (1 ) TADP (1 ) Technologies (3 ) tegra (5) Tegra Android Dev eloper Pack (1 ) Tegra Android Dev elopm ent Pack (1 ) Tegra Dev eloper Stories (1 ) Tegra Profiler (1 ) Tegra Zone (1 ) Textures (1 ) Thrust (3 ) Tools (1 0) tools (2 ) Toradex (1 ) Visual Studio (3 ) Windows 8 (1 ) xoom (1 ) Zone In (1 ) Developer Blogs Parallel Forall Blog
About Contact Copy right 2013 NVIDIA Corporation Legal Inf ormation
9/10
10/10/13
Priv acy Policy Code of Conduct
10/10

Thinking Parallel, Part 2

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Thinking Parallel, Part 2

Caricato da

Copyright:

Formati disponibili

10/10/13

Tags: Algorithms, Parallel Programming

Bounding Volume Hierarchy

_ _ d e v i c e _ _v o i dt r a v e r s e I t e r a t i v e (C o l l i s i o n L i s t &l i s t , B V H & A A B B &

Learn m ore and Register

Potrebbero piacerti anche