Hard Stuff

Massively Parallel Computing and GPU Hardware
Types of Parallel Computing
Parallel on Device (Supercomputer / Co-processors) Cray built the first supercomputers. Economies of scale eventually priced out nearly all custom supercomputing solutions. Modern supercomputers are most often many (thousands) of commodity processors in an extremely highly tuned cluster with special built custom interconnects. Famous Supercomputers / Coprocessors include: - Cray - GRAPE Board (GRAvity PipE) - IBM Blue Gene
Parallel not on device (cluster) Types of Clusters - Beowulf Clusters TCP/IP (InfiniBand or Ethernet) networked (usually) identical computers built in cluster designed specifically for supercomputing - Grid Computing distributed computing systems that are more loosely coupled, heterogeneous in nature and geographically disperse
Famous Distributed Projects include: - Folding@home - SETI@home
Distributed / Parallel Computing
OpenMP (Open Multi-Processing) OpenMP is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures. Supercomputers / Co-processors are typically Shared Memory / Distributed Shared Memory parallel multiprocessing platforms. An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and MPI.
MPI (Message Passing Interface) MPI is a specification for an application programming interface (API) that allows many computers to communicate with one another. Clusters are Distributed Memory parallel multiprocessing platforms i.e. individual computers in the cluster are not aware of what resides in another's memory without communication. Both types of parallel computing offer unique programming and optimization challenges.
Pros and Cons of clusters and supercomputers

PRO
CON
$0 to write a proposal for time on a cluster / supercomputer. If properly programmed, can do amazing amounts of PC wall clock work in very short amount of time. Cluster's are well suited to running the same operation on many discrete data sets Supercomputers can be extremely effective at running massive simulations very quickly.
May need to write a proposal for cluster / supercomputer time. Writing a new proposal for each new set of simulations lacks flexibility for those that run code frequently. Extremely difficult to get maximum potential out of clusters with MPI Supercomputer time may be very difficult to come by. Supercomputers / large clusters are massively expensive to build. Clusters are not necessarily well suited to larger data sets which can not be broken apart and processed separately.
On-device Parallel computing

PRO
CON - On device memory limitations will keep simulations necessarily small as compared to possibilities with a cluster or supercomputer. - Can not compete with the scale of clusters and supercomputers. The sheer amount of computing power provided by a massive cluster or supercomputer is unrivaled.
+ Keeping calculations on the device rather than transferring data over some type of network saves massive amounts of compute time. + Amount of programming required to take advantage of an on-board device is significantly less than MPI/OpenMP. + Much of the architectural nuances of a parallel device can be hidden in the compiler. Or, conversely, more optimization can occur at significantly less cost in terms of development time.
Why the GPU and why now?

There are two primary answers to this question 1.) Economies of Scale 2.) Unified Shader Model / Stream Processing
Economies of Scale Where Cray and many other supercomputing solutions failed is in the realm of profit. GPU's are graphics cards for computer video game players and graphics professionals first for HPC second. Technological advances have come at a rapid rate due to the ever increasing graphical demands of modern computer gaming. Because of the constricted budget of the average pc gamer as compared to a professional/scientist seeking an expensive solution to a HPC problem, graphics cards were necessarily produced at much higher volume with lower margins.
Classical Graphics Pipeline

Each stage of the graphics pipeline traditionally had dedicated shader processors for each operation. For example, a traditional pipeline hypothetical video card could have: 8 vertex shader pipelines 24 pixel shader pipelines 24 texture filtering units 8 vertex texture addressing units 8 rendering output units
Unified Shader Model

By unifying the pixel/vertex shaders into a large programmable shader core, there are no longer discrete/separate shader core types. Since the data is no longer flowing down the pipeline in sequential fashion, bottlenecks were removed for gaming/video processing. However, the advent of programmable shaders is important for the sheer fact that each shader is no longer forced to perform only graphics operations. They can be give a kernel of our choice and can process information as the data is streamed through the device.
Sample Device Architecture
NVIDIA G80 architecture
NVIDIA / ATI / Intel GPU Solutions
NVIDIA - GeForce (gaming/movie playback) Quadro (professional graphics) Tesla (HPC) ATI Intel Radeon (gaming/movie playback) FireStream (HPC) Larabee (HPC/gaming)
ATI
Pros + Open Source driver + CAL (compute abstraction layer) and CTM (close to metal) allow for a high degree of optimization + Slightly better price for nonFireStream GPUs (Radeon) Cons - SDK still in 1.0 beta - SDK only supports Windows XP (ATI says linux support will come within the next calendar year) - Brook/CTM more general than CUDA ---> more code development is required to achieve similar functionality - CUDA considered to be more fully featured and 1+ years ahead
To run the CAL/Brook+ SDK, you need a platform based on the AMD R600 GPU or later. R600 and newer GPUs are found with ATI Radeon HD2400, HD2600, HD2900 and HD3800 graphics board.
Intel
Tera-Scale + 1st Teraflops Research Chip named Polaris + 80 Core prototype developed + Achieved 1.01 teraflops computing performance at 3.16 GHz and 65 W + Later achieved 2 teraflops at 5.7 GHz and 265W +In order to take advantage of the numerous cores of planned processors, Intel is developing Ct, a programming model that ease SIMD and multithreading programming. - No plans for mass production. Currently, Tera-Scale is just a research initiative. Larabee + GPU to compete with NVIDIA/ATI + Will run at 1.7 2.5 GHz + 16 to 24 in order cores supporting 4 simultaneous threads of execution + TDP 150 300 W +- Different from NVIDIA/ATI solutions in that it will not use a custom instruction set designed for graphics, rather an extension of the x86 instruction set.
- Public release late 09 / 10
NVIDIA
+ Full featured CUDA SDK v1.0 has been out for months + CUDA SDK v2.0 is already at beta2 + CUDA SDK supports 32bit/64bit Windows, Linux, OSX + CUDA driver supports 32bit/64bit Windows, Linux, OSX + Large selection of CUDA enabled GPUs + Double Precision Support + Greater performance than ATI cards (single precision) + CUDA adds functionality and ease of use as compared to Brook and CAL. + Ready for deployment in the here and now + Rocks 4.0/5.0 Operating System has CUDA rolls ready + Sun Grid Engine (SGE) already deployed + Decent CUDA programming documentation - Slightly lower double precision performance than ATI - Lacks good hardware architecture documentation/ compiler information
The Problem with CPU Architecture
The Central Processing Unit contains one to several multiprocessors which run serial operations extremely quickly in order to make tasks appear to run simultaneously. (Newer dual and quad core processors obviously operate with true simultaneity) As can be seen in the block diagram, a significant portion of the transistors on the die are designated for flow control and data caching. These CPU's are smart but narrow. Of the ~820 million transistors on a Core2, only a small fraction are dedicated to ALU's.
An arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. It is the basic building block of modern microprocessors.
Graphics Processing Units
A large portion of the processor die is spent on ALU's ~80% of 1.4 billion transistors. Because the same function is executed on each element of data with high arithmetic intensity, there is a much reduced requirement of flow control for the hardware. Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets such as arrays can use a data-parallel programming model to speed up the computations. In 3D rendering large sets of pixels and vertices are mapped to parallel threads.
Hardware Architecture
Each multiprocessor has a Single Instruction, Multiple Thread architecture (SIMT): At any given clock cycle, each stream processor of the multiprocessor executes the same instruction, but operates on different data. On the GTX 280 there are 30 multiprocessors with 8 Stream Processors each (240 SP's total). On the 8800 GTX there are 16 multiprocessors with 8 Stream Processors each (128 SP's total).
Memory Management Model

Access to registers is usually instantaneous and read-write -> typically costs 0 cycles Access to Shared Memory is very fast and read-write Access to Texture Memory is very fast and read only Access to Constant Memory is very fast and read only Access to Global Memory is slow and read-write. -> typical global memory latency is 400 600 cycles per CUDA programming manual There number of registers per multiprocessor is 8192 (G80), 16k (GT200). The amount of shared memory available per multiprocessor is 16 KB organized into 16 banks. The total amount of constant memory is 64 KB. The cache working set for constant memory is 8 KB per multiprocessor. The cache working set for texture memory is 8 KB per multiprocessor. The total amount of global memory varies by card from 256 MB to 4 GB.
A function that is compiled to the instruction set of the device results in a program, called a kernel, which is downloaded to the device. The actual number of threads used per block is user defined. This value will depend on the complexity of the calculation. The more sophisticated the calculation, the more registers that are required. The more registers required, the fewer instances of the thread that can be run simultaneously.
Multiprocessor Architecture
SFU = Special Function Unit SP = Stream Processor RF = Register File - SFU is responsible for 1 floating point operation per cycle stream processor it is attached to (4). It does transcendentals, square roots, square root / divides. - Each stream processor is capable of a dual issue add/multiply resulting in 2 floating point operations per cycle. - Accessing register connected to the stream processor generally takes 0 cycles. It may take a few cycles to access registers attached to other stream processors on the multiprocessor.
Device Architecture Revisited
NVIDIA G80 architecture
G80 vs GT200 Architecture

G80 GT200
GPU vs CPU Memory Performance

model core clock (MHz) 575 675 600 602 shader clock (GHz) 1.350 1.688 1.500 1.296 effective (real) memory total memory memory bus clock MHz * width 1800 (900) 2200 (1100) 2000 (1000) 2214 (1107) 768 MB 512 MB 2 x 512 MB 1 GB 384 bit 256 bit 2 x 256 bit 512 bit max memory bandwidth ** 86.4 GB/s 70.4 GB/s 2 x 64 GB 141.7 GB/s 8800 GTX 9800 GTX 9800 GX2 280 GTX
* GDDR3 ram is DDR2 RAM designed for the GPU. Runs cooler so that it can be clocked higher. Is similar to DDR2 RAM in many other ways so, the actual data transfer rate us effectively 2x the operational rate because DDR2 / GDDR3 RAM is capable of 2 data transfers per cycle. ** The actual throughput for a single channel DIMM of GDDR3 memory would be significantly lower than this e.g. model 8800 GTX 9800 GTX 9800 GX2 280 GTX memory clock (MHz) 1800 (900) 2200 (1100) 2000 (1000) 2214 (1107) single channel throughput # of channels 14.4 GB/s 6 17.6 GB/s 4 16.0 GB/s 2x4 17.7 GB/s 8
64 bit wide data word * 2 transfers / cycle * real memory clock = 8 bytes * effective clock
However, it appears as though each GPU has memory modules installed in chunks of 128 MB / i.e. there is 1 channel for each 128 MB's worth of RAM. i.e. The 512 MB 9800 GTX graphics card has 4 * 128 MB DIMMS resulting in a maximum throughput of 4*17.6 GB/s = 70.4 GB/s
CPU/GPU Memory Continued
DDR2 800 (PC 6400) has a bandwidth of 6.4 GB/s DDR2 1066 (PC 8000) has a bandwidth of 8.0 GB/s etc. Main system memory is 64 bits (8 bytes) wide and capable of 2 transfers per clock. (DDR -> Double Data Rate) i.e. DDR2 800 is actually 400 MHz clock but effectively runs at 800 MHz 8 bytes * 800 MHz = 6.4 GB/s
Latency
Main Memory latency is in the neighborhood of 50 ns, and is as low as ~ 10 ns for DDR2 800 + RAM. This equates to a latency of ~ 5 cycles. The NVIDIA CUDA programming guide suggests that global memory latency is between 400 to 600 cycles but has far superior bandwidth. This indicates that the GPU has a latency of 350 to 650 ns.
**It is hard to be more specific than this because NVIDIA does not tell us if the cycles of latency is memory, shader or core cycles which has a range between 600 MHz to 1100 MHz.
CPU vs GPU Performance per Watt
Intel Xeon/Core2 Processors (Core microarchitecture, x86 ISA) Operating Speed x86 Core2/Xeon arch TDP Number of Cores = 2.0 - 3.0GHz = 4 single precision executions per cycle, 2 double precision (only true if optimized with SSE) ** = 1 flops single/double precision non-SSE = 50 150W =2-4
For a 2.83 GHz Intel Quad Core (95W) processor running SSE optimized code: 4 cores x 2.83 GHZ *8 flops = 90.4 GFLOPS theoretical peak (single precision) = 45.2 GFLOPS theoretical peak (double precision) 90.4 GFLOPS/ 95 W = 952 MFLOPS/W (sp) = 476 MFLOPS/W (dp)
For a 3.0 GHz Intel Core 2 (65W) processor with no multithreading / SSE: 1 core x 3.00 GHz * 1 flops = 3.0 Gflops (single/double precision) 3.0 Gflops / 65 W = 46 Mflops/W (single/double precision)
** When calculating single precision operations per cycle, a single operation consists of a fused Multiply/Add resulting in 2 flops per execution
NVIDIA 8800GTX / 9800GTX (G80, G92 architecture respectively) Shader Clock Freq. G80/G92 architecture TDP Number of Stream Proc 8800 GTX 9800 GTX = 1.35 / 1.688 GHz = dual issue add/mul + SFU* --> 3 flops single precision = 145 / 156 W (maximum 225W) = 128 = 518.4 Gflops (single precision) = 3.575 Gflops/W = 648.2 Gflops (single precision) = 4.155 Gflops/W
* The G80 architecture had a hardware flaw that prevented it from actually achieving the peak 3 flops performance.
NVIDIA 280 GTX (GT200 architecture) Shader Clock Freq. G80/G92 architecture = 1.296 GHz = dual issue add/mul + SFU** --> 3 flops single precision = 1 add/mul per multiprocessor -> 2 flops double precision TDP = 236 W (maximum 300W) Number of Stream Proc = 240 ** The GT200 architecture # double precision units = 30 GTX 280 = 933.1 Gflops (single precision) = 3.95 Gflops/W = 77.8 Gflops (double precision) = 328 Mflops/W
removes this error. The SFU, or special function unit, is responsible for a series of operations on the graphics card and is capable of a single floating point operation per cycle.
Typical CPU programming would result in ~1-3 Gflops of single/double precision performance. Multi-threaded and Streaming SIMD Extension enhanced code could perform at ~ 100 and 25 Gflops for single and double precision respectively. Full enhanced GPU code could perform at ~ 1 Tflops and ~75 Gflops for single and double precision respectively Typically, simply mapping CPU code to a GPU programming with little or no memory management would result in at least 10x performance boost. This corresponds to a: minimum maximum minimum maximum 20x speed up in single precision over fully enhanced SSE code 350x speed up in single precision over regular code 3x speed up in double precision over fully enhanced SSE 50x speed up in double precision over regular code
CPU vs GPU Price vs Performance
GeForce GTX 280 (236W): 933 Gflops single precision 78 Gflops double precision $649.99 (newegg) GeForce 9800 GX2 (197W) 1.15 Tflops single precision $449.99 (newegg) Intel Core 2 Quad Yorkfield 2.83 GHz 95W 90.4 Gflops single precision 45.2 Glops double precision $559.99 (newegg) Intel Core 2 Duo E7200 Wolfdale 2.53GHz 65W 40.4 Gflops single precision 20.2 Gflops double precision $129.99
1.44 Gflops / $ 0.12 Gflops / $ 2.56 Gflops / $
sp dp sp
0.16 Gflops / $ 0.08 Gflops / $
sp dp
0.32 Gflops / $ 0.16 Gflops / $
sp dp
Even when we consider the ultimate metric of Performance/W per $ the GPU comes out ahead. The wattage use is roughly double or triple for the GPUs as compared to the CPU's, however they offer orders of magnitude better performance per dollar.
Drawbacks
Accuracy - single precision - double precision avaliable but at a significant performance penalty Portability of code - CUDA vs Brook + CUDA and Brook share a common heritage in that CUDA is partly based on Brook. Code is not direclty portable, however, programming concepts between the two implementation are similar.
Future proofing code : - Increasing register space could result in different kernels being useful. - Increasing amounts of shared memory would allow different sized/more data sets to be handled.
Brook/CUDA
At this point, you basically have three options: CUDA, BrookGPU, or PeakStream. Brook has backends for CTM, D3D9, and a generic CPU backend, so it'll run on pretty much anything. However, there's not a huge amount of documentation (compared to the other two options). CUDA is G8x,G9x,G2x-only and is similar to Brook--it was primarily designed by Ian Buck, who also one of the principle authors of Brook. It's kind of a superset of Brook, as it exposes more functionality than can be exposed in Brook in a cross-platform way and has very similar syntax. It has some libraries included, like an FFT (Fast Fourier Transform) and a BLAS (Basic Linear Algebra Subprograms) implementation. CUDA tries to hide GPU implementation details except for fundamentals (warps, blocks, grids, etc), however it retains three different memory regions, which can initially be confusing. PS is relatively new and is commercial, but it's the only way (besides writing D3D9 HLSL and compiling it) to write CTM in a high-level language. Its syntax model is somewhat like OpenMP, where particular code regions are marked to be executed on the GPU. Dispatching programs to the GPU is usually handled automatically, and the memory management model is considered somewhat more straightforward than CUDA's. Like CUDA, it has FFT and BLAS libraries included. Right now, it only compiles to CTM or to a CPU backend, but that will most likely not be the case in the future. Even though it is commercial, PS is a good cross-platform API that seems to be pretty sensibly designed and doesn't require programmers to learn totally new programming paradigms. It's also got good documentation.
How to build your own cluster

Useful builder tips Troubleshooting
Case
ATX form factor case is required to fit typical motherboard. Motherboards are typically ATX, but other form factors exist. However, your case only truly needs to match the motherboard type, whatever that may be. Check to make sure your case comes with at least 1 fan to help evacuate heat. Practically speaking though, you will most likely end up with an ATX motherboard and will need an ATX case.
Power Supply Unit
Most high end NVIDIA GPU's suggest 550 or 600 W power supplies. ---> We recommend 700 - 850W PSUs. (More) Reliable Brands: Seasonic, Cooler Master, BFG, Antec YOU GET WHAT YOU PAY FOR ---> Do not skimp on your PSU. They WILL burn out under heavy use if they are cheap / underpowered.
Motherboards
****** PCI-Express 2.0 16x slots are required for 2xx and 9xxx series NVIDIA GPUs (2xx series cards support double precision). Need at least 1 PCI-express 16x (PCI-e 16x) for the GPU. Intel Socket Type ---> LGA 775 (mostly Core 2 duo/quad), LGA 771 (Xeon) AMD Socket Type ---> AM2+, AM2 (mostly Athlon/Phenom), 940, F (Opteron). Socket type must match motherboard otherwise the pins on the bottom of the processor will not line up. Make sure that it has dual integrated networking ports. Most come with 4 DIMM slots Most motherboards can support between 8 and 32 GB of system memory.
Hard Drives
Make sure your hard drive is SATA 1.5 Gb/s or SATA II 3.0 Gb/s. Many newer motherboards have limited or nonexistent IDE ATA 100/133 support. This drive does not need to be very large at all if you plan on using NAS to provide additional hard disk storage space.
Processor
AMD Athlon/Phenom and Intel Core 2 Duo/Quad processors are all excellent choices. Intel offers faster processors at the top end, but the middle range processors are all very competent Server processors do offer superior reliability, however, do not increase performance (typically). They are also more expensive and harder to find (Xeons in particular).
Memory
Performance of DDR3 memory may not significantly improve upon the performance of DDR2 system memory, although it may be easier to find a motherboard that supports DDR3 memory with high FSB speeds. This warning is only to tell you that DDR2 memory performance is not poor. Buy a minimum of 4GB in 2 x 2GB sticks. This will allow for upgrades later on (if required). The major key here is to make sure that you buy RAM that performs at the maximum allowable speed your motherboard supports. CPU's front side bus speeds often exceed that of your RAM. Buying RAM that is faster than your motherboard can support will yield no practical benefits. **
** Ideally you want your system memory to be operating at roughly a 2:1 frequency ratio with the Front Side Bus. The FSB has the same data width (64 bit) but is capable of 4 transfers per clock instead of the 2 that DDR2 is capable of. Doubling the operating frequency allows you to match the CPU's L2 bandwidth. Realistically achieving a 2:1 ratio is impossible, but a 1:1 ratio is realistic,
CUDA enabled GPUs
http://www.nvidia.com/object/cuda_learn_products.html 8xxx, 9xxx, and 2xx series GeForce GPUs Tesla C,S,D 870 (old) C,S,D 1060/1070 (new/fall 08) Nvidia Quadro FX/NVS series (nearly all mobile versions supported as well) 2xx series GPU's add double precision support The card is considerably better if the second number is higher. i.e. 8800 GT >> 8600 GT GTX Ultra > GTX > GTS > GT >> GS > G (an O on anything means overclocked, I think)
GX2 is literally two boards fused together. ~ = 2x of a single card (slightly less) Some 8xxx and all 9xxx / 2xx series cards are PCI-e 2.0 16x compatible. Some 8xxx cards only support PCI-e 1.0 16x. However, these cards will work on a motherboard which supports PCI-express 2.0
Troubleshooting / Testing your Build
Run MemTest86 for 24 hrs. Be sure that there is good contact between your cpu and heatsink. If you were required to apply thermal paste, do not use too much. Only enough to cover thinly. Most suggest a drop about the size of a pea. Do not allow large amounts of dust particulate to touch the thermal paste. Make sure there is enough space in your pc for good ventilation and air flow. Make sure you tie down you cables so that they do not get jammed in active fans. Make sure addition 6 pin and 8 pin power connectors are attached to your GPU (or else it will run at very low efficiency). If your system memory is dual channel, make sure that the pairs are plugged in and correspond to the appropriate channels on your motherboard (They are usually color coordinated or slightly different heights). Motherboard manual will provide all necessary information on where fans can be plugged in, the power supply connect to the motherboard, and how if any jumpers need to be attached. LED connections are also labeled.
Info on Cluster OS

ROCKS 5.0
<---- Cent OS
Sun Grid Engine (SGE) May need to roll your own installation depending on what features you desire. However, it is as simple as selecting the CUDA roll during a ROCKS install to get GPU job scheduling support.
Future
Open Computer Language (Apple) AMD ++ Intel Ct IBM / HP PeakStream --- > a unified language?
Sources
Nvidia CUDA Programming Guide 2.0 Nvidia Tesla Technical Brief Nvidia GeForce 8800 GPU Architecture Overview Nvidia CUDA SDK 2.0 Nvidia Compiler Documentation GPGPU.org Intel Whitepaper(s) www.nvidia.com www.ati.com www.intel.com www.arstechnica.com www.cnet.com

Hard Stuff

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hard Stuff

Caricato da

Copyright:

Formati disponibili

Massively Parallel Computing and GPU Hardware

Types of Parallel Computing

Famous Distributed Projects include: - Folding@home - SETI@home

Distributed / Parallel Computing

Pros and Cons of clusters and supercomputers

On-device Parallel computing

Why the GPU and why now?

Classical Graphics Pipeline

Unified Shader Model

Sample Device Architecture

NVIDIA G80 architecture

NVIDIA / ATI / Intel GPU Solutions

- Public release late 09 / 10

The Problem with CPU Architecture

Graphics Processing Units

Memory Management Model

Device Architecture Revisited

NVIDIA G80 architecture

G80 vs GT200 Architecture

GPU vs CPU Memory Performance

CPU/GPU Memory Continued

CPU vs GPU Performance per Watt

CPU vs GPU Performance per Watt

CPU vs GPU Performance per Watt

CPU vs GPU Price vs Performance

1.44 Gflops / $ 0.12 Gflops / $ 2.56 Gflops / $

0.16 Gflops / $ 0.08 Gflops / $

0.32 Gflops / $ 0.16 Gflops / $

How to build your own cluster

Useful builder tips Troubleshooting

Power Supply Unit

CUDA enabled GPUs

Troubleshooting / Testing your Build

Potrebbero piacerti anche