Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
4 CHIPS
The SCC
48 cores each 32 bit, 533 Mhz clock, cross bar network, no coherent cache
The Cell
64 bit power PC, 8 128 bit vector processors with private RAM, high speed ring comms, 3.2 Ghz
Sandy Bridge
4 x cores with 256 bit AVX vector instructions, 3.1 Ghz, ring bus, coherent caches
Nehalem
8 cores in 2 chips, each core supports 128 bit vector operations, 2.4 Ghz, SSE2, coherent caches
3 LANGUAGES
LINO
Scripting language targeted at multi-cores * Topological process algebra Allows two dimensional composition of communicating processes Underlying processes can be any executable shell command Targets any shared memory Linux platform + SCC
Vector Pascal
Invented by Turner * for vector super-computers Extends Pascals support for array operations to general parallel map operations over arrays Designed to make use of SIMD instruction sets and multi-core. Glasgow Pascal Compiler supports Vector Pascal extensions.
* T .R. Turner. Vector Pascal: a computer programming language for the FPS-164 array pro cessor. 1987.
Fortran
Invented by Backus Updated to include array programming features with High Performance Fortran, and Fortran 90 A clean subset F exists* in which the antiquated features are deleted , but the array features are kept. The E# compiler ( a pun on F) developed at Glasgow by Keir, implements this subset.
THE PROBLEM
Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.
The w cells run the nbody programme a slightly modified version of the C nbody code. Each timestep compute new velocities and positions of planets then broadcast updates round loop.
procedure radvance( dt:real); var dv:array[1..n,1..1] of coord; i,j:integer; pure function computevelocitychange(i:integer;dt:real):coord; begin {--- do the computation on last slide} computevelocitychange:=changes.pos; end; begin Iota[0] is the 0th index vector If the left hand side
parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;
invDistCubed = 1.0_ki / sqrt(distSixth) s = pchunk%jvec4(j)%w * invDistCubed accel%avec3(i)%x = accel%avec3(i)%x - dx * s accel%avec3(i)%y = accel%avec3(i)%y - dy * s accel%avec3(i)%z = accel%avec3(i)%z - dz * s end do
end do
end function calc_accel_p
Invoked by writing
accels = calc_accel_p(pchunks)
Where
type(accel_chunk), dimension(size(pchunks)) :: accels
And
type, public :: accel_chunk type(vec3), dimension(CHUNK_SIZE) :: avec3 end type accel_chunk
E# compilation strategy
RESULTS
Least squares fit of equation to the data of performances of both machines = time to compute partial velocity change for two bodies, = time to transmit data about a body between two cores the parameters have dimension nanoseconds.
machine SCC Nehalem 677 27 94000 223 Ratio 1/138 1/8
SCC has a slower clock than the Xeon and uses an older core design, but the worst factor is the much slower inter core comms cost using the SCC
Vector Pascal
machine SCC
677
27 82
94000
Ratio 1/138
Nehalem Cell
223 42200
1/8 1/500
CONCLUSIONS
Hardware
Machines order as follows 1. Sandy Bridge 2. Nehalem 3. Cell 4. SCC General conclusions Shared memory designs much faster and easier to programme Ring communications with hardware logic much better than the SCC message passing or the Cell DMA architecture Single instruction set much easier to target
Software
Array Languages
Pro
These seem to outperform C whilst allowing code to remain at high level. Allow machine independence.
Lino
Pro
Gives good perfomance on standard Linux. Competitive with Eden, C#, Go. Allows use of legacy code
Con
Can not use existing legacy code that uses loops.
Con
Not as fast as array languages Performance on SCC disappointing