4 Experiments

6 Multi-core Experiments
P Cockshott, Y. Gdura, P. Keir, A.Koliousis
4 Targets, 3 Translation Systems, 1 Problem

The Targets Intel SCC 48 core chip Intel Nehalem 8 core chip Intel Sandy Bridge 4 core chip IBM Cell 7 core chip The Languages Lino tiling language Glasgow Pascal Compiler (Vector Pascal ) Glasgow Fortran Compiler (E#) The Problem N-body gravitational.
4 CHIPS
The SCC
48 cores each 32 bit, 533 Mhz clock, cross bar network, no coherent cache
The Cell
64 bit power PC, 8 128 bit vector processors with private RAM, high speed ring comms, 3.2 Ghz
Sandy Bridge
4 x cores with 256 bit AVX vector instructions, 3.1 Ghz, ring bus, coherent caches
Nehalem
8 cores in 2 chips, each core supports 128 bit vector operations, 2.4 Ghz, SSE2, coherent caches
3 LANGUAGES
LINO
Scripting language targeted at multi-cores * Topological process algebra Allows two dimensional composition of communicating processes Underlying processes can be any executable shell command Targets any shared memory Linux platform + SCC
* developed by Cockshott, Michaelson and Koliousis
Vector Pascal
Invented by Turner * for vector super-computers Extends Pascals support for array operations to general parallel map operations over arrays Designed to make use of SIMD instruction sets and multi-core. Glasgow Pascal Compiler supports Vector Pascal extensions.
* T .R. Turner. Vector Pascal: a computer programming language for the FPS-164 array pro cessor. 1987.
Fortran
Invented by Backus Updated to include array programming features with High Performance Fortran, and Fortran 90 A clean subset F exists* in which the antiquated features are deleted , but the array features are kept. The E# compiler ( a pun on F) developed at Glasgow by Keir, implements this subset.
* M Metcalf an D J Reid . The F Programming Language . Oxford University Press, 1996
THE PROBLEM
The N body Problem

For 1024 bodies Each time step
For each body B in 1024
Compute force on it from each other body From these derive partial acceleration Sum the partial accelerations Compute new velocity of B
For each body B in 1024

Compute new position
Complexity
Force calculation N planets, p cores
Inter core communication
The Reference C Version

for (i = 0; i < nbodies; i++) { struct planet * b = &(bodies[i]); for (j = i + 1; j < nbodies; j++){ struct planet * b2 = &(bodies[j]); double dx = b->x - b2->x; double dy = b->y - b2->y; double dz = b->z - b2->z; double distance = sqrt(dx * dx + dy * dy + dz * dz); double mag = dt / (distance * distance * distance); b->vx -= dx * b2->mass * mag; b->vy -= dy * b2->mass * mag; b->vz -= dz * b2->mass * mag; b2->vx += dx * b->mass * mag; b2->vy += dy * b->mass * mag; b2->vz += dz * b->mass * mag; }
Note that this version has side effects so the successive iterations of the outer loop can not run in parallel as the inner loop updates the velocities.
Lino version for 4 cores

nwcorner = [./nbody >East <South]; swcorner = [./nbody >North <East]; scorner = [./starter4.sh >South <West]; corner = [cat >West <North]; passright = [./nbody >East <West]; passleft = [./nbody >West <East]; top = [nwcorner | passright | scorner]; bottom = [swcorner | passleft |corner]; main = top _ bottom;
The w cells run the nbody programme a slightly modified version of the C nbody code. Each timestep compute new velocities and positions of planets then broadcast updates round loop.
Larger Lino versions
Pascal version no explicit inner loop

pure function computevelocitychange(start:integer):coord;
-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :
displacement matrix, distance: vector of distances}

begin row:=x^[iota[0],i]; { Compute the displacement vector between each planet and planet i.} di:= row[iota[0]]- x^; { Next compute the euclidean distances } xp:=@ di[1,1];yp:=@di[2,1];zp:=@di[3,1]; { point at the rows } distance:= sqrt(xp^*xp^+ yp^*yp^+ zp^*zp^)+epsilon; mag:=dt/(distance *distance*distance ); changes.pos:= \+ (M^*mag*di); end
Row Summation operator builds x,y,z components of dv
Pack this up in Pure Function Applied in Parallel

This is a column vector
procedure radvance( dt:real); var dv:array[1..n,1..1] of coord; i,j:integer; pure function computevelocitychange(i:integer;dt:real):coord; begin {--- do the computation on last slide} computevelocitychange:=changes.pos; end; begin Iota[0] is the 0th index vector If the left hand side
dv :=computevelocitychange(iota[0],dt); { can be evaluated in
parallel}
for i:= 1 to N do { iterate on planets } for j:= 1 to 3 do { iterate on dimensions } v^[j,i]:=v^[j,i]+ dv[i,1].pos[j]; { update velocities } x^ := x^ + v^ *dt; { Finally update positions. } end;
Equivalent Fortran kernel function

Elemental is equivalent to pure function in the Pascal
elemental function calc_accel_p(pchunk) result(accel) type(pchunk2d), intent(in) :: pchunk type(accel_chunk) integer :: i, j accel%avec3 = vec3(0.0_ki, 0.0_ki, 0.0_ki) do i=1,size(pchunk%ivec4) do j=1,size(pchunk%jvec4) dx = pchunk%ivec4(i)%x - pchunk%jvec4(j)%x dy = pchunk%ivec4(i)%y - pchunk%jvec4(j)%y dz = pchunk%ivec4(i)%z - pchunk%jvec4(j)%z distSqr distSixth = dx*dx + dy*dy + dz*dz + EPS = distSqr * distSqr * distSqr :: accel real(kind=ki) :: dx, dy, dz, distSqr, distSixth, invDistCubed, s
invDistCubed = 1.0_ki / sqrt(distSixth) s = pchunk%jvec4(j)%w * invDistCubed accel%avec3(i)%x = accel%avec3(i)%x - dx * s accel%avec3(i)%y = accel%avec3(i)%y - dy * s accel%avec3(i)%z = accel%avec3(i)%z - dz * s end do
end do
end function calc_accel_p
Invoked by writing
accels = calc_accel_p(pchunks)
Where
type(accel_chunk), dimension(size(pchunks)) :: accels
And
type, public :: accel_chunk type(vec3), dimension(CHUNK_SIZE) :: avec3 end type accel_chunk
Pascal compilation strategy
Virtual SIMD Machine (VSM) Model VSM Instructions

Register to Register Instructions Operate on virtual SIMD registers ( 1KB - 16KB )
Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)
E# compilation strategy
RESULTS
Lino on SCC versus Nehalem ( Xeon )
Why is SCC so much slower
Least squares fit of equation to the data of performances of both machines = time to compute partial velocity change for two bodies, = time to transmit data about a body between two cores the parameters have dimension nanoseconds.
machine SCC Nehalem 677 27 94000 223 Ratio 1/138 1/8
SCC has a slower clock than the Xeon and uses an older core design, but the worst factor is the much slower inter core comms cost using the SCC
Compare all systems
Pascal Performance on Cell for Large Problems

Performance (seconds) per Iteration
N-body Problem Size 1K 4K 8K 16K
Vector Pascal
PPE 0.381 4.852 20.355 100.250
1 SPE 0.105 1.387 5.715 22.278
2 SPEs 0.065 0.782 3.334 13.248
4 SPEs 0.048 0.470 2.056 8.086
PPE 0.045 0.771 3.232 16.524
machine SCC
677
27 82
94000
Ratio 1/138
Nehalem Cell
223 42200
1/8 1/500
CONCLUSIONS
Hardware
Machines order as follows 1. Sandy Bridge 2. Nehalem 3. Cell 4. SCC General conclusions Shared memory designs much faster and easier to programme Ring communications with hardware logic much better than the SCC message passing or the Cell DMA architecture Single instruction set much easier to target
Software
Array Languages
Pro
These seem to outperform C whilst allowing code to remain at high level. Allow machine independence.
Lino
Pro
Gives good perfomance on standard Linux. Competitive with Eden, C#, Go. Allows use of legacy code
Con
Can not use existing legacy code that uses loops.
Con
Not as fast as array languages Performance on SCC disappointing

4 Experiments

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

4 Experiments

Caricato da

Copyright:

Formati disponibili

6 Multi-core Experiments

P Cockshott, Y. Gdura, P. Keir, A.Koliousis

4 Targets, 3 Translation Systems, 1 Problem

* developed by Cockshott, Michaelson and Koliousis

* M Metcalf an D J Reid . The F Programming Language . Oxford University Press, 1996

The N body Problem

For each body B in 1024

Inter core communication

The Reference C Version

Lino version for 4 cores

Larger Lino versions

Pascal version no explicit inner loop

-- declarations {M: pointer to mass vector, x: pointer to position matrix, di :

displacement matrix, distance: vector of distances}

Pack this up in Pure Function Applied in Parallel

dv :=computevelocitychange(iota[0],dt); { can be evaluated in

Equivalent Fortran kernel function

Pascal compilation strategy

Virtual SIMD Machine (VSM) Model VSM Instructions

Support basic Operations (+, - , / , * , sqrt , \+, rep ... etc)

Lino on SCC versus Nehalem ( Xeon )

Why is SCC so much slower

Compare all systems

Pascal Performance on Cell for Large Problems

N-body Problem Size 1K 4K 8K 16K

PPE 0.381 4.852 20.355 100.250

1 SPE 0.105 1.387 5.715 22.278

2 SPEs 0.065 0.782 3.334 13.248

4 SPEs 0.048 0.470 2.056 8.086

PPE 0.045 0.771 3.232 16.524

Potrebbero piacerti anche