Sei sulla pagina 1di 10

str 1 / 10

Matrix Multiplication
We will show how to implement matrix multiplication
C=C+A*B
on several different kinds of architectures (shared memory or different kinds of nets),
Let A,B and C are dense matrices of size n n.
(A dense matrix is a matrix in which most of the entries are nonzero)

The algorithm.
We will use the standard algorithm requiring 2*n
3
arithmetic operations:
for each element C
i,j
of C, we must compute
Hence the optimal parallel time on p processors will be 2*n
3
/p time steps (arithmetic
operations).
Scheduling these operations will be the interesting part of the algorithm design.

=
+ =
n
k
j k k i j i j i
B A C C
1
, , , ,
*

str 2 / 10
- The data layout, or where A, B and C are stored on the processors.
The two most basic layouts are 1D blocked and 2D blocked.

Matrix multiplication C= A*B with matrices A, B, and C decomposed in one dimension (1D
Blocked layout).
Matrix multiplication C= A*B with matrices A, B, and C decomposed in two dimensions (2D
Blocked Layout).



str 3 / 10

Cannon algorithm.

Cannon algorithm is a memory-efficient version of a simple algorithm.

We partition matrices A and B into p square blocks.
The processes are labeled from P
0,0
to P
\p-1, \p-1
and initially assign submatrixes A
ij
and B
ij
to
process

P
ij
.

Althow every process in the i-th row requires sqrt(p) submatrices A
ik
( 0 s k < sqrt(p) ) it is
possible to schedule the computation of the sqrt(p) processes in the i-it row such that , at any
given time, each process is using a different A
ik
.

These block can be systematically rotated among the processes after every submatrix
multiplication so that every process gets a fresh A
ik
after each rotation.

It an identical schedule is applied to the columns, then no process holds more than one block of
each matrix at any time, and the total memory requirement of the algorithm over all the
processes is O(n
2
).
Cannon algorithm is based on this idea.
The first communication step of the algorithm aligns the blocks of A and B in such a way that
each process multiplies its local submatrices. This aligment is achieved for matrix A by
shifting all submatrixes A
ij
to the left (with wraparound) by i steps. All submatrices B
ij
are
shifted up by j steps.
These are circular shift operation.
After a submatrix multiplication step , each block oA moves one step left and each block of B
moves one step up ((with wraparound).
A sequence of sqrt(p) such submatrix multiplications and single step shifts pairs up each A
ik
and
B
ki
for k ( 0 s k < sqrt(p) ) at P
ij
completes the multiplication of matrices A and B.



str 4 / 10

The communication steps in Cannons algorithm on 16 processes


A
00
A
01
A
02
A
03
B
00
B
01
B
02
B
03

A
10
A
11
A
12
A
13
B
10
B
11
B
12
B
13

A
20
A
21
A
22
A
23
B
20
B
21
B
22
B
23

A
30
A
31
A
32
A
33
B
30
B
31
B
32
B
33

(a) Initial aligment of A


(b) Initial aligment of B


A
00
B
00
A
01
B
11

A
02
B
22

A
03
B
33

A
01
B
10
A
02
B
21

A
03
B
32

A
00
B
03

A
11
B
10

A
12
B
21

A
13
B
32

A
10
B
03

A
12
B
20

A
13
B
31

A
10
B
02

A
11
B
13

A
20
B
20

A
21
B
31

A
22
B
02

A
23
B
13

A
23
B
30

A
20
B
01

A
21
B
12

A
22
B
23

A
33
B
30

A
30
B
01

A
31
B
12

A
32
B
23

A
30
B
00

A
31
B
11

A
32
B
22

A
33
B
33

(c) A and B after initial aligment (d) Submatrix location after first shift


A
02
B
20
A
03
B
31

A
00
B
02

A
01
B
13

A
03
B
30
A
00
B
01

A
01
B
12

A
02
B
23

A
13
B
30

A
10
B
01

A
11
B
12

A
12
B
23

A
10
B
100

A
11
B
11

A
12
B
22

A
13
B
33

A
20
B
00

A
21
B
11

A
22
B
22

A
23
B
33

A
21
B
10

A
22
B
21

A
23
B
32

A
20
B
03

A
31
B
10

A
32
B
21

A
33
B
32

A
30
B
03

A
32
B
20

A
33
B
31

A
30
B
02

A
31
B
13

(e) Submatrix location after second shift


(f) Submatrix location after third shift








str 5 / 10
(

=
(

=
(

=
C C
C C
C
B B
B B
B
A A
A A
A
22 21
12 11
22 21
12 11
22 21
12 11
C B A ij ij ij
, ,
(

=
(

C C
C C
B B
B B
A A
A A
22 21
12 11
22 21
12 11
22 21
12 11
*
B A B A C
B A B A C
B A B A C
B A B A C
22 22 12 21 22
21 22 11 21 21
22 12 12 11 12
21 12 11 11 11
+ =
+ =
+ =
+ =
STRASSENS ALGORITHM FOR MATRIX MULTIPLICATION

References
+ Higham,N.J., Exploiting fast matrix multiplication within the Level 3 BLAS, A.C.M.
Trans.Math.Software, vol. 16, pp. 352-368.
+ Strassen, v., 1969, Gaussian elimination is not optimal, Numer. Math., vol. 13, pp. 354-
356.


Consider the calculation of the matrix-matrix product

C=A*B

Where A, B, C, are 2n*2n matrixes. Divide A, B, C into 2*2 block matrices:


Where the block matrixes :













are n*n.

Given this block form of the matrixes the matrix-matrix product can be written as:






The conventional way of calculating the blocks of C would be as follows:








str 6 / 10
) )( (
) )( (
) (
) (
) (
) (
) )( (
22 21 22 12 7
12 11 11 21 6
22 12 11 5
11 21 22 4
22 12 11 3
11 22 21 2
22 11 22 11 1
B B A A P
B B A A P
B A A P
B B A P
B B A P
B A A P
B B A A P
+ =
+ =
+ =
=
=
+ =
+ + =
P P P P C
P P C
P P C
P P P P C
6 2 3 1 22
4 2 21
5 3 12
7 5 4 1 11
+ + =
+ =
+ =
+ + =
- This involves 8 matrix multiplications (and 4 additions). Matrix multiplication requires n
3

floating point operations (flops), whereas matrix addition requires n
2
floating point
operations (flops)
- This is a divide-and-conquer algorithm: a 2n*2n matrix multiplication has been divided into
8n*n matrix multiplications (and 4n*n additions)
- Clearly each of these divided problems (n*n matrix multiplications) itself could be futher
subdivided into 8 n/2*n/2 matrix multiplications, and so on.

For this particular approach to matrix-matrix multiplication there may be nothing to gain by
posing the algorithm as a divide-and-conquer algorithm.

However, if we define the intermediate matrixes P
i
as















then the blocks of C are given by:










This involves 7 matrix multiplications (and 18 matrix aditions)
- This is again divide-and-conquer algorithm, but now the 2n*2n matrix multiplication has
been replaced by 7n*n matrix multiplications (and 18 matrix additions), which is worthwhile
saving for large matrixes
- The algorithm is applied recursively. So that the 7n*n matrix multiplications are replaced by
49 n/2 *n/2 matrix multiplications, and so on. The recursion is continued until the divided
matrixes are sufficiently small that the standard multiplication algorithm is more efficient
that this recursive approach.
- The algorithm was suggested by Strasses (1969).

str 7 / 10
- The algorithm is not as strongly stable (in the numerical sense) as the conventional
algorithm, but it is sufficiently stable for many applications.



str 8 / 10

DNS Algorithm (Dekel, Nassimi, Sahni)

The DNS algorithm based on partitioning intermidiate data that can use up to n
3
processes and
performs matrix multiplication in time O(log n) by using O(n
3
/ log n) processes.

Assume that n
3
processes are available for multiplying two n n matrices. These processes are
aranged in a three-dimentional n n n logical array.
Since the matrix multiplication algorithm performs n
3
scalar multiplications, each of the n
3

processes is assigned a single scalar multiplication.
The processes are labeled according to their location in the array, and the multiplication A[i,k]
B[k,j] is assigned to process P[i,j,k] (0si,j,k<n).
After each process performs a single multiplication, the contents of P[i,j,0],P[i,j,1],,P[i,j,n-1]
are added to obtain C[i,j].
The addition for all C[i,j] can be carried out simultaneously in log n steps each.
Thus, it takes one step to multiply and log n steps to add.
It takes time O(log n) to multiply the n n matrix.


A[0,3] A[1,3] A[2,3] A[3,3]
K=3




K=2 A[0,2] A[1,2] A[2,2] A[3,2]




K=1
A[0,1] A[1,1] A[2,1] A[3,1]


[0,3] [1,3] [2,3] [3,3]
K=0 [0,2] [1,2] [2,2] [3,2]
[0,1] [1,1] [2,1] [3,1]
[0,0] [1,0] [2,0] [3,0] A[0,0] A[1,0] A[2,0] A[3,0]
(a) Initial distribution of A and B (b) After moving A[i,j] from P[i,j0] to P[i,j,j]


str 9 / 10

B[3,3]
K=3 B[3,2]
B[3,1]
B[3,0]

B[2,3]
K=2 B[2,2]
B[2,1]
B[2,0]

B[1,3]
K=1 B[1,2]
B[1,1]
B[1,0]

B[0,3]
K=0 B[0,2]
B[0,1]
B[0,0]
(b) After moving B[i,j] from P[i,j,0] to P[i,j,i]




str 10 / 10

A[0,3] A[1,3] A[2,3] A[3,3] C[0,0] B[3,3] B[3,3] B[3,3] B[3,3]
K=3 A[0,3] A[1,3] A[2,3] A[3,3] = B[3,2] B[3,2] B[3,2] B[3,2]
A[0,3] A[1,3] A[2,3] A[3,3] A[0,3] B[3,0] B[3,1] B[3,1] B[3,1] B[3,1]
A[0,3] A[1,3] A[2,3] A[3,3] B[3,0] B[3,0] B[3,0] B[3,0]

A[0,2] A[1,2] A[2,2] A[3,2] B[2,3] B[2,3] B[2,3] B[2,3]
K=2 A[0,2] A[1,2] A[2,2] A[3,2] + B[2,2] B[2,2] B[2,2] B[2,2]
A[0,2] A[1,2] A[2,2] A[3,2] A[0,2] B[2,0] B[2,1] B[2,1] B[2,1] B[2,1]
A[0,2] A[1,2] A[2,2] A[3,2] B[2,0] B[2,0] B[2,0] B[2,0]

A[0,1] A[1,1] A[2,1] A[3,1] B[1,3] B[1,3] B[1,3] B[1,3]
K=1 A[0,1] A[1,1] A[2,1] A[3,1] + B[1,2] B[1,2] B[1,2] B[1,2]
A[0,1] A[1,1] A[2,1] A[3,1] A[0,1] B[1,0] B[1,1] B[1,1] B[1,1] B[1,1]
A[0,1] A[1,1] A[2,1] A[3,1] B[1,0] B[1,0] B[1,0] B[1,0]

A[0,0] A[1,0] A[2,0] A[3,0] B[0,3] B[0,3] B[0,3] B[0,3]
K=0 A[0,0] A[1,0] A[2,0] A[3,0] + B[0,2] B[0,2] B[0,2] B[0,2]
A[0,0] A[1,0] A[2,0] A[3,0] A[0,0] B[0,0] B[0,1] B[0,1] B[0,1] B[0,1]
A[0,0] A[1,0] A[2,0] A[3,0] B[0,0] B[0,0] B[0,0] B[0,0]
(c) After broadcasting A[i,j] along j axis (d) Corresponding distribution of B


The vertical column of processes P[i,j,*] computes the dot product of row A[i,*] and column
B[*,j].

The DNS algorithm has three main communication steps:
(1) moving the columns of A and the rows of B to their respective places,
(2) performing one-to-all broadcast along the j axis for A and along the I axis for B
(3) all-to-one reduction along the k axis

Potrebbero piacerti anche