LAPACK and ScaLAPACK Examples

LAPACK and ScaLAPACK Examples
From Documentation
LAPACK and ScaLAPACK Examples

Nick Chepurniy, Ph.D.
e-mail: nickc@sharcnet.ca
HPC Consultant - SHARCNEThttp://www.sharcnet.ca
Compute Canadahttp://www.computecanada.org
We start by defining a special matrix called STRIDWAD (Sparse TRIDiagonal With Anti-Diagonal). This matrix will be
stored using the dense format (i.e. all elements are defined and are stored in memory). The matrix has the following properties:
It can be of any order N (where N is an even integer)
When STRIDWAD is multiplied by the vector X whose components are X(i)=i it produces a vector C
whose elements are C(i)=N+1 except for the last element which is C(N)=2N+2.
These characteristics make it very easy to work with STRIDWAD while developing computer programs for testing purposes.
Contents
[hide]
• 1 LAPACK and ScaLAPACK Examples

◦ 1.1 Defining the STRIDWAD matrix
◦ 1.2 Program to calculate the RHS vector of the STRIDWAD system
◦ 1.3 Important Notes on the program and Output Results
• 2 LAPACK EXAMPLE
◦ 2.1 Compiling and Submitting Jobs
◦ 2.2 The Fortran and Other Files
• 3 ScaLAPACK EXAMPLE
◦ 3.1 Example of a block cyclic data distribution
◦ 3.2 Set Up prior to invoking the Parallel Solvers
◦ 3.3 Timing Routines and Scripts used in the makefile
◦ 3.4 Using make/makefile to compile and submit the program
◦ 3.5 Full fortran program listing using ScaLAPACK routines to solve a STRIDWAD system
◦ 3.6 Results for solutions of the STRIDWAD system for different orders of matrix A
■ 3.6.1 LAPACK RESULTS
■ 3.6.2 ScaLAPACK RESULTS
■ 3.6.3 CONCLUSIONS
Defining the STRIDWAD matrix
In practice we will use large values of N but for illustration purposes the STRIDWAD system for N=6 would be:
[ 3. -1. 0. 0. 0. 1. ] [ 1. ] [ 7. ]
[ -1. 3. -1. 0. 1. 0. ] [ 2. ] [ 7. ]
[ 0. -1. 3. 0. 0. 0. ] [ 3. ] [ 7. ]
[ 0. 0. 0. 3. -1. 0. ] [ 4. ] = [ 7. ]
[ 0. 1. 0. -1. 3. -1. ] [ 5. ] [ 7. ]
[ 1. 0. 0. 0. -1. 3. ] [ 6. ] [ 14. ]
For N=18 it takes this form:
[ 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. ] [ 1. ] [ 19. ]
[ -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. ] [ 2. ] [ 19. ]
[ 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. ] [ 3. ] [ 19. ]
[ 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. ] [ 4. ] [ 19. ]
[ 0. 1. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. ] [ 5. ] [ 19. ]
[ 1. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. ] [ 6. ] [ 19. ]
[ 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. ] [ 7. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 1. 0. 0. 0. 0. 0. 0. 0. ] [ 8. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. -1. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 9. ] = [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. -1. 0. 0. 0. 0. 0. 0. 0. ] [ 10. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. ] [ 11. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. ] [ 12. ] [ 19. ]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. ] [ 13. ] [ 19. ]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. ] [ 14. ] [ 19. ]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. ] [ 15. ] [ 19. ]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. ] [ 16. ] [ 19. ]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. ] [ 17. ] [ 19. ]
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. ] [ 18. ] [ 38. ]
Observations on the structure of the STRIDWAD systems:

All elements along the diagonal are equal to 3.
The elements in the two subdiagonals (lower and upper diagonals) are -1.
The diagonal going from the lower left corner to the upper right corner is called anti-diagonal.
All elements on the anti-diagonal are 1.
When the anti-diagonal crosses the subdiagonals the 1's cancel out with the -1's.
Order N is an EVEN number.
The components of the X vector are x(i)=i and all the components of C are N+1 except for the last one.
Last component of C is 2*(N+1) where N is the order of the STRIDWAD system.
So in fact what we have established here is an IDENTITY which for computational purposes can be used in the following two
ways:
If we assume X is known and defined by x(i)=i we can use a matrix multiply program
to calculate the product of the STRIDWAD matrix and X and compare the results to C.
If we assume X is unknown and we use the C as defined above we can

use a matrix inversion program and calculate the unknown vector X.
Program to calculate the RHS vector of the STRIDWAD system
Here is the listing of the program used to calculate the Right Hand Side (RHS) vector assuming that X is known and defined
by x(i)=i:
program TMA
implicit none
integer :: istat,i,N_PRT
real(kind=8),dimension(:,:),allocatable :: A
real(kind=8),dimension(:,:),allocatable :: C
real(kind=8),dimension(:,:),allocatable :: D
real(kind=8),dimension(:,:),allocatable :: X
integer :: NH, NH1, nprocs, KOUNT
integer*8 :: N, mem_allocated
real(kind=8) :: mem_gb, EPS
parameter (N = 55388)
! parameter (N = 18)
parameter (N_PRT = 40)
! --------------------------------------------------------------------------
interface
subroutine init_my_matrix (n,a)

implicit none
integer*8 :: N
real(kind=8) :: a(:,:)
end subroutine init_my_matrix
subroutine init_my_vector (N,X)

implicit none
integer*8 :: N
real(kind=8) :: X(:,:)
end subroutine init_my_vector
subroutine init_my_rhs (n,c)

implicit none
integer*8 :: N
real(kind=8) :: c(:,:)
end subroutine init_my_rhs
end interface
! ---------------------------------------------------------------------------
!
! Trace intermediate steps only for small values of N
EPS = 0.00000001
NPROCS = 1
! N must be EVEN
NH = N/2
NH1 = NH + 1
IF (2*NH .NE. N) THEN

WRITE(6,2001) N
2001 FORMAT("N must be EVEN, got N = ",i9)
STOP
ENDIF
!
! ----- Print out nprocs -----
WRITE(6,101) nprocs,N
101 FORMAT(" nprocs = ",i3," N = ",i12)
call flush(6_4)
!
!
! ----- Allocate LHS, RHS, and pivot -----
!
allocate (a(N, N ), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for A"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for A"
else
mem_allocated = 8*N*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,103) mem_allocated,mem_gb
103 FORMAT(" memory alloc (A) = ",i12," bytes i.e. ", f8.5," Gb")
endif
allocate (c(N,1), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for C"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for C"
else
mem_allocated = 8*N
104 FORMAT(" memory alloc (C) = ",i12," bytes i.e. ", f8.5," Gb")
endif
allocate (D(N,1), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for D"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for D"
else
mem_allocated = 8*N
105 FORMAT(" memory alloc (D) = ",i12," bytes i.e. ", f8.5," Gb")
endif
allocate (X (n,1), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for X"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for X"
else
mem_allocated = 8*N
106 FORMAT(" memory alloc (X) = ",i12," bytes i.e. ", f8.5," Gb")
endif
mem_allocated = 8*N*N + 8*N + 8*N

102 FORMAT(" memory allocated = ",i12," bytes i.e. ", f8.5," Gb")
call flush(6_4)
!
! ----- Initialize: A X = C
!
call init_my_matrix (N,A)
call init_my_vector (N,X)
call init_my_rhs (N,C)
D = MATMUL(A,X)
!
write (6,500)
500 FORMAT(/" C(I): ")
IF (N .le. N_PRT) THEN
DO I=1,N
WRITE (6,501) I,C(I,1)
501 FORMAT(" I = ",I3," ",f6.2)
ENDDO
ELSE
! For large N print only last N_PRT elements of vector
DO I=N-N_PRT,N
WRITE (6,502) I,A(I,I),D(I,1)
502 FORMAT(" I=",I10," A(I,I) = ",f13.2," D(I,1) = ",f13.2)
ENDDO
ENDIF
KOUNT = 0
DO I=1,N
if (abs(C(I,1)-D(I,1)) .gt. EPS) then
KOUNT = KOUNT + 1
write(6,107) I
107 format("For I = ",i12," C .neq. D")
endif
ENDDO
write(6,108) KOUNT
108 format(/"KOUNT = ",i10)
!
! ----- Cleanup arrays -----
!
deallocate (A,C,D,X)
!
end program TMA
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A)
IMPLICIT NONE
INTEGER :: NH,NH1
INTEGER*8 :: N
REAL(kind=8) :: A(:,:)
INTEGER :: I
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
A(I,I) = 3.0
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
A(I,I+1) = -1.0d0
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
A(I+1,I) = -1.0d0
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190
A(I,N-I+1) = 1.0d0
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C)
IMPLICIT NONE
INTEGER*8 :: N
REAL(kind=8) :: C(:,:)
INTEGER :: I
DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO
C(N,1) = 2*N + 2
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_VECTOR (N,X)
IMPLICIT NONE
INTEGER*8 :: N
REAL(kind=8) :: X(:,:)
INTEGER :: I
DO I= 1, N
X(I,1) = I
ENDDO
return
end
Important Notes on the program and Output Results
There are several things which should be pointed out:

(1) The order of the STRIDWAD system (i.e. parameter N) has been declared as INTEGER*8
in order that there be no overflow when memory is calculated using mem_allocated.
(2) To submit the job the command:

sqsub -r 60m -q threaded -n 4 -o ${CLU}_TEST_MEM_8_%J ./test_mem_alloc8
was used so that all the memory of the node would be available to this program.
Output for N=18 was as follows:

nprocs = 1 N = 18
memory alloc (A) = 2592 bytes i.e. 0.00000 Gb
memory alloc (C) = 144 bytes i.e. 0.00000 Gb
memory alloc (D) = 144 bytes i.e. 0.00000 Gb
memory alloc (X) = 144 bytes i.e. 0.00000 Gb
memory allocated = 2880 bytes i.e. 0.00000 Gb
C(I):
I = 1 19.00
I = 2 19.00
I = 3 19.00
I = 4 19.00
I = 5 19.00
I = 6 19.00
I = 7 19.00
I = 8 19.00
I = 9 19.00
I = 10 19.00
I = 11 19.00
I = 12 19.00
I = 13 19.00
I = 14 19.00
I = 15 19.00
I = 16 19.00
I = 17 19.00
I = 18 38.00
KOUNT = 0
Output for N=55388 (the highest possible value that fits in memory) was:
nprocs = 1 N = 55388
memory alloc (A) = 24542644352 bytes i.e. 24.54264 Gb
memory alloc (C) = 443104 bytes i.e. 0.00044 Gb
memory alloc (D) = 443104 bytes i.e. 0.00044 Gb
memory alloc (X) = 443104 bytes i.e. 0.00044 Gb
memory allocated = 24543530560 bytes i.e. 24.54353 Gb
C(I):
I= 55348 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55349 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55350 A(I,I) = 3.00 D(I,1) = 55389.00
...
I= 55386 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55387 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55388 A(I,I) = 3.00 D(I,1) = 110778.00
KOUNT = 0
LAPACK EXAMPLE
In the previous section the STRIDWAD matrices of oder N were defined. We now want to use those matrices as input to the
LAPACK's routines dgetrf/dgetrs to find the inverses of different orders N. The programs should also report the memory used,
cpu time and elapsed time.
Compiling and Submitting Jobs
Following makefile is used to compile and submit the jobs to a compute node:
# Determine the platform that is being used for this compilation
OS:= $(shell uname)

PLATFORM:= $(shell uname)_$(shell uname -m)
# Set the dafault parameters N (order of matrix times 1000)

# and P (number of threads)
N:= 1
P:= 4
# Specify executable, source and required libraries :
EXE = DP_LAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
LIB:= -L/opt/sharcnet/acml/current/pathscale64/lib -lacml
$(EXE): $(SRC)
@echo Check if N is in the range [1,40]
./check_N $N
# @echo COMPILING LAPACK DP program with N = $(N),000
# @echo PLATFORM used = $(PLATFORM)
# @echo Edit the parameter_N.inc file
cp parameter_N.inc_basic parameter_N.inc
./ed_parm_N parameter_N.inc $N
# @echo "Using following parameter_N.inc file:"
# cat parameter_N.inc
# @echo "LIB = " $(LIB)
pathf90 -o $(EXE) $(SRC) $(LIB)
cp sub_batch_DP_LAPACK_SOLVE_basic sub_batch_DP_LAPACK_SOLVE
./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P
./sub_batch_DP_LAPACK_SOLVE
superclean: clean
@echo ' '
@echo Removing Output files generated by $(EXE)
@echo '--------------------------------------------------'
@echo ' '
rm -rf ${CLU}_SERIAL_DP_SOLVE_*
@echo ' '
@echo Output files have been removed.
@echo '-------------------------------'
@echo ' '
clean:
@echo ' '
@echo Cleaning executable $(EXE)
@echo '-----------------------------------'
@echo ' '
rm -rf $(EXE)
rm -rf ED_ERROR
@echo ' '
@echo Executable $(EXE) has been removed.
@echo '--------------------------------------------'
@echo ' '
help:
@echo "+-----------------------------------------------------------------------------------+"
@echo "| |"
@echo "| Makefile for running LAPACK example using dgetrf/dgetrs |"
@echo "| |"
@echo "| Usage: make compile program for N=1 (i.e. order=1,000) |"
@echo "| |"
@echo "| make N=n compile program for N=n (i.e. order=n*1000) |"
@echo "| where 1 <= n <= 40 |"
@echo "| |"
@echo "| make P=p submit job with -q threaded -n p |"
@echo "| where 1 <= p <= 4 or 6 (depending on cluster) |"
@echo "| |"
@echo "| make superclean remove output files and executable |"
@echo "| |"
@echo "| make clean remove executable |"
@echo "| |"
@echo "| make help display makefile usage information |"
@echo "| |"
@echo "+-----------------------------------------------------------------------------------+"
@echo " "
@echo PLATFORM $(PLATFORM)
The make utility will use above makefile and we can override any parameters appearing in the the makefile. Thus, if we type
the command:
make N=2 P=6
make will use parameter N=2 and P=6 instead of those appearing in the makefile.
Then make proceeds to specify the executable, source and required library and executes the commands required to generate
the target $(EXE). The first command for generating this target is to invoke the script check_N which takes as argument the
parameter $N.
#!/bin/bash
# script file = check_N
if [ $# -ne 1 ]; then
echo "Must have 1 argument: value of N"
echo "Usage: $0 <value_of_N> "
exit -1
fi
N=$1
if [ "$N" -lt 1 ] || [ "$N" -gt 40 ]

then
echo "Value of N is out of range !"
echo " 1 <= N <= 40 "
exit -1
fi
exit
If the value of $N is in the specified range the check_N script returns 0 and make proceeds to the next command, but if $N is
out of the specified range, then the check_N script returns -1 and the make command terminates with an error message.
If the check_N script returned 0 then make proceeds to the next command which copies the parameter_N.inc_basic file into
the parameter_N.inc file. The next scripted_parm_N takes two arguments: parameter_N.inc and $N to modify the file
parameter_N.inc which is used by the fortran program that we will compile in the next command. Here is a listing of the
second script:
#!/bin/bash
# script file = ed_parm_N
if [ $# -ne 2 ]; then
echo "Must have 2 argument: Filename to change and value to substitute x for"
echo "Usage: $0 <filename> <value_for_x> "
exit
fi
Filename=$1
ed $Filename <<EOF 2> ED_ERROR
1,2s/x/$2/
w
q
EOF
Now, the make command is ready to compile the source program using the updated version of the include file
parameter_N.inc.
If errors are detected in the compilation then make will terminate, otherwise the next command is executed in the makefile
script, which is the script:
#!/bin/bash
# script file = ed_procs_P
if [ $# -ne 2 ]; then
echo "Must have 2 argument: Filename to change and number of procs to use"
echo "Usage: $0 <filename> <num_of_processors> "
exit -1
fi
if [ `printenv CLU` = "nar" ] ; then

let MXP=4
fi
if [ `printenv CLU` = "saw" ] ; then

let MXP=6
fi
if [ $2 -lt 1 ] || [ $2 -gt $MXP ] ; then

echo "Parameter p is out of range for cluster ${CLUSTER}"
exit -1
fi
Filename=$1
ed $Filename <<EOF 2> ED_ERROR

1,6s/nprocs/$2/
w
q
EOF
If p is out of range the make command exits, otherwise the next command is executed which is a script submitting the job to a
queue (in which the parameter nprocs will be replaced by what is specified in the makefile):
#!/bin/bash
# script file = sub_batch_DP_LAPACK_SOLVE_basic
sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J ./DP_LAPACK_SOLVE
The Fortran and Other Files
For completnes we list next the whole fortran source file and working files parameter_N.inc_basic and
sub_batch_DP_LAPACK_SOLVE_basic:
program DP_LAPACK_SOLVE
!
! file name = DP_LAPACK_SOLVE.f90
!
implicit none
integer :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable :: a
real(kind=8),dimension(:,:),allocatable :: c
integer,dimension(:),allocatable :: ipiv
integer :: root, ierr

integer*8 :: N, mem_allocated
real(kind=8) :: mem_gb, EPS
include "parameter_N.inc"
integer :: context, iam, pnum

integer :: mycol, myrow, nb
integer :: npcol, nprocs, nprow
integer :: ITEM, IAorC, NH, NH1, DEBUG

! --------------------------------------------------------------------------
integer :: day(10),hour(10),minute(10),second(10),millisec(10)
character*30 :: descr_milestone(10)
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
character*23 :: date_time
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm
! --------------------------------------------------------------------------
interface
subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol)

implicit none
integer*8 :: N
integer :: nprow,npcol,myrow,mycol
subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol)

implicit none
integer*8 :: N
subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&

& millisec_beg, day_end, hour_end, minute_end, second_end, &
& millisec_end,tid,ITEM )
implicit none
integer :: day,hour, minute, second, millisec, tid, ITEM
end subroutine elapsed_time
subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
integer :: day, hour, minute, second, millisec
integer :: elements(8)
character*3 :: months(12)
end subroutine timestamp
end interface
! ---------------------------------------------------------------------------
!
DEBUG = 1
root = 0
IAM = 0
NPROCS = 1
! N must be EVEN
NH = N/2
NH1 = NH + 1

WRITE(6,2001) N
STOP
ENDIF
IF ( N .gt. N_PRT ) DEBUG = 0

!
! ---------------------------------------------------------------------------
ITEM = 1000 + iam
call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg, &

& millisec_beg )
milestone = 1
IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time,milestone
1006 format(I4," iam ",i3," BEGIN ",a23," milestone = ",i2)
1008 format(I4," iam ",i3," ",a23," milestone = ",i2)
call flush(6_4)
call cpu_time(T1)
call timestamp( date_time,day(milestone), hour(milestone), &

& minute(milestone),second(milestone),millisec(milestone) )
descr_milestone(milestone) = "Init blacs_gridinit, allocate"

!
IF (IAM .EQ. 0) THEN

WRITE(6,101) nprocs,N
101 FORMAT(" nprocs = ",i3," N = ",i12)
call flush(6_4)
ENDIF
!
!
! ----- Allocate LHS, RHS, and pivot -----
!
allocate (a(N, N ), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for A"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for A"
else
mem_allocated = 8*N*N
103 FORMAT(" memory alloc (A) = ",i12," bytes i.e. ", f9.6," Gb")
endif
allocate (c(N,1), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for C"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for C"
else
mem_allocated = 8*N
104 FORMAT(" memory alloc (C) = ",i12," bytes i.e. ", f9.6," Gb")
endif
allocate (ipiv (n), stat=istat)

if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for IPIV"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for IPIV"
else
mem_allocated = 4*N
105 FORMAT(" memory alloc (ipiv) = ",i12," bytes i.e. ", f9.6," Gb")
endif
mem_allocated = 8*N*N + 8*N + 4*N


102 FORMAT(" memory allocated = ",i12," bytes i.e. ", f9.6," Gb")
call flush(6_4)
ENDIF
!
! ----- Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol)
call init_my_rhs (n,c,nprow,npcol,myrow,mycol)
!
milestone = milestone + 1
call flush(6_4)
descr_milestone(milestone) = "call dgetrf"

!
! ----- Do the work -----
!
call dgetrf( N, N, A, N, ipiv, info )
if (iam.eq.0 .AND. DEBUG .eq. 1) &

& write (6,*) "info (from dgetrf) = ", info
call flush(6_4)
descr_milestone(milestone) = "call dgetrs"
call dgetrs( 'N', N, 1, A, N, ipiv, c, N, info )

& write (6,*) "info (from dgetrs) = ", info
!
call flush(6_4)
descr_milestone(milestone) = "get_solution --> root"
IF (IAM.EQ.0) THEN
ITEM = 6000
write (6,500) ITEM
500 FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,C(I,1)
501 FORMAT(I4," I = ",I3," ",f6.2)
ENDDO
ELSE
DO I=N-N_PRT,N
WRITE (6,502) I,C(I,1)
502 FORMAT(" I=",I10," ",f13.2)
ENDDO
ENDIF
ENDIF
call flush(6_4)
& minute(milestone),second(milestone), millisec(milestone) )
descr_milestone(milestone) = "print and clean up"
! Now report the milestone time differences
ITEM = 7000
IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone

1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)
do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) + 3600.0*(hour(i+1)-hour(i)) + &
& 60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i)) + &
& 0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm, &
& descr_milestone(i)
1005 format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
& ,I2," equals ",f10.2,3x,a30)
enddo
call timestamp( date_time,day_end,hour_end,minute_end,second_end, &

& millisec_end )
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END ",a23)
call cpu_time(T2)
TOT_TIME = T2-T1
call elapsed_time(day_beg,hour_beg, minute_beg, second_beg, &

& millisec_end,iam,ITEM )
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4, &
& " seconds")
! ---------------------------------------------------------------------------
!
!
deallocate (a,c,ipiv)
!
end program DP_LAPACK_SOLVE
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
INTEGER :: I, J
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
A(I,I) = 3.0
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
A(I,I+1) = -1.0d0
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
A(I+1,I) = -1.0d0
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
A(I,N-I+1) = 1.0d0
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NPROW,NPCOL,MYROW,MYCOL
INTEGER :: I
DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO
C(N,1) = 2*N + 2
return
end
!
!-----------------------------------------------------------------------
!
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
& millisec_beg, day_end, hour_end, minute_end, second_end,&
& millisec_end,tid, ITEM )
!date_and_time
! VALUES
! must be of type default integer and of rank one. It is an INTENT(OUT)
! argument. Its size must be at least eight. The values returned in VALUES
! are as follows:
! VALUES(1)
! is the year (for example, 1998), or -HUGE (0) if no date is available.
! VALUES(2)
! is the month of the year, or -HUGE (0) if no date is available.
! VALUES(3)
! is the day of the month, or -HUGE (0) if no date is available.
! VALUES(4)
! is the time difference with respect to Coordinated Universal Time (UTC)
! in minutes, or -HUGE (0) if this information is not available.
!
! VALUES(5)
! is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
! no clock.
!
! VALUES(6)
! is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
! is no clock.
! !
! VALUES(7)
! is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
! there is no clock.
!
! VALUES (8)
! is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
! if there is no clock.
implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM
if ( millisec_end .lt. millisec_beg ) then

millisec_end = millisec_end + 1000
second_end = second_end - 1
endif
if ( second_end .lt. second_beg ) then

second_end = second_end + 60
minute_end = minute_end - 1
endif
if ( minute_end .lt. minute_beg ) then

minute_end = minute_end + 60
hour_end = hour_end - 1
endif
if ( hour_end .lt. hour_beg ) then

hour_end = hour_end + 24
day_end = day_end - 1
endif
millisec = millisec_end - millisec_beg

second = second_end - second_beg
minute = minute_end - minute_beg
hour = hour_end - hour_beg
day = day_end - day_beg
if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0 &
& .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000 format(I4," IAM = ",I3," ELAPSED0 days = ",i3)
write(6,8001) ITEM,tid,hour
8001 format(I4," IAM = ",I3," ELAPSED1 hours = ",i3)
write(6,8002) ITEM,tid,minute
8002 format(I4," IAM = ",I3," ELAPSED2 minute = ",i3)
write(6,8003) ITEM,tid,second
8003 format(I4," IAM = ",I3," ELAPSED3 second = ",i3)
write(6,8004) ITEM,tid,millisec
8004 format(I4," IAM = ",I3," ELAPSED4 millisec = ",i3)
ENDIF
endif
return
!
!-----------------------------------------------------------------------
!
subroutine timestamp( date_time, day,hour,minute,second,millisec)

implicit none
data months /'Jan','Feb','Mar','Apr','May','Jun', &

& 'Juk','Aug','Sep','Oct','Nov','Dec' /
call date_and_time ( VALUES=elements )
if ( elements(1) .ne. -HUGE(0) ) then

century: if ( elements(1) .lt. 2000 ) then
elements(1) = elements(1) - 1900
else
endif century
write(date_time,1010) elements(3), &
& months ( elements(2) ), &
& elements(1) , elements(5), &
& elements(6) , elements(7) , elements(8)
1010 format(i2.2,1x,a3,1x,i2.2,1x, &

& i2.2,':',i2.2,':',i2.2,'.',i4.4 )
day = elements(3)
hour = elements(5)
minute = elements(6)
second = elements(7)
millisec = elements(8)
else
date_time=' '
day = 0
hour = 0
minute = 0
second = 0
millisec = 0
endif
!
!-----------------------------------------------------------------------
!
This file is used by script ed_parm_N which substitutes the x by $N:

! filename = parameter_N.inc_basic
parameter (N = x000)
This file is used by script ed_procs_P which substitutes nprocs by $P:

#!/bin/bash
# script file = sub_batch_DP_LAPACK_SOLVE_basic

sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J ./DP_LAPACK_SOLVE
ScaLAPACK EXAMPLE
ScaLAPACK is a distributed memory version of LAPACK. It uses the Parallel Basic Linear Algebra Subprograms (PBLAS)
as computational building blocks. ScaLAPACK uses the block-cyclic decomposition scheme to distribute block-partitioned
matrices which should make the computation balanced and scalable.
Example of a block cyclic data distribution
According to the two-dimensional block cyclic data distribution scheme, an M_ by N_ dense matrix is first decomposed into
MB_ by NB_ blocks starting at its upper left corner.
These blocks are then uniformly distributed in each dimension of the Process Grid.
Thus, every process owns a collection of blocks, which are locally and contiguously stored in a two-dimensional ``column
major array. The partitioning of a matrix into blocksand the mapping of these blocks onto a Process Grid is illustrated with a
global 9x9 matrix A. The first step in this process is to partition the matrix A into block. Let us use 2x2 blocks and assume that
the 2-D Process Grid is 2x3.
a11 a12 | a13 a14 | a15 a16 | a17 a18 | a19

a21 a22 | a23 a24 | a25 a26 | a27 a28 | a29 B11 B12 B13 B14 B15
--------.---------.---------.---------.----
a31 a32 | a33 a34 | a35 a36 | a37 a38 | a39
a41 a42 | a43 a44 | a45 a46 | a47 a48 | a49 B21 B22 B23 B24 B25
--------.---------.---------.---------.----
a51 a52 | a53 a54 | a55 a56 | a57 a58 | a59 ==
a61 a62 | a63 a64 | a65 a66 | a67 a68 | a69 B31 B32 B33 B34 B35
--------.---------.---------.---------.----
a71 a72 | a73 a74 | a75 a76 | a77 a78 | a79
a81 a82 | a83 a84 | a85 a86 | a87 a88 | a89 B41 B42 B43 B44 B45
--------.---------.---------.---------.----
a91 a92 | a93 a94 | a95 a96 | a97 a98 | a99 B51 B52 B53 B54 B55
matrix A is M_ x N_ (9 x 9)
A is partitioned into 2x2 blocks
In the above diagram the Bij are the 2x2 blocks, e.g.
B11 == a11 a12 B12 == a13 a14 ... B15 == a19

a21 a22 a23 a24 a29
Initially, the 2x3 Process Grid is empty and looks like this:
0 1 2
. . .
0 . . .
. . .
1 . . .
. . .
We identify each process in the Process Grid by two coordinates (row,col). Thus, for the 2x3 Process Grid the processes would
be:
0 1 2
0 (0,0) (0,1) (0,2)
1 (1,0) (1,1) (1,2)
The distribution process starts by taking the Global Bij in first row and distribute them to the first row of the Processor Grid:
0 1 2
B11 B14 B12 B15 B13

0 . . .
. . .
1 . . .
. . .
Take Global Bij in next row and distribute them to the next row of the Process Grid: (if previous distribution was on last row
of Process Grid then restart with row 0 of Process Grid ).
0 1 2
B11 B14 B12 B15 B13

0 . . .
. . .
1 B21 B24 B22 B25 B23

. . .
Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).
0 1 2
B11 B14 B12 B15 B13

0 B31 B34 B32 B35 B33
. . .
1 B21 B24 B22 B25 B23

. . .
Take Global Bij in next row and distribute them to the next row of the Process Grid:
0 1 2
B11 B14 B12 B15 B13

0 B31 B34 B32 B35 B33
. . .
1 B21 B24 B22 B25 B23
B41 B44 B42 B45 B43
Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).
0 1 2
B11 B14 B12 B15 B13

0 B31 B34 B32 B35 B33
B51 B54 B52 B55 B53
1 B21 B24 B22 B25 B23

B41 B44 B42 B45 B43
Set Up prior to invoking the Parallel Solvers
Before calling solvers and other functions from the ScaLAPACK library the user needs to configure the processes into a
BLACS 2D Process Grid. ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D Processor
Grid.
Here is the procedure to accomplish this:
(1) Start by invoking the following routine:

call blacs_pinfo( iam, nprocs )
OUT OUT
which identifies the process in the process grid and how many processes there are in total.
(2) Next, you should set up a rectangular grid (2D) as close as possible to a square. Following algorithm was developed by
Carlo Cavazzoni of CINECA (this is not part of BLACS), and you must include it in your source:
call gridsetup (nprocs,nprow,npcol)
INPUT OUT OUT
This computes the number of rows and columns we'll have in the process grid, such that NPROW x NPCOL == NPROCS.
(3) Inform BLACS of this grid extent by calling following routines:

call blacs_get( -1, 0, context )
IN IN OUT
call blacs_gridinit( context, 'r', nprow, npcol )

INPUT IN INPUT INPUT
Note: 'r' ==> ROW-MAJOR ordering i.e. fill in first row left to right then next row.
'c' ==> COLUMN-MAJOR ordering.
MPI has communicators, BLACS has contexts. For this example, the default context, which includes all processors, is all we
need. The BLACS_GET call simply returns a handle to the default context for use in subsequent BLACS calls.
BLACS_GRIDINIT informs BLACS of the grid extent which we computed in step 2. Behind the scenes, BLACS then assigns
processors to points in this virtual grid.
This now completes the definition of the BLACS 2D process grid.
We can now use subroutine BLACS_GRIDINFO to query BLACS on the information we supplied plus additional information
BLACS has computed such as its coordinate, e.g., COL and ROW, in the grid by specifying "context":
call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
INPUT OUT OUT OUT OUT
Subroutine report_back prints the information we got from the call to blacs_gridinfo.
(4) Distribute the data arrays across all processors:
For each array, we need to take four basic steps:

(a) determine a block size for the global array
(b) call DESCINIT to create a standard ScaLAPACK array descriptor
(c) allocate local memory for each process' portion of the array
(d) distribute the actual data values into the allocated memory
(a) determine a block size for the global arrays and block-cyclically distribute these blocks across the 2D processor grid
defined in steps (1), (2) and (3):
Here "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed.
The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays.
Carlo Cavazzoni of CINECA has written a subroutine to compute the block size for a given array: "Blockset" chooses a good
block size nb based on the size of the global array (i.e. dimension of matrix A), n, and number of rows and columns in the
processor grid (nprow,npcol). It also honors a maximum block size value, which, following the User Guide recommendation,
can simply be set at 64. We will then simply call blockset to determine the block size, nb:
call blockset( nb, 64, n, nprow, npcol)
OUT INPUT INPUT INPUT INPUT
(b) call DESCINIT to create a standard ScaLAPACK array descriptor
The ScaLAPACK function NUMROC is used to determine the number of rows of matrix A:
l_nrowsa = numroc(n,nb,myrow,0,nprow)
and again to determine the number of columns of matrix A:

l_ncolsa = numroc(n,nb,mycol,0,npcol)
Now we are ready to call DESCINIT:

call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )
Similar calls are required for matrix C (the right hand side).
(c) allocate local memory for each process' portion of the array
Following fortran statement will allocate the required local memory for the segment of the matrix A used by this processor:
allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
Similar statements are required for the arrays C and IPIV, etc ...
(d) distribute the actual data values into the allocated memory
The ScaLAPACK subroutine PDELSET is used to store the global values of matrices A and C to the appropriate locations in
each processor.
When we are done with a BLACS context, it should be released by calling subroutine BLACS_GRIDEXITR, and when we are
done with BLACS altogether, we should call BLACS_EXIT. (These calls are similar to the MPI functions, MPI_Comm_free
and MPI_Finalize).
Timing Routines and Scripts used in the makefile
In the next section a full listing of the program designed to solve a system of STRIDWAD equations of an even order N will
appear. This program uses ScaLAPACK routines by distributing the coefficient matrix A and right hand side vector on the
processors of the 2D Process Grid.
The timing routines are used to measure the cpu and elapsed times in different segments of the program. Several milestones are
set up throughout the program and the timings between this milestones is reported at the end of the program.
Following scripts and basic files are used in the makefile:

./check_N $N
parameter_N.inc_basic
sub_batch_DP_LAPACK_SOLVE_basic
./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P
./sub_batch_DP_LAPACK_SOLVE
listings follow:
Using make/makefile to compile and submit the program
The make command is used to execute the commands in the makefile. The makefile has several macros with default values but
the user can override these values. Several scripts are invoked to verify some parameters and if these parameters are out of
range then the script returns a non-zero return code which stop make from further executing the commands on the makefile.
For example if the default parameters N, P and FPP_OPT were set to:
N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR
in the makefile and we type the command:

make
the utility program make will proceed to read the makefile and carry out the following tasks:
Check if N is in the range [1,40] (N times 1000 is the order of the matrix to invert)
Create the include file parameter_N.inc by making a copy of it from
file parameter_N.inc_basic and changing the parameter N to 4
Invoking fpp with option -DPRINT_TIMES and generating a new source
file ScaLAPACK_SOLVE2.f90
Compiling the latter with command compile using the libraries defined in the makefile by
macro $(LIB)
Generating a script to submit the job based on file sub_batch_ScaLAPACK_SOLVE_basic
and setting the parameter for number or processors to use
Submitting the job to the test queue
Since we are submitting to the test queue we must wait until the job finishes running, also we need to type the command:
make clean
to remove the last executable.
If we want to run the case for a system of order 4000 using 128 processors and do not want to time the routines but want to see
how the matrix A is distributed on the processors we could type:
make N=4 P=128 "FPP_OPT=-DNO_PRINT_TIMES -DPRINT_MEMORY_DISTR"
or one could also type:

make N=4 P=128 FPP_OPT=-DNO_PRINT_TIMES FPP_OPT+=-DPRINT_MEMORY_DISTR
The macro USE_QUEUE is used to decide if the job to be submitted should be run in the test queue or not. This macro must
be set up manually by editing the makefile.
Thus, for jobs to be submitted to the test queue the macro should look like this:
USE_QUEUE:= \-t
# USE_QUEUE:=\-v
also note that the script sub_batch_ScaLAPACK_SOLVE_basic uses -r 60m.
For longer production jobs (not to be submitted to the test queue) the macro should look like this:
# USE_QUEUE:= \-t
USE_QUEUE:=\-v
and in the script sub_batch_ScaLAPACK_SOLVE_basic could use larger than 60m values for parameter -r instead of -r
60m.
As you become familiar with the macros in the makefile, you can adjust them to fit your needs. Here is the makefile used in
this online tutorial:
# Determine the platform that is being used for this compilation
OS:= $(shell uname)

PLATFORM:= $(shell uname)_$(shell uname -m)
# Set the dafault parameters N (order of matrx times 1000)

# and P (number of processors
N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR
USE_QUEUE:= \-t
# USE_QUEUE:=\-v
# Specify executable, source and required libraries :
EXE = ScaLAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
EXE2:=$(EXE)2
SRC2:=${EXE2}.$(EXT)
LIB:= -psc -lscalapack -lblacs -lblacsF77init -lblacs -lmpi -L$(BLACSPATH)/pathscale/lib \

-L$(SCALAPACKPATH)/pathscale/lib -L$(ACMLPATH) -lacml
$(EXE): $(SRC)
@echo Check if N is in the range [1,40]
./check_N $N
# @echo COMPILING LAPACK DP program with N = $(N),000
# @echo PLATFORM used = $(PLATFORM)
# @echo Edit the parameter_N.inc file
cp parameter_N.inc_basic parameter_N.inc
# @echo "Using following parameter_N.inc file:"
# cat parameter_N.inc
# @echo "LIB = " $(LIB)
@echo "Invoking fpp with option $(FPP_OPT)"

fpp $(FPP_OPT) $(SRC) $(SRC2)
compile -o $(EXE) $(SRC2) $(LIB)
cp sub_batch_ScaLAPACK_SOLVE_basic sub_batch_ScaLAPACK_SOLVE
./ed_procs_P sub_batch_ScaLAPACK_SOLVE $(USE_QUEUE) $P
./sub_batch_ScaLAPACK_SOLVE
superclean: clean
@echo ' '
@echo Removing Output files generated by $(EXE)
@echo '--------------------------------------------------'
@echo ' '
rm -rf ${CLU}_SERIAL_DP_SOLVE_*
@echo ' '
@echo Output files have been removed.
@echo '-------------------------------'
@echo ' '
clean:
@echo ' '
@echo Cleaning executable $(EXE)
@echo '-----------------------------------'
@echo ' '
rm -rf $(EXE)
rm -rf ED_ERROR
@echo ' '
@echo Executable $(EXE) has been removed.
@echo '--------------------------------------------'
@echo ' '
help:
@echo "+-----------------------------------------------------------------------------------+"
@echo "| |"
@echo "| Makefile for running ScaLAPACK example using pdgetrf/pdgetrs |"
@echo "| |"
@echo "| Usage: make compile program for N=1 (i.e. order=1,000) |"
@echo "| |"
@echo "| make N=n compile program for N=n (i.e. order=n*1000) |"
@echo "| where 1 <= n <= 40 |"
@echo "| |"
@echo "| make P=p submit job with -q threaded -n p |"
@echo "| where 1 <= p <= 4 or 6 (depending on cluster) |"
@echo "| |"
@echo "| make superclean remove output files and executable |"
@echo "| |"
@echo "| make clean remove executable |"
@echo "| |"
@echo "| make help display makefile usage information |"
@echo "| |"
@echo "+-----------------------------------------------------------------------------------+"
@echo " "
@echo PLATFORM $(PLATFORM)
Full fortran program listing using ScaLAPACK routines to solve a STRIDWAD system
The full fortran 90 program (which includes fpp directives) is listed next:
program ScaLAPACK_SOLVE
!
! file name = ScaLAPACK_SOLVE.f90
!
implicit none
include 'mpif.h'
integer :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable :: a
real(kind=8),dimension(:,:),allocatable :: c
real(kind=8),dimension(:),allocatable :: solution
integer,dimension(:),allocatable :: ipiv
integer*8 :: N
real(kind=8) :: memory_local, memory_sum
integer :: root, ierr
real(kind=8),dimension(:),allocatable :: memory_a
logical :: PRT_MEM
include "parameter_N.inc"
integer :: context, iam, pnum

integer :: mycol, myrow, nb
integer :: npcol, nprocs, nprow
integer :: l_nrowsa,l_ncolsa
integer :: l_nrowsc,l_ncolsc
integer,parameter :: descriptor_len=9
integer :: desca( descriptor_len )
integer :: descc( descriptor_len )
integer :: numroc, ITEM, IAorC, NH, NH1, DEBUG

! --------------------------------------------------------------------------
#ifdef PRINT_TIMES
integer :: day(10),hour(10),minute(10),second(10),millisec(10)
character*30 :: descr_milestone(10)
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm
#endif
! --------------------------------------------------------------------------
interface
subroutine report_back ( context, &
iam,nprocs,myrow,nprow,mycol,npcol)
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum
end subroutine report_back
subroutine printlocals( context, &

a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
implicit none
integer :: context
integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
integer :: i,j,pnum
end subroutine printlocals
subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)

implicit none
integer*8 :: N
integer :: desca(:)
subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)

implicit none
integer*8 :: N
integer :: descc(:)
subroutine get_solution (n,c,descc,solution)

implicit none
integer*8 :: N
integer :: descc(:)
real(kind=8) :: c(:,:),solution(:)
end subroutine get_solution
#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
& millisec_end,tid,ITEM )
implicit none
integer :: day,hour, minute, second, millisec, tid, ITEM
subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
#endif
end interface
! ---------------------------------------------------------------------------
#ifdef PRINT_MEMORY_DISTR
PRT_MEM = .TRUE.
#else
PRT_MEM = .FALSE.
#endif
!
DEBUG = 1
root = 0
! N must be EVEN
NH = N/2
NH1 = NH + 1

WRITE(6,2001) N
STOP
ENDIF
IF ( N .gt. N_PRT ) DEBUG = 0

!
! ----- Initialize the blacs. Note: processors are counted starting at 0.
!
call blacs_pinfo( iam, nprocs )
! ---------------------------------------------------------------------------
ITEM = 1000 + iam
#ifdef PRINT_TIMES
call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg, &
& millisec_beg )
IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time

1006 format(I4," iam ",i3," BEGIN ",a23)
call cpu_time(T1)
milestone = 1
descr_milestone(milestone) = "Init blacs_gridinit, allocate"

#endif
!

WRITE(6,101) nprocs
101 FORMAT(" nprocs = ",i3)
call flush(6_4)
ENDIF
! ----- Set the dimension of the 2d processors grid.
!
call gridsetup(nprocs,nprow,npcol)
!
! ----- Initialize a single blacs context. Determine which processor I
! am in the 2D process or grid.
!
call blacs_get( -1, 0, context )
call blacs_gridinit( context, 'r', nprow, npcol )
call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
IF (DEBUG .eq. 1) call report_back (context, &

& iam,nprocs,myrow,nprow,mycol,npcol)
!
! ----- Calculate the blocking factor for the matrix.
!
call blockset( nb, 64, n, nprow, npcol)
!
! ----- Distributed matrices: get num. local rows/cols. Create description.
!
l_nrowsa = numroc(n,nb,myrow,0,nprow)
l_ncolsa = numroc(n,nb,mycol,0,npcol)
call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )
l_nrowsc = numroc(n,nb,myrow,0,nprow)
l_ncolsc = numroc(1,nb,mycol,0,npcol)
call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info )
!
! ----- Allocate LHS, RHS, pivot, and solution -----
!
allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for A"
allocate (c(l_nrowsc,l_ncolsc), stat=istat)

if (istat/=0) stop "ERR:ALLOCATE FAILS for C"
allocate (ipiv (n), stat=istat)

if (istat/=0) stop "ERR:ALLOCATE FAILS for IPIV"
allocate (solution (n), stat=istat)

if (istat/=0) stop "ERR:ALLOCATE FAILS for solution"
allocate (memory_a (nprocs), stat=istat)

if (istat/=0) stop "ERR:ALLOCATE FAILS for memory_a"
!
! ----- Reassemble memory_a on all processors. Add and Print on 0 -----
!
memory_local = l_nrowsa*l_ncolsa*8
call MPI_Gather(memory_local, 1, MPI_REAL8, &

& memory_a, 1, MPI_REAL8, root, &
& MPI_COMM_WORLD, ierr)
IF (IAM.EQ.0) THEN
IF (N .le. N_PRT .or. PRT_MEM ) THEN
ITEM = 6500
memory_sum = 0
write (6,550) ITEM,N
550 FORMAT(I4," N = ",i12 /" Distributed memory for A: ")
DO I=0,nprocs-1
ITEM = 6500 + I
WRITE (6,551) ITEM,I,memory_a(I+1)
memory_sum = memory_sum + memory_a(I+1)
551 FORMAT(I4," processor = ",I3," ",f14.1)
ENDDO
ITEM = ITEM + 1
WRITE (6,552) ITEM,memory_sum,float(N)*float(N)*8
552 FORMAT(I4," TOTAL MEMORY for A = ",f14.1," == N*N*8 = ",f14.1)
ENDIF
call flush(6_4)
ENDIF
!
! ----- Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
call init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)
!
! ----- Show how arrays distributed
!
IF (DEBUG .eq. 1) THEN
ITEM = 2000 + 100*MYROW + 10*MYCOL
IF (IAM.EQ.0) THEN
write(6,300) ITEM,n,n
300 format(i4," DISTRIBUTION OF ARRAY: A - Global dimension:", &
& i3,":",i3)
ENDIF
IAorC = 0
call printlocals ( context, &
& a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa,IAorC)
IF (IAM.EQ.0) THEN
write(6,400) ITEM,N,1
400 FORMAT(i4," DISTRIBUTION OF ARRAY: C - Global dimension:", &
& i3,":",i3)
ENDIF
IAorC = 1000
call printlocals ( context, &
& c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc,IAorC)
ENDIF
#ifdef PRINT_TIMES
descr_milestone(milestone) = "call pdgetrf"

#endif
!
! ----- Do the work -----
!
call pdgetrf( n, n, a, 1, 1, desca, ipiv, info )
& write (6,*) "info (from pdgetrf) = ", info
#ifdef PRINT_TIMES
descr_milestone(milestone) = "call pdgetrs"

#endif
call pdgetrs( 'n', n, 1, a, 1, 1, desca, ipiv, &

& c, 1, 1, descc, info )

& write (6,*) "info (from pdgetrs) = ", info
!
! ----- Reassemble solution on all processors. Print on 0 -----
!
#ifdef PRINT_TIMES
descr_milestone(milestone) = "get_solution --> root"

#endif
call get_solution (n,c,descc,solution)
IF (IAM.EQ.0) THEN
ITEM = 6000
write (6,500) ITEM
500 FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,SOLUTION(I)
501 FORMAT(I4," I = ",I3," ",f6.2)
ENDDO
ELSE
DO I=N-N_PRT,N
WRITE (6,502) I,SOLUTION(I)
502 FORMAT(" I=",I10," ",f13.2)
ENDDO
ENDIF
ENDIF
#ifdef PRINT_TIMES
& minute(milestone),second(milestone), millisec(milestone) )
descr_milestone(milestone) = "print and clean up"

#endif
! Now report the milestone time differences
#ifdef PRINT_TIMES
ITEM = 7000
IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)
do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) + 3600.0*(hour(i+1)-hour(i)) + &
& 60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i)) + &
& 0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm, &
& descr_milestone(i)
1005 format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
& ,I2," equals ",f10.2,3x,a30)
enddo
call timestamp( date_time,day_end,hour_end,minute_end,second_end, &

& millisec_end )
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END ",a23)
call cpu_time(T2)
TOT_TIME = T2-T1
call elapsed_time(day_beg,hour_beg, minute_beg, second_beg, &

& millisec_end,iam,ITEM )
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4, &
& " seconds")
#endif
! ---------------------------------------------------------------------------
!
!
deallocate (a,c,ipiv,solution,memory_a)
!
! ----- Exit BLACS cleanly -----
!
call blacs_gridexit( context )
call blacs_exit( 0 )
end program ScaLAPACK_SOLVE
subroutine gridsetup(nproc,nprow,npcol)
!
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol, that are the sizes of the 2d processors mesh.
!
! Written by Carlo Cavazzoni
!
implicit none
integer nproc,nprow,npcol
integer sqrtnp,i
sqrtnp = int( sqrt( dble(nproc) ) + 1 )

do i=1,sqrtnp
if(mod(nproc,i).eq.0) nprow = i
end do
npcol = nproc/nprow
return
end
!
!-----------------------------------------------------------------------
!
subroutine blockset( nb, nbuser, n, nprow, npcol)
!
! This subroutine try to choose an optimal block size
! for the distributd matrix.
!
! Written by Carlo Cavazzoni, CINECA
!
implicit none
integer*8 :: N
integer nb, nprow, npcol, nbuser
nb = min ( n/nprow, n/npcol )

if(nbuser.gt.0) then
nb = min ( nb, nbuser )
endif
nb = max(nb,1)
return
end subroutine blockset
!
!-----------------------------------------------------------------------
!
subroutine report_back ( context, &
iam,nprocs,myrow,nprow,mycol,npcol)
!
! Each processor identifies itself and its place in the processor grid
!
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum,ITEM
do i=0,nprocs-1
call blacs_barrier (context, 'a')
if (iam.eq.i) then
ITEM = 3000 + 100*MYROW + 10*MYCOL
write(6,100) ITEM,iam,nprocs,myrow,mycol,myrow,nprow,mycol, &
& npcol
100 format(i4," PE=",i2,":",i2," Grid-coord(",i2,",",i2, &

& ") PROW=",i2,":",i2," PCOL=",i2,":",i2)
call flush(6_4)
endif
enddo

end subroutine report_back
!
!-----------------------------------------------------------------------
!
subroutine printlocals( context, &
a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC)
!
! Prints the local array "a", processor by processor. Only use this
! for very small arrays or sub-arrays...
!
implicit none
integer :: context
integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols,IAorC
integer :: i,j,pnum,ITEM
do pnum=0,nprocs-1
if (iam .eq. pnum) then
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL
write (6,100) ITEM,iam,myrow,mycol,nb,l_nrows,l_ncols, &
& l_nrows*l_ncols*8
100 format (i4," proc:",i3," grid position:",i3,",",i3, &
& " blksz:",i3," numroc:",i3,":",i3," Memory Allocated "&
& ,i6," bytes")
do i=1,l_nrows
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL + I
write (6,200) ITEM,(a(i,j),j=1,l_ncols)
200 format (I4," ",40(" ",f5.1))
call flush(6_4)
enddo
endif
enddo
call flush(6_4)
end subroutine printlocals
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL,DESCA)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
INTEGER :: DESCA(:)
REAL(kind=8) :: AII,AII1,AI1I,AINI1
INTEGER :: I, J
! Compute values for all elements of the global array, but using
! pdelset, only set those elements which occur in local portion.
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
! A(I,I) = 3.0
AII = 3.0
J = I
CALL PDELSET(A,I,J,DESCA,AII)
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
! A(I,I+1) = -1.0d0
AII1 = -1.0d0
J = I+1
CALL PDELSET(A,I,J,DESCA,AII1)
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
! A(I+1,I) = -1.0d0
AI1I = -1.0d0
J = I
CALL PDELSET(A,I+1,J,DESCA,AI1I)
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
! A(I,N-I+1) = 1.0d0
AINI1 = 1.0d0
J = N-I+1
CALL PDELSET(A,I,J,DESCA,AINI1)
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL,DESCC)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NPROW,NPCOL,MYROW,MYCOL
INTEGER :: DESCC(:)
REAL(kind=8) :: CI, CN
INTEGER :: I
DO I= 1, N-1
! C(I) = N + 1.0
CI = N + 1.0
CALL PDELSET(C,I,1,DESCC,CI)
ENDDO
! C(N) = 2*N + 2
CN = 2*N + 2
CALL PDELSET(C,I,1,DESCC,CN)
return
end
!
!-----------------------------------------------------------------------
!
subroutine get_solution (n,c,descc,solution)
implicit none
integer*8 :: n
integer :: descc(:)
real(kind=8) :: c(:,:),solution(:)
integer :: i
do i= 1, n
call pdelget('A',' ',solution(i),c,i,1,descc)
enddo
return
end
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
& millisec_beg, day_end, hour_end, minute_end, second_end,&
& millisec_end,tid, ITEM )
!date_and_time
! VALUES
! must be of type default integer and of rank one. It is an INTENT(OUT)
! argument. Its size must be at least eight. The values returned in VALUES
! are as follows:
! VALUES(1)
! is the year (for example, 1998), or -HUGE (0) if no date is available.
! VALUES(2)
! is the month of the year, or -HUGE (0) if no date is available.
! VALUES(3)
! is the day of the month, or -HUGE (0) if no date is available.
! VALUES(4)
! is the time difference with respect to Coordinated Universal Time (UTC)
! in minutes, or -HUGE (0) if this information is not available.
!
! VALUES(5)
! is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
! no clock.
!
! VALUES(6)
! is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
! is no clock.
! !
! VALUES(7)
! is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
! there is no clock.
!
! VALUES (8)
! is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
! if there is no clock.
implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM
if ( millisec_end .lt. millisec_beg ) then

millisec_end = millisec_end + 1000
second_end = second_end - 1
endif
if ( second_end .lt. second_beg ) then

second_end = second_end + 60
minute_end = minute_end - 1
endif
if ( minute_end .lt. minute_beg ) then

minute_end = minute_end + 60
hour_end = hour_end - 1
endif
if ( hour_end .lt. hour_beg ) then

hour_end = hour_end + 24
day_end = day_end - 1
endif
millisec = millisec_end - millisec_beg

second = second_end - second_beg
minute = minute_end - minute_beg
hour = hour_end - hour_beg
day = day_end - day_beg
if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0 &
& .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000 format(I4," IAM = ",I3," ELAPSED0 days = ",i3)
write(6,8001) ITEM,tid,hour
8001 format(I4," IAM = ",I3," ELAPSED1 hours = ",i3)
write(6,8002) ITEM,tid,minute
8002 format(I4," IAM = ",I3," ELAPSED2 minute = ",i3)
write(6,8003) ITEM,tid,second
8003 format(I4," IAM = ",I3," ELAPSED3 second = ",i3)
write(6,8004) ITEM,tid,millisec
8004 format(I4," IAM = ",I3," ELAPSED4 millisec = ",i3)
ENDIF
endif
return

#endif
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine timestamp( date_time, day,hour,minute,second,millisec)
implicit none
data months /'Jan','Feb','Mar','Apr','May','Jun', &

& 'Juk','Aug','Sep','Oct','Nov','Dec' /
call date_and_time ( VALUES=elements )
if ( elements(1) .ne. -HUGE(0) ) then

century: if ( elements(1) .lt. 2000 ) then
else
endif century
write(date_time,1010) elements(3), &
& months ( elements(2) ), &
& elements(1) , elements(5), &
& elements(6) , elements(7) , elements(8)
1010 format(i2.2,1x,a3,1x,i2.2,1x, &

& i2.2,':',i2.2,':',i2.2,'.',i4.4 )
day = elements(3)
hour = elements(5)
minute = elements(6)
second = elements(7)
millisec = elements(8)
else
date_time=' '
day = 0
hour = 0
minute = 0
second = 0
millisec = 0
endif

#endif
!
!-----------------------------------------------------------------------
!
Results for solutions of the STRIDWAD system for different orders of matrix A
LAPACK RESULTS
These are the results for serial LAPACK runs (nprocs=1). Note that on one processor the highest matrix size is 30,000.
N CPU Time
order of per
matrix processor
[ sec ]
10000 312.33
20000 1808.32
30000 5487.23
32000 N/A
ScaLAPACK RESULTS
N CPU Time CPU Time CPU Time CPU Time

order of nprocs per nprocs per nprocs per per
matrix processor processor processor processor
[ sec ] [ sec ] [ sec ] [ sec ]
10000 64 8.17 128 5.15 256 6.31 512 .

20000 40.40 24.37 24.44
30000 118.24 64.40 59.70
40000 254.73 135.43 110.18
50000 462.04 244.30 219.82
60000 754.16 398.18 288.39
70000 1164.53 608.61 562.66
80000 1666.45 881.23 587.94
90000 2330.41 1227.83 1146.27
100000 3134.40 1645.89 1028.17 3134.39
110000 2131.83 1328.90
120000 2726.03 1672.00
130000 3412.37 2071.03
140000 N/A 2558.06
150000 3097.58
160000 3661.49
170000 4318.20
180000 5092.44
190000 5873.94
200000 6794.57
210000 7843.55
220000 8798.14
230000 10190.61
240000 11230.16
250000 12874.02
N/A
CONCLUSIONS
With ScaLAPACK on 256 processors matrices up to order 250,000 can be inverted, compared to order of 30,000 for
LAPACK. When the matrix is distributed over many processors it takes less time and larger order matrices can be inverted.
For a list of the available ScaLAPACK routines see:

http://www.netlib.org/scalapack/double/
All ScaLAPACK routines assume that the data has been distributed on the process grid prior to the invocation of the routine.
Detailed descriptions of the appropriate calling sequences for each of the ScaLAPACK routines can be found in the leading
comments of the source code or the ScaLAPACK Users' Guide:
http://netlib2.cs.utk.edu/scalapack/slug/scalapack_slug.html
Retrieved from "https://www.sharcnet.ca/help/index.php/LAPACK_and_ScaLAPACK_Examples"

Category: Tutorials

LAPACK and ScaLAPACK Examples

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

LAPACK and ScaLAPACK Examples

Caricato da

Copyright:

Formati disponibili

LAPACK and ScaLAPACK Examples

LAPACK and ScaLAPACK Examples

• 1 LAPACK and ScaLAPACK Examples

Defining the STRIDWAD matrix

Observations on the structure of the STRIDWAD systems:

All elements on the anti-diagonal are 1.

Order N is an EVEN number.

If we assume X is unknown and we use the C as defined above we can

Program to calculate the RHS vector of the STRIDWAD system

parameter (N_PRT = 40)

subroutine init_my_matrix (n,a)

subroutine init_my_vector (N,X)

subroutine init_my_rhs (n,c)

IF (2*NH .NE. N) THEN

allocate (c(N,1), stat=istat)

allocate (D(N,1), stat=istat)

allocate (X (n,1), stat=istat)

mem_allocated = 8*N*N + 8*N + 8*N

! For large N print only last N_PRT elements of vector

Important Notes on the program and Output Results

There are several things which should be pointed out:

(2) To submit the job the command:

Output for N=18 was as follows:

Compiling and Submitting Jobs

OS:= $(shell uname)

# Set the dafault parameters N (order of matrix times 1000)

# Specify executable, source and required libraries :

# script file = check_N

if [ "$N" -lt 1 ] || [ "$N" -gt 40 ]

# script file = ed_parm_N

# script file = ed_procs_P

if [ `printenv CLU` = "nar" ] ; then

if [ `printenv CLU` = "saw" ] ; then

if [ $2 -lt 1 ] || [ $2 -gt $MXP ] ; then

ed $Filename <<EOF 2> ED_ERROR

# script file = sub_batch_DP_LAPACK_SOLVE_basic

sqsub -t -r 60m -q threaded -n nprocs -o ${CLU}_SERIAL_DP_SOLVE_%J ./DP_LAPACK_SOLVE

The Fortran and Other Files

integer :: root, ierr

parameter (N_PRT = 40)

integer :: context, iam, pnum

integer :: ITEM, IAorC, NH, NH1, DEBUG

subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol)

subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol)

subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&

IF (2*NH .NE. N) THEN

IF ( N .gt. N_PRT ) DEBUG = 0

ITEM = 1000 + iam

call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg, &

call timestamp( date_time,day(milestone), hour(milestone), &

descr_milestone(milestone) = "Init blacs_gridinit, allocate"

IF (IAM .EQ. 0) THEN

allocate (c(N,1), stat=istat)

allocate (ipiv (n), stat=istat)

mem_allocated = 8*N*N + 8*N + 4*N

IF (IAM .EQ. 0) THEN

descr_milestone(milestone) = "call dgetrf"

if (iam.eq.0 .AND. DEBUG .eq. 1) &

descr_milestone(milestone) = "call dgetrs"

call dgetrs( 'N', N, 1, A, N, ipiv, c, N, info )

if (iam.eq.0 .AND. DEBUG .eq. 1) &

descr_milestone(milestone) = "get_solution --> root"

! For large N print only last N_PRT elements of vector

mem_allocated = 8NN + 8N + 8N

mem_allocated = 8NN + 8N + 4N