Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/248820802
CITATIONS READS
152 1,689
3 authors, including:
Some of the authors of this publication are also working on these related projects:
ShaleSeq: Physico-chemical effects of co2 storage in the Pomeranian gas-bearing shales. Currently in the final phase of the project results publishing. View project
All content following this page was uploaded by Daniel Walter Schmid on 26 August 2016.
[1] The finite element method (FEM) combined with unstructured meshes forms an elegant and versatile
approach capable of dealing with the complexities of problems in Earth science. Practical applications
often require high-resolution models that necessitate advanced computational strategies. We therefore
developed ‘‘Million a Minute’’ (MILAMIN), an efficient MATLAB implementation of FEM that is
capable of setting up, solving, and postprocessing two-dimensional problems with one million unknowns
in one minute on a modern desktop computer. MILAMIN allows the user to achieve numerical resolutions
that are necessary to resolve the heterogeneous nature of geological materials. In this paper we provide the
technical knowledge required to develop such models without the need to buy a commercial FEM package,
programming compiler-language code, or hiring a computer specialist. It has been our special aim that all
the components of MILAMIN perform efficiently individually and as a package. While some of the
components rely on readily available routines, we develop others from scratch and make sure that all of
them work together efficiently. One of the main technical focuses of this paper is the optimization of the
global matrix computations. The performance bottlenecks of the standard FEM algorithm are analyzed. An
alternative approach is developed that sustains high performance for any system size. Applied
optimizations eliminate Basic Linear Algebra Subprograms (BLAS) drawbacks when multiplying small
matrices, reduce operation count and memory requirements when dealing with symmetric matrices, and
increase data transfer efficiency by maximizing cache reuse. Applying loop interchange allows us to use
BLAS on large matrices. In order to avoid unnecessary data transfers between RAM and CPU cache we
introduce loop blocking. The optimization techniques are useful in many areas as demonstrated with our
MILAMIN applications for thermal and incompressible flow (Stokes) problems. We use these to provide
performance comparisons to other open source as well as commercial packages and find that MILAMIN is
among the best performing solutions, in terms of both speed and memory usage. The corresponding
MATLAB source code for the entire MILAMIN, including input generation, FEM solver, and
postprocessing, is available from the authors (http://www.milamin.org) and can be downloaded as
auxiliary material.
Dabrowski, M., M. Krotkiewski, and D. W. Schmid (2008), MILAMIN: MATLAB-based finite element method solver for
large problems, Geochem. Geophys. Geosyst., 9, Q04030, doi:10.1029/2007GC001719.
2 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
lined goals. The mesh generator chosen is Tri- The basic two-dimensional element is a triangle. In
angle developed by J. R. Shewchuk (version 1.6, the thermal problem discrete temperature values are
http://www.cs.cmu.edu/quake/triangle.html- defined for the nodal points, which can be associated
Shewchuk, 2007). Triangle is extremely versatile with element vertices, located on its edges, or even
and stable, and consists of one single file that reside inside the elements. Introducing shape func-
can be compiled into an executable on all plat- tions Ni that interpolate temperatures from the nodes
forms with a standard C compiler. We choose the Ti to the domains of neighboring elements, an
executable-based file I/O approach, which has the approximation to the temperature field T~ in W is
advantage that we can always reuse a saved defined as
mesh. The disadvantage is that the ASCII file X
nnod
I/O provided by Triangle is rather slow, which T~ ð x; yÞ ¼ Ni ð x; yÞTi ð2Þ
can be overcome by adding binary file I/O as i¼1
described in the instructions provided in the where nnod is the number of nodes in the
MILAMIN code repository. discretized domain.
conditions are given as constrained velocity or [15] With the convention that velocity degrees of
vanishing traction components. In equation (4) we freedom are followed by pressure ones in the local
use the divergence rather than Laplace form (in the element numbering, the stiffness matrix for the
latter different velocity components are only Stokes problem is given by [e.g., Bathe, 1996]
coupled through the incompressibility constraint) !
as we expect to deal with strongly varying viscosity. e A QT
K ¼
It is also worth noting that even for homogeneous Q k1 M
models the computationally advantageous Laplace Z Z ! ð6Þ
form may lead to serious defects if the boundary me BT DB BTvol PT
¼ dxdy
terms are not treated adequately [Limache et al., e
PBvol k1 PPT
W
2007]. Additionally our formulation, equation (4),
and its numerical implementation are also applicable where B is the so-called kinematic matrix trans-
to compressible and incompressible elastic problems forming velocity into strain rate e_ (we use here the
due to the correspondence principle. engineering convention for the shear strain rate)
0 @N 1
1
0 1 ð x; yÞ 0 ... 0 1
e_ xx ð x; yÞ B @x C u1x
B C
B e_ ð x; yÞ C B @N1 CB u1y C
@ yy A ¼ Bð x; yÞue ¼ B
B 0 ð x; yÞ . . . CB
C@
C
A ð7Þ
B @y C
g_ xy ð x; yÞ @ @N A ..
1 @N1 .
ð x; yÞ ð x; yÞ . . .
@y @x
[14] In analogy to the thermal problem we intro- The matrix D extracts the deviatoric part of the
duce the discrete spaces to approximate the veloc- strain rate, converts from engineering convention
ity components and pressure: to standard shear strain rate, and includes a
conventional factor 2. The bulk strain rate is
X
nnod computed according to the equation e_ vol = Bvolue
~ux ð x; yÞ ¼ Ni ð x; yÞuix and pressure is the projection of this field onto the
i¼1
pressure approximation space
X
nnod
~uy ð x; yÞ ¼ Ni ð x; yÞuiy ð5Þ
i¼1 pð x; yÞ ¼ kPT ð x; yÞM1 Que ð8Þ
X
np
~pð x; yÞ ¼ Pi ð x; yÞpi With the chosen approximation spaces, the linear
i¼1 pressure shape functions P are spanned by the
corner nodal values that are defined independently
where np denotes the number of pressure degrees of for neighboring elements. Thus it is possible to
freedom and Pi are the pressure shape functions, invert M on element level (the so-called static
which may not coincide with the velocity ones. To condensation) and consequently avoid the pres-
ensure the solvability of the resulting system of sure unknowns in the global system. Since the
equations (inf-sup condition [see Elman et al., pressure part of the right-hand-side vector is set to
2005]), special care must be taken when constructing zero, this results in the following velocity Schur
the approximation spaces. A wrong choice of the complement:
pressure and velocity discretization results in spur-
ious pressure modes that may seriously pollute the A þ kQT M1 Q ue ¼ f e ð9Þ
numerical solution. Our particular element choice is
the seven-node Crouzeix-Raviart triangle with quad- Once the solution to the global counterpart of (9)
ratic velocity shape functions enhanced by a cubic is obtained, the pressure can be restored afterward
bubble function and discontinuous linear interpola- according to (8). The resulting global system of
tion for the pressure field [e.g., Cuvelier et al., 1986]. equations is not only symmetric, but also positive-
This element is stable and no additional stabilization definite as opposed to the original system (6).
techniques are required [Elman et al., 2005]. The fact Unfortunately, the global matrix becomes ill-
that in our case the velocity and pressure approxima- conditioned for penalty parameter values corre-
tions are autonomous leads to the so-called mixed sponding to a satisfactorily low level of the flow
formulation of the finite element method [Brezzi and divergence. It is possible to circumvent this by
Fortin, 1991].
4 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
introducing Powell and Hestenes iterations [Cuvelier Thus the element matrix from equation (3) is now
et al., 1986] and keeping the penalty parameter k given by
moderate compared to the viscosity m: Z Z
@Ni @Nj @Ni @Nj
Kije ¼ ke þ j J jdxdh ð15Þ
@x @x @y @y
p0 ¼ 0 Wref
while max Dpi > tol
1 where jJj is the determinant of the Jacobian, taking
ui ¼ A þ kQT M1 Q f QT pi
care of the area change introduced by the mapping,
Dpi ¼ kM1 Qui ð10Þ and Wref is the domain of the reference element. To
iþ1 i i
p ¼ p þ Dp avoid symbolic integration equation (15) can be
increment i integrated numerically:
end
X
nip
@Ni @Nj @Ni @Nj
Kije ¼ Wk k e þ j J j ð16Þ
@x @x @y @y ðx k ;hk Þ
In the above iteration scheme the matrices A, Q, M k¼1
5 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
arrays used during the matrix computation proce- current element, coordinates of the nodes, and
dure are allocated in advance, e.g., K_all. element conductivity, viscosity and density.
[20] ii.) Inside the loop over all elements the code [21] iii.) For each element the following loop over
begins with reading element-specific information, integration points performs numerical integration
such as indices of the nodes belonging to the of the underlying equations, which results in the
element stiffness matrix K_elem[nnodel,nnodel].
6 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
In the case of mechanical code additional matrices do not provide native access to solvers, visualiza-
A_elem[nedof, nedof], Q_elem[nedof,np] and tion, file I/O etc. However, the ease of code
M_elem[np,np] are required. All of the above development in MATLAB comes with a loss of
arrays must be cleared before the integration point some performance, especially when certain recom-
loop together with the right-hand-side vector mended strategies are not followed: http://math-
Rhs_elem. works.com/support/solutions/data/1-15NM7.html.
The more obvious performance considerations
[22] iv.) Inside the integration point loop the pre- have already gone into the above standard imple-
computed shape function derivatives dNdui are mentation and we would like to point these out:
extracted for the current integration point. b) In
the chosen element type the pressure is interpolated [29] 1. Memory allocation and explicit variable
linearly in the global coordinates. Pressure shape declaration have been performed. Although not
functions Pi at an integration point are obtained as formally required, it is advisable to explicitly
a solution of the system P*Pi = Pb, where the first declare variables, including their size and type. If
equation enforces that the shape functions Pi sum variables are not declared with their final size, but
to unity. are instead successively extended (filled in) during
loop evaluation, a large penalty has to be paid for
[23] v.) The Jacobian J[ndim,ndim] is calculated the continuous and unnecessary memory manage-
for each integration point by multiplying the ele- ment. Hence, all variables that could potentially
ment’s nodal coordinates matrix ECOORD_X[n- grow in size during loop execution are preallo-
dim,nnodel] by dNdui[nnodel,ndim]. Furthermore cated, e.g., K_all. Variables such as ELEM2NODE
its determinant, detJ, and inverse, invJ[ndim,n- that only have to store integer numbers should be
dim], are obtained with the corresponding MAT- declared accordingly, int32 in the case of ELE-
LAB functions. M2NODE instead of MATLABs default variable
[24] vi.) The derivatives versus global coordinates, type double. This reduces both the amount of
dNdx[nnodel, ndim], are obtained by dNdx = memory required to store this large array and the
dNdui*invJ according to equation (14). time required to access it since less data must be
transferred.
[25] vii.) a) The element thermal stiffness matrix
contribution is obtained according to equation (16) [30] 2. Data layout has been optimized to facilitate
and implemented as K_elem = K_elem + weight*- memory access by the CPU. For example, the
ED*(dNdX*dNdX’). b) The kinematic matrix B indices of the nodes of each element must be
needs to be formed, equation (7), and A_elem, stored in neighboring memory locations, and sim-
Q_elem and M_elem are computed according to ilarly the x-y-z coordinates of every node. The
equation (6). actual numbering of nodes and elements also has a
visible effect on cache reuse inside the element
[26] viii.) The pressure degrees of freedom are loop, similarly to sparse matrix-vector multiplica-
eliminated at this stage. It is possible to invert tion problem [Toledo, 1997].
M_elem locally because the pressure degrees of
freedom are not coupled across elements, thus [31] 3. Multiple data transfers and computations
there is no need to assemble them into the global have been avoided. Generally, statements should
system of equations. For large viscosity variations appear in the outermost possible loop to avoid
it is beneficial to relate the penalty factor PF to the multiple transfer and computation of identical data.
element’s viscosity to improve the condition num- This is why the integration point evaluated shape
ber of the global matrix. function derivatives with respect to local coordi-
nates are precomputed outside the element loop (as
[27] ix.) The lower (incl. diagonal) part of the opposed to inside the integration loop) and the
element stiffness matrix is written into the global nodal coordinates are extracted before the integra-
storage relying on the symmetry of the system. b) tion loop.
Q_elem and invM_elem matrices are stored for
each element in order to avoid recomputing them 4.2.2. Performance Analysis
during Powell and Hestenes iterations.
[32] In order to analyze the performance of the
[28] MATLAB provides a framework for scientific standard matrix computation algorithm we run
computing that is freed from the burden of con- corresponding tests on an AMD Opteron system
ventional high-level programming languages, with 64 bit Red Hat Enterprise Linux 4 and
which require detailed variable declarations and
7 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
MATLAB 2007a using GoTo BLAS (http:// a possible solution is to explicitly write out the
www.tacc.utexas.edu/resources/software). This small matrix by matrix multiplications, which
system has a peak performance of 4.4 gigaflops results in a more efficient code. In MATLAB,
per core, i.e., it is theoretically capable of however, this is not a practical alternative as
performing 4.4 billion double precision floating explicitly writing out matrix multiplications leads
point operations per second (flops). The specific to unreadable code without substantial perfor-
element types used are 6-node triangles (quadratic mance gains. The above performance considera-
shape functions) with 6 integration points for the tions apply equally to the mechanical code.
thermal problem and 7-node triangles with 12
integration points for the mechanical problem. [36] In conclusion, the standard algorithm is a
viable option when writing compiler code. How-
[33] In the thermal problem, results are obtained ever, the achievable performance in MATLAB is
for an unstructured mesh consisting of approxi- unsatisfactory so we developed a more efficient
mately 1 million nodes and 0.5 million elements. approach, which is presented in the following
For this model the previously described matrix section.
computation took 65 s, during which 324 floating
point operations per integration point per element [37] Remark 1: Measuring code performance
were calculated. This corresponds to 15 Megaflops [38] Since no flops measure exists in MATLAB,
(Mflops) or approximately 0.4% of the peak per- the number of operations must be manually calcu-
formance. Analysis of the code with MATLAB’s lated on the basis of code inspection and divided
built-in profiler revealed that a significant amount by the computational time. To provide more mean-
of time was spent on the calculation of the deter- ingful performance measures only the number of
minant and inverse of the Jacobian. Therefore, in necessary floating point operations may be consid-
further tests these calls were replaced by explicit ered, e.g., the redundant computations of the upper
calculations of detJ and invJ. The final perfor- triangular entries in the standard matrix contribute
mance achieved by this algorithm was 30 Mflops, to the flop count, which artificially increases the
which is still less than one percent of the peak measured performance. However, it is not neces-
performance and equivalent to a peak CPU perfor- sarily the case that the algorithm with the lowest
mance that was reached by commodity computers operation count is the fastest in terms of execution
more than a decade ago. time. We restrain from adjusting the actual flop
[34] Profiling the improved standard algorithm counts in this paper.
revealed that most of the computational time was
spent on matrix multiplications. This means that 4.3. Matrix Computation: Optimized
the efficiency of the analyzed implementation Algorithm
depends mainly on the efficiency of dense matrix [39] In this section we explain how to efficiently
by matrix multiplications inside the integration compute the local stiffness matrices. This optimi-
point loop. In order to perform these calculations zation strategy is common to both (thermal and
MATLAB uses hardware-tuned, high-performance mechanical) problems considered. For simplicity,
BLAS libraries (Basic Linear Algebra Subpro- we present it on the example of the thermal
grams; see http://www.netlib.org/blas/faq.html and problem. Overall performance benchmarks and
Dongarra et al. [1990]), which reach up to 90% of application examples are provided for both types
the CPU peak performance; a value from which the of problems in subsequent sections.
analyzed code is far away.
[40] The small matrix by small matrix multiplica-
[35] The cause for this bad performance is that the tions in the integration loop nested inside the loop
matrix by matrix multiplications inside the integra- over elements are the bottleneck of the standard
tion point loop operate on very small matrices, for algorithm. Written out in terms of loops, these
which BLAS libraries are known not to work well matrix multiplications represent another three loops,
due to the introduced overhead (e.g., http://math- totaling to five. Since the element loop exhibits no
atlas.sourceforge.net/timing/36v34/OptPerf.html). data dependency, it can be moved into the innermost
Therefore, the same observation can be made when three (out of five), effectively becoming part of
writing the standard algorithm in a compiler lan- small matrix by large matrix multiplication.
guage such as C and relying on BLAS for the
matrix multiplications, although the actual perfor- [41] This loop reordering does not change the total
mance in this case is higher than in MATLAB. In C amount of operations. However, the number of
8 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
BLAS calls is greatly reduced (ndim*nip versus the individual rows of the Jacobian evaluated at
nel*nip in the standard approach), and the amount the actual integration point for all elements of the
of computation done per function call is drastically current block. Jx and Jy are calculated by multi-
increased. Consequently, the overhead problem plying the nodal coordinates by the shape function
vanishes leading to a substantial performance im- derivatives, e.g. Jx[nelblo,ndim] = ECOORD_x
provement. Unfortunately, the performance decreases [nnodel, nelblo]’*dNdui[nnodel, ndim]. Thus, in-
once a certain number of elements is exceeded. The stead of nelblo*nip matrix multiplications of
reason for this is that the data required for the dNdu[ndim,nnodel] and ECOORD_X[ndim,nno-
operation does not fit any longer into the CPUs del], ndim*nip multiplications involving the larger
cache . This inhibits cache reuse within the matrices ECOORD_x, ECOORD_y are performed,
integration point loop. The remedy is to operate on i.e. the same work is done with less multiplications
blocks of elements of the size for which the observed of larger matrices. Once the Jacobian is obtained,
performance is best. Once a block is processed, the its determinant, detJ, and inverse, split into invJx
results are written to the main memory and the data and invJy, are explicitly computed using simple
required by the next block is copied into the cache. operations on vectors.
Data required for every block should fit (reside) in
the cache at all times. The ideal block size depends [48] vi.) The derivatives with respect to the global
on the cache structure of a CPU and must be coordinates (x, y), dNdx[nelblo,nnodel] and
determined system and problem specifically. This dNdy[nelblo,nnodel], are obtained by multiplying
computing strategy is called ‘‘blocking’’ and is the invJx and invJy by the transpose of dNdui.
implemented as a part of the optimized algorithm. Again, less multiplication calls involving larger
Coincidentally, this entire approach to optimize the matrices are performed.
FEM matrix computation is similar to vector com- [49] vii.) The local stiffness matrix contribution
puter implementations [e.g., Ferencz and Hughes, for all the elements in the block, K_block[nelblo,
1998; Hughes et al., 1987; Silvester, 1988]. nnodel*(nnodel+1)/2], is computed according to
equation (16). Note that exploiting symmetry
4.3.1. Algorithm Description allows for calculation of only the lower triangle
[42] Code Fragment 2 shows the implementation of of stiffness matrices, which substantially reduces
the optimized matrix computation algorithm (see the operation count.
Figure 2). The key operations are explained and [ 50 ] viii.) After the numerical integration of
compared to the standard algorithm in the following. K_block is completed, the results are written into
[43] i.) The outermost loop of the optimized matrix the global storage K_all, again exploiting symme-
computation is the block loop. Before this loop is try by storing only the lower triangular part.
entered, required arrays (IP_X, IP_w, dNdu) are [51] ix.) The number of elements remaining in the
assigned and necessary variables are allocated. final block might be smaller than the nelblo.
[44] ii.) Inside the block loop the code begins with Consequently, nelblo and several arrays must be
reading element specific information. Since we adjusted.
simultaneously operate on nelblo elements, all the
corresponding global data blocks are copied into 4.3.2. Performance Analysis
local arrays ECOORD_x, ECOORD_y, and ED, [52] To illustrate the performance of the optimized
and are used repeatedly inside the integration loop. matrix computation systematic tests were run with
[45] iii.) For the entire block of elements, the loop the same 1 million node problem that was used for
over integration points performs numerical integra- the performance analysis of the standard algorithm.
tion of the element matrices K_block[nelblo, Since larger matrices resulting from larger block
nnodel*(nnodel+1)/2]. sizes should yield better BLAS efficiency, the
performance in Mflops is plotted versus the num-
[46] iv.) As in the standard algorithm, every itera- ber of elements in a block; see Figure 3. This plot
tion of the integration point loop begins by reading confirms the arguments for the introduction of the
precomputed dNdu arrays. blocking algorithm. Starting from approximately
the performance of the standard algorithm, a steady
[47] v.) The Jacobian of the standard algorithm, increase can be observed up to 350 Mflops,
J[ndim,ndim], is replaced by ndim matrices; which on the test system is reached for a block
Jx[nelblo,ndim] and Jy[nelblo,ndim], containing with 1000 elements for thermal problem. Further
9 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Figure 2. Code Fragment 2 shows the optimized finite element global matrix computation.
increase of the block size leads to a performance Profiling the code revealed that for the test problem
decrease toward a stable level of 120 Mflops due approximately half of the time was spent on
to lack of cache reuse in the integration point loop. reading and writing variables from and to RAM
Compared to the standard version, the optimized (e.g., nodal coordinates and element matrices).
matrix computation achieves a 20-fold speedup in This value is constrained by the memory band-
terms of flops performance. Since the optimized width of the hardware, which on current computer
algorithm performs fewer operations (computation architectures is often a bigger bottleneck than the
of only lower triangular part of symmetric element CPU performance. Compared to C implementa-
matrix), its execution time is actually more than 30 tions, the optimized matrix computation perfor-
times faster. mance is better than the straightforward standard
algorithm using BLAS, but more than a factor 3
[53] The achieved 350 Mflops efficiency corre-
sponds to only 8% of the peak CPU performance.
10 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
slower than what can be achieved by explicitly equivalent function sparse2, provided by T. A.
writing out the matrix multiplications. Davis within the CHOLMOD package (http://
www.cise.ufl.edu/research/sparse/SuiteSparse), is
[54] In the mechanical code, the peak flops perfor- substantially faster and does not require a conver-
mance is similar. Note that in this case the optimal sion of the coefficients to double precision. Code
blocksize is smaller due to the larger workspace of Fragment 3 presents in detail how to create a global
the method; see Figure 3. system matrix.
4.4. Matrix Assembly: Triplet to Sparse [57] Code Fragment 3 shows the global sparse
Format Conversion matrix assembly.
[55] The element stiffness matrices stored in K_all % CREATE TRIPLET FORMAT INDICES
must be assembled into the global stiffness matrix indx_j = repmat(1:nnodel,nnodel,1);
K. The row and column indices (K_i and K_j) that indx_i = indx_j0;
specify where the individual entries of K_all have indx_i = tril(indx_i);
to be stored in the global system are commonly indx_i = indx_i(:);
known as the triplet sparse matrix format [e.g., indx_i = indx_i(indx_i>0);
Davis, 2006]. Since we only use lower triangular indx_j = tril(indx_j);
entries, special care must be taken so that the indx_j = indx_j(:);
indices referring to the upper triangle are not indx_j = indx_j(indx_j>0);
created; see Code Fragment 3. Note that K_i and
K_j hold duplicate entries, and the purpose of the K_i = ELEM2NODE(indx_i,:);
MATLAB sparse function is to sum and eliminate K_j = ELEM2NODE(indx_j,:);
them.
[56] While creation of the triplet format is fast, the K_i = K_i(:);
call to sparse gives some concerns. MATLABs K_j = K_j(:);
sparse implementation requires that K_i and K_j
are of type double, which is memory- and % SWAP INDICES REFERRING TO UPPER
performance-wise inefficient. In addition, sparse TRIANGLE
itself is rather slow, especially if compared to the indx = K_i < K_j;
time spent on the entire matrix computation. The tmp = K_j(indx);
11 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Figure 4. Performance analysis of the different steps of the Cholesky algorithm with different reorderings for our
one million degrees of freedom thermal test problem.
[66] However, it is best to use CHOLMOD and the faster during the reordering steps, it results in slower
related parts by installing the entire package from Cholesky factorization and forward and back substi-
the developers SuiteSparse Web site (http://www. tution. If the reordering can be reused for a large
cise.ufl.edu/research/sparse/SuiteSparse). This pro- number of steps, it is recommended to rely on
vides access to cholmod2, which is capable of METIS, which is accessible in MATLAB through
dealing with only upper triangular input data and the SuiteSparse package.
precomputed permutation (reordering) vectors. Sui-
teSparse also contains lchol, a Cholesky factoriza- 4.7. Powell and Hestenes Iterations
tion operating only on lower triangular matrices,
[68] In the thermal code, the solution vector is
which is faster and more memory efficient than
obtained by calling forward and back substitution
MATLABs chol equivalent. Reusing the Cholesky
routines with the Cholesky factor and the adequately
factor L during the Powell and Hestenes iterations in
permuted right-hand-side vector. During the second
the mechanical problem greatly reduces the compu-
substitution phase the upper Cholesky factor is
tational cost of achieving a divergence free flow
required. However, instead of explicitly forming it
solution.
through the transposition of the stored lower factor,
[67] The mentioned reuse of reordering data is it is advantageous to call the cs_ltsolve that can
possible as long as the mesh topology remains operate on the lower factor and performs the needed
identical, which even in our large strain flow task of the back substitution.
calculations is the case for many time steps. The
reordering step decreases factorization fill-in and [69] In the MILAMIN flow solver the incompres-
sibility constraint is achieved through an iterative
consequently improves memory and CPU efficiency
penalty method, i.e., the bulk part of the deforma-
[Davis, 2006], but is a rather costly operation
tion is suppressed with a large bulk modulus
compared to the rest of the Cholesky algorithm.
(penalty parameter) k. In a single step penalty
Different reordering schemes can be used, and we
method there is a trade off between the incompres-
compare two of them in Figure 4: AMD (Approxi-
sibility of the flow solution and the condition
mate Minimum Degree) and METIS (http://glaros.
number of the global equation system. This can
dtc.umn.edu/gkhome/views/metis). While AMD is
13 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
be avoided by using a relatively small k, which new arrangement, where physical nodes are listed
ensures a good condition number and then itera- separately for every element that accesses them.
tively improving incompressibility of the flow. The same can also be done for other meshes than
Note that for the chosen Crouzeix-Raviart element, triangular ones by creating the corresponding con-
pressure is discontinuous between elements and the nectivity (ELEM2NODE) and calling:
corresponding degrees of freedom can be eliminat-
ed element-wise (no global system solution re- [73] Code Fragment 6 shows the postprocessor.
quired). Pressure increments can be computed
with the velocity solution vector and stored Q patch(0faces0, ELEM2NODE,0vertices0,
and M1 matrices. These pressure increments are GCOORD0,0facevertexcdata0,T);
sent to the right-hand side of the system and shading interp;
accumulated in the total pressure. The code
fragment for these so-called Powell and Hestenes 6. MILAMIN Performance Analysis
iterations is given in Code Fragment 5.
6.1. Overall Performance
[70] Code Fragment 5 shows the Powell and Hes-
tenes iterations. [74] The overall performance of MILAMIN versus
the number of nodes is analyzed in Figure 5. The
while (div_max>div_max_uz && uz_iter<uz_iter_max) goal of MILAMIN to perform a complete FEM
uz_iter = uz_iter + 1; analysis for one million unknowns in one minute is
%FORWARD AND BACK SUBSTITUTION reached for the thermal as well as the mechanical
Vel(Free(perm)) = problem. All components of MILAMIN scale lin-
cs_ltsolve(L,cs_lsolve(L,Rhs(Free(perm)))); early with the number of nodes; the only exception
is the direct solver, which shows super-linear
%COMPUTE QUASI-DIVERGENCE scaling. The performance details are discussed in
Div = invM*(Q*Vel); the following sections.
14 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Figure 5. Overall performance results for MILAMIN given for total time spent on problem, and the direct solver
contribution.
[77] The second component of the Cholesky solver especially for positive definite systems that can be
is the factorization. This step takes most of the total solved with Cholesky factorizations. Moreover, it
MILAMIN execution time. However, the efficien- is in problems of this size where our optimizations
cy achieved by CHOLMOD is close to the optimal greatly reduce the total solution time. Such numer-
CPU performance. For further optimization one ical resolutions are often sufficient in two dimen-
could consider other types of solvers such as itera- sions to solve challenging problems and the
tive ones. Yet, preconditioned iterative methods or achieved performance allows for studies with large
algebraic multigrid are less robust (especially for number of time steps.
large material contrasts as targeted here) and per-
form better only for large systems; see section 6.4. [78] The third part of the Cholesky solver is the
These methods are the only option in the case of forward and backward substitution and does not
most three-dimensional problems, because the scal- contribute substantially in the case of thermal
ing of factorization time and memory requirements problems. For mechanical problems several Powell
for direct solvers is much worse than in two dimen- and Hestenes iterations are required to enforce
sions. However, for two dimensional problems incompressibility, each issuing a forward and back
direct solvers are the best choice for resolutions substitution call plus other computations. The time
on the order of one million degrees of freedom, spent on the Powell and Hestenes iteration is not
15 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
16 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Table 1. Performance Results for Different Software Packages for the Thermal Problema
Matrix Computation
Software and Assembly Solve Solver Type
conditions representing a linearly varying temper- TAUCS (S. Toledo et al., http://www.tau.ac.il/
ature field. stoledo/taucs), PARDISO (O. Schenk and
K. Gärtner, http://www.pardiso-project.org),
[82] The software that entered the test are commer- SPOOLES (C. Ashcraft et al., http://www.netlib.
cial finite element packages, ABAQUS (SIMULIA, org/linalg/spooles/spooles.2.2.html), CHOLMOD
6.6-1, http://www.simulia.com/products/abaqus_ (T. A. Davis, http://www.cise.ufl.edu/research/
fea.html) and FEMLAB (COMSOL 3.3, http:// sparse/cholmod), and the MATLAB backslash
www.femlab.com), and open source packages operator (\). We also compared different imple-
FEAPpv (O. C. Zienkiewicz and R. L. Taylor, mentations of iterative solvers such as Conjugate
2.0, http://www.ce.berkeley.edu/rlt/feappv), Gradients preconditioned with Jacobi (PCG),
OOFEM (B. Patzak, OOFEM 1.7, http://www. Symmetric Successive Over-Relaxation (SSOR-
oofem.org), and TOCHNOG (D. Roddeman, CG), Incomplete Cholesky (ICCG), and Algebraic
11 February 2001, http://sourceforge.net/projects/ Multigrid (AMG-CG), and a Biconjugate Gradients
tochnog) for compiler languages, and AFEM@ solver preconditioned with Jacobi (BiCG).
matlab (L. Chen and C. Zhang, http://www.
mathworks.com/matlabcentral/fileexchange), and [83] A number of other MATLAB-based packages
IFISS (D. J. Silvester et al., 2.2, http://www.maths. are available, which, however, could not enter our
manchester.ac.uk/djs/ifiss) for MATLAB. For table because they are simply incapable of solving
the solution stage we used a wide range of direct the test problem in a reasonable amount of time
solvers, including UMFPACK (T. A. Davis, http:// and the amount of RAM available. From the
www.cise.ufl.edu/research/sparse/umfpack), MATLAB packages that entered the performance
Table 2. Performance Results for Different Software Packages for the Mechanical Problema
Matrix Computation
Software and Assembly Solve Solver Type
17 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
comparison AFEM excels with high performance. or AMG, are not competitive with respect to the
However, AFEM is specifically developed to op- direct solvers for the targeted problem size.
erate with linear triangles solving the Poisson
problem. This allows AFEM to employ only one [85] A performance comparison of MILAMIN for
integration point and the amount of work per- a mechanical test problem is given in Table 2. The
formed is substantially less than for isoparamteric domain is again a box containing a circular hole
quadratic elements, although the actual number of (free surface) and a circular inclusion with a ten
elements is higher for the test problem with a fixed times higher viscosity than the matrix. The outer
number of nodes. IFISS is another MATLAB- boundaries are set to Dirichlet conditions repre-
based package capable of solving Poisson and senting pure shear deformation. The number of
incompressible Navier-Stokes problems on the available packages to solve incompressible Stokes
basis of linear and quadratic quadrilateral meshes. problems with heterogeneous material is greatly
Despite its aim of being a vectorized code, the reduced compared to the thermal problem. In fact
performance of IFISS is not optimal. This is partly the IFISS package is not capable of dealing with
due to a badly performing boundary condition heterogeneous materials and we used here an iso-
implementation. The matrix computation and as- viscous model. In the case of FEMLAB we had to
sembly performance of the compile language and employ the special MEMS module, which provides
commercial codes is quite reasonable, with FEAP an incompressible Stokes application mode. How-
being the clear leader. However, none of the tested ever, even with this specialized module we were
packages is as fast for the matrix computation and unable to fit the test problem into the 2 Gb RAM
assembly as the optimized version of MILAMIN and therefore the results are provided for a five
and even the standard version of MILAMIN is times smaller problem size. MILAMIN outper-
performing quite reasonably in comparison. forms IFISS as well as FEMLAB both in terms
of matrix computation and assembly, and the
[84] The analysis of the solver times confirms our solution time. The latter demonstrates that iterative
previous statement that for the studied 2-D prob- penalty approach chosen in MILAMIN and the
lems direct solvers (CHOLMOD, UMFPACK, resulting possibility to use a Cholesky solver
TAUCS, PARDISO, SPOOLES) are the best (symmetric and positive definite system) is superi-
choice with CHOLMOD being the best in the or to other approaches.
group. Iterative solvers, even if equipped with
good preconditioners, like incomplete Cholesky
18 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Figure 9. Illustration of a one million node application problem modeled with MILAMIN. Steady state diffusion is
solved in a heterogeneous rock with channels of high conductivity. Heat flow is imposed by a horizontal thermal
gradient; i.e., T(left boundary) = 0, T(right boundary) = 1. Top and bottom boundary conditions are zero flux.
(a) Conductivity distribution. (b) Flux visualized by cones and colored by magnitude. Normalization versus flux in
homogeneous medium with conductivity of the channels. Background color represents the conductivity. Triangular
grid is the finite element mesh used for computation. Note that this picture only corresponds to a small subdomain of
Figure 9a (see square outline).
19 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Figure 10. Mechanical application example. Circular inclusions in box subjected to vertical gravity field. Black
(heavy) and white (light) inclusions have the same density contrast with respect to the matrix. They are hundred times
more viscous than the matrix. Figures 10a and 10b show (unsmoothed) pressure perturbations, Figures 10c and 10d
show maximum shear strain rate, and Figures 10e and 10f shows the magnitude of the velocity field with superposed
velocity arrows (random positions). All values are normalized by the corresponding maximum value generated by a
single inclusion of the same size centered in the same box. Figures 10a, 10c, and 10e show the entire domain;
Figures 10b, 10d, and 10f show a zoom-in with superposed finite element mesh according to the white square.
employing the best available direct solver and mentation of boundary conditions. In the case of
reordering packages. MATLAB-specific optimiza- the mechanical application the chosen penalty
tions include proper memory management (preal- method together with the particular element type
location of arrays) and data structures, explicit type allows us to use the efficient Cholesky factoriza-
declaration for integer arrays, and efficient imple- tion to solve the incompressible flow problem. The
20 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
21 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
22 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
clear structure of the code serves the educational English. The manuscript benefited from the reviews by Boris
purposes well. The results of our software compar- Kaus and Eh. Tan and the editorial work of Peter van Keken.
ison show that our standard version performs Finally, we would like to thank Yuri Podladchikov for his never-
surprisingly efficiently even compared to packages ending enthusiasm and stimulation.
implemented in compiler languages.
References
[91] Furthermore, in our optimized version we
have improved the efficiency of the stiffness matrix Alberty, J., et al. (1999), Remarks around 50 lines of Matlab:
Short finite element implementation, Numer. Algorithms, 20,
calculations, which resulted in an overall execution 117 – 137.
speedup of approximately 4 times with respect to the Bathe, K.-J. (1996), Finite Element Procedures, vol. XIV,
standard version. This has been done by minimizing 1037 pp., Prentice-Hall, London.
the ratio of overhead (BLAS and MATLAB) to Brezzi, F., and M. Fortin (1991), Mixed and Hybrid Finite
computation. Another priority was to avoid unnec- Elements Methods, vol. ix, 350 pp., Springer, New York.
Cuvelier, C., et al. (1986), Finite Element Methods and Navier-
essary data transfers and promote cache reuse, as Stokes Equations, vol. XVI, 483 pp., D. Reidel, Dordrecht,
memory speed is a major bottleneck on current Netherlands.
computer architectures. Particular optimizations to Davies, D. R., et al. (2007), Investigations into the applicabil-
the matrix computation algorithm include (1) in- ity of adaptive finite element methods to two-dimensional
creased performance of the BLAS operations by infinite Prandtl number thermal and thermochemical convec-
tion, Geochem. Geophys. Geosyst., 8, Q05010, doi:10.1029/
interchanging loops and operating on large matrices, 2006GC001470.
(2) reducing the total operation count by exploiting Davis, T. A. (2006), Direct Methods for Sparse Linear Sys-
the symmetry of the system, and (3) facilitating tems, Soc. for Ind. and Appl. Math., Philadelphia, Pa.
cache reuse through the introduction of blocking. Davis, T. A., and W. W. Hager (2005), Row modifications of a
sparse Cholesky factorization, SIAM J. Matrix Anal. Appl.,
[92] Our implementation of the matrix computation 26, 621 – 639.
achieves a sustained performance of 350 Mflops Dongarra, J. J., et al. (1990), A set of level 3 basic linear
for any system size. Any further performance algebra subprograms, ACM Trans. Math. Software, 16, 1 –
17.
improvements to this part of the code are irrele- Dunavant, D. A. (1985), High degree efficient symmetrical
vant, since even for smallest systems the matrix Gaussian quadrature rules for the triangle, Int. J. Numer.
computation now takes only a fraction of the total Methods Eng., 21, 1129 – 1148.
solution time, with the solver being the bottleneck. Elman, H. C., et al. (2005), Finite Elements and Fast Iterative
Solvers With Applications in Incompressible Fluid Dy-
[93] By paying attention to the strategies outlined in namics, 400 pp., Oxford Univ. Press, New York.
this article, MATLAB-based MILAMIN can not Ferencz, R. M., and T. J. R. Hughes (1998), Implementation of
element operations, in Handbook of Numerical Analysis,
only be used as a development and prototype tool, edited by P. G. Ciarlet and J. L. Lions, pp. 39 – 52, Elsevier,
but also as a production tool for the analysis of two New York.
dimensional problems with millions of unknowns Fletcher, C. A. J. (1997), Computational Techniques for Fluid
within minutes. The complete MILAMIN source Dynamics, 3rd ed., Springer, Berlin.
code is available from the authors and can be Gould, N. I. M., et al. (2007), A numerical evaluation of sparse
direct solvers for the solution of large sparse symmetric lin-
downloaded as auxiliary material (see Software S1). ear systems of equations, ACM Trans. Math. Software, 33(2),
article 10, doi:10.1145/1206040.1206043.
Appendix A Hughes, T. J. R. (2000), The Finite Element Method: Linear
Static and Dynamic Finite Element Analysis, vol. XXII, 682
pp., Dover, Mineola, N. Y.
[94] Table A1 lists the variables used throughout Hughes, T. J. R., et al. (1987), Large-scale vectorized implicit
the paper and in the code to facilitate its under- calculations in solid mechanics on a Cray X-MP/48 utilizing
standing. Variable names, their sizes, and short EBE preconditioned conjugate gradients, Comput. Methods
descriptions are given. Appl. Mech. Eng., 61, 215 – 248.
Kwon, Y. W., and H. Bang (2000), The Finite Element Method
Acknowledgments Using MATLAB, 2nd ed., 607 pp., CRC Press, Boca Raton,
Fla.
Limache, A., et al. (2007), The violation of objectivity in La-
[95] This work was supported by the Norwegian Research place formulations of the Navier-Stokes equations, Int. J.
Council through a Centre of Excellence grant to PGP. We Numer. Methods Fluids, 54, 639 – 664.
would like to thank Tim Davis, the author of the SuiteSparse Pelletier, D., et al. (1989), Are FEM solutions of incompres-
package, for making this large suite of tools available and sible flows really incompressible? (or how simple flows can
giving us helpful comments. We would also like to thank J. R. cause headaches!), Int. J. Numer. Methods Fluids, 9, 99 – 112.
Shewchuk for making the mesh generator Triangle freely Persson, P. O., and G. Strang (2004), A simple mesh generator
available. We are grateful to Antje Keller for her help regarding in MATLAB, SIAM Rev., 46, 329 – 345.
code benchmarking. We thank Galen Gisler for improving the
23 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719
Pozrikidis, C. (2005), Introduction to Finite and Spectral Ele- Toledo, S. (1997), Improving the memory-system performance
ment Methods Using MATLAB, 653 pp., CRC Press, Boca of sparse-matrix vector multiplication, IBM J. Res. Dev., 41,
Raton, Fla. 711 – 725.
Sigmund, O. (2001), A 99 line topology optimization code Wesseling, P. (1992), An Introduction to Multigrid Methods,
written in Matlab, Struct. Multidisciplinary Optim., 21, 284 pp., John Wiley, Chichester, N. Y.
120 – 127. Zienkiewicz, O. C., and R. L. Taylor (2000), The Finite Element
Silvester, D. J. (1988), Optimizing finite-element matrix cal- Method, 5th ed., Butterworth-Heinemann, Oxford, U. K.
culations using the general technique of element vectoriza-
tion, Parallel Comput., 6, 157 – 164.
24 of 24
View publication stats