Sei sulla pagina 1di 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/248820802

MILAMIN: MATLAB-based finite element method solver for large problems

Article  in  Geochemistry Geophysics Geosystems · April 2008


DOI: 10.1029/2007GC001719

CITATIONS READS
152 1,689

3 authors, including:

Marcin Dabrowski Daniel Walter Schmid


Państwowy Instytut Geologiczny University of Oslo
80 PUBLICATIONS   556 CITATIONS    87 PUBLICATIONS   1,281 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ShaleSeq: Physico-chemical effects of co2 storage in the Pomeranian gas-bearing shales. Currently in the final phase of the project results publishing. View project

Lithospheric deformations induced by surface processes View project

All content following this page was uploaded by Daniel Walter Schmid on 26 August 2016.

The user has requested enhancement of the downloaded file.


3 Technical Brief
Geochemistry
G
Volume 9, Number 4
Geophysics 23 April 2008
Q04030, doi:10.1029/2007GC001719
Geosystems
AN ELECTRONIC JOURNAL OF THE EARTH SCIENCES ISSN: 1525-2027
Published by AGU and the Geochemical Society

MILAMIN: MATLAB-based finite element method solver


for large problems
M. Dabrowski, M. Krotkiewski, and D. W. Schmid
Physics of Geological Processes, University of Oslo, Pb 1048 Blindern, N-0316 Oslo, Norway (marcind@fys.uio.no)

[1] The finite element method (FEM) combined with unstructured meshes forms an elegant and versatile
approach capable of dealing with the complexities of problems in Earth science. Practical applications
often require high-resolution models that necessitate advanced computational strategies. We therefore
developed ‘‘Million a Minute’’ (MILAMIN), an efficient MATLAB implementation of FEM that is
capable of setting up, solving, and postprocessing two-dimensional problems with one million unknowns
in one minute on a modern desktop computer. MILAMIN allows the user to achieve numerical resolutions
that are necessary to resolve the heterogeneous nature of geological materials. In this paper we provide the
technical knowledge required to develop such models without the need to buy a commercial FEM package,
programming compiler-language code, or hiring a computer specialist. It has been our special aim that all
the components of MILAMIN perform efficiently individually and as a package. While some of the
components rely on readily available routines, we develop others from scratch and make sure that all of
them work together efficiently. One of the main technical focuses of this paper is the optimization of the
global matrix computations. The performance bottlenecks of the standard FEM algorithm are analyzed. An
alternative approach is developed that sustains high performance for any system size. Applied
optimizations eliminate Basic Linear Algebra Subprograms (BLAS) drawbacks when multiplying small
matrices, reduce operation count and memory requirements when dealing with symmetric matrices, and
increase data transfer efficiency by maximizing cache reuse. Applying loop interchange allows us to use
BLAS on large matrices. In order to avoid unnecessary data transfers between RAM and CPU cache we
introduce loop blocking. The optimization techniques are useful in many areas as demonstrated with our
MILAMIN applications for thermal and incompressible flow (Stokes) problems. We use these to provide
performance comparisons to other open source as well as commercial packages and find that MILAMIN is
among the best performing solutions, in terms of both speed and memory usage. The corresponding
MATLAB source code for the entire MILAMIN, including input generation, FEM solver, and
postprocessing, is available from the authors (http://www.milamin.org) and can be downloaded as
auxiliary material.

Components: 11,344 words, 10 figures, 3 tables, 1 animation.


Keywords: numerical models; FEM; earth science; diffusion; incompressible Stokes; MATLAB.
Index Terms: 0545 Computational Geophysics: Modeling (4255); 0560 Computational Geophysics: Numerical solutions
(4255); 0850 Education: Geoscience education research.
Received 12 June 2007; Revised 26 October 2007; Accepted 11 December 2007; Published 23 April 2008.

Dabrowski, M., M. Krotkiewski, and D. W. Schmid (2008), MILAMIN: MATLAB-based finite element method solver for
large problems, Geochem. Geophys. Geosyst., 9, Q04030, doi:10.1029/2007GC001719.

Copyright 2008 by the American Geophysical Union 1 of 24


Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

1. Introduction dard implementation serves to provide educational


insight into subjects such as implementation of the
[2] Geological systems are often formed by mul- numerical method, efficient use of the computer
tiphysics processes interacting on many temporal architecture and computational libraries, code struc-
and spatial scales. Moreover, they are heteroge- turing, proper data layout, and solution techniques.
neous and exhibit large material property con- We also provide an optimized FEM version that
trasts. In order to understand and decipher these increases the performance of production runs even
systems numerical models are frequently employed. further, but at the cost of code clarity.
Appropriate resolution of the behavior of these [4] The MATLAB code implementing the different
heterogeneous systems, without the (over)simpli- approaches discussed here is available from the
fications of a priori applied homogenization authors (http://www.milamin.org) and can be down-
techniques, requires numerical models capable loaded as auxiliary material (see Software S11).
of efficiently and accurately dealing with high-
resolution, geometry-adapted meshes. These crite-
rions are usually used to justify the need for special 2. Code Overview
purpose software (commercial finite element method
(FEM) packages) or special code development in [5] A typical finite element code consists of three
high-performance compiler languages such as C or basic components: preprocessor, processor, post-
FORTRAN. General purpose packages like MAT- processor. The main component is the processor,
LAB are usually considered not efficient enough for which is the actual numerical model that imple-
this task. This is reflected in the current literature. ments a discretized version of the governing con-
MATLAB is treated as an educational tool that servation equations. The preprocessor provides all
allows for fast learning when trying to master the input data for the processor and in the present
numerical methods, e.g., the books by Kwon and case the main work is to generate an unstructured
Bang [2000], Elman et al. [2005], and Pozrikidis mesh for a given geometry. The task of the post-
[2005]. MATLAB also facilitates very short imple- processor is to analyze and visualize the results
mentations of numerical methods that give overview obtained by the processor. These three components
and insight, which is impossible to obtain when of MILAMIN are documented in the following
dealing with closed black-box routines, e.g., finite sections.
elements on 50 lines [Alberty et al., 1999], topology
optimization on 99 lines [Sigmund, 2001], and mesh 3. Preprocessor
generation on one page [Persson and Strang, 2004].
However, while advantageous from an educational [6] Geometrically complex problems promote the
standpoint, these implementations are usually rather use of interface adapted meshes, which accurately
slow and run at a speed that is a fraction of the peak resolve the input geometry and are typically creat-
performance of modern computers. Therefore the ed by a mesh generator that automatically produces
usual approach is to use MATLAB for prototyping, a quality mesh. The drawback of this approach is
development, and testing only. This is followed by that one cannot exploit the advantages of solution
an additional step where the code is manually strategies for structured meshes, such as operator
translated to a compiler language to achieve the splitting methods (e.g., ADI [Fletcher, 1997]) or
memory and CPU efficiency required for high- geometric multigrid [Wesseling, 1992] for efficient
resolution models. computation.
[3] This paper presents the outcome of a project [7] A number of mesh generators are freely
called ‘‘MILAMIN - MILlion A MINute’’ aimed available. Yet, none of these are written in native
at developing a MATLAB-based FEM package MATLAB and fulfill the requirement of automat-
capable of preprocessing, processing, and post- ed quality mesh generation for multiple domains.
processing an unstructured mesh problem with DistMesh by Persson and Strang [2004] is an
one million degrees of freedom in two dimensions interesting option as it is simple, elegant, and
within one minute on a commodity personal com- written entirely in MATLAB. However, lack of
puter. Choosing a native MATLAB implementation speed and proper multidomain support renders it
allows simultaneously for educational insight, easy unsuitable for a production code with the out-
access to computational libraries and visualization
tools, rapid prototyping and development, as well as 1
Auxiliary materials are available in the HTML. doi:10.1029/
actual two-dimensional production runs. Our stan- 2007GC001719.

2 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

lined goals. The mesh generator chosen is Tri- The basic two-dimensional element is a triangle. In
angle developed by J. R. Shewchuk (version 1.6, the thermal problem discrete temperature values are
http://www.cs.cmu.edu/quake/triangle.html- defined for the nodal points, which can be associated
Shewchuk, 2007). Triangle is extremely versatile with element vertices, located on its edges, or even
and stable, and consists of one single file that reside inside the elements. Introducing shape func-
can be compiled into an executable on all plat- tions Ni that interpolate temperatures from the nodes
forms with a standard C compiler. We choose the Ti to the domains of neighboring elements, an
executable-based file I/O approach, which has the approximation to the temperature field T~ in W is
advantage that we can always reuse a saved defined as
mesh. The disadvantage is that the ASCII file X
nnod
I/O provided by Triangle is rather slow, which T~ ð x; yÞ ¼ Ni ð x; yÞTi ð2Þ
can be overcome by adding binary file I/O as i¼1

described in the instructions provided in the where nnod is the number of nodes in the
MILAMIN code repository. discretized domain.

4. Processor [12] On the basis of the weak formulation that


takes the form of an integral over W, the problem
can now be stated in terms of a system of linear
4.1. FEM Outline equations. From a computational point of view it is
[8] In this paper we show two different physical beneficial to evaluate this integral as a sum of
applications of MILAMIN: steady state thermal integrals over each element We. A single element
problems and incompressible Stokes flow (referred contribution, the so-called ‘‘element stiffness ma-
to as mechanical problem). This section provides trix,’’ to the global system matrix in the Galerkin
an outline of the governing equations and their approach for the thermal problem is given by
corresponding FEM formulation. The numerical
implementation and performance discussions fol- Z Z  
@Ni @Nj @Ni @Nj
low in subsequent sections. Keij ¼ ke þ dxdy ð3Þ
@x @x @y @y
We

4.1.1. Thermal Problem


where ke is the element specific conductivity. Note
[9] The strong form of the steady state thermal that the shape function index in equation (3)
diffusion in the two-dimensional domain W is corresponds to local numbering of element nodes
   
@ @T @ @T and must be converted to global node numbers
k þ k ¼ 0 in W ð1Þ
@x @x @y @y before element matrix Ke is assembled into the
global matrix K.
where T is temperature, k is the conductivity, x and
y are Cartesian coordinates. The boundary G of W
4.1.2. Mechanical Problem
is divided into two nonintersecting parts: G = GN[
GD. Zero heat-flux is specified on GN (Neumann [13] The strong form of the plane strain Stokes
boundary condition) and temperature T is pre- flow in W is
scribed on GD (Dirichlet boundary condition).
     
@ 4 @ux 2 @uy @ @ux @uy @p
[10] The FEM is based on the weak (variational) m  þ m þ  ¼ fx
@x 3 @x 3 @y @y @y @x @x
formulation of partial differential equations, taking @
 
4 @uy 2 @ux

@
 
@ux @uy

@p
an integral form. For the purpose of this paper we @y
m
3

@y 3 @x
þ
@x
m
@y
þ
@x

@y
¼ fy in W
only introduce the basic concepts of this method @ux @uy p
that are important from an implementation view- þ þ ¼0
@x @y k
point. A detailed derivation of the finite element
method and a description of the weak formulation ð4Þ
of PDEs can be found in textbooks [e.g., Bathe,
1996; Hughes, 2000; Zienkiewicz and Taylor, where ux and uy are components of velocity, fx and
2000]. fy are components of the body force vector field, p
is pressure and m denotes viscosity. In our
[11] In FEM, the domain W is partitioned into
numerical code the incompressibility constrained
nonoverlapping element subdomains We, i.e., W =
nS
el is achieved by penalizing the bulk deformation
We, where nel denotes the number of elements. with a large bulk modulus k. The boundary
e¼1
3 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

conditions are given as constrained velocity or [15] With the convention that velocity degrees of
vanishing traction components. In equation (4) we freedom are followed by pressure ones in the local
use the divergence rather than Laplace form (in the element numbering, the stiffness matrix for the
latter different velocity components are only Stokes problem is given by [e.g., Bathe, 1996]
coupled through the incompressibility constraint) !
as we expect to deal with strongly varying viscosity. e A QT
K ¼
It is also worth noting that even for homogeneous Q k1 M
models the computationally advantageous Laplace Z Z ! ð6Þ
form may lead to serious defects if the boundary me BT DB BTvol PT
¼ dxdy
terms are not treated adequately [Limache et al., e
PBvol k1 PPT
W
2007]. Additionally our formulation, equation (4),
and its numerical implementation are also applicable where B is the so-called kinematic matrix trans-
to compressible and incompressible elastic problems forming velocity into strain rate e_ (we use here the
due to the correspondence principle. engineering convention for the shear strain rate)
0 @N 1
1
0 1 ð x; yÞ 0 ... 0 1
e_ xx ð x; yÞ B @x C u1x
B C
B e_ ð x; yÞ C B @N1 CB u1y C
@ yy A ¼ Bð x; yÞue ¼ B
B 0 ð x; yÞ . . . CB
C@
C
A ð7Þ
B @y C
g_ xy ð x; yÞ @ @N A ..
1 @N1 .
ð x; yÞ ð x; yÞ . . .
@y @x

[14] In analogy to the thermal problem we intro- The matrix D extracts the deviatoric part of the
duce the discrete spaces to approximate the veloc- strain rate, converts from engineering convention
ity components and pressure: to standard shear strain rate, and includes a
conventional factor 2. The bulk strain rate is
X
nnod computed according to the equation e_ vol = Bvolue
~ux ð x; yÞ ¼ Ni ð x; yÞuix and pressure is the projection of this field onto the
i¼1
pressure approximation space
X
nnod
~uy ð x; yÞ ¼ Ni ð x; yÞuiy ð5Þ
i¼1 pð x; yÞ ¼ kPT ð x; yÞM1 Que ð8Þ
X
np
~pð x; yÞ ¼ Pi ð x; yÞpi With the chosen approximation spaces, the linear
i¼1 pressure shape functions P are spanned by the
corner nodal values that are defined independently
where np denotes the number of pressure degrees of for neighboring elements. Thus it is possible to
freedom and Pi are the pressure shape functions, invert M on element level (the so-called static
which may not coincide with the velocity ones. To condensation) and consequently avoid the pres-
ensure the solvability of the resulting system of sure unknowns in the global system. Since the
equations (inf-sup condition [see Elman et al., pressure part of the right-hand-side vector is set to
2005]), special care must be taken when constructing zero, this results in the following velocity Schur
the approximation spaces. A wrong choice of the complement:
pressure and velocity discretization results in spur-
 
ious pressure modes that may seriously pollute the A þ kQT M1 Q ue ¼ f e ð9Þ
numerical solution. Our particular element choice is
the seven-node Crouzeix-Raviart triangle with quad- Once the solution to the global counterpart of (9)
ratic velocity shape functions enhanced by a cubic is obtained, the pressure can be restored afterward
bubble function and discontinuous linear interpola- according to (8). The resulting global system of
tion for the pressure field [e.g., Cuvelier et al., 1986]. equations is not only symmetric, but also positive-
This element is stable and no additional stabilization definite as opposed to the original system (6).
techniques are required [Elman et al., 2005]. The fact Unfortunately, the global matrix becomes ill-
that in our case the velocity and pressure approxima- conditioned for penalty parameter values corre-
tions are autonomous leads to the so-called mixed sponding to a satisfactorily low level of the flow
formulation of the finite element method [Brezzi and divergence. It is possible to circumvent this by
Fortin, 1991].

4 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

introducing Powell and Hestenes iterations [Cuvelier Thus the element matrix from equation (3) is now
et al., 1986] and keeping the penalty parameter k given by
moderate compared to the viscosity m: Z Z  
@Ni @Nj @Ni @Nj
Kije ¼ ke þ j J jdxdh ð15Þ
@x @x @y @y
p0 ¼ 0 Wref
 
while max Dpi > tol
 1   where jJj is the determinant of the Jacobian, taking
ui ¼ A þ kQT M1 Q f  QT pi
care of the area change introduced by the mapping,
Dpi ¼ kM1 Qui ð10Þ and Wref is the domain of the reference element. To
iþ1 i i
p ¼ p þ Dp avoid symbolic integration equation (15) can be
increment i integrated numerically:
end   
X
nip
@Ni @Nj @Ni @Nj 
Kije ¼ Wk k e þ j J j  ð16Þ
@x @x @y @y ðx k ;hk Þ
In the above iteration scheme the matrices A, Q, M k¼1

represent global assembled versions rather than


single element contributions. Here the integral is transformed into a sum over nip
integration points located at (x k, hk), where the
4.1.3. Isoparametric Elements individual summands are evaluated and weighted
by point specific Wk. For numerical integration
[16] To exploit the full flexibility of FEM, we rules for triangular elements, see, e.g., Dunavant
employ isoparametric elements. Each element in [1985]. The numerical integration of the element
physical space is mapped onto the reference ele- matrix arising in the mechanical case is analogous.
ment with fixed shape, size, and orientation. This
geometrical mapping between local (x, h) and [17] In the following we first show the straightfor-
global (x, y) coordinates of an element is realized ward implementation of the global matrix compu-
using the same shape functions Ni that interpolate tation and investigate its efficiency. It proves to be
physical fields: unsuited for high-performance computing in the
MATLAB environment. We then introduce a dif-
X
nnodel ferent approach, which solves the identified prob-
xðx; hÞ ¼ Ni ðx; hÞxi ð11Þ lems. Finally, we discuss how to build sparse
i¼1
matrix data structures, apply boundary conditions,
solve the system of linear equations and perform
X
nnodel
the Powell and Hestenes iterations.
yðx; hÞ ¼ Ni ðx; hÞyi ð12Þ
i¼1
4.2. Matrix Computation:
where nnodel is the number of nodes in the Standard Algorithm
element. The local linear approximation to this
mapping is given by the Jacobian matrix J: 4.2.1. Algorithm Description
2 3 [18] The algorithm outlined in Code Fragment 1
@x @x
(see Figure 1) represents the straightforward im-
6 @x @h 7
J ¼6
4 @y
7 ð13Þ plementation of section 4.1. We tried to use intu-
@y 5
itive variable and index names; they are explained
@x @h in Table A1. The details of the algorithm are
described in the following (Roman numbers corre-
The shape function derivatives with respect to spond to the comments in Code Fragment 1).
global coordinates (x, y) are calculated using the
inverse of the Jacobian and the shape function [19] i.) The outermost loop of the standard algo-
derivatives with respect to local coordinates (x, h): rithm is the element loop. Before the actual matrix
computation, general element-type specific data
2 3 such as integration points IP_X and weights IP_w
    @x @x 1
@Ni @Ni @Ni @Ni 6 @h 7
are assigned. The derivatives of the shape functions
¼ 6 @x 7 ð14Þ
@x @y @x @h 4 @y @y 5 dNdu with respect to the local (x, h) coordinates
@x @h are evaluated in the integration points IP_X. All

5 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 1. Code Fragment 1 shows the standard matrix computation.

arrays used during the matrix computation proce- current element, coordinates of the nodes, and
dure are allocated in advance, e.g., K_all. element conductivity, viscosity and density.
[20] ii.) Inside the loop over all elements the code [21] iii.) For each element the following loop over
begins with reading element-specific information, integration points performs numerical integration
such as indices of the nodes belonging to the of the underlying equations, which results in the
element stiffness matrix K_elem[nnodel,nnodel].

6 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

In the case of mechanical code additional matrices do not provide native access to solvers, visualiza-
A_elem[nedof, nedof], Q_elem[nedof,np] and tion, file I/O etc. However, the ease of code
M_elem[np,np] are required. All of the above development in MATLAB comes with a loss of
arrays must be cleared before the integration point some performance, especially when certain recom-
loop together with the right-hand-side vector mended strategies are not followed: http://math-
Rhs_elem. works.com/support/solutions/data/1-15NM7.html.
The more obvious performance considerations
[22] iv.) Inside the integration point loop the pre- have already gone into the above standard imple-
computed shape function derivatives dNdui are mentation and we would like to point these out:
extracted for the current integration point. b) In
the chosen element type the pressure is interpolated [29] 1. Memory allocation and explicit variable
linearly in the global coordinates. Pressure shape declaration have been performed. Although not
functions Pi at an integration point are obtained as formally required, it is advisable to explicitly
a solution of the system P*Pi = Pb, where the first declare variables, including their size and type. If
equation enforces that the shape functions Pi sum variables are not declared with their final size, but
to unity. are instead successively extended (filled in) during
loop evaluation, a large penalty has to be paid for
[23] v.) The Jacobian J[ndim,ndim] is calculated the continuous and unnecessary memory manage-
for each integration point by multiplying the ele- ment. Hence, all variables that could potentially
ment’s nodal coordinates matrix ECOORD_X[n- grow in size during loop execution are preallo-
dim,nnodel] by dNdui[nnodel,ndim]. Furthermore cated, e.g., K_all. Variables such as ELEM2NODE
its determinant, detJ, and inverse, invJ[ndim,n- that only have to store integer numbers should be
dim], are obtained with the corresponding MAT- declared accordingly, int32 in the case of ELE-
LAB functions. M2NODE instead of MATLABs default variable
[24] vi.) The derivatives versus global coordinates, type double. This reduces both the amount of
dNdx[nnodel, ndim], are obtained by dNdx = memory required to store this large array and the
dNdui*invJ according to equation (14). time required to access it since less data must be
transferred.
[25] vii.) a) The element thermal stiffness matrix
contribution is obtained according to equation (16) [30] 2. Data layout has been optimized to facilitate
and implemented as K_elem = K_elem + weight*- memory access by the CPU. For example, the
ED*(dNdX*dNdX’). b) The kinematic matrix B indices of the nodes of each element must be
needs to be formed, equation (7), and A_elem, stored in neighboring memory locations, and sim-
Q_elem and M_elem are computed according to ilarly the x-y-z coordinates of every node. The
equation (6). actual numbering of nodes and elements also has a
visible effect on cache reuse inside the element
[26] viii.) The pressure degrees of freedom are loop, similarly to sparse matrix-vector multiplica-
eliminated at this stage. It is possible to invert tion problem [Toledo, 1997].
M_elem locally because the pressure degrees of
freedom are not coupled across elements, thus [31] 3. Multiple data transfers and computations
there is no need to assemble them into the global have been avoided. Generally, statements should
system of equations. For large viscosity variations appear in the outermost possible loop to avoid
it is beneficial to relate the penalty factor PF to the multiple transfer and computation of identical data.
element’s viscosity to improve the condition num- This is why the integration point evaluated shape
ber of the global matrix. function derivatives with respect to local coordi-
nates are precomputed outside the element loop (as
[27] ix.) The lower (incl. diagonal) part of the opposed to inside the integration loop) and the
element stiffness matrix is written into the global nodal coordinates are extracted before the integra-
storage relying on the symmetry of the system. b) tion loop.
Q_elem and invM_elem matrices are stored for
each element in order to avoid recomputing them 4.2.2. Performance Analysis
during Powell and Hestenes iterations.
[32] In order to analyze the performance of the
[28] MATLAB provides a framework for scientific standard matrix computation algorithm we run
computing that is freed from the burden of con- corresponding tests on an AMD Opteron system
ventional high-level programming languages, with 64 bit Red Hat Enterprise Linux 4 and
which require detailed variable declarations and
7 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

MATLAB 2007a using GoTo BLAS (http:// a possible solution is to explicitly write out the
www.tacc.utexas.edu/resources/software). This small matrix by matrix multiplications, which
system has a peak performance of 4.4 gigaflops results in a more efficient code. In MATLAB,
per core, i.e., it is theoretically capable of however, this is not a practical alternative as
performing 4.4 billion double precision floating explicitly writing out matrix multiplications leads
point operations per second (flops). The specific to unreadable code without substantial perfor-
element types used are 6-node triangles (quadratic mance gains. The above performance considera-
shape functions) with 6 integration points for the tions apply equally to the mechanical code.
thermal problem and 7-node triangles with 12
integration points for the mechanical problem. [36] In conclusion, the standard algorithm is a
viable option when writing compiler code. How-
[33] In the thermal problem, results are obtained ever, the achievable performance in MATLAB is
for an unstructured mesh consisting of approxi- unsatisfactory so we developed a more efficient
mately 1 million nodes and 0.5 million elements. approach, which is presented in the following
For this model the previously described matrix section.
computation took 65 s, during which 324 floating
point operations per integration point per element [37] Remark 1: Measuring code performance
were calculated. This corresponds to 15 Megaflops [38] Since no flops measure exists in MATLAB,
(Mflops) or approximately 0.4% of the peak per- the number of operations must be manually calcu-
formance. Analysis of the code with MATLAB’s lated on the basis of code inspection and divided
built-in profiler revealed that a significant amount by the computational time. To provide more mean-
of time was spent on the calculation of the deter- ingful performance measures only the number of
minant and inverse of the Jacobian. Therefore, in necessary floating point operations may be consid-
further tests these calls were replaced by explicit ered, e.g., the redundant computations of the upper
calculations of detJ and invJ. The final perfor- triangular entries in the standard matrix contribute
mance achieved by this algorithm was 30 Mflops, to the flop count, which artificially increases the
which is still less than one percent of the peak measured performance. However, it is not neces-
performance and equivalent to a peak CPU perfor- sarily the case that the algorithm with the lowest
mance that was reached by commodity computers operation count is the fastest in terms of execution
more than a decade ago. time. We restrain from adjusting the actual flop
[34] Profiling the improved standard algorithm counts in this paper.
revealed that most of the computational time was
spent on matrix multiplications. This means that 4.3. Matrix Computation: Optimized
the efficiency of the analyzed implementation Algorithm
depends mainly on the efficiency of dense matrix [39] In this section we explain how to efficiently
by matrix multiplications inside the integration compute the local stiffness matrices. This optimi-
point loop. In order to perform these calculations zation strategy is common to both (thermal and
MATLAB uses hardware-tuned, high-performance mechanical) problems considered. For simplicity,
BLAS libraries (Basic Linear Algebra Subpro- we present it on the example of the thermal
grams; see http://www.netlib.org/blas/faq.html and problem. Overall performance benchmarks and
Dongarra et al. [1990]), which reach up to 90% of application examples are provided for both types
the CPU peak performance; a value from which the of problems in subsequent sections.
analyzed code is far away.
[40] The small matrix by small matrix multiplica-
[35] The cause for this bad performance is that the tions in the integration loop nested inside the loop
matrix by matrix multiplications inside the integra- over elements are the bottleneck of the standard
tion point loop operate on very small matrices, for algorithm. Written out in terms of loops, these
which BLAS libraries are known not to work well matrix multiplications represent another three loops,
due to the introduced overhead (e.g., http://math- totaling to five. Since the element loop exhibits no
atlas.sourceforge.net/timing/36v34/OptPerf.html). data dependency, it can be moved into the innermost
Therefore, the same observation can be made when three (out of five), effectively becoming part of
writing the standard algorithm in a compiler lan- small matrix by large matrix multiplication.
guage such as C and relying on BLAS for the
matrix multiplications, although the actual perfor- [41] This loop reordering does not change the total
mance in this case is higher than in MATLAB. In C amount of operations. However, the number of

8 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

BLAS calls is greatly reduced (ndim*nip versus the individual rows of the Jacobian evaluated at
nel*nip in the standard approach), and the amount the actual integration point for all elements of the
of computation done per function call is drastically current block. Jx and Jy are calculated by multi-
increased. Consequently, the overhead problem plying the nodal coordinates by the shape function
vanishes leading to a substantial performance im- derivatives, e.g. Jx[nelblo,ndim] = ECOORD_x
provement. Unfortunately, the performance decreases [nnodel, nelblo]’*dNdui[nnodel, ndim]. Thus, in-
once a certain number of elements is exceeded. The stead of nelblo*nip matrix multiplications of
reason for this is that the data required for the dNdu[ndim,nnodel] and ECOORD_X[ndim,nno-
operation does not fit any longer into the CPUs del], ndim*nip multiplications involving the larger
cache . This inhibits cache reuse within the matrices ECOORD_x, ECOORD_y are performed,
integration point loop. The remedy is to operate on i.e. the same work is done with less multiplications
blocks of elements of the size for which the observed of larger matrices. Once the Jacobian is obtained,
performance is best. Once a block is processed, the its determinant, detJ, and inverse, split into invJx
results are written to the main memory and the data and invJy, are explicitly computed using simple
required by the next block is copied into the cache. operations on vectors.
Data required for every block should fit (reside) in
the cache at all times. The ideal block size depends [48] vi.) The derivatives with respect to the global
on the cache structure of a CPU and must be coordinates (x, y), dNdx[nelblo,nnodel] and
determined system and problem specifically. This dNdy[nelblo,nnodel], are obtained by multiplying
computing strategy is called ‘‘blocking’’ and is the invJx and invJy by the transpose of dNdui.
implemented as a part of the optimized algorithm. Again, less multiplication calls involving larger
Coincidentally, this entire approach to optimize the matrices are performed.
FEM matrix computation is similar to vector com- [49] vii.) The local stiffness matrix contribution
puter implementations [e.g., Ferencz and Hughes, for all the elements in the block, K_block[nelblo,
1998; Hughes et al., 1987; Silvester, 1988]. nnodel*(nnodel+1)/2], is computed according to
equation (16). Note that exploiting symmetry
4.3.1. Algorithm Description allows for calculation of only the lower triangle
[42] Code Fragment 2 shows the implementation of of stiffness matrices, which substantially reduces
the optimized matrix computation algorithm (see the operation count.
Figure 2). The key operations are explained and [ 50 ] viii.) After the numerical integration of
compared to the standard algorithm in the following. K_block is completed, the results are written into
[43] i.) The outermost loop of the optimized matrix the global storage K_all, again exploiting symme-
computation is the block loop. Before this loop is try by storing only the lower triangular part.
entered, required arrays (IP_X, IP_w, dNdu) are [51] ix.) The number of elements remaining in the
assigned and necessary variables are allocated. final block might be smaller than the nelblo.
[44] ii.) Inside the block loop the code begins with Consequently, nelblo and several arrays must be
reading element specific information. Since we adjusted.
simultaneously operate on nelblo elements, all the
corresponding global data blocks are copied into 4.3.2. Performance Analysis
local arrays ECOORD_x, ECOORD_y, and ED, [52] To illustrate the performance of the optimized
and are used repeatedly inside the integration loop. matrix computation systematic tests were run with
[45] iii.) For the entire block of elements, the loop the same 1 million node problem that was used for
over integration points performs numerical integra- the performance analysis of the standard algorithm.
tion of the element matrices K_block[nelblo, Since larger matrices resulting from larger block
nnodel*(nnodel+1)/2]. sizes should yield better BLAS efficiency, the
performance in Mflops is plotted versus the num-
[46] iv.) As in the standard algorithm, every itera- ber of elements in a block; see Figure 3. This plot
tion of the integration point loop begins by reading confirms the arguments for the introduction of the
precomputed dNdu arrays. blocking algorithm. Starting from approximately
the performance of the standard algorithm, a steady
[47] v.) The Jacobian of the standard algorithm, increase can be observed up to 350 Mflops,
J[ndim,ndim], is replaced by ndim matrices; which on the test system is reached for a block
Jx[nelblo,ndim] and Jy[nelblo,ndim], containing with 1000 elements for thermal problem. Further

9 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 2. Code Fragment 2 shows the optimized finite element global matrix computation.

increase of the block size leads to a performance Profiling the code revealed that for the test problem
decrease toward a stable level of 120 Mflops due approximately half of the time was spent on
to lack of cache reuse in the integration point loop. reading and writing variables from and to RAM
Compared to the standard version, the optimized (e.g., nodal coordinates and element matrices).
matrix computation achieves a 20-fold speedup in This value is constrained by the memory band-
terms of flops performance. Since the optimized width of the hardware, which on current computer
algorithm performs fewer operations (computation architectures is often a bigger bottleneck than the
of only lower triangular part of symmetric element CPU performance. Compared to C implementa-
matrix), its execution time is actually more than 30 tions, the optimized matrix computation perfor-
times faster. mance is better than the straightforward standard
algorithm using BLAS, but more than a factor 3
[53] The achieved 350 Mflops efficiency corre-
sponds to only 8% of the peak CPU performance.
10 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 3. Performance of optimized matrix computation versus block size.

slower than what can be achieved by explicitly equivalent function sparse2, provided by T. A.
writing out the matrix multiplications. Davis within the CHOLMOD package (http://
www.cise.ufl.edu/research/sparse/SuiteSparse), is
[54] In the mechanical code, the peak flops perfor- substantially faster and does not require a conver-
mance is similar. Note that in this case the optimal sion of the coefficients to double precision. Code
blocksize is smaller due to the larger workspace of Fragment 3 presents in detail how to create a global
the method; see Figure 3. system matrix.
4.4. Matrix Assembly: Triplet to Sparse [57] Code Fragment 3 shows the global sparse
Format Conversion matrix assembly.
[55] The element stiffness matrices stored in K_all % CREATE TRIPLET FORMAT INDICES
must be assembled into the global stiffness matrix indx_j = repmat(1:nnodel,nnodel,1);
K. The row and column indices (K_i and K_j) that indx_i = indx_j0;
specify where the individual entries of K_all have indx_i = tril(indx_i);
to be stored in the global system are commonly indx_i = indx_i(:);
known as the triplet sparse matrix format [e.g., indx_i = indx_i(indx_i>0);
Davis, 2006]. Since we only use lower triangular indx_j = tril(indx_j);
entries, special care must be taken so that the indx_j = indx_j(:);
indices referring to the upper triangle are not indx_j = indx_j(indx_j>0);
created; see Code Fragment 3. Note that K_i and
K_j hold duplicate entries, and the purpose of the K_i = ELEM2NODE(indx_i,:);
MATLAB sparse function is to sum and eliminate K_j = ELEM2NODE(indx_j,:);
them.
[56] While creation of the triplet format is fast, the K_i = K_i(:);
call to sparse gives some concerns. MATLABs K_j = K_j(:);
sparse implementation requires that K_i and K_j
are of type double, which is memory- and % SWAP INDICES REFERRING TO UPPER
performance-wise inefficient. In addition, sparse TRIANGLE
itself is rather slow, especially if compared to the indx = K_i < K_j;
time spent on the entire matrix computation. The tmp = K_j(indx);

11 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

K_j(indx) = K_i(indx); Free (indices of unconstraint degrees of freedom)


K_i(indx) = tmp; and Bc_ind, where Dirichlet boundary conditions
with corresponding values Bc_val are applied.
K_all = K_all(:); Since the solution values in the Bc_ind are known,
the corresponding equations can be eliminated from
% CONVERT TRIPLET DATA TO SPARSE the system of equations by modifying the right-
MATRIX hand side of the remaining degrees of freedom
K = sparse2(K_i, K_j, K_all); accordingly. This is implemented as shown in Code
clear K_i K_j K_all; Fragment 4.
[58] The triplet format is converted into the sparse [63] Code Fragment 4 shows the boundary condi-
matrix K with one single call to sparse2. Assem- tion implementation for the thermal problem.
bling smaller sparse matrices for blocks of elements
and calling sparse consecutively would reduce the Free = 1:nnod;
workspace for the auxiliary arrays; however, it Free(Bc_ind) = [];
would also slow down the code. Therefore, as long TMP = K(:,Bc_ind) + cs_transpose(K(Bc_ind,:));
as the K_i, K_j and K_all arrays are not the memory Rhs = Rhs - TMP*Bc_val0;
bottleneck, it is beneficial to perform the global K = K(Free,Free);
conversion. Once K is created, the triplet data is T = zeros(nnod,1);
cleared in order to free as much memory as possible T(Bc_ind) = Bc_val;
for the solution stage. In the mechanical code Q and
M1 matrices are stored in sparse format for later [64] Since only the lower part of the global matrix
reuse in the Powell and Hestenes iterations. is stored, we need to restore the remaining parts of
the columns by transposing the adequate rows.
[59] Remark 2: Symbolic approach to sparse ma-
trix assembly. 4.6. System Solution
[60] In general the auxiliary arrays can be altogether [65] We have ensured that the global system of
avoided with a symbolic approach to sparse matri- linear equations under consideration is symmetric,
ces. While the idea of sparse storage is the elimi- positive-definite, and sparse. It has the form
nation of zero entries, in a symbolic approach all
possible nonzero entries are stored and initialized KT ¼ Rhs ð17Þ
to zero. During the computation of element stiff-
ness matrices, global locations of their entries can where K is the stiffness matrix, T the unknown
be found at a small computational cost, and temperature vector, and Rhs is the right-hand side.
corresponding values are incrementally updated. One of the fastest and memory efficient direct
Also, this symbolic storage pattern can be reused solvers for this type of systems is CHOLMOD, a
between subsequent time steps, as long as the sparse supernodal Cholesky factorization package
mesh topology is not changed. Unfortunately, this developed by T. Davis [Davis and Hager, 2005; Y.
improvement cannot be implemented in MATLAB Chen et al., Algorithm 8xx: CHOLMOD, super-
as zero entries are automatically deleted. nodal sparse Cholesky factorization and update/
downdate, submitted to ACM Transactions on
4.5. Boundary Conditions Mathematical Software, 2007; T. A. Davis and
W. W. Hager, Dynamic supernodes in sparse
[61] The implemented models have two types of Cholesky update/downdate and triangular solves,
boundary conditions: vanishing fluxes and Dirich- submitted to ACM Transactions on Mathematical
let. While the former automatically results from the Software, 2007]; see the report by Gould et al.
FEM discretization, the latter must be specified [2007]. Newer versions of MATLAB (2006a and
separately, which usually leads to a modification of later) use this solver, which is substantially faster
the global stiffness matrix. These modifications than the previous implementation. When sym-
may, depending on the implementation, cause loss metric storage is not exploited, CHOLMOD can be
of symmetry, changes in the sparsity pattern and invoked through the backslash operator: T = K\Rhs
row addressing of K, all of which can lead to a (make sure that the matrix K is numerically
badly performing code. symmetric, otherwise MATLABs will invoke a
[62] An elegant and sufficiently fast approach is to different, slower solver).
separate the degrees of freedom of the model into
12 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 4. Performance analysis of the different steps of the Cholesky algorithm with different reorderings for our
one million degrees of freedom thermal test problem.

[66] However, it is best to use CHOLMOD and the faster during the reordering steps, it results in slower
related parts by installing the entire package from Cholesky factorization and forward and back substi-
the developers SuiteSparse Web site (http://www. tution. If the reordering can be reused for a large
cise.ufl.edu/research/sparse/SuiteSparse). This pro- number of steps, it is recommended to rely on
vides access to cholmod2, which is capable of METIS, which is accessible in MATLAB through
dealing with only upper triangular input data and the SuiteSparse package.
precomputed permutation (reordering) vectors. Sui-
teSparse also contains lchol, a Cholesky factoriza- 4.7. Powell and Hestenes Iterations
tion operating only on lower triangular matrices,
[68] In the thermal code, the solution vector is
which is faster and more memory efficient than
obtained by calling forward and back substitution
MATLABs chol equivalent. Reusing the Cholesky
routines with the Cholesky factor and the adequately
factor L during the Powell and Hestenes iterations in
permuted right-hand-side vector. During the second
the mechanical problem greatly reduces the compu-
substitution phase the upper Cholesky factor is
tational cost of achieving a divergence free flow
required. However, instead of explicitly forming it
solution.
through the transposition of the stored lower factor,
[67] The mentioned reuse of reordering data is it is advantageous to call the cs_ltsolve that can
possible as long as the mesh topology remains operate on the lower factor and performs the needed
identical, which even in our large strain flow task of the back substitution.
calculations is the case for many time steps. The
reordering step decreases factorization fill-in and [69] In the MILAMIN flow solver the incompres-
sibility constraint is achieved through an iterative
consequently improves memory and CPU efficiency
penalty method, i.e., the bulk part of the deforma-
[Davis, 2006], but is a rather costly operation
tion is suppressed with a large bulk modulus
compared to the rest of the Cholesky algorithm.
(penalty parameter) k. In a single step penalty
Different reordering schemes can be used, and we
method there is a trade off between the incompres-
compare two of them in Figure 4: AMD (Approxi-
sibility of the flow solution and the condition
mate Minimum Degree) and METIS (http://glaros.
number of the global equation system. This can
dtc.umn.edu/gkhome/views/metis). While AMD is

13 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

be avoided by using a relatively small k, which new arrangement, where physical nodes are listed
ensures a good condition number and then itera- separately for every element that accesses them.
tively improving incompressibility of the flow. The same can also be done for other meshes than
Note that for the chosen Crouzeix-Raviart element, triangular ones by creating the corresponding con-
pressure is discontinuous between elements and the nectivity (ELEM2NODE) and calling:
corresponding degrees of freedom can be eliminat-
ed element-wise (no global system solution re- [73] Code Fragment 6 shows the postprocessor.
quired). Pressure increments can be computed
with the velocity solution vector and stored Q patch(0faces0, ELEM2NODE,0vertices0,
and M1 matrices. These pressure increments are GCOORD0,0facevertexcdata0,T);
sent to the right-hand side of the system and shading interp;
accumulated in the total pressure. The code
fragment for these so-called Powell and Hestenes 6. MILAMIN Performance Analysis
iterations is given in Code Fragment 5.
6.1. Overall Performance
[70] Code Fragment 5 shows the Powell and Hes-
tenes iterations. [74] The overall performance of MILAMIN versus
the number of nodes is analyzed in Figure 5. The
while (div_max>div_max_uz && uz_iter<uz_iter_max) goal of MILAMIN to perform a complete FEM
uz_iter = uz_iter + 1; analysis for one million unknowns in one minute is
%FORWARD AND BACK SUBSTITUTION reached for the thermal as well as the mechanical
Vel(Free(perm)) = problem. All components of MILAMIN scale lin-
cs_ltsolve(L,cs_lsolve(L,Rhs(Free(perm)))); early with the number of nodes; the only exception
is the direct solver, which shows super-linear
%COMPUTE QUASI-DIVERGENCE scaling. The performance details are discussed in
Div = invM*(Q*Vel); the following sections.

%UPDATE RHS 6.2. Component Performance


Rhs = Rhs – PF*(Q’*Div); [75] Figure 6 shows the total amount of time for
the one million degrees of freedom (DOFs) test
%UPDATE TOTAL PRESSURE problems split into the individual components of
Pressure = Pressure + PF*Div; MILAMIN. The contributions of the boundary
conditions and postprocessor are minor. The time
%CHECK INCOMPRESSIBILITY taken by the preprocessor is also not relevant,
div_max = max(abs(Div(:))); especially if the same (Lagrangian) mesh is used
end for many time steps. A major achievement of
MILAMIN is the performance of the optimized
5. Postprocessor matrix computation that is more than 15–30fold
better than the standard algorithm. The matrix
[71] The results of a numerical model are only assembly done by sparse2 is one of the major
useful if fast and precise analysis and visualization contributors to the total time, but cannot be opti-
is possible. One of the main aspects to achieve this mized without a major change in the way MAT-
is to avoid loops. For triangular meshes trisurf is LAB operates on sparse matrices; see Remark 2.
the natural choice for two and three dimensional Finally the three components of the Cholesky
data visualization as it employs the usual FEM solver take substantial time.
structures: connectivity (ELEM2NODE), coordi-
nates (GCOORD), and data (T). This allows for [76] The time taken by the first part of the Cho-
visualization of FEM models with more than one lesky solver, the reordering, can often be neglected
million elements in less than one second. for practical applications. During nonlinear mate-
rial and time step iterations the mesh topology
[72] A problem that often arises is the visualization remains the same as long as no remeshing is
of discontinuous data, such as pressure in mixed performed, and the permutation vector can be
formulations of deformation problems. The remedy reused if the SuiteSparse package is employed.
is to abandon the nodal connectivity and to create a

14 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 5. Overall performance results for MILAMIN given for total time spent on problem, and the direct solver
contribution.

[77] The second component of the Cholesky solver especially for positive definite systems that can be
is the factorization. This step takes most of the total solved with Cholesky factorizations. Moreover, it
MILAMIN execution time. However, the efficien- is in problems of this size where our optimizations
cy achieved by CHOLMOD is close to the optimal greatly reduce the total solution time. Such numer-
CPU performance. For further optimization one ical resolutions are often sufficient in two dimen-
could consider other types of solvers such as itera- sions to solve challenging problems and the
tive ones. Yet, preconditioned iterative methods or achieved performance allows for studies with large
algebraic multigrid are less robust (especially for number of time steps.
large material contrasts as targeted here) and per-
form better only for large systems; see section 6.4. [78] The third part of the Cholesky solver is the
These methods are the only option in the case of forward and backward substitution and does not
most three-dimensional problems, because the scal- contribute substantially in the case of thermal
ing of factorization time and memory requirements problems. For mechanical problems several Powell
for direct solvers is much worse than in two dimen- and Hestenes iterations are required to enforce
sions. However, for two dimensional problems incompressibility, each issuing a forward and back
direct solvers are the best choice for resolutions substitution call plus other computations. The time
on the order of one million degrees of freedom, spent on the Powell and Hestenes iteration is not

15 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

6.3. Memory Requirements


[80] Besides CPU performance the available mem-
ory (RAM) is the other parameter that determines
the problem size that can be solved on a specific
machine. The memory requirements of MILAMIN
are presented in Figure 8. Within the studied range
of systems sizes, all data allocated during the
matrix computation and assembly requires substan-
tially less memory than the solution stage. Thus the
auxiliary arrays such as K_i, K_j and K_val are not
a memory bottleneck and it is indeed beneficial to
perform conversion to sparse format globally. Note
that the amount of memory required during the
factorization stage depends strongly on reordering
used. This analysis is only approximate as the
Figure 6. Overall performance of MILAMIN split up workspace of the external routines (lchol, sparse2,
into the individual components for thermal and etc.) is not taken into account. On 2 Gb RAM
mechanical test problems with one million degrees of
computers we are able to solve systems consisting
freedom. The timing for the matrix computation is given
for the standard (S) and the optimized (O) algorithm. of 1.65 and 0.65 million nodes for the thermal and
Note that the forward and backward (F&B) substitution mechanical problems, respectively.
timing also contains three Powell and Hestenes itera-
tions in the case of the mechanical problem. 6.4. Comparison to Other Software
[81] In this section we compare MILAMIN to
negligible, but the strategy chosen to deal with different available commercial and free software
incompressibility is clearly advantageous to other solving similar test problems. Table 1 presents run
strategies that would not allow the use of Cholesky times for a thermal problem with 1 million
solvers; see, for example, the results for FEMLAB degrees of freedom. The model setup consists of
using UMFPACK in section 6.4. a box with a circular hole (zero flux) and a circular
inclusion of ten times higher conductivity than the
[79] A final analysis of the overall speedup achieved matrix. The outer boundaries are set to Dirichlet
by MILAMIN is shown in Figure 7, where we depict
the ratio of the total time tstandard/toptimized for the
thermal and mechanical code. In this speedup anal-
ysis we define the total time as the sum of the time
needed to compute and assemble the global matrix,
apply boundary conditions, factorize and solve the
system of equations, and perform the Powell and
Hestenes iterations (incompressible Stokes flow).
Thus mesh generation, postprocessing, and reor-
dering, which do not need to be performed for
every time step, do not enter this analysis. For our
target system sizes the achieved speedups reach
approximately 3 and 4 for the mechanical and
thermal codes, respectively. Hence the performance
gains due to the developed MILAMIN package are
substantial. The scaling with respect to system size
shows that the speedup decreases with increasing
number of nodes. This is due to the super-liner
scaling of the direct solver, which starts to domi-
nate the total execution time for very large systems.
Figure 7. Achieved MILAMIN speedup for all
operations that need to be performed for every time
step; see text for details.

16 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Table 1. Performance Results for Different Software Packages for the Thermal Problema
Matrix Computation
Software and Assembly Solve Solver Type

ABAQUS, T2 80 260 proprietary


FEMLAB, T2 18 40 UMFPACK
45 TAUCS
52 PARDISO
58 SPOOLES
240 ICCG
500 AMG-CG
1000 SSOR-CG
2500 PCG
FEAPpv, Fortran, T2 7 712 PCG
OOFEM, C++, T1 36 400 ICCG
TOCHNOG, C\C++, T2 15 1711 BiCG
AFEM@matlab, T1 25 19 MATLAB \
IFISS, Q2 999 57 MATLAB \
IFISS, Q1 464 30 MATLAB \
MILAMIN std, T2 65 24 CHOLMOD2 (AMD)
MILAMIN opt, T2 5 24 CHOLMOD2 (AMD)
a
T1 and T2 stand for linear and quadratic triangles, and Q1 and Q2 stand for linear and quadratic quadrilateral
elements, respectively.

conditions representing a linearly varying temper- TAUCS (S. Toledo et al., http://www.tau.ac.il/
ature field. stoledo/taucs), PARDISO (O. Schenk and
K. Gärtner, http://www.pardiso-project.org),
[82] The software that entered the test are commer- SPOOLES (C. Ashcraft et al., http://www.netlib.
cial finite element packages, ABAQUS (SIMULIA, org/linalg/spooles/spooles.2.2.html), CHOLMOD
6.6-1, http://www.simulia.com/products/abaqus_ (T. A. Davis, http://www.cise.ufl.edu/research/
fea.html) and FEMLAB (COMSOL 3.3, http:// sparse/cholmod), and the MATLAB backslash
www.femlab.com), and open source packages operator (\). We also compared different imple-
FEAPpv (O. C. Zienkiewicz and R. L. Taylor, mentations of iterative solvers such as Conjugate
2.0, http://www.ce.berkeley.edu/rlt/feappv), Gradients preconditioned with Jacobi (PCG),
OOFEM (B. Patzak, OOFEM 1.7, http://www. Symmetric Successive Over-Relaxation (SSOR-
oofem.org), and TOCHNOG (D. Roddeman, CG), Incomplete Cholesky (ICCG), and Algebraic
11 February 2001, http://sourceforge.net/projects/ Multigrid (AMG-CG), and a Biconjugate Gradients
tochnog) for compiler languages, and AFEM@ solver preconditioned with Jacobi (BiCG).
matlab (L. Chen and C. Zhang, http://www.
mathworks.com/matlabcentral/fileexchange), and [83] A number of other MATLAB-based packages
IFISS (D. J. Silvester et al., 2.2, http://www.maths. are available, which, however, could not enter our
manchester.ac.uk/djs/ifiss) for MATLAB. For table because they are simply incapable of solving
the solution stage we used a wide range of direct the test problem in a reasonable amount of time
solvers, including UMFPACK (T. A. Davis, http:// and the amount of RAM available. From the
www.cise.ufl.edu/research/sparse/umfpack), MATLAB packages that entered the performance

Table 2. Performance Results for Different Software Packages for the Mechanical Problema
Matrix Computation
Software and Assembly Solve Solver Type

IFISS Q2-P1 (5e5 DOFs) 340 298 MATLAB \


FEMLAB 3.3 T2+P-1 (2e5 DOFs) 7 66 UMFPACK
186 ILU-GMRES
MILAMIN (opt) T2 + P-1 (1e6 DOFs) 15 34 CHOLMOD (AMD)
a
Note the different system sizes for this test.

17 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 8. Memory requirements of the thermal and mechanical versions of MILAMIN.

comparison AFEM excels with high performance. or AMG, are not competitive with respect to the
However, AFEM is specifically developed to op- direct solvers for the targeted problem size.
erate with linear triangles solving the Poisson
problem. This allows AFEM to employ only one [85] A performance comparison of MILAMIN for
integration point and the amount of work per- a mechanical test problem is given in Table 2. The
formed is substantially less than for isoparamteric domain is again a box containing a circular hole
quadratic elements, although the actual number of (free surface) and a circular inclusion with a ten
elements is higher for the test problem with a fixed times higher viscosity than the matrix. The outer
number of nodes. IFISS is another MATLAB- boundaries are set to Dirichlet conditions repre-
based package capable of solving Poisson and senting pure shear deformation. The number of
incompressible Navier-Stokes problems on the available packages to solve incompressible Stokes
basis of linear and quadratic quadrilateral meshes. problems with heterogeneous material is greatly
Despite its aim of being a vectorized code, the reduced compared to the thermal problem. In fact
performance of IFISS is not optimal. This is partly the IFISS package is not capable of dealing with
due to a badly performing boundary condition heterogeneous materials and we used here an iso-
implementation. The matrix computation and as- viscous model. In the case of FEMLAB we had to
sembly performance of the compile language and employ the special MEMS module, which provides
commercial codes is quite reasonable, with FEAP an incompressible Stokes application mode. How-
being the clear leader. However, none of the tested ever, even with this specialized module we were
packages is as fast for the matrix computation and unable to fit the test problem into the 2 Gb RAM
assembly as the optimized version of MILAMIN and therefore the results are provided for a five
and even the standard version of MILAMIN is times smaller problem size. MILAMIN outper-
performing quite reasonably in comparison. forms IFISS as well as FEMLAB both in terms
of matrix computation and assembly, and the
[84] The analysis of the solver times confirms our solution time. The latter demonstrates that iterative
previous statement that for the studied 2-D prob- penalty approach chosen in MILAMIN and the
lems direct solvers (CHOLMOD, UMFPACK, resulting possibility to use a Cholesky solver
TAUCS, PARDISO, SPOOLES) are the best (symmetric and positive definite system) is superi-
choice with CHOLMOD being the best in the or to other approaches.
group. Iterative solvers, even if equipped with
good preconditioners, like incomplete Cholesky

18 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 9. Illustration of a one million node application problem modeled with MILAMIN. Steady state diffusion is
solved in a heterogeneous rock with channels of high conductivity. Heat flow is imposed by a horizontal thermal
gradient; i.e., T(left boundary) = 0, T(right boundary) = 1. Top and bottom boundary conditions are zero flux.
(a) Conductivity distribution. (b) Flux visualized by cones and colored by magnitude. Normalization versus flux in
homogeneous medium with conductivity of the channels. Background color represents the conductivity. Triangular
grid is the finite element mesh used for computation. Note that this picture only corresponds to a small subdomain of
Figure 9a (see square outline).

6.5. Applications thermal solver. As already mentioned the mechan-


ical solver is devised in a way that compressible
[86] The power of MILAMIN to perform high-
and incompressible elastic problems can be easily
resolution calculation for heterogeneous problems
treated, simply by variable substitution. Coupled
is illustrated with a thermal and a mechanical
thermomechanical problems, arising for example in
application example. Figure 9 shows the heat flux
mantle convection, only require that the developed
through a heterogeneous rock requiring approxi-
thermal and mechanical models are combined in
mately one million nodes to resolve it. Figure 10
the same time loop. This results in an unstructured,
shows a mechanical application of MILAMIN.
Lagrangian mantle convection solver capable of
Gravity-driven incompressible Stokes flow is used
efficiently dealing with hundreds of thousands of
to study the interaction of circular inclusions with
nodes [cf. Davies et al., 2007].
different densities leading to a stratification of the
material; see Animation S1.2
7. Conclusions
[87] MILAMIN not only allows us to study the
overall response of the system, but also resolves [89] We have demonstrated that it is possible to
the details of the flow pattern around the hetero- write an efficient native MATLAB implementation
geneities. Note that we see none of the pressure of the finite element method and achieved the goal
oscillation problems that are often caused by the to set up, process, and postprocess thermal and
incompressibility constraint [e.g., Pelletier et al., mechanical problems with one million degrees of
1989]. freedom in one minute on a desktop computer.
[88] The MILAMIN strategies and package are [90] In our standard implementation we have com-
applicable to a much broader class of problems bined all the state of the art components required in
than illustrated here. For example, transient thermal a finite element implementation. These include
problems require only minor modifications to the efficient preprocessing, fast matrix assembly,
exploiting matrix symmetry for storage, and
2
Animations are available in HTML.

19 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Figure 10. Mechanical application example. Circular inclusions in box subjected to vertical gravity field. Black
(heavy) and white (light) inclusions have the same density contrast with respect to the matrix. They are hundred times
more viscous than the matrix. Figures 10a and 10b show (unsmoothed) pressure perturbations, Figures 10c and 10d
show maximum shear strain rate, and Figures 10e and 10f shows the magnitude of the velocity field with superposed
velocity arrows (random positions). All values are normalized by the corresponding maximum value generated by a
single inclusion of the same size centered in the same box. Figures 10a, 10c, and 10e show the entire domain;
Figures 10b, 10d, and 10f show a zoom-in with superposed finite element mesh according to the white square.

employing the best available direct solver and mentation of boundary conditions. In the case of
reordering packages. MATLAB-specific optimiza- the mechanical application the chosen penalty
tions include proper memory management (preal- method together with the particular element type
location of arrays) and data structures, explicit type allows us to use the efficient Cholesky factoriza-
declaration for integer arrays, and efficient imple- tion to solve the incompressible flow problem. The

20 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Table A1. MILAMIN Variablesa


Variable Group Variable Size Description

Variable size ndim 1 number of dimensions


nel 1 number of elements
nnod 1 number of nodes
nnodel 1 number of nodes per element
nedof 1 number of thermal or velocity
degrees of freedom per element
np 1 number of pressure degrees of
freedom per element
nip 1 number of integration points per element
nelblo 1 number of elements per block
nblo 1 number of blocks
npha 1 number of material phases
nbc 1 number of constraint degrees of freedom
nfree 1 number of unconstraint degrees of freedom
Mesh ELEM2NODE [nnodel, nel] connectivity
Phases [1, nel] phase of elements
GCOORD [ndim, nnod] global coordinates of nodes
Integration points, IP_X [ndim, nip] local coordinates of integration points
shape functions
and their derivatives
IP_w [1, nip] weights of integration points
N {nip*[ nnodel, 1 ]} cell array of nip entries of shape
functions Ni evaluated at integration
points
dNdu {nip*[ nnodel, ndim]} cell array of nip entries of shape
functions derivatives wrt local
coordinates dNdui evaluated at
integration points
Geometry ECOORD_X [ndim, nnodel] global coordinates of nodes in element
J [ndim, ndim] Jacobian in integration point
invJ [ndim, ndim] inverse of Jacobian
detJ 1 or [nelblo,1] determinant of Jacobian (or aeib)
dNdX [nnodel, ndim] shape function derivatives wrt global
coordinates in integration point
ECOORD_x, ECOORD_y [nnodel, nelblo] global x and y coordinates for nodes (aeib)
Jx, Jy [nelblo, ndim] first (x) and second (y) row of Jacobian
in integration point (aeib)
invJx, invJy [nelblo, ndim] first (x) and second (y) column of
inverse of Jacobian (aeib)
dNdx, dNdy [nelblo, nnodel] shape function derivatives wrt global
x and y coordinate (aeib)
Auxiliary arrays indx_l [nedof*(nedof+1)/2,1] indices extracting lower part of
element matrix
Boundary conditions Free [1, nfree] unconstraint degrees of freedom
Bc_ind [1, nbc] constraint degrees of freedom
Bc_val [1, nbc] constraint boundary values
Solution perm [1,nfree] permutation vector reducing
factorization fill-in
L [nfree, nfree] sparse lower Cholesky factor of
global stiffness matrix
Rhs [nfree, 1] global right-hand-side vector
THERMAL
Materials D [npha,1] conductivities for different phases
ED 1 or [nelblo,1] conductivity of element (or aeib)
Matrix calculations K_elem [nnodel, nnodel] element stiffness matrix
K_block [nelblo, nnodel*(nnodel+1)/2] flattened element stiffness matrices
(aeib)
Triplet storage K_i [nnodel*(nnodel+1)/2, nel] row indices of triplet sparse format
for K_all
K_j column indices of triplet sparse format
for K_all

21 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Table A1. (continued)


Variable Group Variable Size Description

K_all flattened element stiffness matrices


for all elements
Solution stage K [nfree, nfree] sparse global stiffness matrix
(only lower part)
T [nnod, 1] unknown temperature vector
MECHANICAL
Materials Mu, Rho [npha,1] viscosity and density for different phases
EMu, ERho 1 or [nelblo,1] viscosity and density of element (or aeib)
Matrix calculations Pi [np,1] pressure shape functions in integration
point
P [np, np] auxiliary matrix containing global
coordinates of the corner nodes
Pb [np,1] auxiliary vector containing global
coordinates of integration point
B [nedof, ndim*(ndim+1)/2] kinematic matrix
A_elem [nedof, nedof] element stiffness matrix (velocity part)
Q_elem [np, nedof] element divergence matrix
M_elem [np, np] element pressure mass matrix
invM_elem [np, np] inverse of element pressure mass matrix
Rhs_elem [ndim, nedof] element right-hand-side vector
PF 1 penalty factor
GIP_x, GIP_y [1,nelblo] global x and y coordinates of integration point
(aeib)
Pi_block [nelblo, np] pressure shape functions in integration point
(aeib)
A_block [nelblo, nedof*(nedof+1)/2] flattened element stiffness matrices (aeib)
Q_block [nelblo, nedof*np] flattened element divergence matrices (aeib)
M_block [nelblo, np*(np+1)/2] flattened element pressure mass matrices (aeib)
invM_block [nelblo, np*np] flattened inverses of element pressure mass
matrices (aeib)
Rhs_block [nelblo, nedof] element right-hand-side vectors (aeib)
Triplet storage Rhs_all [nedof, nel] element right-hand-side vectors for all elements
A_i [nedof*(nedof+1)/2, nel] row indices of triplet sparse format
for A_all
A_j column indices of triplet sparse format
for A_all
A_all flattened element stiffness matrices
for all elements
Q_i [nedof*np, nel] row indices of triplet sparse format
for Q_all
Q_j column indices of triplet sparse format
for Q_all
Q_all flattened element divergence matrices
for all elements
invM_i [np*np, nel] row indices of triplet sparse format
for invM_all
invM_j column indices of triplet sparse format
for invM_all
invM_all flattened inverses of element pressure mass
matrices for all elements
Solution stage A [nfree, nfree] sparse global stiffness matrix
(only lower part)
Q [np*nel, ndim*nnod] sparse divergence matrix
invM [np*nel, np*nel] sparse pressure mass matrix
Div [nel*np, 1] quasi-divergence vector
Vel [ndim*nnod, 1] unknown velocity vector
Pressure [nel*np, 1] unknown pressure vector
a
Note: ‘‘aeib’’ stands for ‘‘all elements in block.’’

22 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

clear structure of the code serves the educational English. The manuscript benefited from the reviews by Boris
purposes well. The results of our software compar- Kaus and Eh. Tan and the editorial work of Peter van Keken.
ison show that our standard version performs Finally, we would like to thank Yuri Podladchikov for his never-
surprisingly efficiently even compared to packages ending enthusiasm and stimulation.
implemented in compiler languages.
References
[91] Furthermore, in our optimized version we
have improved the efficiency of the stiffness matrix Alberty, J., et al. (1999), Remarks around 50 lines of Matlab:
Short finite element implementation, Numer. Algorithms, 20,
calculations, which resulted in an overall execution 117 – 137.
speedup of approximately 4 times with respect to the Bathe, K.-J. (1996), Finite Element Procedures, vol. XIV,
standard version. This has been done by minimizing 1037 pp., Prentice-Hall, London.
the ratio of overhead (BLAS and MATLAB) to Brezzi, F., and M. Fortin (1991), Mixed and Hybrid Finite
computation. Another priority was to avoid unnec- Elements Methods, vol. ix, 350 pp., Springer, New York.
Cuvelier, C., et al. (1986), Finite Element Methods and Navier-
essary data transfers and promote cache reuse, as Stokes Equations, vol. XVI, 483 pp., D. Reidel, Dordrecht,
memory speed is a major bottleneck on current Netherlands.
computer architectures. Particular optimizations to Davies, D. R., et al. (2007), Investigations into the applicabil-
the matrix computation algorithm include (1) in- ity of adaptive finite element methods to two-dimensional
creased performance of the BLAS operations by infinite Prandtl number thermal and thermochemical convec-
tion, Geochem. Geophys. Geosyst., 8, Q05010, doi:10.1029/
interchanging loops and operating on large matrices, 2006GC001470.
(2) reducing the total operation count by exploiting Davis, T. A. (2006), Direct Methods for Sparse Linear Sys-
the symmetry of the system, and (3) facilitating tems, Soc. for Ind. and Appl. Math., Philadelphia, Pa.
cache reuse through the introduction of blocking. Davis, T. A., and W. W. Hager (2005), Row modifications of a
sparse Cholesky factorization, SIAM J. Matrix Anal. Appl.,
[92] Our implementation of the matrix computation 26, 621 – 639.
achieves a sustained performance of 350 Mflops Dongarra, J. J., et al. (1990), A set of level 3 basic linear
for any system size. Any further performance algebra subprograms, ACM Trans. Math. Software, 16, 1 –
17.
improvements to this part of the code are irrele- Dunavant, D. A. (1985), High degree efficient symmetrical
vant, since even for smallest systems the matrix Gaussian quadrature rules for the triangle, Int. J. Numer.
computation now takes only a fraction of the total Methods Eng., 21, 1129 – 1148.
solution time, with the solver being the bottleneck. Elman, H. C., et al. (2005), Finite Elements and Fast Iterative
Solvers With Applications in Incompressible Fluid Dy-
[93] By paying attention to the strategies outlined in namics, 400 pp., Oxford Univ. Press, New York.
this article, MATLAB-based MILAMIN can not Ferencz, R. M., and T. J. R. Hughes (1998), Implementation of
element operations, in Handbook of Numerical Analysis,
only be used as a development and prototype tool, edited by P. G. Ciarlet and J. L. Lions, pp. 39 – 52, Elsevier,
but also as a production tool for the analysis of two New York.
dimensional problems with millions of unknowns Fletcher, C. A. J. (1997), Computational Techniques for Fluid
within minutes. The complete MILAMIN source Dynamics, 3rd ed., Springer, Berlin.
code is available from the authors and can be Gould, N. I. M., et al. (2007), A numerical evaluation of sparse
direct solvers for the solution of large sparse symmetric lin-
downloaded as auxiliary material (see Software S1). ear systems of equations, ACM Trans. Math. Software, 33(2),
article 10, doi:10.1145/1206040.1206043.
Appendix A Hughes, T. J. R. (2000), The Finite Element Method: Linear
Static and Dynamic Finite Element Analysis, vol. XXII, 682
pp., Dover, Mineola, N. Y.
[94] Table A1 lists the variables used throughout Hughes, T. J. R., et al. (1987), Large-scale vectorized implicit
the paper and in the code to facilitate its under- calculations in solid mechanics on a Cray X-MP/48 utilizing
standing. Variable names, their sizes, and short EBE preconditioned conjugate gradients, Comput. Methods
descriptions are given. Appl. Mech. Eng., 61, 215 – 248.
Kwon, Y. W., and H. Bang (2000), The Finite Element Method
Acknowledgments Using MATLAB, 2nd ed., 607 pp., CRC Press, Boca Raton,
Fla.
Limache, A., et al. (2007), The violation of objectivity in La-
[95] This work was supported by the Norwegian Research place formulations of the Navier-Stokes equations, Int. J.
Council through a Centre of Excellence grant to PGP. We Numer. Methods Fluids, 54, 639 – 664.
would like to thank Tim Davis, the author of the SuiteSparse Pelletier, D., et al. (1989), Are FEM solutions of incompres-
package, for making this large suite of tools available and sible flows really incompressible? (or how simple flows can
giving us helpful comments. We would also like to thank J. R. cause headaches!), Int. J. Numer. Methods Fluids, 9, 99 – 112.
Shewchuk for making the mesh generator Triangle freely Persson, P. O., and G. Strang (2004), A simple mesh generator
available. We are grateful to Antje Keller for her help regarding in MATLAB, SIAM Rev., 46, 329 – 345.
code benchmarking. We thank Galen Gisler for improving the

23 of 24
Geochemistry 3
Geophysics
Geosystems G dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

Pozrikidis, C. (2005), Introduction to Finite and Spectral Ele- Toledo, S. (1997), Improving the memory-system performance
ment Methods Using MATLAB, 653 pp., CRC Press, Boca of sparse-matrix vector multiplication, IBM J. Res. Dev., 41,
Raton, Fla. 711 – 725.
Sigmund, O. (2001), A 99 line topology optimization code Wesseling, P. (1992), An Introduction to Multigrid Methods,
written in Matlab, Struct. Multidisciplinary Optim., 21, 284 pp., John Wiley, Chichester, N. Y.
120 – 127. Zienkiewicz, O. C., and R. L. Taylor (2000), The Finite Element
Silvester, D. J. (1988), Optimizing finite-element matrix cal- Method, 5th ed., Butterworth-Heinemann, Oxford, U. K.
culations using the general technique of element vectoriza-
tion, Parallel Comput., 6, 157 – 164.

24 of 24
View publication stats

Potrebbero piacerti anche