Sei sulla pagina 1di 36

Implementing a Linux Cluster

A project report submitted in partial fulfilment of the requirements for the award of the degree of

Bachelor of Technology

in Computer Science and Engineering by Deepak Lukose Roll No:16 Group No:8 S8 CSE

Engineering by Deepak Lukose Roll No:16 Group No:8 S8 CSE Department of Computer Engineering National Institute

Department of Computer Engineering

National Institute of Technology, Calicut

Kerala - 673601

2006

Certificate

This is to certify that the project entitled “Implementing a Linux Cluster” is

a bonafide record of the project presented by Deepak Lukose, (Y2.025) under

our supervision and guidance. The project report has been submitted to the De-

partment of Computer Engineering of National Institute of Technology,

Calicut in partial fulfilment of the award of the Degree of Bachelor of Technol-

ogy in Computer Science and Engineering.

Dr. M.P. Sebastian Professor and Head Dept.of Computer Engineering NIT Calicut

Mr. Vinod Pathari Lecturer Dept.of Computer Engineering NIT Calicut

iii

Abstract

A computer cluster is a group of loosely coupled computers that work to- gether closely so that in many respects it can be viewed as though it were a single computer. Clusters are commonly connected through fast local area networks and are usually deployed to improve speed and/or reliability over that provided by a single computer, while typically being much more cost-effective than single com- puters of comparable speed or reliability. Clusters built from open source software, particularly based on the GNU/Linux operating system, are increasingly popular. Their success is not hard to explain because they can cheaply solve an ever-widening range of number-crunching appli- cations. A wealth of open source or free software has emerged to make it easy to set up, administer, and program these clusters. This work aims at an implemen- tation of free and open source clusters for performing scientific computations at a faster pace.

iv

Acknowledgements

I express my sincere thanks to Mr. Vinod Pathari for his constant backing and support. I would like to extend my gratitude to entire faculty and staff of CSE Department of NITC, who stood by me in all the difficulties I had to face during the completion of this project. Last but not the least I thank God Almighty for

being the guiding light

all throughout.

Deepak Lukose

v

Chapter

Contents

1 Introduction

1

1.1 Problem Definition

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

1.2 Cluster history

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

1.3 High-performance (HPC) clusters

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2

1.4 Cluster technologies .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2

2 Motivation

4

3 Design

5

3.1

Design decisions .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

3.1.1 Mission

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

3.1.2 Operating System .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

3.1.3 Cluster Software

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

3.1.4 Hardware

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7

4 Cluster Installation

 

8

4.1 OSCAR

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

4.2 Packages .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9

4.3 Installation Strategy

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

vi

5 Parallel Programming

 

11

5.1 Hello World

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

5.1.1 MPI Init .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

5.1.2 MPI Finalize

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

5.1.3 MPI Comm size

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13

5.1.4 MPI Comm rank

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13

5.1.5 MPI Get processor name

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

5.2 Numerical Integration

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15

6 Molecular Dynamics Modeling of Thermal Conductivity of Engineering Flu-

ids and its Enhancement due to Nanoparticle Inclusion

 

19

6.1 Algorithm

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20

6.2 Profiling

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21

7 Testing and Benchmarking

 

26

8 Conclusion

29

Bibliography

30

Chapter 1

Introduction

1.1 Problem Definition

To setup a cluster which can increase the performance of CPU intensive tasks like scientific computations, rendering graphics etc. In short, to create a High performance cluster for general purpose computing with special regard to modeling and simulation of molecular systems.

1.2 Cluster history

The first commodity clustering product was ARCnet, developed by Dat- apoint in 1977. ARCnet was not a commercial success and clustering did not really take off until DEC released their VAXcluster product in the 1980s for the VAX/VMS operating system. The ARCnet and VAXcluster products not only supported parallel computing, but also shared file systems and peripheral devices. They were supposed to give us the advantage of parallel processing while maintain- ing data reliability and uniqueness. VAXcluster, now VMScluster, is still available on OpenVMS systems from HP running on Alpha and Itanium systems. The history of cluster computing is intimately tied up with the evolution of networking technology. As networking technology has become cheaper and faster, cluster computers have become significantly more attractive.

2

1.3 High-performance (HPC) clusters

High-performance clusters are implemented primarily to provide increased performance by splitting a computational task across many different nodes in the cluster, and are most commonly used in scientific computing. One of the more popular HPC implementations is a cluster with nodes running Linux as the OS and free software to implement the parallelism. This configuration is often referred to as a Beowulf cluster. Such clusters commonly run custom programs which have been designed to exploit the parallelism available on HPC clusters. Many such programs use libraries such as Message Passing Interface(MPI) which are specially designed for writing scientific applications for HPC computers.

1.4 Cluster technologies

The Message Passing Interface (MPI) is a computer communications protocol. It is a de facto standard for communication among the nodes running a parallel program on a distributed memory system. MPI is a library of routines that can be called from Fortran, C, C++ and Ada programs. MPI’s advantage over older message passing libraries is that it is both portable (because MPI has been imple- mented for almost every distributed memory architecture) and fast (because each implementation is optimized for the hardware it runs on). The GNU/Linux world sports various cluster software, such as:

Specialized application clusters - Those clusters which are built using special- ized applications fall in this category. Eg: Beowulf, distcc, MPICH and other. distcc provides parallel compilation when using GCC. Director-based clusters - These clusters allow incoming requests for services to be distributed across multiple cluster nodes. Eg:Linux Virtual Server, Linux-HA Full-blown clusters - This category includes those clusters which integrated into the kernel mechanism for automatic process migration among homogeneous nodes.

3

are

single-system image implementations.

DragonFly BSD, a recent fork of FreeBSD 4.8 is being redesigned at its core

to enable native clustering capabilities. It also aims to achieve single-system image

capabilities.

MSCS is Microsoft’s high-availability cluster service for Windows. Based

Eg:Mosix, openMosix, Kerrighed, OpenSSI etc.

OpenSSI

and

Kerrighed

on technology developed by Digital Equipment Corporation. The current version

supports up to eight nodes in a single cluster, typically connected to a SAN. A set

of APIs support cluster-aware applications, generic templates provide support for

non-cluster aware applications.

Grid computing is a technology closely related to cluster computing. The key

differences between grids and traditional clusters are that grids connect collections

of computers which do not fully trust each other, and hence operate more like a

computing utility than like a single computer. In addition, grids typically support

more heterogeneous collections than are commonly supported in clusters.

Chapter 2

Motivation

Most of the time, the computer is idle. Start a program like xload or top that monitors the system use, and one will probably find that the processors load is not even hitting the 1.0 mark. If one has two or more computers, chances are that at any given time, at least one of them is doing nothing. Unfortunately, when we really do need CPU power - during a C++ compile, or encoding Ogg Vorbis music files - we need a lot of it at once. The idea behind clustering is to spread these loads among all available computers, using the resources that are free on other machines. The basic unit of a cluster is a single computer, also called a “node”. Clusters can grow in size - they “scale” - by adding more machines. The power of the cluster as a whole will be based on the speed of individual computers and their connection speeds are. In addition, the operating system of the cluster must make the best use of the available hardware in response to changing conditions. This becomes more of a challenge if the cluster is composed of different hardware types (a “heterogeneous” cluster), if the configuration of the cluster changes unpredictably (machines joining and leaving the cluster), and the loads cannot be predicted ahead of time.

Chapter 3

Design

Designing a cluster entails four sets of design decisions:

(1) Determine the overall mission of the cluster.

(2) Select a general architecture for the cluster.

(3) Select the operating system, cluster software, and other system software that will be used.

(4) Select the hardware for the cluster.

While each of these tasks, in part, depends on the others, the first step is crucial. If at all possible, the cluster’s mission should drive all other design decisions. At the very least, the other design decisions must be made in the context of the cluster’s mission and be consistent with it. Selecting the hardware should be the final step in the design, but often we won’t have as much choice as we would like. A number of constraints may force us to select the hardware early in the design process. The most obvious is the budget constraints. Defining what we want to do with the cluster is really the first step in de- signing it. For many clusters, the mission will be clearly understood in advance. This is particularly true if the cluster has a single use or a few clearly defined uses.

6

But it should be noted that clusters have a way of evolving. What may be a

reasonable assessment of needs today may not be tomorrow. Good design is often

the art of balancing today’s resources with tomorrow’s needs.

3.1

Design decisions

3.1.1

Mission

To setup a cluster which can increase the performance of CPU intensive

tasks like scientific computations, rendering graphics etc. In short, to create a

High performance cluster for general purpose computing with special regard to

modeling and simulation of molecular systems.

3.1.2 Operating System

RedHat Enterprise Linux 4(RHEL4) was selected because it is one of the

most widely supported platform by all cluster softwares.

3.1.3 Cluster Software

OSCAR (Open Source Cluster Application Resources) is a high performance

Linux cluster that is available for parallel/distributed computing needs. OSCAR

scales well over Linux distributions. For greater control over how the cluster is

configured, one will be happier with OSCAR in the long run. Typically, OSCAR

provides better documentation than other Cluster kits like Rocks. OSCAR was

chosen over traditional Beowulf due to the ease of installation as well as its com-

prehensive package which includes many compilers and application softwares. One

of the main package that is being used on OSCAR is LAM-MPI, a popular imple-

mentation of the MPI parallel programming paradigm. LAM-MPI can be used for

developing parallel programs and it is one of the leading implementations available.

7

3.1.4

Hardware

Hardware preferred is P4 3.0GHz machines with 2GB DDR RAM(on each node) and 1Gbps/100Mbps NIC which supports PXE(Preboot Execution Envi- ronment), for network booting on computers. The Head node must have at least 7GB Hard disk capacity and all other client nodes need a minimum of 5GB hard disk capacity.

Chapter 4

Cluster Installation

One of the more important developments in the short life of high perfor- mance clusters has been the creation of cluster installation kits such as OSCAR (Open Source Cluster Application Resources) and Rocks. With software packages like these, it is possible to install everything one needs and very quickly have a fully functional cluster. A fully functional cluster will have a number of software packages each addressing a different need, such as programming, management, and scheduling.

4.1

OSCAR

OSCAR is a software package that is designed to simplify cluster installation.

A

collection of open source cluster software, OSCAR includes everything that one

is

likely to need for a dedicated, high-performance cluster. OSCAR uses a best-in-

category approach, selecting the best available software for each type of cluster- related task. One will often have several products to choose from for any given need.

The design goals for OSCAR include using the best-of-class software, elimi- nating the downloading, installation, and configuration of individual components, and moving toward the standardization of clusters. OSCAR, it is said, reduces the need for expertise in setting up a cluster because OSCAR takes us completely through the installation of a cluster. In practice, it might be more fitting to say

9

that OSCAR delays the need for expertise and allows us to create a fully functional

cluster before mastering all the skills one will eventually need. In the long run,

one will want to master those packages in OSCAR. OSCAR makes it very easy to

experiment with packages and dramatically lowers the barrier to getting started.

OSCAR is designed with high-performance computing in mind. So, unless

one customizes the installation, the computer nodes are meant to be dedicated to

the cluster.

4.2

Packages

OSCAR brings together a number of software packages for clustering. Most

of the packages are available as standalone packages. The main packages in OS-

CAR are:

Core: This is the core OSCAR package.

C3: The Cluster, Command, and Control tool suite provides a command-line ad-

ministration interface.“cexec mkdir /opt/c3-4” will create the directory on all

clients. Similarly we can use commands like cget, ckill, cpush, crm, cshutdown.

Environmental Switcher: This is based on Modules, a Perl script that al-

lows the user to make changes to the environment of future shells. For exam-

ple, Switcher allows a user to change between MPICH and LAM/MPI. “switcher

mpi = mpich-ch-p4-gcc-1.2.5.10” can be used to set the execution environ-

ment needed for mpich library.

SIS: The System Installation Suite is used to install the operating systems on the

clients.

Monitoring systems: Clumon, a web-based performance-monitoring system,

and Ganglia, a real-time monitoring system and execution environment, are the

softwares included in this category.

MAUI: This job scheduler is used with openPBS.

10

openPBS: The portable batch system is a workload management system. PVFS: Parallel Virtual File System is a high-performance, scalable, parallel vir- tual file system.

Any high-performance cluster would be incomplete without programming tools. The OSCAR distribution includes:

LAM/MPI :This is one implementation of the message passing interface (MPI) libraries. MPICH :This is another implementation of the message passing interface (MPI) libraries. PVM :This package provides the parallel virtual machine system, another mes- sage passing library.

4.3 Installation Strategy

With OSCAR, one first installs Linux (but only on the head node) and then installs OSCAR. The installations of the two are separate. This makes the instal- lation more involved, but it gives us more control over the configuration of the system, and it is somewhat easier to recover when we encounter installation prob- lems. And because the OSCAR installation is separate from the Linux installation, we are not tied to a single Linux distribution. OSCAR uses a system image cloning strategy to distribute the disk image to the compute nodes. With OSCAR it is best to use the same hardware throughout the cluster. OSCAR’s thin client model is designed for diskless systems.

Chapter 5

Parallel Programming

5.1 Hello World

It is customary to start a new programming language with “Hello world” program. So here also we will start with “Hello World” program. The parallel version of the program(using LAM/MPI) is as follows:

#include

"mpi.h"

#include

<stdio.h>

int

main(int

argc,

char

*argv[])

{

int

processId;

/*

rank

of

process

*/

int

noProcesses;

/*

number

of

processes

*/

int

nameSize;

/*

length

of

name

*/

char

computerName[MPI_MAX_PROCESSOR_NAME];

 

MPI_Init(&argc,

MPI_Comm_size(MPI_COMM_WORLD,

MPI_Comm_rank(MPI_COMM_WORLD,

&argv);

&noProcesses);

&processId);

MPI_Get_processor_name(computerName,

&nameSize);

fprintf(stderr,"Hello

MPI_Finalize(

return

0;

);

from

process

%d

on

%s\n",

processId,

12

computerName);

}

This example introduces five MPI functions, defined through the inclusion

of the header file for the MPI library, mpi.h, and included when the MPI library

is linked to the program. While this example uses C, similar libraries are available

for C++ and FORTRAN.

Four of these functions, MPI Init, MPI Comm size, MPI Comm rank, and

MPI Finalize, are seen in virtually every MPI program.

5.1.1 MPI Init

MPI Init is used to initialize an MPI session. All MPI programs must have

call to MPI Init. MPI Init is called once, typically at the start of a program.

One can have lots of other code before this call, or one can even call MPI Init

a

from a subroutine, but one should call it before any other MPI functions are

called. (There is an exception: the function MPI Initialized can be called before

MPI Init. MPI Initialized is used to see if MPI Init has been previously called. )

In C, MPI Init can be called with the addresses for argc and argv as shown in the

example. This allows the program to take advantage of command-line arguments.

Alternatively, these addresses can be replaced with a NULL.

5.1.2 MPI Finalize

MPI Finalize is called to shut down MPI. MPI Finalize should be the last

It is the user’s

MPI call made in a program.

It is used to free memory, etc.

13

responsibility to ensure that all pending communications are complete before a

process calls MPI Finalize. Every process must call MPI Finalize.

5.1.3 MPI Comm size

This routine is used to determine the total number of processes running in a

communicator (the communications group for the processes being used). It takes

the communicator as the first argument and the address of an integer variable

used to return the number of processes. For example, if one is executing a pro-

gram using five processes and the default communicator, the value returned by

MPI Comm size will be five, the total number of processes being used. This is

number of processes, but not necessarily the number of machines being used.

In the hello world program, both MPI Comm size and MPI Comm rank used

the default communicator, MPI COMM WORLD. This communicator includes all

the processes available at initialization and is created automatically. Communica-

tors are used to distinguish and group messages. As such, communicators provide

a powerful encapsulation mechanism. While it is possible to create and manipulate

one’s own communicators, the default communicator will probably satisfy most of

the needs.

5.1.4 MPI Comm rank

MPI Comm rank is used to determine the rank of the current process within

the communicator. MPI Comm rank takes a communicator as its first argument

and the address of an integer variable is used to return the value of the rank.

Basically, each process is assigned a different process number or rank within

a communicator. Ranks range from 0 to one less than the size returned by

MPI Comm size. For example, if one is running a set of five processes, the individ-

ual processes will be numbered 0, 1, 2, 3, and 4. By examining its rank, a process

14

can distinguish itself from other processes. The values returned by MPI Comm size

and MPI Comm rank are often used to divide up a problem among processes.

Next, each individual process can examine its rank to determine its role in the

calculation. For example, the process with rank 0 might work on the first part of

the problem; the process with rank 1 will work on the second part of the problem,

etc. One can divide up the problem differently also. For example, the process with

rank 0 might collect all the results from the other processes for the final report

rather than participate in the actual calculation. It is really up to the programmer

to determine how to use this information.

5.1.5

MPI Get processor name

MPI Get processor name is used to retrieve the host name of the node on

which the individual process is running. In the sample program, we used it to

display host names. The first argument is an array to store the name and the

second is used to return the actual length of the name.

Each of the C versions of these five functions returns an integer error code.

With a few exceptions, the actual code is left up to the implementers. Error codes

can be translated into meaningful messages using the MPI Error string function.

Here is an example of compiling and running the code:

[deepak@ssl1

~]$

mpicc

hello.c

-o

hello

[deepak@ssl1

~]$

mpirun

-np

5

hello

Messages

received

by

process

0

on

ssl1.

Greetings

from

process

1

on

ssl3.deltaforce!

Greetings

from

process

2

on

ssl1!

 

Greetings

from

process

3

on

ssl3.deltaforce!

Greetings

from

process

4

on

ssl1!

 

15

This will depend on the speed

of the individual machines, the loads on the machines, and the speeds of the

communications links. Unless one takes explicit measures to control the order of

execution among processors, one should not make assumptions about the order of

execution.

When running the program, the user specifies the number of processes on

the command line. MPI Comm size provides a way to get that information back

into the program. Next time, if one wants to use a different number of processes,

just change the command line and the code will take care of the rest.

There is no apparent order in the output.

5.2 Numerical Integration

This is a fairly standard problem for introducing parallel calculations because

it can be easily decomposed into parts that can be shared among the computers

in a cluster. Although in most cases it can be solved quickly on a single processor,

the parallel solution illustrates all the basics one needs to get started writing MPI

code. The reason this area problem is both interesting and commonly used is

that it is very straightforward to subdivide this problem. We can let different

computers calculate the areas for different rectangles. Basically, MPI Comm size

and MPI Comm rank are used to divide the problem among processors. MPI Send

is used to send the intermediate results back to the process with rank 0, which

collects the results with MPI Recv and prints the final answer. Here is the program:

#include

"mpi.h"

#include

<stdio.h>

/*

problem

parameters

*/

#define

f(x)

((x)

*

(x))

#define

numberRects

50

16

#define

lowerLimit

 

2.0

#define

upperLimit

5.0

int

main(int

argc,

char

*argv[])

{

/*

MPI

variables

*/

int

dest,

noProcesses,

processId,

src,

tag;

MPI_Status

status;

/*

problem

variables

*/

int

i;

double

area,

at,

height,

lower,

width,

total,

range;

/*

MPI_Init(&argc,

MPI_Comm_size(MPI_COMM_WORLD,

MPI_Comm_rank(MPI_COMM_WORLD,

MPI

setup

*/

&argv);

&noProcesses);

&processId);

/*

adjust

problem

size

for

subproblem*/

range

=

(upperLimit

 

-

lowerLimit)

/

noProcesses;

width

=

range

/

numberRects;

 

lower

=

lowerLimit

+

range

*

processId;

/*

calculate

area

for

subproblem

*/

area

=

0.0;

 

for

(i

=

0;

i

<

numberRects;

i++)

{

17

}

/*

at

height

=

area

=

lower

=

+

i

*

width

f(at);

+

area

width

*

+

width

height;

collect

information

and

print

/

2.0;

results

*/

tag

=

0;

if

(processId

=

=

0)

/*

if

rank

is

0,

collect

results

*/

{

total

=

area;

 

for

(src=1;

src

<

noProcesses;

src++)

 
 

{

 

MPI_Recv(&area,

1,

MPI_DOUBLE,

src,

tag,

MPI_COMM_WORLD,

&status);

total

=

total

+

area;

 
 

}

fprintf(stderr,

"The

area

from

%f

to

%f

is:

%f\n",

 

lowerLimit,

upperLimit,

total

);

}

else

 

/*

all

other

processes

only

send

*/

{

dest

=

0;

MPI_Send(&area,

1,

MPI_DOUBLE,

dest,

tag,

MPI_COMM_WORLD);

 

};

/*

finish

*/

 

MPI_Finalize(

);

 

return

0;

}

18

In this example, we are calculating the area under the curve y = f (x 2 ) between

x=2 and x=5. Since each process only needs to do part of this calculation, we need

to divide the problem among the processes so that each process gets a different

part and all the parts are accounted for. MPI Comm size is used to determine the

number of parts the problem will be broken into, noProcesses. That is, we divide

the total range (2 to 5) equally among the processes and adjust the start of the

range for an individual process based on its rank. In the next section of code, each

process calculates the area for its part of the problem. Then we need to collect

and combine all our individual results. One process will act as a collector to which

the remaining processes will send their results. Using the process with rank 0 as

the receiver is the logical choice. The remaining processes act as senders. A fair

amount of MPI code development can be done on a single processor system and

then moved to a multiprocessor environment.

Chapter 6

Molecular Dynamics Modeling of Thermal Conductivity of Engineering Fluids and its Enhancement due to Nanoparticle Inclusion

In the present study, an attempt is made to estimate the enhancement of the

thermal conductivity of water by suspension of nanoparticles, using the method

of Molecular Dynamics simulation. This involves the process of generating the

atomic trajectories of a system of a finite number of particles by direct integration

of the classical Newton’s equations of motion, with appropriate specification of

interatomic potentials and application of suitable initial and boundary conditions.

Initially a general simulation procedure is developed for the thermal conductivity

of a liquid and the procedure is validated with standard values. A few among

the plausible theoretical models for the thermal conductivity of nanofluids has

been selected and studied. Algorithms are made for simulating them, abiding the

procedural steps of the Molecular Dynamics method. The thermal conductivity

enhancement in the base fluid due to suspension of nanoparticles, estimated using

the simulations of the models considered are compared among themselves and with

the existing experimental results, and further investigated to select the most ap-

propriate model which matches best with a practical case of interest (metal oxide

- water system). Parametric studies are conducted to study the variation of ther-

mal conductivity enhancement with temperature, and the optimal dosing levels

of nanofluids are also investigated upon. Further, an optimization of the simula-

tion procedure and algorithms is attempted to bring out an efficient computation

20

strategy.

6.1

Algorithm

main()

{

initializePositions();

initializeVelocities();

for(time=0;

time<

totalTimeSteps;

time++)

{

velocityVerlet(dt); instantaneousTemperature();

if(i%200==0)

rescaleVelocities();

if(time>equilibrationtime)

}

jt();

thermalConductivity();

}

velocityVerlet()

{

computeAccelerations();

updatepositions;

computeAccelerations();

updateVelocity;

}

//at

time

’t’

//function

of

//at

//function

of

a[t]

time

a[t]

’t+dt’;

and

a[t+dt]

21

6.2

Profiling

It is generally said that a typical program will spend over 90% of its execution time in less that 10% of the actual code. This is just a rule of thumb or heuristic, and as such, will be wildly inaccurate or totally irrelevant for some programs. But for many, if not most, programs, it is a reasonable observation. The actual numbers don’t matter since they will change from program to program. It is the idea that is important for most programs, most of the execution time spent is in a very small portion of the code. If the application spends 95% of its time in 5% of the code, there is little to be gained by optimizing the other 95% of the code. Even if one could completely eliminate it, one would only see a 5% improvement. But if one can manage a 10% improvement in the critical 5% of the code, for example, we will see a 9.5% overall improvement in the program. Thus, the key to improving the code’s performance is to identify that crucial 5%. That is the region where one should spend one’s time optimizing code. There is a point of diminishing returns when optimizing code. We will need to balance the amount of time we spend optimizing code with the amount of improvement we actually get. There is a point where the code is good enough. The goals of profiling are two-fold to decide how much optimization is worth doing and to identify which parts of code should be optimized. For serial algorithms, one can often make reasonable estimates on how time is being spent by simply examining and analyzing the algorithm. The standard approach characterizes performance using some measurement of the problem size. Since the problem size often provides a bound for algorithmic performance, this approach is sometimes called asymptotic analysis. Asymptotic analysis can be problematic with parallel programs for several reasons. First, it may be difficult to estimate the cost of communications required

22

by a parallel solution. This can be further complicated by the need for additional

code to coordinate communications among the processors. Second, there is often

a less than perfect overlap among the communicating processes. A processor may

be idle while it waits for its next task. In particular, it may be difficult to predict

when a processor will be idle and what effect this will have on overall performance.

For these and other reasons, an empirical approach to estimating performance is

often the preferred approach for parallel programs. That is, we directly measure

performance of existing programs.

Thus, with parallel programs, the most appropriate strategy is to select the

best algorithm one can and then empirically verify its actual performance.

The profile of the unoptimized code is as follows:

[delta16@athena

project]$

time

-p

./green_without_prof

 

Thermal

Conductivity

:

0.000922671

 

real

378.63

user

378.61

sys

0.00

%

cumulative

self

 

self

total

time

seconds

seconds

calls

ms/call

ms/call

name

49.31

398.03

398.03

40000

9.95

9.95

Accelerations()

34.33

675.12

277.09

1620000

0.17

0.17

lj(int)

15.88

803.34

128.21

15000

8.55

27.09

jt()

0.47

807.09

3.75

20000

0.19

20.09

Verlet(double)

0.13

808.10

1.01

1620000

0.00

0.17

ei(int)

0.00

808.14

0.04

 

201

0.20

0.20

rescaleVelocities()

<some

output

removed

 

for

brevity>

 

From this profile we can find that there is scope for improvement in func-

23

tions like Accelerations, lj and jt because they account for more than 99% of the

execution time. It can be noticed that a single call to lj function takes very little

amount of time to execute but since it is executed large number of times, even

a slight improvement in the code will heavily affect the execution time. It can

be found from the code that the functions jt and Accelerations are both having

a complexity of O(N 2 ). So the work done by this part of the code can be split

among the individual nodes.

The profile of the optimized code is as follows:

[delta16@athena

project]$

mpirun

-np

Thermal

Conductivity

real

user

sys

155.74

134.28

1.70

:

0.000922359

4

time

-p

./mpi_green_pg

<some

output

removed

for

brevity>

%

cumulative

self

 

self

total

time

seconds

seconds

calls

ms/call

ms/call

name

54.35

182.83

182.83

40000

4.57

4.57

Accelerations()

22.37

258.11

75.27

405000

0.19

0.19

lj(int)

19.61

324.08

65.97

15000

4.40

9.45

jt()

2.39

332.12

8.04 20000

0.40

9.54

Verlet(double)

<some

output

removed

for

brevity>

The comparison between the unoptimized and optimized code is shown be-

low:

24

24 self ms/call 12 10 8 6 4 2 0 Before After   No of Calls
self ms/call 12 10 8 6 4 2 0 Before After
self ms/call
12
10
8
6
4
2
0
Before
After
 

No of Calls

45000

40000

35000

30000

25000

20000

15000

10000

5000

0

Before

After

Total Time 450 400 350 300 250 200 150 100 50 0 Before After
Total Time
450
400
350
300
250
200
150
100
50
0
Before
After
450 400 350 300 250 200 150 100 50 0 Before After self ms/call 0.195 0.19
self ms/call 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 Before After
self ms/call
0.195
0.19
0.185
0.18
0.175
0.17
0.165
0.16
Before
After
No of Calls Total Time 1800000 300 1600000 250 1400000 1200000 200 1000000 150 800000
No of Calls
Total Time
1800000
300
1600000
250
1400000
1200000
200
1000000
150
800000
600000
100
400000
50
200000
0
0
Before
After
Before
After
100 400000 50 200000 0 0 Before After Before After self ms/call 9 8 7 6
self ms/call 9 8 7 6 5 4 3 2 1 0 Before After
self ms/call
9
8
7
6
5
4
3
2
1
0
Before
After
No of Calls Total Time 16000 140 14000 120 12000 100 10000 80 8000 60
No of Calls
Total Time
16000
140
14000
120
12000
100
10000
80
8000
60
6000
40
4000
2000
20
0
0
Before
After
Before
After
25
25

With this optimized code we get a speedup of 2.43 on a single node(4 CPUs) cluster. The error in the computation of the order of 10 7 is due the presence of floating point arithmetic. Maximum speedup was achieved on a dual processor machines by using shared memory instead of message passing. When the program was run on a cluster with two nodes(4 CPUs) the speedup achieved was in the range of 1.7 to 1.9 .

Chapter 7

Testing and Benchmarking

Once the cluster is running, one needs to run a benchmark or two just to see how well it performs. There are three main reasons for running benchmarks. First, a benchmark will provide us with a baseline. If we were to make changes to the cluster or if we suspect problems with the cluster, we can rerun the benchmark to see if performance is really any different. Second, benchmarks are useful when comparing systems or cluster configurations. They can provide a reasonable basis for selecting between alternatives. Finally, benchmarks can be helpful with planning. If we can run several with differently sized clusters, etc. , we should be able to make better estimates of the impact of scaling the cluster. High Performance Linpack, Hierarchical Integration (HINT) and NAS Par- allel Benchmarks are some common benchmarks available for clusters. As a yardstick of performance we are using the ‘best’ performance as mea- sured by the LINPACK Benchmark. LINPACK was chosen because it is widely used and performance numbers are available for almost all relevant systems. LINPACK is a software library for performing numerical linear algebra on digital computers. LINPACK makes use of the BLAS (Basic Linear Algebra Sub- programs) libraries for performing basic vector and matrix operations. The LIN- PACK Benchmark is based on LINPACK, representing a measure of a system’s floating point computing power. It measures how fast a computer solves dense

27

n by n systems of linear equations Ax=b, a common task in engineering.

solution is based on Gaussian elimination with partial pivoting, with

floating point operations. The result is Millions of floating point operations per

second(Mflop/s).

For large scale distributed memory systems, the performance of a portable

implementation of the High-Performance Linpack Benchmark link is used as a per-

formance measure for ranking supercomputers in the TOP500 list of the world’s

fastest computers. This performance does not reflect the overall performance of

a given system, as no single number ever can. It does, however, reflect the per-

formance of a dedicated system for solving a dense system of linear equations.

Since the problem is very regular, the performance achieved is quite high, and

the performance numbers give a good correction of peak performance. When the

High Performance Linpack was run on the cluster it gave a peak performance of

1.380e-01 GFLOPS.

The folowing figure explains the relationship between the execution time and

the number of available processors.

The

3 2 n 3 + n 2

28

28

Chapter 8

Conclusion

The basic objective of this project was to setup a high performance compu- tational cluster with special concern to molecular modelling. By using a cluster kit such as OSCAR,the first phase of the project, setting up a high performance clus- ter, could be completed. With the help of message passing libraries like LAM/MPI, the second phase of the project, improving the performance of a molecular mod- elling problem, was completed. The molecular modelling problem, which is con- sidered to be a standard benchmark, when tested on a two node cluster gave a remarkable speedup of the order of 2.0 . This increased efficiency came at the expense of higher code complexity. Finally, the testing phase was completed using the High Performance Linpack benchmark which showed a peak performance of 1.380e-01 GFLOPS. Future work includes the formal analysis of existing code for automated conversion to a parallel version for cluster implementation.

30

[1] Joseph D. Sloan,

Bibliography

High Performance Linux Clusters with OSCAR, Rocks,

OpenMosix, and MPI, O’Reilly & Associates, 1991.

[2] Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, Tata McGraw-Hill, 2003.

[3] Zolt´an Juh´asz, P´eter Kacsuk and Dieter Kranzim¨uller, Distributed and Par- allel Systems-Cluster and Grid Computing, Springer, 2002.

[4] Stefan B¨ohringer, Building a diskless Linux Cluster for high performance computations from a standard Linux distribution, Technical Report, Institut f¨ur Humangenetik, Universit¨atsklinikum Essen, 2003.

[5] Linux clusters information centre, http://lcic.org/

http://www.tldp.org/HOWTO/openMosix-

[6] Linux Documentation Project, HOWTO/index.html

[7] openMosix Development site, http://openmosix.sourceforge.net/

[8] openMosix website, http://openmosix.org

[9] Oscar website, http://oscar.openclustergroup.org

[10] Rocks cluster website, http://www.rocksclusters.org

[11] Openssi website, http://www.openssi.org/