Paper

Proceedings of the FirstJ.
EELA-2
MendozaConference
/ Implementation of two BEOWULF Clusters
B. Marechal et al. (Eds.)
CIEMAT 2008
© 2008 The authors. All rights reserved
Implementation of two BEOWULF Clusters and Performance

Evaluation using LAPACK, NAS and Skyvase Benchmarks
J. Mendoza1, M. Strobbe1, R. Zamudio1, W. Hoppe1, D. Aragón1, G. Roig1, J. Carrasco1, V.
Espinoza1, R. Delgado1, A. Montero1, C. Corzo1.
1
MIRG-UPCH Research Group, Universidad Peruana Cayetano Heredia
javiermp@upch.edu.pe
Abstract
Beowulf clusters consist of a server and many slave nodes connected through SSH and
NFS services, using libraries of message passing and parallelization, such as MPI and PVM.
Benchmarks are a set of procedures that use 4 general categories of tests of comparison:
application-based, playback test, synthetic test and inspection tests. LAPACK Benchmark
allows complex algebraic calculus through the use of linear matrices to differentiate the time
of computing of the clusters, while NAS Benchmark (Embarrassingly Parallel) accumulates
2D statistics of a great number of pseudorandom Gaussian numbers. The Skyvase
Benchmark, using PVMPOV, measures the performance of each slave in the ratio of time of
answer of the renderized image to the server. Open PBS/ Torque allows the clusters to be
managed with a great efficiency, because it is capable of organizing and delaying homework
according to the available resources, avoiding memory overloading. Two Beowulf clusters
were implemented, TURI-03, with 11 Pentium Dual Core slave nodes of 1Gb RAM , and
PANI-13, with 7 Pentium III nodes with 192 Mb RAM, using the Scientific Linux 5.2
distribution (32 bits) and the HTTP and PXE protocols. The libraries PVM and MPI (Open
MPI) were installed, as well as the Open PBS/Torque queuing system. LAPACK, NAS and
Skyvase benchmarks were used to evaluate different parameters, such as the performance, the
velocity, its capacity and the time of response of both clusters TURI-03 y PANI-13.
1. Introduction
Due to the technological advances observed in the recent decades an increasing numbers of
companies and educational centres have recognized the necessity of a high performance
computing system. As a result, the use of clusters is becoming more widespread. The
Cayetano Heredia Peruvian University acknowledges the importance of staying updated with
recent advances and decided to implement two clusters in order to analyze the performance,
speed, time of response and efficiency of each system.
The project consisted of two clusters Pani-13 and Turi-03. The first one was made up by 7
DELL GX16 motherboards, with Pentium III processors and 192 megabytes of RAM; while
Turi-03 was built with 11 Intel DG31PR motherboards, Pentium dual-core 1.8 GHz
processors and 1 gigabyte of RAM. But we only used 10 nodes of the 11.
Additionally both clusters run a Scientific Linux version 5.2 of 32 bits operating system
and work as a parallel network. The software parallel virtual machine transforms the system
into a single large parallel computer. Because of that, problems can be solved more cost
1
J. Mendoza / Implementation of two BEOWULF Clusters
effectively by using the aggregate power and memory of many computers. PVM is used
around the world in scientific, educational, industrial and medical fields.
Furthermore, two time management tools were installed: MPI and TORQUE. MPI
(Message Passing Interface) is a middleware software which allows the distribution of a
running task throughout the cluster nodes. MPI works with several programming languages
such as C, C++, Fortran 77 and Fortran 90. We used Open MPI 1.2.8, since it is documented
as a stable version and has additional features compared to the common MPI software, for
instance, it has a unique Modular Component Architecture. Open MPI is an open source
which can be downloaded freely from the Open MPI website and is compatible with
Scientific Linux.
TORQUE is designed to facilitate a cluster’s resources since it allows the users to freely
insert tasks which are later administrated according to the memory and processing tools that
are currently available. Therefore certain tasks are performed immediately while others are
delayed.
Moreover, benchmarks were installed in the system with the purpose of evaluating the
performance of each cluster. There is a variety of options among the current benchmarks
which assess different parameters of system efficiency.
One of the benchmarks selected was LAPACK which is a library of Fortran 77 subroutines
for solving the most commonly occurring problems in numerical linear algebra. It can also be
used as a benchmark for testing processing times in a single node. The version 3.0 contains
two distinct test programs; one evaluates the routines for solving linear equations and linear
least squares problems, and the other assesses routines for the matrix eigenvalue problem.
NAS Parallel Benchmark (NPB) consists of a small set of programs designed to evaluate
the performance of highly parallel supercomputers. It is based in Fortran-90 (including
Fortran-77) or C. NPB 2.3 is based on MPI source code. It includes the original eight
benchmark problems; FT, MG, LU, SP, BT, EP, CG and IS. Nowadays, the NPB is being
used by the Numerical Aerodynamic Simulation (NAS) program to evaluate the parallel
computers’ performance.
PVMPOV is an unofficial version source code which enables POVRay to run on a Linux
cluster; and is also the sum of PVM (Parallel virtual machine) plus POV-Ray. PVM is a
message transfer system that allows a computer network to be used as a single computer with
a parallel distributed memory. POV – Ray (The Persistence of Vision Ray Tracer), is a 3D
ray tracing software package which takes input information and simulates the interaction of
light with objects in order to create 3D pictures and animations.
When rendering an animation or image each computer can render a subset of the total
number of frames. The master has the responsibility of dividing the image up into small
blocks, which are assigned to the slaves. When the slaves have finished rendering the blocks,
they are sent back to the master, which combines them to form the final image.
2. Materials and methods
In the following section the processes of installation and configuration will be described as
well as the materials required.
2.1. Cluster implementation
The configuration of both clusters involves a similar process; first the modification of the
file, filesystem/etc/host, which has to be performed in every node. The IP address of each
node added and its name must be specified as follows:
2
122.168.1.13 pnode03 (slave node)

122.168.1.1 mserver03 (head node)
122.168.1.11 pnode01
122.168.1.12 pnode02
......
The same procedure is followed for Turi-03, replacing the head node name with mserver03
and the nodes for tnode**. The following steps let the creation of a user turi or pani
respectively; for which the following commands were used:
# useradd turi
# passwd turi
# password turi03
# groupadd Beowulf
# usermod –g Beowulf turi
# mkdir /mnt/nfs_serv02
# chmod 770 /mnt/nfs_serv02
# chown turi:Beowulf /mnt/nfs_serv02 –R
Once this was done in every node, the modification of the /etc/fstab file was required.
mserver02:/mnt/turi_03 /mnt/nfs_serv02 nfs rw,hard,intr0 0
2.2. PVM
For unpacking PVM, the distribution we’re using must contain the source code. In this
case, PVM 3.4 source code. The files in the source distribution unpack into a “pvm3”
directory.
For building and installing the full PVM software, a make utility is needed as well as a C
or C++ compiler, and a Fortram compiler on the system.
We have been using the GCC.4.1.6 compiler for a long time. Although, PVM software was
needed an old version of the GCC compiler. We installed the GCC.3.4.6 compiler and it
works perfectly.
Once we have the correct compilers working in our system, we started to build the PVM
software. For building we must use “ssh” instead of “rsh” on our system so we modified the
$PVM_ROOT/conf/LINUX.def to change the absolute path specified for RSHCOMMAND
in the ARCHFLAGS define. We replace the path to “rsh” with the absolute path to “ssh” on
our system.
Finally we have to copy the PVM source code “pvm3.4.6.tar” to /home/pani or
turi/usr/local and then unpack it there too. Inside of the “pvm3” directory we typed make and
PVM software was installed in that directory. Additionally to add all of the hosts we typed:
$ ssh-agent bash
$ ssh-add
$ pvm
$ pvm> add pnode01 or tnode01.
2.3. MPI
3
The Open MPI installation file can be found with the following extensions: .tar.bz2, .tar.gz
and .src.rpm. We used .tar.bz2 and .tar.gz.
The building syntaxes we used consisted in saving the file in the server and then in a
shared folder. The nodes would pick the file from the shared folder and then it should be
saved in their home’s folders. Then, the file had to be unpacked, configured and installed in
every PC involved.
At this point, the only precaution we took care of was the user. The commands were the
following:
$ bunzip2 openmpi-1.2.8.tar.bz2
$ tar –xvf openmpi-1.2.8.tar
$ cd openmpi-1.2.8
$ ./configure –prefix=/usr/local/
$ make all install
After the installation, a series of jobs were run to check if installation was effective. For
this purpose, we chose two files from the examples folder: hello_c-c and rinc.c. The election
of these files was made on basis of former experience.
After compiling, we found difficulties in running these files. In the case of hello_c, an error
associated to the library path occurred. It was solved when we added
LD_LIBRARY_PATH=usr/local/lib into the barsch file. In the case of ring_c, it wouldn’t
run because the firewall was activated; this was solved when we deactivated it in every single
PC.
2.4. TORQUE
TORQUE is an open source queueing system that can be obtained from

http://www.clusterresources.com/downloads/torque/. Several versions are available, but
TORQUE 2.3.7 was chosen because it was the most recent stable production release at the
time.
The setup is fairly simple; the first step involves the extraction and installation on the
machine that will act as the server:
# tar -xzvf torqueXXX.tar.gz

# cd torqueXXX
# ./configure
# make
# make install
Afterwards, the server must be configured:
set server operators = root@headnode

set server operators += username@headnode
create queue batch
set queue batch queue_type = Execution
set queue batch started = True
set queue batch enabled = True
set server default_queue = batch
set server resources_default.nodes = 1
set server scheduling = True
4
Additionally, the mom must be configured to recognize and communicate with its nodes.
Therefore, a file titled “nodes” must be created in the directory /var/spool/torque/server_priv
and must specify the number of nodes in the cluster and the number of processors as shown in
the following example:
node001 np=2
node002 np=4
…
Given the structure of the cluster, it is possible to make self extracting packages of
TORQUE which can later be distributed to the nodes, with the following command:
# make packages
This will create 5 packages that can be found in /var/spool/torque and must be transferred
to each node.
The next step involves the extraction of the individual files in each node:
#./{file_name} --install
Finally, the command pbs_mom must be run in each node.
To check that TORQUE is functioning correctly,
#qterm -t quick
#pbs_server
After a few moments pbsnodes -a should list all nodes in state free.
2.4. Benchmarks
2.4.1. LAPACK
The tests were made in two turi nodes, two pani nodes and the pani server. The test
routines applied were: for eigenvalue problems, nep and sgv; and for linear equation problems
the test routine. Each routine was applied using the four different executable files both for
linear equation problems and for eigenvalue problems. The files for linear equation are
xlintsts, xlintstc, xlintstd, xlintstz; and for eigenvalue problems, xeigtsts, xeigtstc, xeigtstd and
xeigtstz.
2.4.2. NAS
NPB 2.3 is downloaded from the National Aeronautics and Space Administration’s
website: <URL: http://www.nas.nasa.gov/>.
The file is then is then compiled in each node through the following instructions (the word
“cluster” represents each of the clusters; turi-03 and pani-13):
# cp Desktop/NPB2.3.tar /home/cluster
# cd /home/cluster
5
# chown cluster:Beowulf NPB2.3.tar

# su cluster
# scp NPB2.3.tar cnode02:S/. [This instruction is done for each node in both clusters]
# cd Desktop
# tar –xvf NPB2.3.tar.gz
# cp –r Desktop/NPB.2.3 /home/cluster
# cd /home/cluster
# chown cluster:Beowulf NPB2.3 -R
# su cluster
# cd
# cd NPB2.3
# cd NPB2.3-MPI
# cd EP
# make config CLASS=A NPROCS=8
# make CLASS=A NPROCS=8
Then, to execute all the nodes in a cluster at the same time:
# mpirun –np 8 –hostfile nodos ep.A.8.
The number of processes also varied, from 1 to 16.
2.4.3. Skyvase and PVMPOV
In order to build and run PVMPOV, pvmpov-3.1g2.tgz is required (the PVMPOV patch)
as well as povuni_s.tgz (the POV-Ray UNIX source code), and povuni_d.tgz (a data
collection that is part of POV-Ray). Due to the fact that the PVMPOV patch (pvmpov-
3.1g2.tgz) is the latest version (2001) and no new versions have been released, it is necessary
to install Scientific Linux version 3.
The SVGALib (svgalib-1.4.3.tar.gz) is also needed, which is an open-source low-level
graphics library that allows programs to change video mode and display full-screen graphics;
in order to run he PVMPOV binaries.
After downloading the first three files mentioned above, the PVMPOV patch file needs to
be untared on the server and the slaves, with the command:
$ tar xfz pvmpov-3.1g2.tgz
This will create a pvmpov3_1g_2 directory. Change into this directory and extract the
POV-Ray source files. Once the source files have been extracted, apply the PVMPOV patch
by executing ”./inst-pvm”.
After the patch has been applied successfully you can build the PVMPOV binaries. Change
into the povray31/source/pvm directory and type “aimk newunix”. When the compilation
finishes, compile the display capable versions of PVMPOV by executing “aimk newsvga”
and “aimk newxwin”. After this, enter the root file and execute the command “aimk install”,
which will install the necessary binaries.
Once the binaries have been successfully installed, the PVM daemon must be launched on
each host that will participate in the rendering (see PVM materials and methods). Now that
the PVM daemons are up, rendering can start. POV-Ray needs object script files (.pov) to
raytrace, in this case we get the skyvase.pov file. Copy this file on the server and on the
slaves. Finally, to run the skyvase.pov benchmark, execute the following command:
6
#pvmpov +Iskyvase.pov +Oskyvase.tga pvm_hosts=pnode02,pnode03,pnode05,pnode06,\

pnode07 +NT24 +NW64 +NH64 +v –w1024 –h768
Observe that the “+NT24” command specifies the number of tasks in which the rendering
will be divided in the slaves (for further specifications see the Appendix).
3. Results
The following results were obtained from the applied benchmarks.
3.1. LAPACK
These results compared the time required for certain nodes and servers to process test
routines.
Table 1. Lineal equation routine test: Processing Times in seconds
Test Pani Pani Turi Turi

Pani
rou node node node node
server
tine 01 02 04 05
Table 2. Eigenvalue Xlintsts 8.89 17.64 17.67 4.37 5.76
matrix problems routine Xlintstc 26.67 59.33 59.42 4.41 5.83
test: Processing Times in Xlintstd 9.20 18.25 18.26 13.93 20.16
seconds
xlintstz 27.35 61.85 61.99 14.19 20.58
Nep Pani Pani Turi Turi
Pani
rou node node node node
server
tine 01 02 04 05
Xeigtsts 0.17 0.37 0.37 0.09 0.13
Xeigtstc 0.40 0.99 0.99 0.23 0.33
Xeigtstd 0.18 0.41 0.41 0.11 0.14
xeigtstz 0.46 1.10 1.09 0.25 0.37
The data obtained gives an approximate idea of the efficiency of each node or server, As
shown in the tables, the time needed to execute the required task is much greater in the Pani
nodes and server than in the Turi nodes. It is also important to note that the Pani server was
faster than its nodes. These results are due to the type of processor that was selected for each
cluster, since the Pani server is a Pentium IV and its nodes are Pentium III; while the Turi
nodes are Pentium Dual Core.
7
Figure 1. "NEP" Eigenvalue Matrix Problems Routine Test Processing Times

3.2. NAS
In both clusters, the operation’s time decreases as the number of processes increases. This
is because when the number of processes increases, the number of processors also increased,
and thus the time taken to complete successfully the operations decreased.
The time it took for Pani to complete the process was less than the time it took for a single
node in Turi to complete the same process successfully. This shows that even though Pani’s
nodes individually may process the operations slower than Turi, when we combine several
nodes together forming the cluster, the performance output is much better than that of Turi.
Table 3. Time of the EP Class A NAS problem changing the number of Processes /Processors
Number of Processes /Processors (s)

Cluster
1/1 2/2 4/4 6/6 8/8 16/16
Pani 141.68 70.19 35.1 46.8 35 27.06
Turi 42.28 21.56 10.7 15.14 6.04 3.06
Figure 2. Time of the EP Class A NAS problem changing the number of Processes/Processors
3.3 Skyvase
Due to the fact that the current version of Linux had to be replaced by Scientific Linux
3.09, we could only install it in only 5 nodes (pnode02, pnode03, pnode05, pnode06 and
8
pnode07) that were part of the PANI-13 cluster. Because of time problems we didn’t make the
same in the TURI-03 cluster, so the following results are based on PANI-13 nodes only.
The tests made are according to the number of nodes used and to the number of tasks for
the Skyvase image rendering. When a single node was used the response time for each given
number of tasks is very high compared with the results of the tests using 2 to 5 nodes. The
first given number of tasks was 48, which gave a response time between 35 and 38 seconds,
with 5 and 3 nodes, respectively. With 24 tasks, the best time was 28 seconds, with 5 and 3
nodes; while with 4 nodes, the response time was 29 seconds; which represents a slight
difference. Also when 12 tasks were given, using 3 and 5 nodes gave the best times, 25 and
26 seconds, respectively. In the first three tasks that were given, using 3 nodes has better
results than using 4 nodes. These results may be due to the different nodes combination used,
because not all the nodes used had the same requirements, except for the processor.
The best time for 3 tasks was between 25 and 26 seconds, using 3, 4 and 5 nodes. For 6
tasks the best response time was between 25 and 26 too, with 4 and 5 nodes. In the last two
tests, using 4 and 5 nodes gave quite similar results, in contrast with the first three tests.
On the other hand, the nodes that had the highest working rate were pnode07 and pnode05,
which had a working percentage between 25 and 30 % for each tasks. Observe that the
response times obtained depend on the different combinations of the nodes used for each test.
Table 4. Time of image rendering in seconds at the Skyvase Benchmark
Number Number of tasks (s)

of Nodes 3 6 12 24 48
1 61 62 64 70 78
2 33 33 33 36 45
3 26 25 25 28 38
4 25 26 27 29 41
5 25 28 26 28 35
90
80
70
1 Node
60
2 Nodes
Time (s)
50
3 Nodes
40
4 Nodes
30
5 Nodes
20
10
0
3 6 12 24 48
Number of Tasks
Figure 3. Time of image rendering in seconds using Skyvase Benchmark
9
4. Conclusions
Following the implementation of the cluster, the benchmarks proved to be a useful method
of comparison between the nodes and each cluster. The LAPACK test routines proved the fact
that the time taken to perform a task is directly related to the type of processor used. Turi’s
nodes (Pentium Dual Core) took less time than Pani’s nodes (Pentium III); even the Pani
server (Pentium IV), in spite of being a server, was unable to process the routines as fast as
the Turi’s nodes.
By using the benchmark NAS, which is a program created by the NASA, we concluded
that the time that a processor takes to perform a process is more than the time that two
processors take to perform two processes. The same applies when more processes and
processors are added.
Finally, from the results obtained with the PVMPOV we can conclude that a single Pani
node is less efficient that a single Turi node. However, if Pani’s nodes are combined the time
taken to process tasks is minor. Additionally, the processing time decreases when the number
of nodes used increases.
The authors would like to thanks the professor Liubov Flores for her work that helped us to
do ours. Thanks also to the CENTRO DE INVESTIGACIÓN ENERGÉTICAS
MEDIOAMBIENTALES Y TECNOLÓGICAS (CIEMAT) and to the Agencia Española de
Cooperación Internacional (AECI) because they helped us directly or indirectly to do our
research. We want to thanks to Maria Gamero, Ana-LucíaGarcía, Felipe de la Torre, José
Aarón Zapata and to the student of first year of medicine of the Universidad Peruana
Cayetano Heredia because without them we could not have been capable to do this research.
References
[1] J. Amit, “Beowulf Cluster Desing and Setup”. Retrieved June 6, 2008 from
http://cs.boisestate.edu/~amit/research/beowulf/beowulf_setup.pdf
[2] K. Swendson, “The Beowulf HOWTO”. Retrieved June 6, 2008 from
http://tldp.org/HOWTO/Beowulf-HOWTO/
[3] C.Yang, Y. Chang. “An Introduction to a PC Cluster with Diskless Slave Nodes”,
Tunghai Science, Taiwan, Vol 4 (2002).
[4] A. Wachsmann, “How to Install Red Hat Linux via PXE and Kickstart. Retrieved
December 5, 2008 from http://www.stanford.edu/~alfw/PXE-Kickstart/PXE-
Kickstart.html
[5] Open MPI: Open Source High Performance Computing Retrieved January 10, 2009,
from http://www.open-mpi.org/
[6] Cyberinfrasctructure Tutor. Introduction to MPI. Retrieved January 10, 2009, from
http://ci-tutor.ncsa.uiuc.edu/index.php
[7] LAPACK: Linear Algebra PACKage retrieved January 3, 2009, from
http://www.netlib.org/lapack/
[8] UNIVERSIDAD EAFIT: Una introducción al uso de LAPACK retrieved January 3,
2009, from http://www.eafit.edu.co/
[9] UNIVERSITY OF TENNESEE: LAPACK Working Note 41 Installation Guide for
LAPACK. Retrieved January 4, 2009, from http://www.mirrorservice.org/
[10] NPB: The NAS Parallel Benchmarks. Retrieved December 10, 2008, from
http://www.nas.nasa.gov/Resources/Software/npb.html
10

Paper

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Paper

Caricato da

Copyright:

Formati disponibili

Proceedings of the FirstJ.

Implementation of two BEOWULF Clusters and Performance

2. Materials and methods

2.1. Cluster implementation

122.168.1.13 pnode03 (slave node)

mserver02:/mnt/turi_03 /mnt/nfs_serv02 nfs rw,hard,intr0 0

TORQUE is an open source queueing system that can be obtained from

# tar -xzvf torqueXXX.tar.gz

Afterwards, the server must be configured:

set server operators = root@headnode

Finally, the command pbs_mom must be run in each node.

To check that TORQUE is functioning correctly,

# chown cluster:Beowulf NPB2.3.tar

Then, to execute all the nodes in a cluster at the same time:

# mpirun –np 8 –hostfile nodos ep.A.8.

The number of processes also varied, from 1 to 16.

2.4.3. Skyvase and PVMPOV

$ tar xfz pvmpov-3.1g2.tgz

#pvmpov +Iskyvase.pov +Oskyvase.tga pvm_hosts=pnode02,pnode03,pnode05,pnode06,\

The following results were obtained from the applied benchmarks.

Table 1. Lineal equation routine test: Processing Times in seconds

Test Pani Pani Turi Turi

Figure 1. "NEP" Eigenvalue Matrix Problems Routine Test Processing Times

Number of Processes /Processors (s)

Table 4. Time of image rendering in seconds at the Skyvase Benchmark

Number Number of tasks (s)

Figure 3. Time of image rendering in seconds using Skyvase Benchmark

Potrebbero piacerti anche