Dplasma and Hwloc

Co-operative Report
Integrating Hardware Architecture

knowledge to the
Distributed PLASMA project.
by
Krerkchai Kusolchu
Student Visitor
Innovative Computing Laboratory
Department of Electrical Engineering and
Computer Science
University of Tennessee
2010
Table of Contents
Acknowledgement 1
Abstract 1
Objective 1
Introduction 2
About Organization 2
Background 3
Hardware Locality 5
What is Hardware Locality? 5
Design and interface of Hardware Locality. 7
A. Abstracting the Hardware Topology
B. Abstracting the Hardware Topology
Application and Performance Example 6
Affinity-aware Thread Scheduling
Implementation, Problem, Solving & Result 1
9
Conclusion 2
1
References 2
2
Appendix 2
3
Acknowledgement
I would like to express my gratitude to all those

who gave me the opportunity to come here at
University of Tennessee for my internship to gain
valuable experiences. I want to thank Department of
Computer Engineering for giving me permission to
work and broaden my knowledge in the area of
Computer Science. I have furthermore to thank
Institute of Engineering, Suranaree University of
Technology to encourage and support everything for
my time here. I am deeply indebted to my supervisor
Dr. George Bosilca whose help, concern, suggestions
assisted me in all the time of completing my
internship. My colleagues from Department of
Computer Engineering who gave me some beneficial
guidance in my project work. And also to all of the
people who help me along the way. Last but not
least, I would like to give my special thanks to Dr.
Thara Angskun who contributed greatly to my visit to
University of Tennessee this time for my internship.
Abstract
The document is mainly consists of the following,

introduce the hwloc software, explains why affinities
are important in modern HPC hardware and
applications, gave several use cases with MPI and
OPENMP libraries, and show how hwloc helps them
achieve better performance. Compare the performance
of Distributed PLASMA with hardware architecture
knowledge with other method.
Objective
Apply and integrating hardware architecture

knowledge to the Distributed PLASMA project using
hwloc to improve the performance of Distributed
PLASMA and compare the performance of Distributed
PLASMA with hardware architecture knowledge with
other method.
Chapter 1: Introduction
About Organization
Name: Innovative Computer Laboratory
Alias: ICL
Contact University of Tennessee Department of
Electrical Engineering and Computer
Science
Suite 413 Claxton
1122 Volunteer Blvd
Knoxville TN 37996-3450
Located at the heart of the University of Tennessee

campus in Knoxville, ICL continues to lead the way as
one of the most respected academic, enabling
technology research laboratories in the world. Our
many contributions to technological discovery in the
HPC community, as well as at UT, underscore our
commitment to remain at the forefront of enabling
technology research.
Before his recent departure as Chancellor of the

Knoxville campus, Dr. Loren Crabtree remarked about
ICL’s prominent role at the University of Tennessee:
On behalf of the entire university, it is a privilege to
recognize the importance of the Innovative Computing
Laboratory to the university’s research mission. Led by
Distinguished Professor Jack Dongarra, ICL continues
to set the standard for academic research centers in
the 21st century. As one of the university’s most
respected centers, the students and staff of ICL
continue to demonstrate the dedication, leadership,
and accomplishments that embody the university’s
ongoing efforts to remain one of the top publicly
funded academic research institutions in the United
States. Going forward, I also expect ICL to continue
to play a major role in helping the university establish
and foster national and international collaborations,
including our ongoing partnerships with Oak Ridge
National Laboratory and the construction in Tennessee
of the NSF’s new petascale supercomputing center.
The future of research demands that academic
institutions raise the bar for instruction and exploration.
The University of Tennessee is proud to be the home
of world-class centers such as ICL and we look
forward to its continued contributions to our nation’s
research agenda.
Background
At the Innovative Computing Laboratory (ICL), our

mission is simple. We intend to be a world leader in
enabling technologies and software for scientific
computing. Our vision is to provide leading edge tools
to tackle science’s most challenging high performance
computing problems and to play a major role in the
development of standards for scientific computing in
general.
ICL was founded in 1989 by Dr. Jack Dongarra who

came to the University of Tennessee from Argonne
National Laboratory upon receiving a dual appointment
as Distinguished Professor in the Computer Science
Department and as Distinguished Scientist at nearby
Oak Ridge National Laboratory (ORNL), two positions
he holds today. What began with Dr. Dongarra and a
single graduate assistant has evolved into a fully
functional center, with a staff of more than 40
researchers, students, and administrators.
ThrougoutThroughout the past 18 years, ICL has
attracted many post-doctoral researchers and
professors from multi-disciplines such as mathematics,
chemistry, etc. Many of these scientists came to UT
specifically to work with Dr. Dongarra, which began a
long list of top research talent to pass through ICL
and move on to make exciting contributions at other
institutions and organizations. Below we recognize just
a few who have helped make ICL the respected
center it has become.
• Zhaojun Bai - University of California, Davis

• Richard Barrett - Oak Ridge National Laboratory
• Adam Beguelin - formerly of AOL, now retired
• Susan Blackford - Myricom
• Henri Casanova - University of Hawaii, Manoa
• Jaeyoung Choi - Soongsil University, Korea
• Andy Cleary - Lawrence Livermore National
Laboratory
• Frederic Desprez - ENS-Lyon, France
• Victor Eijkhout - University of Texas, Austin
• Graham Fagg - Microsoft
• Edgar Gabriel - University of Houston
• Robert van de Geijn - University of Texas, Austin
• Julien Langou - University of Colorado at Denver
• Antoine Petitet - ESI Group, France
• Roldan Pozo - NIST
• Erich Strohmaier - Lawrence Berkeley National
Laboratory
• Francoise Tisseur - Manchester University,
England
• Bernard Tourancheau - University of Lyon, France
• Sathish Vadhiyar - Indian Institute of Science
(IISC), India
• Clint Whaley - University of Texas, San Antonio
• Felix Wolf - Forschungszentrum Julich, Germany
Over the past 18 years, ICL has produced numerous

high value tools and applications that now compose
the basic fabric of high performance, scientific
computing. Some of the technologies that our research
has produced include:
Active Netlib ATLAS BLAS
FT-MPI HARNESS LAPACK

LAPACK for LINPACK
MPI
Clusters Benchmark
NetBuild Netlib NetSolve
PAPI PVM RIB
ScaLAPACK Top500
Our successes continue along with current ICL efforts

such as Fault Tolerant Linear Algebra, Generic Code
Optimization (GCO), HPC Challenge benchmark suite
(HPCC), KOJAK, Multi-core and Cell effort (PLASMA),
NetSolve/GridSolve, Open MPI, PAPI, SALSA,
SCALASCA, and vGrADS. Many of our efforts have
been recognized nationally and internationally, which
includes many awards such as four R&D 100 awards;
PVM in 1994, ATLAS and NetSolve in 1999, and
PAPI in 2001.
Chapter 2: Hardware Locality
What is Hardware Locality?
Hardware Locality or hwloc is a softwaresoftware that

provides command line tools and a C API to gathers
hardware information about processors, caches,
memory nodes and more, and exposes it to
applications and runtime systems in a abstracted and
portable hierarchical manner. Hwloc primarily goal is to
help high-performance computing (HPC) applications,
but is also applicable to any project seeking to exploit
code and/or data locality on modern computing
platforms. Hwloc may significantly help performance by
having runtime systems place their tasks or adapt
their communication strategies depending on hardware
affinities.
Design and Interface
We now introduce the design and interface of
hwloc. It aims at abstracting topology information in
a portable manner so as to export it to applications
and runtime systems in a convenient way.
A. Abstracting the Hardware Topology
Hardware Locality was designed from the idea that
nowadays and next-generation architectures are highly
hierarchical. Indeed, current machines consist of

several processor sockets containing multiple cores
composed of one or several threads. This led to
representing the hardware architecture as a tree of
resources. hwloc also includes NUMA memory
nodes in its resource tree as depicted on Figure 4.
In case of NUMA machines with dozens of
memory nodes such as SGI ALTIX systems [2],
hwloc can also parse the matrix of distances
between nodes (reported by the operating system)
so as to exhibit the hierarchical organization of
these memory nodes. hwloc was also designed
with the idea that future architectures may be
asymmetric (less cores in some sockets) or even
heterogeneous (different processor types). Thus, the
hierarchical tree is composed of generic objects
containing a type (among Node, Socket, Cache,
Core, and more) and various attributes such as
the cache type and size, or the socket number.
This design enables easy porting on future
architectures thanks to no assumption being made

on the presence of currently-existing object types
(such as sockets or cores) or their relative depth
in the tree.
B. Abstracting Exporting the Hardware Topology
hwloc gathers information about the underlying

hardware at startup. It uses operating system-
specific strategies to do so: reading the sysfs
pseudo-filesystem on LINUX, or calling some
specific low-level library on AIX, DARWIN,
OSF, SOLARIS or WINDOWS. It can then
display to the user a graphical or textual output.
It can also save it to an XML file so as to
reload it later instead of re-gathering it from
scratch, for instance if both a launcher and the
actual process uses it .
The most interesting way to use hwloc is
through its C-programming interface. The hwloc
interface not only abstracts OS-specific interfaces
into a portable API. It also tries to leverage all
their advantages through both a low-level detailed
interface and a high-level conceptual interface. The
former lets an advanced programmer directly
traverse the object tree, following pointers to
parents, children, siblings, etc. So as to find the
relevant resource information using topology
attributes such as their depth or index. The latter
API provides generic and higher-level helpers to find
resources matching some properties. Once the
application or runtime system has found the
interesting objects in the topology tree, it can then
retrieve information from its attributes to adapt its
behavior to the underlying hardware
characteristics.
Application and Performance Example

In this section will tell how hwloc can be used
by some existing OPENMP and MPI runtime
systems. We first look at scheduling OPENMP
threads and placing MPI processes depending on
their software affinities and on the hardware
hierarchy. Then, we show how a predefined process
placement can benefit from topology information by
adapting its communication strategy to the hardware
affinities between processes.
Affinity-aware Thread Scheduling
The OPENMP language consists of a set

of compiler directives, library routines and
environment variables that help the programmer
with designing parallel applications. It has been
originally designed for SMP architectures, and
OPENMP runtime systems now have to evolve to
deal with affinities on hierarchical NUMA
machines.
FO R E S T GOMP is an extension of the
GCC GNU OPENMP runtime system (GOMP)
that takes benefit from hwloc to be efficient on
any kind of shared-memory architecture. It relies
on the BUBBLESCHED scheduling framework to
group related threads together into recursive
Bubble Structures every time the application
enters a parallel section, thus generating a tree
of threads out of OPENMP applications.
BUBBLESCHED also decorates the topology
provided by hwloc with thread queues called
Runqueues. Each runqueue is thus attached to a
different object of the architecture topology. This
way, the computer architecture is modeled by a
tree of runqueues on which a tree of threads can
be scheduled. For instance, scheduling a thread on
a socket-level runqueue means that this thread
can only be executed by the corresponding cores.
And each core can run any thread that is placed
on the runqueue of an object containing this core.
So the problem of scheduling is only a
matter of mapping a dynamic tree of threads onto
a tree of runqueues. FORESTGOMP provides
several scheduling policies to fit different
situations. One of them, called Cache, takes the
topology into account to perform a thread
distribution accounting for cache memory affinities.
Its main goal is to schedule related threads
together in a portable way, consulting the topology
to determine which processing units share cache
memory. It also keeps track of the last runqueue
a thread was scheduled on to be able to move it
back there during a new thread distribution, to
benefit from cache memory reuse. When a
processor idles, the Cache scheduler browses the
topology to steal work from the most local cores
to benefit from shared cache memory.
Experimented Cache on an implicit surface
reconstruction application called MPU on a quad-
socket quad-core OPTERON host. The
parallelism of this application is highly irregular
and leads to the creation of a tree of more than
100,000 threads. Table I shows the results
obtained by both the GOMP and the
FORESTGOMP runtime systems.
We also slightly modified FORESTGOMP to
ignore the architecture topology for comparison. It
behaves better than the GOMP runtime system
thanks to the cheap user-level thread
management in BUBBLESCHED. As re-using
cache memory is crucial for this kind of divide-
and-conquer application, the topology-aware
Cache scheduling policy behaves much better
here. The OPENMP parallelization on this 16-
core host achieves a speedup of 14 over the
sequential code thanks to proper hardware affinity
knowledge, while GOMP and the non-topology
aware FORESTGOMP only reach 4.18 and 8.52
speedups.
4. Implementation
Here are the code that is use to detect the

hardware architecture by using the hardware locality.
int conf_topology(int set);
This function job is to allocate, detection, build, and

terminate and free a topology context. If the input
is 1 it will allocate, and build the topology context,
if the input is 0 it will terminate and any thing else
it will just return 0.
int dplasma_hwlock_nb_levels();
Returns the number of levels of the hardware

architecture for the system up to the core. So if the
number of levels of the hardware architecture is 3: the
System level (numbered 0), the L3 Cache level
(numbered 1), the L2 level (numbered 2), and the L1
level (numbered 3).
int dplasma_hwlock_master_id(int level, int processor_id);
This function returns the processor id of the

"master" of the processor defined by processor_id, at
level i. Example.
System(126GB)
L3(5118KB)
L2(512KB) + L1(64KB) + Core#0
L2(512KB) + L1(64KB) + Core#1
L2(512KB) + L1(64KB) + Core#2
L2(512KB) + L1(64KB) + Core#3
L3(5118KB)
L2(512KB) + L1(64KB) + Core#4
L2(512KB) + L1(64KB) + Core#5
L2(512KB) + L1(64KB) + Core#6
L2(512KB) + L1(64KB) + Core#7
If Level 0 is the system. So,

dplasma_hwlock_master_id(0, 0) = 0,
dplasma_hwlock_master_id(0, 3) = 0, etc. So basically
it will find the the master is the first processor that
appear at this level.
unsigned int dplasma_hwlock_nb_cores(int level, int

master_id);
This function returns the number of processors
that have the same master master_id, at level level.
Using the same exsample at the top
dplasma_hwlock_nb_cores(0, 0) = 8
dplasma_hwlock_nb_cores(0, 4) = 8
size_t dplasma_hwlock_cache_size(int level, int

master_id);
This function returns the size of the cache at
level i for the processors whose master is master_id
at level i.
dplasma_hwlock_cache_size(42, 1) = 5118
KB,
dplasma_hwlock_cache_size(44, 2) =
512 KB
int dplasma_hwloc_distance(int id1, int id2);
This function returns the distance between id1

and id2 . It return how many jumps must be done to
go from the core with id1 to the core of id2. Since
the hierarchy is a tree, this number would normally be
even.

dplasma_hwloc_distance(0, 1) = 6 jump
dplasma_hwloc_distance(0, 4) = 8 jump
5. Conclusion
By apply and integrating hardware architecture
knowledge to the Distributed PLASMA project using
hwloc. We can use the hwloc to take advantage of
the hardware architecture so that the schedule
related threads can by schedule in a portable way,
consulting the topology to determine which
processing units share cache memory. To benefit
from cache memory reuse and also to steal work
from the most local cores to benefit from shared
cache memory. So we can improve the
performance of Distributed PLASMA.
References
François Broquedis, Jérôme Clet-Ortega, Stéphanie

Moreaud, Nathalie Furmento, Brice Goglin,
Guillaume Mercier, Samuel Thibault , Raymond
Namyst. (2010) hwloc: a Generic Framework for
Managing Hardware Affinities in HPC Applications
from
http://hal.inria.fr/docs/00/42/98/89/PDF/main.pdf
Appendix
Build and Destory Topology
int conf_topology(int set)
{
if(set=1){
hwloc_topology_init(&topology);
hwloc_topology_ignore_type_keep_structure(topology,
HWLOC_OBJ_NODE);
hwloc_topology_ignore_type_keep_structure(topology,
HWLOC_OBJ_SOCKET);
hwloc_topology_load(topology);
}
else if
hwloc_topology_destroy(topology);
else
return(0);
}
Find the number of core for master_id

unsigned int dplasma_hwlock_nb_cores(int level, int master_id)
{
int i;
for(i = 0; i < hwloc_get_nbobjs_by_depth(topology, level);

i++){
hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, level,

i);
if(hwloc_cpuset_isset(obj->cpuset, master_id)){
return hwloc_cpuset_weight(obj->cpuset);
}
}
return 0;
}
Find the master id form the processor id

int dplasma_hwlock_master_id(int level, int processor_id)
{
int count=0, i, div =0,real_cores, cores, test = 0;
real_cores = hwloc_get_nbobjs_by_type(topology,
HWLOC_OBJ_CORE);
cores = real_cores;
div = cores;
if(processor_id/cores>0){
while(processor_id){
if(processor_id%div==0){
processor_id = count;
break;
}
count++;
div++;
if(real_cores==count) count = 0;
}
}
i++){
hwloc_obj_t obj = hwloc_get_obj_by_depth(topology,
level, i);
if(hwloc_cpuset_isset(obj->cpuset, processor_id)){
return hwloc_cpuset_first(obj->cpuset);
}
}
return -1;
}
Find the number of core for master_id

unsigned int dplasma_hwlock_nb_cores(int level, int master_id)
{
int i;

i++){
hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, level,

i);
if(hwloc_cpuset_isset(obj->cpuset, master_id)){
return hwloc_cpuset_weight(obj->cpuset);
}
}
return 0;
}
Find the cache size

size_t dplasma_hwlock_cache_size(int level, int master_id)
{
hwloc_obj_t obj = hwloc_get_obj_by_type(topology,
HWLOC_OBJ_PROC, master_id);
while (obj) {
if(obj->depth == level){
if(obj->type == HWLOC_OBJ_CACHE){
return obj->attr->cache.memory_kB;
}
else {
return 0;
}
}
obj = obj->father;
}
Find the distance between

int dplasma_hwloc_distance(int id1, int id2)
}
int jump, count = 0;
hwloc_obj_t obj = hwloc_get_obj_by_type(topology,

HWLOC_OBJ_CORE, id1);
hwloc_obj_t obj2 = hwloc_get_obj_by_type(topology,
HWLOC_OBJ_CORE, id2);
while (obj) {
if (obj==obj2)
return jump = count+count;
obj = obj->father;
obj2 = obj2->father;
count++;
}
}
}
Find the number of level of the hardware architecture

int dplasma_hwlock_nb_levels(void)
{
return hwloc_get_type_depth(topology,
HWLOC_OBJ_CORE);
}

Dplasma and Hwloc

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Dplasma and Hwloc

Caricato da

Copyright:

Formati disponibili

Co-operative Report

Integrating Hardware Architecture

I would like to express my gratitude to all those

The document is mainly consists of the following,

Apply and integrating hardware architecture

Located at the heart of the University of Tennessee

Before his recent departure as Chancellor of the

At the Innovative Computing Laboratory (ICL), our

ICL was founded in 1989 by Dr. Jack Dongarra who

• Zhaojun Bai - University of California, Davis

Over the past 18 years, ICL has produced numerous

Active Netlib ATLAS BLAS

FT-MPI HARNESS LAPACK

NetBuild Netlib NetSolve

PAPI PVM RIB

Our successes continue along with current ICL efforts

Chapter 2: Hardware Locality

What is Hardware Locality?

Hardware Locality or hwloc is a softwaresoftware that

Design and Interface

We now introduce the design and interface of

hwloc. It aims at abstracting topology information in

a portable manner so as to export it to applications

and runtime systems in a convenient way.

A. Abstracting the Hardware Topology

Hardware Locality was designed from the idea that

nowadays and next-generation architectures are highly

hierarchical. Indeed, current machines consist of

composed of one or several threads. This led to

representing the hardware architecture as a tree of

resources. hwloc also includes NUMA memory

nodes in its resource tree as depicted on Figure 4.

In case of NUMA machines with dozens of

memory nodes such as SGI ALTIX systems [2],

hwloc can also parse the matrix of distances

between nodes (reported by the operating system)

so as to exhibit the hierarchical organization of

these memory nodes. hwloc was also designed

with the idea that future architectures may be

asymmetric (less cores in some sockets) or even

heterogeneous (different processor types). Thus, the

hierarchical tree is composed of generic objects

containing a type (among Node, Socket, Cache,

Core, and more) and various attributes such as

the cache type and size, or the socket number.

This design enables easy porting on future

architectures thanks to no assumption being made

(such as sockets or cores) or their relative depth

B. Abstracting Exporting the Hardware Topology

hwloc gathers information about the underlying

Application and Performance Example

Affinity-aware Thread Scheduling

The OPENMP language consists of a set

Here are the code that is use to detect the

This function job is to allocate, detection, build, and

Returns the number of levels of the hardware

int dplasma_hwlock_master_id(int level, int processor_id);

This function returns the processor id of the

If Level 0 is the system. So,

unsigned int dplasma_hwlock_nb_cores(int level, int

Using the same exsample at the top

size_t dplasma_hwlock_cache_size(int level, int

Using the same exsample at the top