Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
E-mail: panda@cse.ohio-state.edu
http://nowlab.cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Presentation Outline
•powerful •powerful
•cost- •ability to
effective integrate
•commodity computation and
communication
Network-
Networking
Based Emerging Applications
Computing
• collaborative/
interactive
•high bandwidth
•cost-effective •wide range of
computation and
•commodity communication
characteristics
Presentation Outline
comp comp
comp comp
Trends for Computing Clusters
in the Top 500 List
• Top 500 list of Supercomputers (www.top500.org)
– June 2001: 33/500 (6.6%)
– Nov 2001: 43/500 (8.6%)
– June 2002: 80/500 (16%)
– Nov 2002: 93/500 (18.6%)
– June 2003: 149/500 (29.8%)
– Nov 2003: 208/500 (41.6%)
– June 2004: 291/500 (58.2%)
– Nov 2004: 294/500 (58.8%)
– June 2005: 304/500 (60.8%)
– Nov 2005: 360/500 (72.0%)
– June 2006: 364/500 (72.8%)
Cluster with Interactive Clients
LAN/WAN client
cluster
Globally-Interconnected Systems
SMP system WAN
SMP system
Interconnected Clusters over WAN
WAN
Workstation
cluster
Generic Three-Tier Model for
Data Centers
Tier 1 Tier 2 Tier 3
Storage
. . .
. . .
Routers/Servers Application Server Database Server
LAN
LAN
LAN/WAN
Applications
(Scientific, Commercial, Servers, Datacenters)
Programming Models
(Message Passing, Sockets, Shared
Memory w/o Coherency)
Applications
System Software/Middleware
Support
Networking and
Communication Support
Trends in Networking/Computing
Technologies
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why InfiniBand?
Memory Memory
P0 PCI/PCI-EX
P0 PCI/PCI-EX
IBA IBA
P1 P1
Inactive
Link
Multicast Active
Inactive
Setup Links
Multicast
Setup
Multicast Join
Subnet
Manager
Designing MPI Using InfiniBand
Features
MPI Design Components
Protocol Flow Communication Multirail
Mapping Control Progress Support
Substrate
InfiniBand Features
High Performance MPI over IBA –
Research Agenda at OSU
• Point-to-point communication
– RDMA-based design for both small and large messages
• Collective communication
– Taking advantage of IBA hardware multicast for Broadcast
– RDMA-based designs for barrier, all-to-all, all-gather
• Flow control
– Static vs. dynamic
• Connection Management
– Static vs. dynamic
– On-demand
• Multi-rail designs
– Multiple ports/HCAs
– Different schemes (striping, binding, adaptive)
• MPI Datatype Communication
– Taking advantage of scatter/gather semantics of IBA
• Fault-tolerance Support
– End-to-end reliability to tolerate I/O errors
– Network fault-tolerance with Automatic Path Migration (APM)
– Applications transparent checkpoint and restart
• Currently extending our designs for iWARP adapters
Overview of MVAPICH and MVAPICH2
Projects (OSU MPI for InfiniBand)
• Focusing on
– MPI-1 (MVAPICH)
– MPI-2 (MVAPICH2)
• Open Source (BSD licensing) with anonymous SVN
access
• Directly downloaded and being used by more than
440 organizations worldwide (in 30 countries)
• Available in the software stacks of
– Many IBA and server vendors
– OFED (OpenFabrics Enterprise Distribution)
– Linux Distributors
• Empowers multiple InfiniBand clusters in the TOP
500 list
• URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
Larger IBA Clusters using MVAPICH
and Top500 Rankings (June ’06)
• 6th : 4000-node dual Intel Xeon 3.6 GHz cluster at Sandia
• 28th: 1100-node dual Apple Xserve 2.3 GHz cluster at Virginia Tech
• 66th: 576-node dual Intel Xeon EM64T 3.6 GHz cluster at Univ. of
Sherbrooke (Canada)
• 367th : 356-node dual Opteron 2.4 GHz cluster at Trinity Center
for High Performance Computing (TCHPC), Trinity College Dublin
(Ireland)
• 436th: 272-node dual Intel Xeon EM64T 3.4 GHz cluster at SARA
(The Netherlands)
• 460th: 200-node dual Intel Xeon EM64T 3.2 GHz cluster at Texas
Advanced Computing Center/Univ. of Texas
• 465th: 315-node dual Opteron 2.2 GHz cluster at NERSC/LBNL
• More are getting installed ….
MPI-level Latency (One-way):
IBA (Mellanox and PathScale) vs. Myrinet vs.
Quadrics
Small message latency
Large message latency
14
700
Latency (us)
12 MVAPICH-X
600
MVAPICH-Ex-1p
10 500
MVAPICH-Ex-2p
8 400
MPICH/MX
4.9 6 300
MPICH/QsNet-X
4.0 4 200
MVAPICH-Gen2-Ex-
3.3 DDR-1p
2.8 2 100
2.0
0 0
0 4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K
• SC ’03
• Hot Interconnect ’04 09/26/05
• IEEE Micro (Jan-Feb) ’05, one of the best papers from HotI ‘04
MPI-level Bandwidth (Uni-directional):
IBA (Mellanox and PathScale) vs. Myrinet vs.
Quadrics
1500 1492
1350
MVAPICH-X
1472
1200 MVAPICH-Ex-1p
Bandwidth (MillionBytes/Sec)
450 494
300
150
0
4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
09/26/05
Mesg size (Bytes)
MPI-level Bandwidth (Bi-directional):
IBA (Mellanox and PathScale)) vs. Myrinet vs.
Quadrics
3000
2700
2724
MVAPICH-X
2628
2400 MVAPICH-Ex-1p
Bandwidth (MillionBytes/Sec)
2100
MVAPICH-Ex-2p
MPICH/MX-X
1800
MPICH/QsNet-X
1841
1500
MVAPICH-Gen2-DDR-Ex-1p
1200
943
900 908
600
901
300
0
4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
HCA 0 HCA 1
3000
P2 P0 P0 P2
2500
4 processes
2000
P3 P1 P1 P3 1500
1000
2808 MB/sec
500
4-processes on each node concurrently
communicating over Dual-rail InfiniBand DDR (Mellanox) 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Message Size (bytes)
3.16 Million
3000 2
2500
1000
0.5
(16 Bytes)
500
0 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K
Message Size (bytes)
Message Size (bytes)
M. J. Koop, W. Huang, A. Vishnu and D. K. Panda, Memory Scalability Evaluation of Next Generation Intel
Bensley Platform with InfiniBand, to be presented at Hot Interconnect Symposium (Aug. 2006).
High Performance and Scalable
Collectives
• Reliable MPI Broadcast using IB hardware
multicast
– Capability to support broadcast of 1K bytes
message to 1024 nodes in less than 40 microsec
• RDMA-based designs for
– MPI_Barrier
– MPI_All_to_All
J. Liu, A. Mamidala and D. K. Panda, Fast and Scalable MPI-Level Broadcast
using InfiniBand’s Hardware Multicast Support, Int’l Parallel and Distributed
Processing Symposium (IPDPS ’04), April 2004
la t e n c y
80
30
60
20 40
10 20
0 0
16
32
64
8
6
2
1
2
4
8
24
48
96
1
2
4
8
12
25
51
16
32
64
8
6
10 2
20 4
40 8
96
10
20
40
12
25
51
2
4
size
size
P e rfo rm a n c e (G F L O P S )
500 25
400 20
15
300
10
200
5
100 0
1 min 2 min 4 min None 2 min (6) 4 min (2) 8 min (1) None
Checkpointing In Checkpointing Interval & No. of checkpoints
• The execution time of NAS benchmarks increased slightly with frequent
checkpointing
• Frequent checkpointing also has some impact on HPL performance metrics
800 25
700
20
600
Bandwidth (MB/s)
%/(MB/s)
500
15
400
10
300
200
5
100
0 0
1 2 3 4 5 6 7 8 9 10 11 12
Threads
System: Sun x2100’s (Dual 2.2 GHz Opteron CPU’s with x8
PCI-Express Adaptors)
File size: 128MB for tmpfs and 1GB for sinkfs
IO size: 1MB
Major Research Directions
• System Software/Middleware
– High Performance MPI on InfiniBand Cluster
– Clustered Storage and File Systems
– Solaris NFS over RDMA
– iWARP and its Benefits to High Performance Computing
– Efficient Shared Memory on High-Speed Interconnects
– High Performance Computing with Virtual Machines (Xen-IB)
– Design of Scalable Data-Centers with InfiniBand
• Networking and Communication Support
– High Performance Networking for TCP-based Applications
– NIC-level Support for Collective Communication and Synchronization
– NIC-level Support for Quality of Service (QoS)
– Micro-Benchmarks and Performance Comparison of High-Speed
Interconnects
• More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why Targeting Virtualization?
• Ease of management
– Virtualized clusters
– VM migration – deal with system upgrade/failures
• Customized OS
– Light-weight OS: No widely adoption due to
management difficulties
– VM makes those techniques possible
• System security & productivity
– Users can do ‘anything’ in VM, in the worst case
crash a VM, not the whole system
Challenges
• Performance overhead of I/O virtualization
• Management framework to take advantages of VM
technology for HPC
• Migration of modern OS-bypass network devices
• File system support for VM based cluster
Our Initial Studies
• J. Liu*, W. Huang, B. Abali*, D. K. Panda. High Performance VMM-
Bypass I/O in Virtual Machines, USENIX’06
Privileged Access
applications
VMM-bypass Access
• Backend and privileged modules can
also reside in a special VM
Xen-IB:Xen-IB: InfiniBand
an InfiniBand virtualization driver
Virtualization for Driver
Xen for Xen
• Follows Xen split driver model
• Presents virtual HCAs to guest domains
– Para-virtualization
• Two modes of access:
– Privileged access
• OS involved
• Setup, resource management and memory
management
– OS/VMM-bypass access
• Directly done in user space/guest VM
• Maintains high performance of InfiniBand
hardware
MPI Latency and Bandwidth
(MVAPICH)
Latency Bandwidth
30 1000
25 xen xen
800
20 native native
MillionBytes/s
600
Latency (us)
15
400
10
200
5
0 0
16
64
1k
4k
1M
4M
1
k
k
6k
32
2k
8k
0
25
16
64
12
51
25
Msg size (Bytes) Msg size (Bytes)
VM Native
BT 0.4% 0.2% 99.4%
1
CG 0.6% 0.3% 99.0%
0.8
EP 0.6% 0.3% 99.3%
0.6
FT 1.6% 0.5% 97.9%
0.4
IS 3.6% 1.9% 94.5%
0.2
LU 0.6% 0.3% 99.0%
0
BT CG EP FT IS LU MG SP MG 1.8% 1.0% 97.3%
SP 0.3% 0.1% 99.6%
. . .
. . .
Routers/Servers Application Server Database Server
WAN
WAN
User User
Kernel
TCP/IP Sockets Kernel
TCP/IP Sockets Sockets Direct
Provider Provider Protocol
InfiniBand CA InfiniBand CA
Source: InfinibandSM Trade Association 2002
Our Objectives
• To study the importance of the communication
layer in the context of a multi-tier data center
– Sockets Direct Protocol (SDP) vs. IPoIB
• Explore whether InfiniBand mechanisms can help
designing various components of datacenter
efficiently
– Web Caching/Coherency
– Re-configurability and QoS
– Load Balancing, I/O and File systems, etc.
• Studying workload characteristics
• In memory databases
3-Tier Datacenter Testbed at OSU
Caching Web Apache
Servers
Tier 3
Tier 1
Database
Clients Proxy Nodes Servers
MySQL/
Tier 2 DB2
Application
Servers
File System
Generate requests for TCP Termination evaluation
both web servers and Load Balancing Caching Schemes
database servers. Caching
100 7000
90
6000
80
5000
Bandwidth (Mbps)
70
Latency (us)
60
4000
50
40 3000
30 2000
20
1000
10
0 0
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Message Size (bytes) Message Size (bytes)
User Request
Proxy Node Back-End
Cache Data
Update
Proxy Nodes
Back-End Nodes
Update
User Requests
Active Caching Performance
Throughput: ZipF Distribution Throughput: Worldcup Trace
800 2500
700
Transactions per Second (TPS)
200
500
100
0
0
0
10
20
30
40
50
60
70
80
90
100
200
0
0
10
20
30
40
50
60
70
80
90
0
10
20
“Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-tier Data-centers over InfiniBand”, S. Narravula,
P. Balaji, K. Vaidyanathan and D. K. Panda. IEEE International Conference on Cluster Computing and the Grid (CCGrid) ’05.
“Supporting Strong Coherency for Active Caches in Multi-tier Data-Centers over InfiniBand”, S. Narravula, P. Balaji, K.
Vaidyanathan, K. Savitha, J. Wu and D. K. Panda. Workshop on System Area Networks (SAN); with HPCA ’03.
Dynamic Re-configurability
and QoS
• More datacenters are using dynamic data
• How to decide the number of proxy nodes vs. application
servers
• Current approach
– Use a fixed distribution
– Incorporate over-provisioning to handle dynamic data
• Can we design a dynamic re-configurability scheme with shared
state using RDMA operations?
– To allow a proxy node work like app server and vice versa as needed
– Allocates resources as needed
– Over-provisioning need not be used
• Scheme can be used for multi web-site hosting servers
Dynamic Reconfigurability in
Shared Multi-tier Data-Centers
Load Balancing
Cluster (Site A) Website A
Servers
Clients
Load Balancing
Cluster (Site B) Website B
WAN Servers
Clients
Load Balancing
Cluster (Site C) Website C
Servers
30000
2
20000
10000 1
0 0
512 1024 2048 4096 8192 16384 0 7491 14962 22423 29901 37345
Burst Length Iterations
Rigid Reconf Over-Provisioning Reconf Node Utilization Rigid Over-provisioning
Performance of dynamic reconfiguration For large burst of requests, dynamic
scheme largely depends on the burst length reconfiguration scheme utilizes all idle
of requests nodes in the system
P. Balaji, S. Narravula, K. Vaidyanathan, S. Narravula, H. -W. Jin, K. Savitha and D. K.
Panda, Exploiting Remote Memory Operations to Design Efficient Reconfigurations for Shared
Data-Centers over InfiniBand, RAIT ’04
QoS meeting capabilities
Hard QoS Meeting Capability (High Priority Requests) Hard QoS Meeting Capability (Low Priority Requests)
100% 100%
90% 90%
80% 80%
% o f t im e s Q o S m e t
% o f t im e s Q o S m e t
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
Case 1 Case 2 Case 3 Case 1 Case 2 Case 3
PI: D. K. Panda
Co-PIs: G. Agrawal, P. Sadayappan, J. Saltz
and H.-W. Shen
Total funding: $3.01M ($1.53M from NSF + $1.48 from Ohio State
Board of Regents and Various units in OSU)
Experimental Testbed Installed
BMI OSC
10.0 GigE 2x10=20 GigE Mass Storage System
switch 40 GigE (Yr4) 500 TBytes
(Existing)
70-node Memory Cluster with
512 GBytes memory, 24 TB disk 10.0 GigE
GigE, and InfiniBand SDR switch
Upgrade (Yr4)
2x10=20 GigE
40 GigE (Yr4)
• Since 1995
– 11 PhD Dissertations
– 15 MS Theses
– 180+ papers in prestigious international conferences and
Journals
– 5 journals and 24 conference/workshop papers in 2005
– 2 journals and 21 conference in 2006 (so far)
– Electronically available from NBC web page
• http://nowlab.cse.ohio-state.edu -> Publications
Student Quality and Accomplishments
http://www.cse.ohio-state.edu/~panda/
http://nowlab.cse.ohio-state.edu/
E-mail: panda@cse.ohio-state.edu
Acknowledgements
99