Sei sulla pagina 1di 41

Research @ Northeastern University

I/O storage modeling and performance


David Kaeli

Soft error modeling and mitigation


Mehdi B. Tahoori

EMC Presentation

April 2005

I/O Storage Research at Northeastern University


David Kaeli Yijian Wang
Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

Outline
Motivation to study file-based I/O Profile-driven partitioning for parallel file I/O I/O Qualification Laboratory @ NU Areas for future work

EMC Presentation

April 2005

Important File-base I/O Workloads


Many subsurface sensing and imaging workloads involve file-based I/O
Cellular biology in-vitro fertilization with NU biologists Medical imaging cancer therapy with MGH Underwater mapping multi-sensor fusion with Woods Hole Oceanographic Institution Ground-penetrating radar toxic waste tracking with Idaho National Labs

EMC Presentation

April 2005

Air

The Impact of Profile-guided Parallelization on SSI Applications


Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster
Hot-path parallelization Data restructuring

Mine
Soil

Scattered Light Simulation Speedup


Runtime in seconds
100000 10000 1000 100 10 1 Hot path parallelization Original Matlab-to-C

Reduced the runtime of a Monte Carlo scattered light simulation by 98% on a 16-node Silicon Graphics Origin 2000
Matlab-to-C compliation Hot-path parallelization

Ellipsoid Algorithm Speedup (versus serial C version)


20 15 10 5 0 1 2 4 8 16 Number of Nodes

Matlab-to-C compliation Hot-path parallelization


EMC Presentation April 2005

Speedup

Obtained superlinear speedup of Ellipsoid Algorithm run on a 16-node IBM SP2

64-vector 1024-vector

256-vector linear speedup

Limits of Parallelization
For compute-bound workloads, Beowulf clusters can be used effectively to overcome computational barriers Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit) For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems

EMC Presentation

April 2005

Outline
Motivation to study file-based I/O Profile-driven partitioning for parallel file I/O I/O Qualification Laboratory @ NU Areas for future work

EMC Presentation

April 2005

Parallel I/O Acceleration


The I/O bottleneck
The growing gap between the speed of processors, networks and underlying I/O devices Many imaging and scientific applications access disks very frequently

I/O intensive applications


Out-of-core applications
Work on large datasets that cannot fit in main memory

File-intensive applications
Access file-based datasets frequently Large number of file operations
EMC Presentation April 2005 8

Introduction
Storage architectures
Direct Attached Storage (DAS)
Storage device is directly attached to the computer

Network Attached Storage (NAS)


Storage subsystem is attached to a network of servers and file requests are passed through a parallel filesystem to the centralized storage device

Storage Area Network (SAN)


A dedicated network to provide an any-to-any connection between processors and disks

EMC Presentation

April 2005

I/O Partitioning
P An I/O intensive application
Disk

Multiple Processes (i.e. MPI-IO)

Disk

Multiple disks (i.e. RAID) P

Disk

Disk

Disk

Disk

Disk

Disk

Data Striping
EMC Presentation April 2005

Data Partitioning
10

I/O Partitioning
I/O is parallelized at both the application level (using MPI and MPI-IO) and the disk level (using file partitioning) Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing) How to recognize the access patterns? Profile-guided approach

EMC Presentation

April 2005

11

Profile Generation
Run the application

Capture I/O execution profiles

Apply our partitioning algorithm

Rerun the tuned application

EMC Presentation

April 2005

12

I/O traces and partitioning


For every process, for every contiguous file access, we capture the following I/O profile information:
Process ID File ID Address Chunk size I/O operation (read/write) Timestamp

Generate a partition for every process Optimal partitioning is NP-complete, so we develop a greedy algorithm We have found we can use partial profiles to guide partitioning
EMC Presentation April 2005 13

Greedy File Partitioning Algorithm


for each IO process, create a partition; for each contiguous data chunk { total up the # of read/write accesses on a process-ID basis;

if the chunk is accessed by only one process


assign the chunk to the associated partition; if the chunk is read (but never written) by multiple processes duplicate the chunk in all partitions where read;

if the chunk is written by one process, but later read by multiple {


assign the chunk to all partitions where read and broadcast the updates on writes; else assign the chunk to a shared partition; }} For each partition sort chunks based on the earliest timestamp for each chunk;
EMC Presentation April 2005 14

NASA Parallel Benchmark (NPB2.4)/BT

Parallel I/O Workloads

Computational fluid dynamics Generates a file (~1.6 GB) dynamically and then reads it back Writes/reads sequentially in chunk sizes of 2040 Bytes

SPEChpc96/seismic
Seismic processing Generates a file (~1.5 GB) dynamically and then reads it back Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB

Tile-IO
Parallel Benchmarking Consortium Tile access to a two-dimensional matrix (~1 GB) with overlap Writes/reads sequential chunks of 32 KB, with 2KB of overlap

Perf
Parallel I/O test program within MPICH Writes a 1 MB chunk at a location determined by rank, no overlap

Mandelbrot
An image processing application that includes visualization Chunk size is dependent on the number of processes
EMC Presentation April 2005 15

Beowulf Cluster
P2-350Mhz

RAID Node
P2-350Mhz P2-350Mhz

Local PCI-IDE Disk

10/100Mb Ethernet Switch

Local PCI-IDE Disk

P2-350Mhz

P2-350Mhz

P2-350Mhz

RAID Node
EMC Presentation April 2005 16

Hardware Specifics
DAS configuration
Linux box, Western Digital WD800BB (IDE), 80GB, 7200RPM

Beowulf cluster (base configuration)


Fast Ethernet 100Mbits/sec Network Attached RAID - Morstor TF200 with 6-9GB drives Seagate SCSI disks, 7200rpm, RAID-5 Local attached IDE disks IBM UltraATA-350840, 5400rpm

Fibre channel disks


Seagate Cheetah X15 ST-336752FC, 15000rpm

EMC Presentation

April 2005

17

Write/Read Bandwidth
200
Bandwidth (MB/sec)

150 100 50 0
Unix Write Unix Read MPI-IO Write MPI-IO Read P-IO Write P-IO Read

NPB2.4/BT

4 procs 9 procs 16 procs 25 procs

200
Bandwidth (MB/sec)

150 100 50 0
Unix Write Unix Read MPI-IO Write MPI-IO Read P-IO Write P-IO Read

SPECHPC/seis

4 procs 8 procs 16 procs 24 procs

EMC Presentation

April 2005

18

Write/Read Bandwidth
125
Bandwidth (MB/sec)

MPI-Tile
Bandwidth (MB/sec)

250 200 150 100 50 0

Perf

100 75 50 25 0
MPI write MPI read PIO write PIO read

MPI write

MPI read

PIO write

PIO read

250

Mandelbrot

Bandwidth (MB/sec)

200 150 100 50 0


MPI write MPI read PIO write PIO read

4 procs 8 procs 16 procs 24 procs

EMC Presentation

April 2005

19

Total Execution Time


Execution Time (seconds)

4000 3000 2000 1000 0


PB T /B 2.4 S EC P c hp IO leTi
April 2005

MPI-IO PIO

erf P n Ma

t ro elb
20

EMC Presentation

Profile training sensitivity analysis


We have found that IO access patterns are independent of file-based data values When we increase the problem size or reduce the number of processes, either:
the number of IOs increases, but access patterns and chunk size remain the same (SPEChpc96, Mandelbrot), or the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, TileIO, Perf)

Re-profiling can be avoided


EMC Presentation April 2005 21

Execution-driven Parallel I/O Modeling


Growing need to process large, complex datasets in high performance parallel computing applications Efficient implementation of storage architectures can significantly improve system performance An accurate simulation environment for users to test and evaluate different storage architectures and applications
EMC Presentation April 2005 22

Execution-driven I/O Modeling


Target applications: parallel scientific programs (MPI) Target machine/Host machine: Beowulf clusters Use DiskSim as the underlying disk drive simulator Direct execution to model CPU and network communication We execute the real parallel I/O accesses and meanwhile, calculate the simulated I/O response time
EMC Presentation April 2005 23

Validation Synthetic I/O Workload on DAS


Response Time of Sequential Reads 10 8
seconds

Response Time of Sequential Writes 12 10


seconds

6 4 2 0 1

model real

8 6 4 2 0

16

16

access size in number of blocks number of accesses = 1000

access size in number of blocks number of accesses = 1000

Response Time of Non-contiguous Reads 10 8


seconds

Response Time of Non-contiguous Writes 10 8


seconds

6 4 2 0 1 2 4 8 16 32
seek distance in number of blocks access size = 1 block number of accesses = 1000 Presentation

6 4 2 0 1 2 4 8 16 32
seek distance in number of blocks access size = 1 block number of accesses = 1000

EMC

April 2005

24

Simulation Framework - NAS

Local I/O traces

Local I/O traces

Local I/O traces

Local I/O traces

LAN/WAN Network File System


RAID controller I/O requests I/O traces

Filesystem metadata Logical file access addresses

Disk Sim
EMC Presentation April 2005 25

Execution Time of NPB2.4/BT on NAS - base configuration


4000 3500 3000
seconds

2500 2000 1500 1000 500 0 4


EMC Presentation

model real

9 16 number of processors
April 2005

25
26

Simulation Framework SAN direct


A variety of SAN where disks are distributed across the network and each server is directly connected to a single device File partitioning Utilize I/O profiling and data partitioning heuristics to distribute portions of files to disks close to the processing nodes LAN/WAN

FileSystem

FileSystem I/O traces


Disk Sim

FileSystem I/O traces


Disk Sim
April 2005

FileSystem I/O traces


Disk Sim

I/O traces
Disk Sim
27

EMC Presentation

Execution Time of NPB2.4/BT on SAN-direct - base configuration


3000 2500
seconds

2000 1500 1000 500 0 4


EMC Presentation

model real

9
April 2005

16

25
28

number of processors

Hardware Specifications

EMC Presentation

April 2005

29

I/O Bandwidth of SPEChpc/seis


250 200
MB/s

150 100 50 0
SAN-direct -ATA SAN-direct-SCSI SAN-direct-FC NAS-joulian SAN-joulian NAS-ATA NAS-SCSI NAS-FC

4 processors 8 processors 16 processors

storage architectures
EMC Presentation April 2005 30

I/O Bandwidth of Mandelbrot


400 350 300 250 200 150 100 50 0

4 processors 8 processors 16 processors

MB/s

SAN-direct -ATA

SAN-direct-SCSI

SAN-direct-FC

EMC Presentation

NAS-joulian

SAN-joulian

NAS-ATA

storage architectures
April 2005 31

NAS-SCSI

NAS-FC

Publications
1. 2. Profile-guided File Partitioning on Beowulf Clusters, Journal of Cluster Computing, Special Issue on Parallel I/O, to appear 2005. Execution-Driven Simulation of Network Storage Systems, Proceedings of the 12th ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604611. Profile-Guided I/O Partitioning, Proceedings of the 17th ACM International Symposium on Supercomputing, June 2003, pp. 252-260. Source Level Transformations to Apply I/O Data Partitioning, Proceedings of the IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 1221. Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applications, International Journal of Systems, Science and Technology, September 2002, pp. 40-55.

3. 4.

5.

EMC Presentation

April 2005

32

Summary of Cluster-based Work


Many imaging applications are dominated by file-based I/O Parallel systems can only be effectively utilized if I/O is also parallelized Developed a profile-guided approach to I/O data partitioning Impacting clinical trials at MGH Reduced overall execution time by 27-82% over MPI-IO Execution-driven I/O model is highly accurate and provides significant modeling flexibility

EMC Presentation

April 2005

33

Outline
Motivation to study file-based I/O Profile-driven partitioning for parallel file I/O I/O Qualification Laboratory @ NU Areas for future work

EMC Presentation

April 2005

34

I/O Qualification Laboratory


Working with Enterprise Strategy Group Develop a state-of-the-art facility to provide independent performance qualification of Enterprise Storage systems Provide a quarterly report to ES customer base on the status of current ES offerings Work with leading ES vendors to provide them with custom early performance evaluation of their beta products
EMC Presentation April 2005 35

I/O Qualification Laboratory


Contacted by IOIntegrity and SANGATE for product qualification Developed potential partners that are leaders in the ES field Initial proposals already reviewed by IBM, Hitachi and other ES vendors Looking for initial endorsement from industry
EMC Presentation April 2005 36

I/O Qualification Laboratory


Why @ NU
Track record with industry (EMC, IBM, Sun) Experience with benchmarking and IO characterization Interesting set of applications (medical, environmental, etc.) Great opportunity to work within the cooperative education model
EMC Presentation April 2005 37

Outline
Motivation to study file-based I/O Profile-driven partitioning for parallel file I/O I/O Qualification Laboratory @ NU Areas for future work

EMC Presentation

April 2005

38

Areas for Future Work


Designing a Peer-to-Peer storage system on a Grid system by partitioning datasets across geographically distributed storage devices
Internet
Head node Head node

RAID
100Mbit/s 1Gbit/s

31 sub-nodes joulian.hpcl.neu.edu
EMC Presentation April 2005

8 sub-nodes keys.ece.neu.edu
39

NPB2.4/BT read performance


180 160 140 120 MB/s 100 80 60 40 20 0 single server
EMC Presentation

4 procs 9 procs 16 procs 25 procs

dual server
April 2005

P2P
40

Areas for Future Work


Reduce simulation time by identifying characteristic phases in I/O workloads Apply machine learning algorithms to identify clusters of representative I/O behavior Utilize K-Means and Multinomial clustering to obtain high fidelity in simulation runs utilizing sampled I/O behavior
A Multinomial Clustering Model for Fast Simulation of Architecture Designs, submitted to the 2005 ACM KDD Conference.
EMC Presentation April 2005 41

Potrebbero piacerti anche