Sei sulla pagina 1di 21

Scalable Parallel File I/O with IBM GPFS

Klaus Gottschalk
HPC Architect Germany

1
© 2013 IBM Corporation
Best Practices for Scalable I/O are not always straight forward

• Scaling of parallel Applications sometimes reveals bad file access patterns


• Assumptions on local disk file systems don’t scale to parallel file system
• Straight forward approach works fine until application is scaled to large node
counts
• Collaboration with LRZ to develop best practice recommendation started
– Co-workers needed

Agenda
• GPFS Features and new GPFS 3.5 Update
– GPFS Storage Server (GSS) – Friday Karsten Kutzer, IBM
– Active File Management (AFM)
– File Placement Optimizer (FPO)
• File Access Problems and Best Practice Examples

2
© 2013 IBM Corporation
The IBM General Parallel File System (GPFS)
Shipping since 1998

Extreme Scalability Proven Reliability Manageability


File system  No special nodes
 Add/remove nodes  Integrated tiered storage
 263 files per file system
and storage on the fly  Storage pools
 Maximum file system  Quotas
size: 299 bytes  Rolling upgrades
 Administer from any  Policy-Driven automation
 Production 19PB file  Clustered NFS
node
system
 Data replication  SNMP monitoring
Number of nodes
 Snapshots  TSM / HPSS (DMAPI)
 1 to 8192 (16284
 File system journaling
3
© 2013 IBM Corporation
IBM General Parallel File System (GPFS™) – History and evolution

GPFS 2.1-2.3 GPFS 3.1-3.2 GPFS 3.3 GPFS 3.4 GPFS 3.5
First HPC
Information
Restricted Enhanced
lifecycle Caching via
called management (ILM) Admin Windows cluster Active File
GPFS Virtual Research Functions support Management
 Storage Pools
Tape Server Visualization - Homogenous (AFM)
 File sets
(VTS) Digital Media Improved Windows Server
 Policy Engine
Seismic installation GPFS Storage
HPC Weather
Linux® Server
Ease of
exploration New license Performance and
GPFS Clusters administration
Life sciences model scaling GPFS File
General File (Multiple improvements
Multiple- Placement
Serving architectures) 32 bit /64 bit networks/ RDMA Improved optimizer (FPO)
 Standards Inter-op (IBM AIX snapshot and Enhanced
 Portable IBM AIX® & Linux) Distributed Token backup migration and
Loose Clusters GPFS Multicluster Management diagnostics
operating
Improved ILM support
system Windows 2008
GPFS over wide policy engine
interface
area networks
(POSIX) Multiple NSD
(WAN) servers
semantics
-Large block Large scale NFS v4 Support
 Directory and clusters
Small file perf thousands of Small file
 Data nodes performance
management

1998 2002 2005 2006 2009 2010 2012

4
© 2013 IBM Corporation
Evolution of the global namespace:
GPFS Active File Management (AFM)

GPFS

GPFS

GPFS GPFS
GPFS
GPFS

GPFS introduced AFM takes global namespace


concurrent file Multi-cluster expands the global truly global by automatically
system access from namespace by connecting managing asynchronous
multiple nodes. multiple sites replication of data

1993 2005 2011

5
© 2013 IBM Corporation
AFM Use Cases

HPC Distributed NAS Storage Cloud

 Grid computing: allowing  WAN Caching: Caching  NAS key building block
data to move transparently across WAN between of cloud storage
during grid workflows SoNAS clusters or architecture
 Facilitates content SoNAS and another NAS
distribution for global
 Enables edge caching in
vendor the cloud
enterprises, “follow-the-
sun” engineering teams  Data Migration: Online  DR support within cloud
cross-vendor data data repositories
migration  Peer-to-peer data access
 Disaster Recovery: among cloud edge sites
multi-site fileset-level  Global wide-area
replication/failover filesystem spanning
 Shared Namespace: multiple sites in the
across SoNAS clusters cloud

6
© 2013 IBM Corporation
AFM Architecture  If data is modified at home
– Revalidation done at a configurable timeout
– Close to NFS style close-to-open consistency across sites
• Fileset on home cluster is associated with a
fileset on one or more cache clusters – POSIX strong consistency within cache site
• If data is in cache …  If data is modified at cache
– Cache hit at local disk speeds – Writes see no WAN latency
– Client sees local GPFS performance if file or directory is in cache
– are done to the cache (i.e. local GPFS), then
• If data not in cache … asynchronously pushed home
– Data and metadata (files and directories) pulled on-demand at network line


speed and written to GPFS
Uses NFS/pNFS for WAN data transfer
 If network is disconnected …
– cached data can still be read, and writes to cache are
written back after reconnection
 There can be conflicts…

SoNAS layer GW Nodes SoNAS layer

Pull on cache miss


Push on write

pNFS/NFS over the WAN

Cache Cluster Site 2


Cache Cluster Site1
(GPFS+Panache)
(GPFS+Panache)
Home Cluster Site
(Any NAS box or SOFS)
7
© 2013 IBM Corporation
AFM Example: Global Namespace
File System: store1
Cache Filesets:
/data1
/data2
Clients access:
/global/data1 Local Filesets:
/global/data2 /data3
/global/data3 /data4 File System: store2
/global/data4 Cache Filesets:
/data5 Local Filesets: Clients access:
/global/data5 /data1
/data6 /global/data1
/global/data6 /data2
/global/data2
Cache Filesets: /global/data3
/data3 /global/data4
/data4
/global/data5
File System: store3 Cache Filesets: /global/data6
Clients access: /data5
Cache Filesets: /data6
/global/data1 /data1
/global/data2 /data2
/global/data3
Cache Filesets:
/global/data4 /data3
/global/data5 /data4
/global/data6
Local Filesets:
/data5
/data6
 See all data from any Cluster
 Cache as much data as required or fetch
data on demand

8
© 2013 IBM Corporation
4-8
Policy based Pre-fetching and Expiration of Data
• Policy-based pre population
• Periodically runs parallel inodescan at home
– Selects files/dirs based on policy criterion
• Includes any user defined metadata in xattrs or other file attributes
• SQL like construct to select
• RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME
> CURRENT_TIME- 3600 AND USER_ATTR1 = “sat-photo” OR USER_ATTR2 =
“classified”
• Cache then pre-fetches selected objects
– Runs asynchronously in the background
– Parallel multi-node prefetch
– Can callout when completed

• Staleness Control
– Defined based on time since disconnection
– Once cache is expired, no access is allowed to cache
– Manual expire/unexpire option for admin
• Mmafmctl –expire/unexpire, ctlcache in sonas
– Allowed onlys for ro mode cache
– Disabled for SW & LU as they are sources of data themselves

9
© 2013 IBM Corporation
File Placement Optimizer (FPO)

GPFS

Architecture
 Use disk local to each server
 All nodes are NSD servers and NSD clients
 Designed for MapReduce workloads

10
© 2013 IBM Corporation
MapReduce Environment Using GPFS-FPO (File Placement Optimizer)

Filers MapReduce Cluster Jobs Users

NFS

M
a
G p
P R
F e
d
S u
- c
F e
P
O

 Uses disk local to each server


 Aggregates the local disk space into a single redundant shared file system
 Designed for MapReduce workloads
 Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy
 Intended as a drop in replacement for open source HDFS (IBM BigInsights product
11 may be required)
© 2013 IBM Corporation
Another GPFS-FPO Use Case – Typical In HPC
Filers HPC Cluster Jobs Users

NFS

Project 1

L
Local file systems S
F Project 2

 Local file systems used for high speed scratch space


 Usually scratch space is RAID 0 (for speed) so reliability can be an issue
 Disk capacity for scratch space can be a limiting factor
 Inaccessible scratch disk renders the compute node useless
12  File systems used: EXT2, EXT3, XFS
© 2013 IBM Corporation
Another GPFS-FPO Use Case
Filers HPC Cluster Jobs Users

NFS

Project 1

G
L
P
S
F
F Project 2
S
-
F
P
O

 GPFS-FPO creates a single, shared scratch disk pool


 Reliability is higher because of GPFS-FPO redundancy design
 Scratch disk capacity is much larger
 Likelihood of filling all the scratch disk is much lower
13
 Performance is preserved because GPFS-FPO exploits locality by design
© 2013 IBM Corporation
GPFS Features and Internal Structures

• Wide striping to all NSDs in a GPFS Pool


• Token based locking
• Byte Range Locking
• Fine granular directory locking

14
© 2013 IBM Corporation
GPFS Concept – Wide Striping

file 1 2 3 4 5 6 30 MB/s per job

GPFS storage node


client node

GPFS storage
node node node
server node

10 MB/s per disk


disk storage
disk disk disk

15
© 2013 IBM Corporation
GPFS Concepts - Locking and Tokens

A B C D

lock range 0 – 2 GB 2 – 4 GB 4 – 6 GB 5 - 7 GB

file D
A B C
0 2 4 6 8 GB
Overlapping Node D has to wait for
lock range Node C to release token

GPFS uses a token based locking mechanism


• On a lock request the token system manager grants access for the token validity time
• No further connect to the manager is required – reduced locking overhead
• After token expiration no client is required for recovery

GPFS provides byte-range tokens for synchronizing parallel access to file


• With valid token, each task can independently write in its file region

16
© 2013 IBM Corporation
Parallel File Access

• Parallel write requires byte-range locking of independent regions


• Overlapping regions cause serialization of write access
– A Task needs to wait until token expires or is freed by holding task
– Potential race conditions for overlapping region

• Parallel I/O Libraries ease use and hide complexity of locking mechanism
• Additional provide block aggregation and caching
– Examples: MPI-IO, MIO

• Data Region per task should be large enough

17
© 2013 IBM Corporation
Fine Granular Directory Locking (FGDL)
Traditional file systems implemented
directories as linear file. A File create locked
the entire directory
• Fine if files are in different directories
• Not the best if they’re in the same directory
• Unfortunately, this is a natural way to
organize files, and programmers are hard
to re-educate
• Checkpoints, output for a time-step

GPFS Optimization to improve write sharing


• GPFS directories use extendible hashing to
map file names to directory blocks
– Last n bits of hash value determine
directory block number
– Example: hash(“NewFile") = 10111101
means the directory entry goes in block 5

• File create now only locks the hash value of the


file name
– lock ensures against multiple nodes
hash(“NewFile") = 10111101
simultaneously creating the same file
– actual update shipped to metanode
– If create requires directory block split, lock
upgraded to cover both old and new block
– Parallel file create performance no longer
depends upon whether or not the files are
in the same directory
18
© 2013 IBM Corporation
Many files created in single Directory

Examples
• Parallel Application writes checkpoint file from each task
• Serial application used to process high number of data files is parallelized using
trivial SPMD mode (i.e. rendering film, compressing multiple files)

Problem
• Creation of files is serialized in traditional file systems
• With GPFS FGDL feature parallel creation of files in single directory scales
nicely, but only up to certain number of tasks
– Max. parallel tasks is depending of file system configuration (>1000)
• How small can a byte-range of a directory object be
• How many tokens can be managed for a single object

Solution
• Parallel Task creates sub-directory per node (or task) and stores its files there

19
© 2013 IBM Corporation
Many Directories created from a parallel Task

A parallel application creates a directory per task to place its file in them
• Follow on problem to many files in same directory

Problem
• Parent directory becomes a bottleneck
• Application startup time does not scale linearly
– Startup time with 1000 tasks ~10min
• Directory Write Tokens “bounces” around between nodes

Solution
• Directories are created by single task (task0 or within the job start script)
• One write token is required for task0
– Startup time with 1000 ~20sec

20
© 2013 IBM Corporation
21
© 2013 IBM Corporation

Potrebbero piacerti anche