Gottschalk GPFS Scalable Parallel IO

Scalable Parallel File I/O with IBM GPFS
Klaus Gottschalk
HPC Architect Germany
1
© 2013 IBM Corporation
Best Practices for Scalable I/O are not always straight forward
• Scaling of parallel Applications sometimes reveals bad file access patterns

• Assumptions on local disk file systems don’t scale to parallel file system
• Straight forward approach works fine until application is scaled to large node
counts
• Collaboration with LRZ to develop best practice recommendation started
– Co-workers needed
Agenda
• GPFS Features and new GPFS 3.5 Update
– GPFS Storage Server (GSS) – Friday Karsten Kutzer, IBM
– Active File Management (AFM)
– File Placement Optimizer (FPO)
• File Access Problems and Best Practice Examples
2
The IBM General Parallel File System (GPFS)
Shipping since 1998
Extreme Scalability Proven Reliability Manageability

File system  No special nodes
 Add/remove nodes  Integrated tiered storage
 263 files per file system
and storage on the fly  Storage pools
 Maximum file system  Quotas
size: 299 bytes  Rolling upgrades
 Administer from any  Policy-Driven automation
 Production 19PB file  Clustered NFS
node
system
 Data replication  SNMP monitoring
Number of nodes
 Snapshots  TSM / HPSS (DMAPI)
 1 to 8192 (16284
 File system journaling
3
IBM General Parallel File System (GPFS™) – History and evolution
GPFS 2.1-2.3 GPFS 3.1-3.2 GPFS 3.3 GPFS 3.4 GPFS 3.5
First HPC
Information
Restricted Enhanced
lifecycle Caching via
called management (ILM) Admin Windows cluster Active File
GPFS Virtual Research Functions support Management
 Storage Pools
Tape Server Visualization - Homogenous (AFM)
 File sets
(VTS) Digital Media Improved Windows Server
 Policy Engine
Seismic installation GPFS Storage
HPC Weather
Linux® Server
Ease of
exploration New license Performance and
GPFS Clusters administration
Life sciences model scaling GPFS File
General File (Multiple improvements
Multiple- Placement
Serving architectures) 32 bit /64 bit networks/ RDMA Improved optimizer (FPO)
 Standards Inter-op (IBM AIX snapshot and Enhanced
 Portable IBM AIX® & Linux) Distributed Token backup migration and
Loose Clusters GPFS Multicluster Management diagnostics
operating
Improved ILM support
system Windows 2008
GPFS over wide policy engine
interface
area networks
(POSIX) Multiple NSD
(WAN) servers
semantics
-Large block Large scale NFS v4 Support
 Directory and clusters
Small file perf thousands of Small file
 Data nodes performance
management
1998 2002 2005 2006 2009 2010 2012
4
Evolution of the global namespace:
GPFS Active File Management (AFM)
GPFS
GPFS
GPFS GPFS
GPFS
GPFS
GPFS introduced AFM takes global namespace

concurrent file Multi-cluster expands the global truly global by automatically
system access from namespace by connecting managing asynchronous
multiple nodes. multiple sites replication of data
1993 2005 2011
5
AFM Use Cases
HPC Distributed NAS Storage Cloud
 Grid computing: allowing  WAN Caching: Caching  NAS key building block
data to move transparently across WAN between of cloud storage
during grid workflows SoNAS clusters or architecture
 Facilitates content SoNAS and another NAS
distribution for global
 Enables edge caching in
vendor the cloud
enterprises, “follow-the-
sun” engineering teams  Data Migration: Online  DR support within cloud
cross-vendor data data repositories
migration  Peer-to-peer data access
 Disaster Recovery: among cloud edge sites
multi-site fileset-level  Global wide-area
replication/failover filesystem spanning
 Shared Namespace: multiple sites in the
across SoNAS clusters cloud
6
AFM Architecture  If data is modified at home
– Revalidation done at a configurable timeout
– Close to NFS style close-to-open consistency across sites
• Fileset on home cluster is associated with a
fileset on one or more cache clusters – POSIX strong consistency within cache site
• If data is in cache …  If data is modified at cache
– Cache hit at local disk speeds – Writes see no WAN latency
– Client sees local GPFS performance if file or directory is in cache
– are done to the cache (i.e. local GPFS), then
• If data not in cache … asynchronously pushed home
– Data and metadata (files and directories) pulled on-demand at network line
–
speed and written to GPFS
Uses NFS/pNFS for WAN data transfer
 If network is disconnected …
– cached data can still be read, and writes to cache are
written back after reconnection
 There can be conflicts…
SoNAS layer GW Nodes SoNAS layer
Pull on cache miss

Push on write
pNFS/NFS over the WAN
Cache Cluster Site 2

Cache Cluster Site1
(GPFS+Panache)
(GPFS+Panache)
Home Cluster Site
(Any NAS box or SOFS)
7
AFM Example: Global Namespace
File System: store1
Cache Filesets:
/data1
/data2
Clients access:
/global/data1 Local Filesets:
/global/data2 /data3
/global/data3 /data4 File System: store2
/global/data4 Cache Filesets:
/data5 Local Filesets: Clients access:
/data6 /global/data1
/global/data2
Cache Filesets: /global/data3
/data3 /global/data4
/data4
/global/data5
File System: store3 Cache Filesets: /global/data6
Clients access: /data5
Cache Filesets: /data6
/global/data3
Cache Filesets:
/global/data6
Local Filesets:
/data5
/data6
 See all data from any Cluster
 Cache as much data as required or fetch
data on demand
8
4-8
Policy based Pre-fetching and Expiration of Data
• Policy-based pre population
• Periodically runs parallel inodescan at home
– Selects files/dirs based on policy criterion
• Includes any user defined metadata in xattrs or other file attributes
• SQL like construct to select
• RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME
> CURRENT_TIME- 3600 AND USER_ATTR1 = “sat-photo” OR USER_ATTR2 =
“classified”
• Cache then pre-fetches selected objects
– Runs asynchronously in the background
– Parallel multi-node prefetch
– Can callout when completed
• Staleness Control
– Defined based on time since disconnection
– Once cache is expired, no access is allowed to cache
– Manual expire/unexpire option for admin
• Mmafmctl –expire/unexpire, ctlcache in sonas
– Allowed onlys for ro mode cache
– Disabled for SW & LU as they are sources of data themselves
9
File Placement Optimizer (FPO)
GPFS
Architecture
 Use disk local to each server
 All nodes are NSD servers and NSD clients
 Designed for MapReduce workloads
10
MapReduce Environment Using GPFS-FPO (File Placement Optimizer)
Filers MapReduce Cluster Jobs Users
NFS
M
a
G p
P R
F e
d
S u
- c
F e
P
O
 Uses disk local to each server

 Aggregates the local disk space into a single redundant shared file system
 Designed for MapReduce workloads
 Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy
 Intended as a drop in replacement for open source HDFS (IBM BigInsights product
11 may be required)
Another GPFS-FPO Use Case – Typical In HPC
Filers HPC Cluster Jobs Users
NFS
Project 1
L
Local file systems S
F Project 2
 Local file systems used for high speed scratch space

 Usually scratch space is RAID 0 (for speed) so reliability can be an issue
 Disk capacity for scratch space can be a limiting factor
 Inaccessible scratch disk renders the compute node useless
12  File systems used: EXT2, EXT3, XFS
Another GPFS-FPO Use Case
Filers HPC Cluster Jobs Users
NFS
Project 1
G
L
P
S
F
F Project 2
S
-
F
P
O
 GPFS-FPO creates a single, shared scratch disk pool

 Reliability is higher because of GPFS-FPO redundancy design
 Scratch disk capacity is much larger
 Likelihood of filling all the scratch disk is much lower
13
 Performance is preserved because GPFS-FPO exploits locality by design
GPFS Features and Internal Structures
• Wide striping to all NSDs in a GPFS Pool

• Token based locking
• Byte Range Locking
• Fine granular directory locking
14
GPFS Concept – Wide Striping
file 1 2 3 4 5 6 30 MB/s per job
GPFS storage node

client node
GPFS storage
node node node
server node
10 MB/s per disk

disk storage
disk disk disk
15
GPFS Concepts - Locking and Tokens
A B C D
lock range 0 – 2 GB 2 – 4 GB 4 – 6 GB 5 - 7 GB
file D
A B C
0 2 4 6 8 GB
Overlapping Node D has to wait for
lock range Node C to release token
GPFS uses a token based locking mechanism

• On a lock request the token system manager grants access for the token validity time
• No further connect to the manager is required – reduced locking overhead
• After token expiration no client is required for recovery
GPFS provides byte-range tokens for synchronizing parallel access to file

• With valid token, each task can independently write in its file region
16
Parallel File Access
• Parallel write requires byte-range locking of independent regions

• Overlapping regions cause serialization of write access
– A Task needs to wait until token expires or is freed by holding task
– Potential race conditions for overlapping region
• Parallel I/O Libraries ease use and hide complexity of locking mechanism
• Additional provide block aggregation and caching
– Examples: MPI-IO, MIO
• Data Region per task should be large enough
17
Fine Granular Directory Locking (FGDL)
Traditional file systems implemented
directories as linear file. A File create locked
the entire directory
• Fine if files are in different directories
• Not the best if they’re in the same directory
• Unfortunately, this is a natural way to
organize files, and programmers are hard
to re-educate
• Checkpoints, output for a time-step
GPFS Optimization to improve write sharing

• GPFS directories use extendible hashing to
map file names to directory blocks
– Last n bits of hash value determine
directory block number
– Example: hash(“NewFile") = 10111101
means the directory entry goes in block 5
• File create now only locks the hash value of the

file name
– lock ensures against multiple nodes
hash(“NewFile") = 10111101
simultaneously creating the same file
– actual update shipped to metanode
– If create requires directory block split, lock
upgraded to cover both old and new block
– Parallel file create performance no longer
depends upon whether or not the files are
in the same directory
18
Many files created in single Directory
Examples
• Parallel Application writes checkpoint file from each task
• Serial application used to process high number of data files is parallelized using
trivial SPMD mode (i.e. rendering film, compressing multiple files)
Problem
• Creation of files is serialized in traditional file systems
• With GPFS FGDL feature parallel creation of files in single directory scales
nicely, but only up to certain number of tasks
– Max. parallel tasks is depending of file system configuration (>1000)
• How small can a byte-range of a directory object be
• How many tokens can be managed for a single object
Solution
• Parallel Task creates sub-directory per node (or task) and stores its files there
19
Many Directories created from a parallel Task
A parallel application creates a directory per task to place its file in them
• Follow on problem to many files in same directory
Problem
• Parent directory becomes a bottleneck
• Application startup time does not scale linearly
– Startup time with 1000 tasks ~10min
• Directory Write Tokens “bounces” around between nodes
Solution
• Directories are created by single task (task0 or within the job start script)
• One write token is required for task0
– Startup time with 1000 ~20sec
20
21

Gottschalk GPFS Scalable Parallel IO

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Gottschalk GPFS Scalable Parallel IO

Caricato da

Copyright:

Formati disponibili

Scalable Parallel File I/O with IBM GPFS

• Scaling of parallel Applications sometimes reveals bad file access patterns

Extreme Scalability Proven Reliability Manageability

1998 2002 2005 2006 2009 2010 2012

GPFS introduced AFM takes global namespace

1993 2005 2011

HPC Distributed NAS Storage Cloud

SoNAS layer GW Nodes SoNAS layer

Pull on cache miss

pNFS/NFS over the WAN

Cache Cluster Site 2

Filers MapReduce Cluster Jobs Users

 Uses disk local to each server

 Local file systems used for high speed scratch space

 GPFS-FPO creates a single, shared scratch disk pool

• Wide striping to all NSDs in a GPFS Pool

file 1 2 3 4 5 6 30 MB/s per job

GPFS storage node

10 MB/s per disk

GPFS uses a token based locking mechanism

GPFS provides byte-range tokens for synchronizing parallel access to file

• Parallel write requires byte-range locking of independent regions

• Data Region per task should be large enough

GPFS Optimization to improve write sharing

• File create now only locks the hash value of the

Potrebbero piacerti anche