Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Klaus Gottschalk
HPC Architect Germany
1
© 2013 IBM Corporation
Best Practices for Scalable I/O are not always straight forward
Agenda
• GPFS Features and new GPFS 3.5 Update
– GPFS Storage Server (GSS) – Friday Karsten Kutzer, IBM
– Active File Management (AFM)
– File Placement Optimizer (FPO)
• File Access Problems and Best Practice Examples
2
© 2013 IBM Corporation
The IBM General Parallel File System (GPFS)
Shipping since 1998
GPFS 2.1-2.3 GPFS 3.1-3.2 GPFS 3.3 GPFS 3.4 GPFS 3.5
First HPC
Information
Restricted Enhanced
lifecycle Caching via
called management (ILM) Admin Windows cluster Active File
GPFS Virtual Research Functions support Management
Storage Pools
Tape Server Visualization - Homogenous (AFM)
File sets
(VTS) Digital Media Improved Windows Server
Policy Engine
Seismic installation GPFS Storage
HPC Weather
Linux® Server
Ease of
exploration New license Performance and
GPFS Clusters administration
Life sciences model scaling GPFS File
General File (Multiple improvements
Multiple- Placement
Serving architectures) 32 bit /64 bit networks/ RDMA Improved optimizer (FPO)
Standards Inter-op (IBM AIX snapshot and Enhanced
Portable IBM AIX® & Linux) Distributed Token backup migration and
Loose Clusters GPFS Multicluster Management diagnostics
operating
Improved ILM support
system Windows 2008
GPFS over wide policy engine
interface
area networks
(POSIX) Multiple NSD
(WAN) servers
semantics
-Large block Large scale NFS v4 Support
Directory and clusters
Small file perf thousands of Small file
Data nodes performance
management
4
© 2013 IBM Corporation
Evolution of the global namespace:
GPFS Active File Management (AFM)
GPFS
GPFS
GPFS GPFS
GPFS
GPFS
5
© 2013 IBM Corporation
AFM Use Cases
Grid computing: allowing WAN Caching: Caching NAS key building block
data to move transparently across WAN between of cloud storage
during grid workflows SoNAS clusters or architecture
Facilitates content SoNAS and another NAS
distribution for global
Enables edge caching in
vendor the cloud
enterprises, “follow-the-
sun” engineering teams Data Migration: Online DR support within cloud
cross-vendor data data repositories
migration Peer-to-peer data access
Disaster Recovery: among cloud edge sites
multi-site fileset-level Global wide-area
replication/failover filesystem spanning
Shared Namespace: multiple sites in the
across SoNAS clusters cloud
6
© 2013 IBM Corporation
AFM Architecture If data is modified at home
– Revalidation done at a configurable timeout
– Close to NFS style close-to-open consistency across sites
• Fileset on home cluster is associated with a
fileset on one or more cache clusters – POSIX strong consistency within cache site
• If data is in cache … If data is modified at cache
– Cache hit at local disk speeds – Writes see no WAN latency
– Client sees local GPFS performance if file or directory is in cache
– are done to the cache (i.e. local GPFS), then
• If data not in cache … asynchronously pushed home
– Data and metadata (files and directories) pulled on-demand at network line
–
speed and written to GPFS
Uses NFS/pNFS for WAN data transfer
If network is disconnected …
– cached data can still be read, and writes to cache are
written back after reconnection
There can be conflicts…
8
© 2013 IBM Corporation
4-8
Policy based Pre-fetching and Expiration of Data
• Policy-based pre population
• Periodically runs parallel inodescan at home
– Selects files/dirs based on policy criterion
• Includes any user defined metadata in xattrs or other file attributes
• SQL like construct to select
• RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME
> CURRENT_TIME- 3600 AND USER_ATTR1 = “sat-photo” OR USER_ATTR2 =
“classified”
• Cache then pre-fetches selected objects
– Runs asynchronously in the background
– Parallel multi-node prefetch
– Can callout when completed
• Staleness Control
– Defined based on time since disconnection
– Once cache is expired, no access is allowed to cache
– Manual expire/unexpire option for admin
• Mmafmctl –expire/unexpire, ctlcache in sonas
– Allowed onlys for ro mode cache
– Disabled for SW & LU as they are sources of data themselves
9
© 2013 IBM Corporation
File Placement Optimizer (FPO)
GPFS
Architecture
Use disk local to each server
All nodes are NSD servers and NSD clients
Designed for MapReduce workloads
10
© 2013 IBM Corporation
MapReduce Environment Using GPFS-FPO (File Placement Optimizer)
NFS
M
a
G p
P R
F e
d
S u
- c
F e
P
O
NFS
Project 1
L
Local file systems S
F Project 2
NFS
Project 1
G
L
P
S
F
F Project 2
S
-
F
P
O
14
© 2013 IBM Corporation
GPFS Concept – Wide Striping
GPFS storage
node node node
server node
15
© 2013 IBM Corporation
GPFS Concepts - Locking and Tokens
A B C D
lock range 0 – 2 GB 2 – 4 GB 4 – 6 GB 5 - 7 GB
file D
A B C
0 2 4 6 8 GB
Overlapping Node D has to wait for
lock range Node C to release token
16
© 2013 IBM Corporation
Parallel File Access
• Parallel I/O Libraries ease use and hide complexity of locking mechanism
• Additional provide block aggregation and caching
– Examples: MPI-IO, MIO
17
© 2013 IBM Corporation
Fine Granular Directory Locking (FGDL)
Traditional file systems implemented
directories as linear file. A File create locked
the entire directory
• Fine if files are in different directories
• Not the best if they’re in the same directory
• Unfortunately, this is a natural way to
organize files, and programmers are hard
to re-educate
• Checkpoints, output for a time-step
Examples
• Parallel Application writes checkpoint file from each task
• Serial application used to process high number of data files is parallelized using
trivial SPMD mode (i.e. rendering film, compressing multiple files)
Problem
• Creation of files is serialized in traditional file systems
• With GPFS FGDL feature parallel creation of files in single directory scales
nicely, but only up to certain number of tasks
– Max. parallel tasks is depending of file system configuration (>1000)
• How small can a byte-range of a directory object be
• How many tokens can be managed for a single object
Solution
• Parallel Task creates sub-directory per node (or task) and stores its files there
19
© 2013 IBM Corporation
Many Directories created from a parallel Task
A parallel application creates a directory per task to place its file in them
• Follow on problem to many files in same directory
Problem
• Parent directory becomes a bottleneck
• Application startup time does not scale linearly
– Startup time with 1000 tasks ~10min
• Directory Write Tokens “bounces” around between nodes
Solution
• Directories are created by single task (task0 or within the job start script)
• One write token is required for task0
– Startup time with 1000 ~20sec
20
© 2013 IBM Corporation
21
© 2013 IBM Corporation