Using AWS ParallelCluster To Simplify HPC Cluster Management CMP372-P

CMP372-P
Using AWS ParallelCluster to

simplify HPC cluster management
Nathan Stornetta
Senior Product Manager, HPC
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
HPC on AWS
Introduction to AWS ParallelCluster
Demo: Running CFD using AWS ParallelCluster
Scaling up fast with AWS ParallelCluster
AWS ParallelCluster best practices

Related breakouts
CMP402-R: Setting up and optimizing your HPC cluster on AWS
CMP408-R: Using Elastic Fabric Adapter to scale HPC workloads on AWS
CMP409-R: Selecting the right instance for your HPC workloads
CMP418-R: Using AWS ParallelCluster to simplify cluster management
We think the metric for success for any business
should be time-to-results
“For every $1 spent on

HPC, businesses see
9
7
5
Fixed datacenter $463 in incremental

revenues and $44 in
8
capacity limit
2
7 8 1
3
7 1 2
incremental profit.”
6 3 7
6
Cores
Cores
9 6 6
4 8
1
9
4 7
2 1 2
5 7
7
8 4 4 4
5 2 1
1 3
2 2
Finite capacity, usually with Massive capacity when needed to speed up time
long queues to wait in to results, and agile environment when additional
hardware and software experimentation is needed
Because, a TCO analysis never tells the whole story
Lost productivity & longer time to results
72.8% of organizations that use HPC reported

delayed or cancelled HPC jobs*
Lost innovation Outdated technology Technical debt

Questions are left unasked, Almost 20% of the useful Adapting newer algorithms to
experiments are left undone, life of new technology/ meet the requirements of an
and potential revenue hardware lost in the existing infrastructure = delays,
left on the table. procurement process. and below-par performance.
AWS services to get started with HPC on AWS
Amazon CloudWatch
Data management Compute & Automation &
Storage Visualization
& data transfer networking orchestration
AWS DataSync Amazon EC2 instances Amazon EBS AWS Batch NICE DCV
(CPU, GPU, FPGA)
AWS Snowball Amazon FSx for Lustre AWS ParallelCluster Amazon AppStream 2.0
Amazon EC2 Spot
AWS Snowmobile Amazon EFS NICE EnginFrame
AWS Auto Scaling
AWS Direct Connect Amazon S3
Placement groups
Enhanced networking
Elastic Fabric Adapter
AWS Identity and Access Management (IAM)
AWS Budgets
Running HPC applications at extreme scale
Accelerating time to innovation
single
HPC cluster of 1 million vCPUs
“Storage technology is amazingly complex and we’re constantly pushing the limits
of physics and engineering to deliver next-generation capacities and technical
innovation. This successful collaboration with AWS shows the extreme scale,
power and agility of cloud-based HPC to help us run complex simulations for
future storage architecture analysis and materials science explorations. Using
AWS to easily shrink simulation time from 20 days to 8 hours allows Western
Digital R&D teams to explore new designs and innovations at a pace un-
imaginable just a short time ago.” – Steve Phillpott, CIO, Western Digital
Helping financial institutions
model investment risks
Run risk models
4,000 times faster
In hours, instead of months
Manage 50X the number of securities

Why use AWS ParallelCluster?
Easy cluster management Automatic Resource Scaling Seamless Migration to the Cloud
Easy cluster management
`pcluster configure` to set Use config files to define Launch, stop, and restart clusters
up a cluster in minutes details of replicable clusters on demand
Automatic resource scaling
Scale up when jobs are Scale down when the Your data storage and file system
waiting cluster is idle scale to match your compute
Seamless migration to the cloud
Making HPC workloads cloud- AWS ParallelCluster simplifies first Integrations simplify the transition to
native can take time and planning steps to migrate HPC workloads cloud-native HPC at your own pace
AWS ParallelCluster
ALINUX CENTOS 6/7 UBUNTU DCV EFA OPENMPI INTELMPI NCCL
16/18
SLURM SGE TORQUE AWS BATCH
FSX EFS S3 EBS RAID
ON-DEMAND SPOT VPC & SUBNETS

AWS ParallelCluster works for a variety of use cases
Optimizing production workloads Fast prototyping Bypassing queues

AWS ParallelCluster runs workloads across all
industries
Drug discovery, Genomics Risk, Quantitative Research Reservoir Modeling, Seismic Imaging
CAE, CFD Weather modeling

AWS ParallelCluster runs workloads with varying
compute and throughput characteristics
Tightly coupled workloads Loosely coupled workloads Accelerated computing
Visualization AI/ML High-volume data analytics

AWS ParallelCluster Architecture
AWS Region
Availability Zone
VPC Subnet
Master Server Compute Nodes
Scheduler
Queue
Client AWS CloudFormation Amazon EC2 AWS Auto Scaling

Template
SSH +
User
NFS Share
Amazon EC2 C5n

Case Data Instances + EFA
Compute Suite
Amazon S3 Amazon Elastic Block Store
Bucket (EBS)
A story of an HPC research group
AWS ParallelCluster can scale in minutes to thousands
of cores
AWS ParallelCluster can switch between compute
nodes for rapid prototyping
AWS ParallelCluster makes elastic HPC easy
Use cluster placement groups to place instances even
closer together
Availability Zone
Cluster Placement Group

Use Elastic Fabric Adapter (EFA) to scale tightly
coupled workloads even further
Scale tightly coupled HPC applications
on AWS
M5n/ R5n/
P3dn G4dn1 C5n i3en
M5dn R5dn
Custom
NVIDIA Intel Xeon
V100 Tensor Scalable
and T4 processor
Core GPUs
Choose compute instances suited for each workload
Categories Capabilities
NEW
(AWS, Intel, AMD)
(up to 4.0 GHz)
(up to 12 TiB)
(HDD and NVMe)
(GPUs and FPGA)
NEW
(up to 100 Gbps)
(Nano to 32xlarge)
Choose a master node to match your cluster
• The master node orchestrates

cluster scaling logic
• Bigger clusters can require bigger master
nodes
• Small master nodes have more

limited network throughput
• DCV is managed through the
master node
• Consider GPU-based instances for graphics-
intensive visualization
AWS ParallelCluster supports on-demand, reserved
instances, and EC2 Spot Pricing
• On-demand for workloads that have
to be done now
• Spot for fault-tolerant and flexible

workloads
• Reserved instances for predictable

levels of workloads
FSx for Lustre offers massively scalable file system
performance
Parallel file system SSD-based
100+ GiB/s throughput Supports hundreds of

Millions of IOPS thousands of cores
Consistent sub-millisecond latencies
Disable hyperthreading for improved performance
• Easily turn hyperthreading on or off

with a single parameter
Custom AMIs provide additional flexibility and choice
• Install your own software on top of

an existing AWS ParallelCluster AMI
• Bring your own AMI and add AWS

ParallelCluster on top
HPC on AWS
Flexible configuration and virtually unlimited scalability

to grow and shrink your infrastructure as your HPC
workloads dictate, not the other way around
Thank you!
Nathan Stornetta
stornetn@amazon.com
https://github.com/aws/aws-parallelcluster

Using AWS ParallelCluster To Simplify HPC Cluster Management CMP372-P

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Using AWS ParallelCluster To Simplify HPC Cluster Management CMP372-P

Caricato da

Copyright:

Formati disponibili

CMP372-P

Using AWS ParallelCluster to

Introduction to AWS ParallelCluster

Demo: Running CFD using AWS ParallelCluster

Scaling up fast with AWS ParallelCluster

AWS ParallelCluster best practices

“For every $1 spent on

Fixed datacenter $463 in incremental

72.8% of organizations that use HPC reported

Lost innovation Outdated technology Technical debt

AWS Identity and Access Management (IAM)

Manage 50X the number of securities

SLURM SGE TORQUE AWS BATCH

FSX EFS S3 EBS RAID

ON-DEMAND SPOT VPC & SUBNETS

Optimizing production workloads Fast prototyping Bypassing queues

CAE, CFD Weather modeling

Tightly coupled workloads Loosely coupled workloads Accelerated computing

Visualization AI/ML High-volume data analytics

Master Server Compute Nodes

Client AWS CloudFormation Amazon EC2 AWS Auto Scaling

Amazon EC2 C5n

Cluster Placement Group

(up to 4.0 GHz)

(HDD and NVMe)

(GPUs and FPGA)

• The master node orchestrates

• Small master nodes have more

• Spot for fault-tolerant and flexible

• Reserved instances for predictable

100+ GiB/s throughput Supports hundreds of

• Easily turn hyperthreading on or off

• Install your own software on top of

• Bring your own AMI and add AWS

Flexible configuration and virtually unlimited scalability

Potrebbero piacerti anche