Sei sulla pagina 1di 38

CMP372-P

Using AWS ParallelCluster to


simplify HPC cluster management
Nathan Stornetta
Senior Product Manager, HPC
Amazon Web Services

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
HPC on AWS

Introduction to AWS ParallelCluster

Demo: Running CFD using AWS ParallelCluster

Scaling up fast with AWS ParallelCluster

AWS ParallelCluster best practices


Related breakouts
CMP402-R: Setting up and optimizing your HPC cluster on AWS
CMP408-R: Using Elastic Fabric Adapter to scale HPC workloads on AWS
CMP409-R: Selecting the right instance for your HPC workloads
CMP418-R: Using AWS ParallelCluster to simplify cluster management
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We think the metric for success for any business
should be time-to-results

“For every $1 spent on


HPC, businesses see
9

7
5

Fixed datacenter $463 in incremental


revenues and $44 in
8

capacity limit
2
7 8 1
3
7 1 2

incremental profit.”
6 3 7
6
Cores

Cores
9 6 6
4 8
1
9
4 7
2 1 2
5 7
7
8 4 4 4
5 2 1
1 3
2 2

Finite capacity, usually with Massive capacity when needed to speed up time
long queues to wait in to results, and agile environment when additional
hardware and software experimentation is needed
Because, a TCO analysis never tells the whole story
Lost productivity & longer time to results

72.8% of organizations that use HPC reported


delayed or cancelled HPC jobs*

Lost innovation Outdated technology Technical debt


Questions are left unasked, Almost 20% of the useful Adapting newer algorithms to
experiments are left undone, life of new technology/ meet the requirements of an
and potential revenue hardware lost in the existing infrastructure = delays,
left on the table. procurement process. and below-par performance.
AWS services to get started with HPC on AWS
Amazon CloudWatch
Data management Compute & Automation &
Storage Visualization
& data transfer networking orchestration

AWS DataSync Amazon EC2 instances Amazon EBS AWS Batch NICE DCV
(CPU, GPU, FPGA)
AWS Snowball Amazon FSx for Lustre AWS ParallelCluster Amazon AppStream 2.0
Amazon EC2 Spot
AWS Snowmobile Amazon EFS NICE EnginFrame
AWS Auto Scaling
AWS Direct Connect Amazon S3
Placement groups
Enhanced networking
Elastic Fabric Adapter

AWS Identity and Access Management (IAM)

AWS Budgets
Running HPC applications at extreme scale
Accelerating time to innovation

single
HPC cluster of 1 million vCPUs

“Storage technology is amazingly complex and we’re constantly pushing the limits
of physics and engineering to deliver next-generation capacities and technical
innovation. This successful collaboration with AWS shows the extreme scale,
power and agility of cloud-based HPC to help us run complex simulations for
future storage architecture analysis and materials science explorations. Using
AWS to easily shrink simulation time from 20 days to 8 hours allows Western
Digital R&D teams to explore new designs and innovations at a pace un-
imaginable just a short time ago.” – Steve Phillpott, CIO, Western Digital
Helping financial institutions
model investment risks
Run risk models
4,000 times faster
In hours, instead of months

Manage 50X the number of securities


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why use AWS ParallelCluster?

Easy cluster management Automatic Resource Scaling Seamless Migration to the Cloud
Easy cluster management

`pcluster configure` to set Use config files to define Launch, stop, and restart clusters
up a cluster in minutes details of replicable clusters on demand
Automatic resource scaling

Scale up when jobs are Scale down when the Your data storage and file system
waiting cluster is idle scale to match your compute
Seamless migration to the cloud

Making HPC workloads cloud- AWS ParallelCluster simplifies first Integrations simplify the transition to
native can take time and planning steps to migrate HPC workloads cloud-native HPC at your own pace
AWS ParallelCluster
ALINUX CENTOS 6/7 UBUNTU DCV EFA OPENMPI INTELMPI NCCL
16/18

SLURM SGE TORQUE AWS BATCH

FSX EFS S3 EBS RAID

ON-DEMAND SPOT VPC & SUBNETS


AWS ParallelCluster works for a variety of use cases

Optimizing production workloads Fast prototyping Bypassing queues


AWS ParallelCluster runs workloads across all
industries

Drug discovery, Genomics Risk, Quantitative Research Reservoir Modeling, Seismic Imaging

CAE, CFD Weather modeling


AWS ParallelCluster runs workloads with varying
compute and throughput characteristics

Tightly coupled workloads Loosely coupled workloads Accelerated computing

Visualization AI/ML High-volume data analytics


AWS ParallelCluster Architecture
AWS Region

Availability Zone

VPC Subnet

Master Server Compute Nodes

Scheduler
Queue

Client AWS CloudFormation Amazon EC2 AWS Auto Scaling


Template

SSH +
User
NFS Share

Amazon EC2 C5n


Case Data Instances + EFA
Compute Suite
Amazon S3 Amazon Elastic Block Store
Bucket (EBS)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A story of an HPC research group
AWS ParallelCluster can scale in minutes to thousands
of cores
AWS ParallelCluster can switch between compute
nodes for rapid prototyping
AWS ParallelCluster makes elastic HPC easy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use cluster placement groups to place instances even
closer together

Availability Zone

Cluster Placement Group


Use Elastic Fabric Adapter (EFA) to scale tightly
coupled workloads even further
Scale tightly coupled HPC applications
on AWS

M5n/ R5n/
P3dn G4dn1 C5n i3en
M5dn R5dn

Custom
NVIDIA Intel Xeon
V100 Tensor Scalable
and T4 processor
Core GPUs
Choose compute instances suited for each workload
Categories Capabilities
NEW
(AWS, Intel, AMD)

(up to 4.0 GHz)

(up to 12 TiB)

(HDD and NVMe)

(GPUs and FPGA)

NEW
(up to 100 Gbps)

(Nano to 32xlarge)
Choose a master node to match your cluster

• The master node orchestrates


cluster scaling logic
• Bigger clusters can require bigger master
nodes

• Small master nodes have more


limited network throughput
• DCV is managed through the
master node
• Consider GPU-based instances for graphics-
intensive visualization
AWS ParallelCluster supports on-demand, reserved
instances, and EC2 Spot Pricing
• On-demand for workloads that have
to be done now

• Spot for fault-tolerant and flexible


workloads

• Reserved instances for predictable


levels of workloads
FSx for Lustre offers massively scalable file system
performance
Parallel file system SSD-based

100+ GiB/s throughput Supports hundreds of


Millions of IOPS thousands of cores
Consistent sub-millisecond latencies
Disable hyperthreading for improved performance

• Easily turn hyperthreading on or off


with a single parameter
Custom AMIs provide additional flexibility and choice

• Install your own software on top of


an existing AWS ParallelCluster AMI

• Bring your own AMI and add AWS


ParallelCluster on top
HPC on AWS

Flexible configuration and virtually unlimited scalability


to grow and shrink your infrastructure as your HPC
workloads dictate, not the other way around
Thank you!
Nathan Stornetta
stornetn@amazon.com

https://github.com/aws/aws-parallelcluster

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Potrebbero piacerti anche