Sei sulla pagina 1di 34

Build data lakes with Amazon S3

John Mallory,
Principal BDM, Storage, AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Defining the data lake

Building the data lake foundation

Optimizing the data lake

Getting to results

Making it easier

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Finding value in data is a journey

Business transformation

New business opportunities

Business optimization

Business insights

Business monitoring

Evolving tools and methods


SQL Query AI/ML
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why choose AWS for data lakes and analytics?

Most Most Most Easiest Most


comprehensive secure cost-effective to build customers
& partners

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining the AWS data lake
Data lakes are designed to provide:

Business Machine
Intelligence Learning Relational and non-relational data

Scale-out to Exabytes
DW Queries Big data
Interactive Real-time
processing

Catalog
Diverse set of analytics and machine learning tools
1001100001001010111001
0101011100101010000101

Work on data without any data movement


1111011010
0011110010110010110
0100011000010

Data Warehouse Data Lake

Low-cost storage and analytics


OLTP ERP CRM LOB Devices Web Sensors Social

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake on AWS components

AWS Amazon Amazon


Amazon Amazon Elasticsearch AWS
Glue AppSync API Gateway Cognito
DynamoDB Service
Catalog & Search Access & User Interfaces
Central Storage
Scalable, secure,
cost-effective

Amazon Amazon AWS Amazon Amazon


Athena EMR Glue Redshift DynamoDB

AWS
Snowball
Amazon
Kinesis Data
AWS Direct
Connect
AWS Database
Migration
AWS Storage
Gateway
S3
Firehose Service

Data Ingestion Manage & Secure Amazon Amazon Amazon


Elasticsearch
Amazon Amazon
QuickSight Kinesis Neptune RDS
Service

Analytics & Serving


AWS IAM AWS Amazon
KMS CloudTrail CloudWatch

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fortnite

125+ million players


Data provides a constant feedback loop
for game designers

Up-to-the-minute analysis of gamer


satisfaction to drive gamer engagement

Resulting in the most popular


game played in the world

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Epic Games uses data lakes and analytics

Entire analytics platform running on AWS


NEAR REALTIME PIPELINE
NEAR REALTIME PIPELINES Grafana
Amazon S3 leveraged as a data lake
Game Scoreboards API
clients
DynamoDB
Spark on EMR
Limited Raw Data
All telemetry data is collected with Amazon Kinesis
User ETL
Game (real time ad hoc SQL)
(metric definition)
servers
Real-time analytics done through Spark on Amazon EMR,
Amazon DynamoDB to create scoreboards and real-time queries
NEAR REALTIME PIPELINE
BATCH PIPELINES
Launcher
Kinesis

S3
Game Tableau/BI Use Amazon EMR for large batch data processing
services
Databases
ETL using S3 Ad-hoc SQL

Game designers use data to inform their decisions


APIs EMR (Data Lake)
Other
sources

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why do we use Amazon S3 for data lakes?

Unmatched durability, Best security, Object-level controls Business insights Most ways to
availability, compliance, and audit into your data bring data in
and scalability capabilities

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ingest Rapidly ingest all data sources
methods
IoT, sensor data, clickstream data,
social media feeds, streaming logs

Oracle, MySQL, MongoDB, DB2,


SQL Server, Amazon RDS

On-premises ERP, mainframes,


lab equipment, NAS storage

Offline sensor data, NAS,


on-premises Hadoop A data lake needs to
accommodate a wide
On-premises data lakes, EDW, variety of concurrent
large-scale data collection data sources

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is not one big bucket
Highly decoupled configurations scale better, are more fault tolerant, and are cost optimized

Raw Data ETL (Hadoop) Staged Data Data Warehouse


Amazon S3 Amazon EMR (Data Lake) Amazon Redshift
Amazon S3

Triggered Code ETL & Catalog Management Triggered Code


AWS Lambda AWS Glue AWS Lambda

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs on
Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline

New raw Run Crawl SLA


Crawl Ready
“optimize” optimized
data arrives raw dataset for reporting deadline
job dataset

< 22:00 Start Start Start Reporting 02:00


UTC crawler job or trigger crawler dataset UTC
ready

Data arrives Crawler Job


in Amazon S3 succeeds succeeds
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Preferred way to control data lake access
“Fine-grained” data and
resource ownership Team X Team Y Team Z

Presto Hive …
• Teams share S3 buckets and Zeppelin

clusters /foo/bar /abc/xyz /local


Local FS

• Access control complex to set hdfs:///data/1st hdfs:///data2


HDFS
up and maintain s3://bucket/prfx s3://group/data
Databases and schemas
EMRFS

• Common in a Amazon Redshift cluster Amazon EMR cluster

“shared services”
architecture
Amazon S3 buckets
“Fine-grained” ownership

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Access controls applied at pipeline stages
Amazon S3 Amazon S3 Amazon
Amazon S3
Ingested Data Staged or Data Lake Redshift
Data Origin Sandboxed Data

Data cleaning/prep
Amazon EMR Amazon EMR

Accessible by:

Service role Data mgmt. service Data mgmt. service Data mgmt. service Reporting service

Human role N/A Data scientist Business analyst Report builders

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control access to data with AWS Identity and Access
Management (IAM)
Configure Amazon S3 permissions

• Use S3 bucket policies for easy cross-account data IAM principals Amazon EMR Amazon
sharing Redshift

• Use S3 prefixes in IAM policies if feasible

• Use S3 object tags for granular access policies

• Use S3 Batch Operations to simplify management of


object tags at scale

• Authorize access from other tools such as Amazon


Redshift and Amazon Athena using IAM roles

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimizing data lake performance & costs

Amazon Amazon

Aggregate small files Amazon S3 Select


S3
Data ManagementDynamoDB

Amazon EMR: S3distcp Big data cheaper, faster HDFS/S3 tiering


Amazon Kinesis Firehose Up to 400% faster Columnar formats
AWS Glue ETL Up to 80% cheaper Partitioning
EMRFS consistent view

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimize costs with data tiering
Hot
HDFS  Use Amazon EMR/Hadoop with
local HDFS for hottest datasets

Amazon S3  Store cooler data in Amazon S3


Standard and cold in S3 Glacier Deep
Archive to reduce costs

Amazon S3—  Use Amazon S3 Analytics to


infrequent access optimize tiering strategy

S3 Glacier
Cold Deep Archive
Amazon S3 Analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S3 Intelligent-Tiering automates cost savings NEW!

Automatically optimizes storage costs for data with


changing access patterns
Moves objects between two storage tiers:
• Frequent Access Tier
• Infrequent Access Tier
Use in primary data Monitors access patterns and auto-tiers on granular
object level
lake buckets with
Millisecond access, >3 AZ, monitoring fee per
objects >1MB object, minimum storage duration

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process data in place…

Amazon Athena Amazon Redshift Amazon SageMaker AWS Glue


Spectrum

Amazon S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S3 Select

Select a subset of your object’s data using a SQL expression

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Seamless integration with Amazon S3
Link your Amazon S3 dataset to your Amazon FSx for Lustre file system, then….

Data stored in Amazon S3 is loaded to


Amazon FSx for processing

Output of processing, returned to


Amazon S3 for retention

When your workload finishes, simply delete your file system.

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sysco—Analytics on the data lake

Sysco is the leader in selling, marketing, and


distributing food
Amazon
Redshift
Spectrum Challenge: Large volumes of data stored in
Amazon
multiple systems
Marketing
Amazon S3 ETL process Amazon Amazon S3
data source Redshift Athena
Other
source
Ingest raw data Data Transformed Consolidated data into a single S3 data lake
systems from multiple preparation data
sources Amazon
EMR
Amazon Redshift Spectrum used by business
users for reporting
ML

Amazon EMR & Amazon Athena used by data


scientists

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation
Build a secure data lake in days

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Most partners to complement AWS offerings

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Featured Data & Analytics APN Consulting
Partners
AWS Data & Analytics Competency Partners have demonstrated success helping customers evaluate
and use the tools and best practices for collecting, storing, governing, and analyzing data, at any scale.
They help customers use data and analytics as a competitive differentiator and a primary source of
value generation. This includes designing and deploying data lakes and analytic solutions; defining and
enforcing data policies; security and management of personal information; creating data catalogs and
glossaries; data integration, data warehousing, reporting, dashboarding, data visualization; and more.

North America APAC

EMEA LATAM Japan

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn from AWS experts. Advance your skills and
knowledge. Build your future in the AWS Cloud.

Digital Training Classroom Training AWS Certification


Free, self-paced online Classes taught by Exams to validate
courses built by AWS accredited AWS instructors expertise with an industry-
experts recognized credential

Ready to begin building your cloud skills?


Get started at: https://www.aws.training/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why work with an APN Partner?
APN Partners are uniquely positioned APN Partners with deep expertise in
to help your organization at any AWS services:
stage of your cloud adoption journey, AWS Managed Service Provider (MSP)
and they:
Partners
• Share your goals—focused on your APN Partners with cloud infrastructure and
success application migration expertise

• Help you take full advantage of all the AWS Competency Partners
business benefits that AWS has to offer APN Partners with verified, vetted, and validated
specialized offerings
• Provide services and solutions to
support any AWS use case across your AWS Service Delivery Partners
full customer life cycle APN Partners with a track record of delivering
specific AWS services to customers

Find the right APN Partner for your needs: https://aws.amazon.com/partners/find/


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You for Attending AWS Innovate
We hope you found it interesting! A kind reminder to complete the survey.
Let us know what you thought of today’s event and how we can improve
the event experience for you in the future.

aws-apac-marketing@amazon.com
twitter.com/AWSCloud

facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices

slideshare.net/AmazonWebServices
twitch.tv/aws

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Potrebbero piacerti anche