Sei sulla pagina 1di 10

These materials are © 2020 John Wiley & Sons, Inc.

Any dissemination, distribution, or unauthorized use is strictly prohibited.


Cloud Data
Lakes
Snowflake Special Edition

by David Baum

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Cloud Data Lakes For Dummies®, Snowflake Special Edition

Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2020 by John Wiley & Sons, Inc., Hoboken, New Jersey

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as
permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written
permission of the Publisher. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011,
fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, and related trade dress are
trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and
other countries, and may not be used without written permission. Snowflake and the Snowflake logo are
trademarks or registered trademarks of Snowflake Inc. All other trademarks are the property of their respective
owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO


REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE
CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT
LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED
OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS.  THE ADVICE AND STRATEGIES CONTAINED
HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING
THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL
SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL
PERSON SHOULD BE SOUGHT.  NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR
DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN
THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN
THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE
MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT
INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN
THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, or how to create a custom For Dummies book
for your business or organization, please contact our Business Development Department in the U.S. at
877-409-4177, contact info@dummies.biz, or visit www.wiley.com/go/custompub. For information about
licensing the For Dummies brand for products or services, contact BrandedRights&Licenses@Wiley.com.

ISBN 978-1-119-66624-0 (pbk); ISBN 978-1-119-66648-6 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

Publisher’s Acknowledgments
We’re proud of this book and of the people who worked on it. For details on how to create a
custom For Dummies book for your business or organization, contact info@dummies.biz or visit
www.wiley.com/go/custompub. For details on licensing the For Dummies brand for products or
services, contact BrandedRights&Licenses@Wiley.com.
Some of the people who helped bring this book to market include the following:

Development Editor: Steve Kaelble Business Development


Project Editor: Martin V. Minner Representative: Karen Hattan

Executive Editor: Steve Hayes Production Editor:


Tamilmani Varadharaj
Editorial Manager: Rev Mengle
Snowflake Contributors Team:
Vincent Morello, Michael Nixon,
Clarke Patterson, Leslie Steere
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
About This Book.................................................................................... 1
Foolish Assumptions............................................................................. 2
Icons Used in This Book........................................................................ 2
Beyond the Book................................................................................... 2

CHAPTER 1: Diving into Cloud Data Lakes.............................................. 3


Flowing Data into the Lake................................................................... 3
Understanding the Problems with Data Lakes.................................. 4
Reviewing the Requirements............................................................... 5
Introducing the Cloud Data Lake......................................................... 6
Considering the Rise of the Modern Data Lake................................. 7
Explaining Why You Need a Modern Cloud Data Lake..................... 8
Looking at Who Uses Modern Data Lakes, and Why........................ 8

CHAPTER 2: Explaining Why the Modern Data


Lake Emerged................................................................................ 11
Differentiating the Data Warehouse from the Data Lake.............. 12
Staying Afloat in the Data Deluge...................................................... 13
Harnessing data from many enterprise applications................ 13
Unifying device-generated data................................................... 13
Keeping Your Data in the Cloud........................................................ 14
Democratizing Your Analytics............................................................ 15

CHAPTER 3: Reducing Risk, Protecting Data....................................... 17


Implementing Compliance and Governance................................... 18
Ensuring Data Quality......................................................................... 19
Incorporating Protection, Availability, and Data Retention............ 20
Protecting Your Data with End-to-End Security............................... 21
Encrypting everywhere................................................................. 21
Managing the key........................................................................... 21
Automating updates and logging................................................. 22
Controlling access.......................................................................... 22
Certifying compliance and attestations...................................... 22
Isolating your data......................................................................... 23
Facing Facts about Data Security...................................................... 24

Table of Contents iii

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 4: Strategies for Modernizing a Data Lake........................ 25
Beginning with the Right Architecture.............................................. 26
Collecting and Integrating a Range of Data Types.......................... 27
Continuously Loading Data................................................................ 27
Enabling Secure Data Sharing............................................................ 29
Customizing Workloads for Optimal Performance......................... 30
Enabling a high-performance SQL layer..................................... 31
Maintaining workload isolation.................................................... 31
Interacting with data in an object store...................................... 31
Resizing compute clusters............................................................ 32
Creating a User-Friendly Environment with Metadata................... 33

CHAPTER 5: Assessing the Benefits of a Modern


Cloud Data Lake.......................................................................... 35
Getting Here from There.................................................................... 35
Increasing Scalability Options............................................................ 36
Reducing Deployment and Management Costs.............................. 38
Gaining Insights from All Types of Data........................................... 39
Boosting Productivity for Business and IT....................................... 40
Simplifying the Environment............................................................. 41
Examining the benefits of object storage................................... 41
Offering more advice..................................................................... 42

CHAPTER 6: Six Steps for Planning Your Cloud


Data Lake.......................................................................................... 43

iv Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
D
ata lakes emerged more than a decade ago to solve a grow-
ing problem: the need for a scalable, low-cost data repos-
itory that allowed organizations to easily store all data
types from a diverse set of sources, and then analyze that data to
make evidence-based business decisions.

But what about the data warehouse, which was the de facto solu-
tion for storing and analyzing structured data and preceded the
data lake by 30 years? It couldn’t accommodate these new, big
data projects and their fast-paced acquisition models, many of
which envisioned easily storing petabytes of data in structured
and semi-structured forms. With the future of big data looming
large, the data lake seemed like the answer: an ideal way to gather,
store, and analyze enormous amounts of data in one location.

Interest in data lakes skyrocketed for one simple reason: Most


organizations consider data a very important asset, and the
systems of the time couldn’t handle the variety. For decades,
organizations have collected structured data from enterprise
applications, and now they’re supplementing it with these newer
forms of semi-structured data from web pages, social media sites,
mobile phones, Internet of Things (IoT) devices, and many other
sources, including shared data sets. Data scientists, business ana-
lysts, and line-of-business professionals still need a way to easily
capture, store, access, and analyze that data.

The initial data lakes were deployed on premises, mostly using


open source tools from the Apache Hadoop ecosystem. In the
decade since data lakes were first introduced, cloud computing
has evolved and data storage technologies have matured. Better
and easier ways to create data lakes have emerged that leverage
the power, flexibility, and near-infinite scalability of the cloud.

About This Book


Cloud Data Lakes For Dummies is your guide to modern data lakes
that combine the power of analytics with the flexibility of big data
models and the agility and limitless resources of the cloud. Whether
you’re considering your first data lake or wish to update an existing
one, this book offers ideas to help you achieve your business and
technology goals.
Introduction 1

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Foolish Assumptions
In creating this book, I’ve made a few assumptions:

»» You’re a business user, data scientist, data platform archi-


tect, data warehouse manager, or perhaps a company
executive.
»» You want to store, analyze, visualize, or share data from a
variety of sources, and your existing solution is coming up
short. Or perhaps you need access to massive data sets to
train a machine learning model.
»» You want to understand how a data lake can create new
business opportunities and help your organization make
better, more complete, and more timely decisions.

Icons Used in This Book


Throughout this book you’ll find the following icons that high-
light tips, important points to remember, and more:

This icon guides you to faster ways to perform essential tasks,


such as better ways to put a cloud data lake to work.

Here you’ll find ideas worth remembering as you immerse your-


self in the exciting world of data lake concepts.

Throughout this book, case studies provide best practices from


organizations that have successfully applied cloud data lakes.

Beyond the Book


Visit www.snowflake.com to find loads of additional content
about cloud data lakes and related topics. Read other ebooks,
view w­ ebinars, and get the scoop on upcoming events. You’ll also
find contact information in case you want to get in touch with
Snowflake or try Snowflake for free as your cloud data lake or for
another business need.

2 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Flowing data into lakes

»» Limitations of traditional data lakes

»» Introducing modern data lakes

»» Looking at who uses modern data lakes,


and why

Chapter  1
Diving into Cloud Data
Lakes

B
ack in 2010, James Dixon, who founded and served as chief
technology officer for Pentaho, coined the term data lake to
describe a new type of data repository for storing massive
amounts of raw data in its native form, in a single location.

This chapter digs into the history of the data lake, why the idea
emerged, and how it has evolved. It explores what data lakes can
do, and where traditional data lakes have fallen short of ever-
expanding expectations. It spells out data lake strategies and
explains where the cloud data lake fits in.

Flowing Data into the Lake


What’s behind the metaphoric name data lake? According to Dixon,
think of a large body of water, into which new water streams from
many channels, and from which samples are taken and analyzed.

It was a revolutionary concept. Prior to the data lake, most ana-


lytic systems stored specific types of data, using a predefined
database structure. For example, data warehouses were built pri-
marily for analytics, using relational databases that included a

CHAPTER 1 Diving into Cloud Data Lakes 3

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
schema to define tables of structured data in orderly columns and
rows. By contrast, the hope for data lakes was to store many data
types in their native formats and make that data available to the
business community for reporting and analytics (see Figure 1-1).
The goal was to enable organizations to explore, refine, and ana-
lyze petabytes of information without a predetermined notion of
structure.

The most important thing to understand about a data lake is not


how it is constructed, but what it enables. It’s a comprehensive
way to explore, refine, and analyze petabytes of information con-
stantly arriving from multiple data sources.

One petabyte of data is equivalent to 1 million gigabytes: about 500


billion pages of standard, printed text or 58,333 high-definition,
two-hour movies. Data lakes were conceived for business users to
explore and analyze petabytes of data.

FIGURE 1-1: The original goal of the data lake, which failed to deliver the
desired rapid insights.

Understanding the Problems


with Data Lakes
The initial data lake concept was compelling, and many organiza-
tions rushed to build on-premises data lakes. The core technol-
ogy was based on the Apache Hadoop ecosystem, an open source

4 Cloud Data Lakes For Dummies, Snowflake Special Edition

These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.

Potrebbero piacerti anche