Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
by David Baum
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Cloud Data Lakes For Dummies®, Snowflake Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2020 by John Wiley & Sons, Inc., Hoboken, New Jersey
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as
permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written
permission of the Publisher. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011,
fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, and related trade dress are
trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and
other countries, and may not be used without written permission. Snowflake and the Snowflake logo are
trademarks or registered trademarks of Snowflake Inc. All other trademarks are the property of their respective
owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For Dummies book
for your business or organization, please contact our Business Development Department in the U.S. at
877-409-4177, contact info@dummies.biz, or visit www.wiley.com/go/custompub. For information about
licensing the For Dummies brand for products or services, contact BrandedRights&Licenses@Wiley.com.
10 9 8 7 6 5 4 3 2 1
Publisher’s Acknowledgments
We’re proud of this book and of the people who worked on it. For details on how to create a
custom For Dummies book for your business or organization, contact info@dummies.biz or visit
www.wiley.com/go/custompub. For details on licensing the For Dummies brand for products or
services, contact BrandedRights&Licenses@Wiley.com.
Some of the people who helped bring this book to market include the following:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 4: Strategies for Modernizing a Data Lake........................ 25
Beginning with the Right Architecture.............................................. 26
Collecting and Integrating a Range of Data Types.......................... 27
Continuously Loading Data................................................................ 27
Enabling Secure Data Sharing............................................................ 29
Customizing Workloads for Optimal Performance......................... 30
Enabling a high-performance SQL layer..................................... 31
Maintaining workload isolation.................................................... 31
Interacting with data in an object store...................................... 31
Resizing compute clusters............................................................ 32
Creating a User-Friendly Environment with Metadata................... 33
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
D
ata lakes emerged more than a decade ago to solve a grow-
ing problem: the need for a scalable, low-cost data repos-
itory that allowed organizations to easily store all data
types from a diverse set of sources, and then analyze that data to
make evidence-based business decisions.
But what about the data warehouse, which was the de facto solu-
tion for storing and analyzing structured data and preceded the
data lake by 30 years? It couldn’t accommodate these new, big
data projects and their fast-paced acquisition models, many of
which envisioned easily storing petabytes of data in structured
and semi-structured forms. With the future of big data looming
large, the data lake seemed like the answer: an ideal way to gather,
store, and analyze enormous amounts of data in one location.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Foolish Assumptions
In creating this book, I’ve made a few assumptions:
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Flowing data into lakes
Chapter 1
Diving into Cloud Data
Lakes
B
ack in 2010, James Dixon, who founded and served as chief
technology officer for Pentaho, coined the term data lake to
describe a new type of data repository for storing massive
amounts of raw data in its native form, in a single location.
This chapter digs into the history of the data lake, why the idea
emerged, and how it has evolved. It explores what data lakes can
do, and where traditional data lakes have fallen short of ever-
expanding expectations. It spells out data lake strategies and
explains where the cloud data lake fits in.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
schema to define tables of structured data in orderly columns and
rows. By contrast, the hope for data lakes was to store many data
types in their native formats and make that data available to the
business community for reporting and analytics (see Figure 1-1).
The goal was to enable organizations to explore, refine, and ana-
lyze petabytes of information without a predetermined notion of
structure.
FIGURE 1-1: The original goal of the data lake, which failed to deliver the
desired rapid insights.
These materials are © 2020 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.