Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Kent Graziano
Senior Technical Evangelist
Agenda
Bio
Data Warehousing: Historical Theory
Data Warehousing: The Reality
Data Warehousing: The Future
Closing Thoughts
My Bio
Co-Author of
Theoretical Architectures
Data Warehouse
What is it
Centralized location for data
Single source of truth or
Single source of Facts
Source of data for reporting,
analytics, and offline operational
processes
Who is it
Capital EDW:
Primary: Teradata, Oracle Exadata,
IBM Pure Systems,
Secondary: HP Vertica, Pivotal
Greenplum
Datamarts
What are they
Databases used to provide fast,
independent access to a subset of data
Often created for departments,
projects, users,
Similar technology
Subset of data
Relieves pressure on EDW
Provides sandbox for analysis /
analysts
Data sources
Traditional
OLTP databases
Non-traditional
Web applications
Enterprise applications
Other
Sensors, devices,
Transformation (ETL)
What is it
Getting data from source form
into a standard, clean,
normalized form
10
Source 1
Transformation
Routines (ETL)
Sales
Data Mart
Source 2
Financial
Data Mart
Source 3
Customer
Service
Data Mart
11
Sales
Data Mart
Source 1
Source 2
Source 3
Enterprise
Data
Warehouse
Financial
Data Mart
ETL
Routines
Customer
Service
Data Mart
12
Workbench
Information Feedback
External
API
Data
Warehouse
ERP
Internet
API
API
Legacy
API
Other
Operational
Systems
Systems
Management
Data
Acquisition
CIF Data
Management
Data
Delivery
Operational
Data Store
TrI
Exploration
Warehouse
DSI
Data Mining
Warehouse
DSI
OLAP Data
Mart
DSI
Oper Mart
DSI
Operation &
Administration
Service
Management
Change
Management
13
tm
DW 2.0
Next Generation data warehouse architecture from Bill Inmon
Superseded CIF (for some)
Includes more accommodation and integration of meta data
Includes integration of unstructured data
14
DW 2.0tm
15
Data Vault
Invented and Developed by Daniel Linstedt
New, hybrid modeling for enterprise date
warehousing
Introduced with TDAN articles in 2002
Truly introduces an approach for agile, incremental
dw model development
Called hyper normalized by some
Methodology adapted from Scott Amblers
Disciplined Agile Development (DAD)
16
LearnDataVault.com
18
LearnDataVault.com
19
Records a history
of the interaction
Sat
Sat
Sat
Email ID
Link
Bank Transactions
Sat
Bank ID
Sat
F(x)
Sat
Sat
Hub
Sat
Link
Passenger
ID
Sat
Satellite
Sat
** Dashed Line is a possible New Relationship
Airline Reservations
20
(C) LearnDataVault.com
21
23
Datamarts
OLTP
databases
Enterprise
applications
BI / Analytics
EDW
ETL
Third-party
Web
applications
Other
Hadoop
Proprietary and Confidential
24
Lots of Hybrids
Most organizations mix Inmon & Kimball
ODS feeding Data Marts
Data Marts backed into an EDW
Off the Shelf models customized to work!
Canned BI apps
Oracle BI Apps
etc
25
Source(s)
of Record
HI
FDW / P MS
G2
EDW V1
KDW
Lynx
Insert
1X
only
<Enterprise
business key
model with
key mapping
pointers to
COMN_STG
data >
COMN
Stage
MU
CI SAS Routines
COMN
Presentation
COMN
Integration
SFDC
KDW Lite
Reporting
<Full copies of
source data
structures with
additional
plumbing
fields to
facilitate
CDC
capturing
subsequent
data changes
over time>
TBLU
Star
Schema(s)
BOBJ
JIT
Transformation
<Virtual v.
Physical>
Data
Marts
Web
26
BOBJ / BI /
Reporting
MSH EDW
HI Stage
HI
Presentation
FIN Stage
FIN
Presentation
HI
G2
FIN
EDW V1
KDW
Lynx
COMN
Integration
SFDC
CLIN
COMN
Presentation
KDW Lite
MKTG
COMN
Stage
MU
CI SAS Routines
COMN Validation
27
BOBJ / BI /
Reporting
MSH EDW
HI Stage
HI
Presentation
FIN Stage
FIN
Presentation
HI
G2
FIN
EDW V1
KDW
Lynx
SFDC
COMN
Integration
COMN
Stage
CLIN
COMN
Presentation
MKTG
KDW Lite
MU
CI SAS Routines
COMN Validation
(DQ)
28
The Future
Todays realities
Datamarts
EDW
Hadoop
Data diversity
Complexity
Barriers to analytics
30
Data Warehousing
Complex: manage hardware, data
distribution, indexes,
Limited elasticity: forklift upgrades, data
redistribution, downtime
Costly: overprovisioning, significant care &
feeding
Hadoop
Complex: specialized skills, new tools
Limited elasticity: data redistribution,
resource contention
Not a data warehouse: batch-oriented,
limited optimization, incomplete security
31
Investigative computing
platform
Traditional EDW
environment
Data i ntegration
platform
Operational systems
RT BI services
Data
refinery
RT analysis platform
Operational real-time
environment
Copyright Intellegent Solutions, Inc 2105. All Rights Reserved. Used by Permission
32
Elasticity
On demand resources
True grid utility computing
Security
33
34
What is Snowflake?
Delivered as a service
Infrastructure, resiliency,
optimization built in
Our vision:
Reinvent the Data Warehouse
Data Warehousing
for Everyone
Data
scientists
SQL
users &
tools
Structured data
(e.g. CSV)
Apple
101.12
250
FIH-2316
Pear
56.22
202
IHO-6912
Orange
98.21
600
WHQ-6090
"firstName": "John",
"lastName": "Smith",
"height_cm": 167.64,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{ "type": "home", "number": "212 555-1234" },
{ "type": "office", "number": "646 555-4567" }
]
Semi-structured data
(e.g. JSON, Avro, XML)
Optimized storage
Flexible schema
Relational processing
37
Low-cost, scalable
cloud storage
Elastic compute, on
demand
Optimized for
diverse data
Software as a
service
Exact amount of
compute needed, exactly
when needed
No knobs, tuning, or
infrastructure
management
38
A new architecture:
multi-cluster, shared data
Standard interfaces
Cloud services layer
coordinates across service
Independent compute
clusters access data
Data centralized in enterprise-
class cloud storage
39
Marketing
Loading /
ETL
Finance
Sales
Operations
Test / Dev
40
Delivered as a service:
no infrastructure, knobs, or tuning
Infrastructure
management
Data storage
management
Metadata
management
Manual query
optimization
**..
**..
Automatic statistics
collection, scaling, and
redundancy
Dynamic optimization,
parallelization, and
concurrency management
41
Datamarts
Hadoop
Barriers to Analysis
42
Conclusions?
44
An Option to Consider
Snowflake is:
a team of accomplished data experts
Funded by top-tier VCs including Altimeter Capital, Redpoint Ventures,
Sutter Hill Ventures, Wing VC
SHAMELESS PLUG:
Available on
Amazon.com
http://www.amazon.com
/Better-Data-Modeling-
Introduction-
Engineering-
ebook/dp/B018BREV1C/
47
Contact Information
Kent Graziano
Snowflake Computing
Kent.graziano@snowflake.net
On Twitter @KentGraziano
More info at
http://snowflake.net
Visit my blog at
http://kentgraziano.com
48