Sei sulla pagina 1di 49

Data Warehousing 2016

Kent Graziano
Senior Technical Evangelist

Agenda
Bio
Data Warehousing: Historical Theory
Data Warehousing: The Reality
Data Warehousing: The Future
Closing Thoughts

My Bio

Senior Technical Evangelist, Snowflake Computing


Oracle ACE Director (DW/BI)
Certified Data Vault Master and DV 2.0 Practitioner
Former Member: Boulder BI Brain Trust (#BBBT)
Member: DAMA International
Data Architecture and Data Warehouse Specialist
30+ years in IT
25+ years of Oracle-related work
20+ years of data warehousing experience

Co-Author of

The Business of Data Vault Modeling


The Data Model Resource Book (1st Edition)

Blogger The Data Warrior


Past-President of ODTUG and Rocky Mountain Oracle
User Group
3

What about you?


Survey says

Theoretical Architectures

What Is a Data Warehouse?


A subject-oriented, integrated, time-variant,
non-volatile collection of data in support of
managements decision making process.
W.H. Inmon
The data warehouse is where we publish
used data.
Ralph Kimball
6

Data Warehouse
What is it
Centralized location for data
Single source of truth or
Single source of Facts
Source of data for reporting,
analytics, and offline operational
processes

Who is it
Capital EDW:
Primary: Teradata, Oracle Exadata,
IBM Pure Systems,
Secondary: HP Vertica, Pivotal
Greenplum

Data warehouse: SQL Server,


MySQL, Oracle,

Proprietary and Confidential

Datamarts
What are they
Databases used to provide fast,
independent access to a subset of data
Often created for departments,
projects, users,

Comparison to data warehouse

Similar technology
Subset of data
Relieves pressure on EDW
Provides sandbox for analysis /
analysts

Proprietary and Confidential

Data sources
Traditional
OLTP databases

Non-traditional
Web applications

Oracle, Sybase, DB2, SQL


Server, MySQL, Postgres,

Enterprise applications

Website applications, mobile


applications,

New third-party data

ERP, CRM, HR,

Traditional third-party data


Consumer databases, stock
trade data,

API data, Twitter, Facebook,


Segment, weather,

Other
Sensors, devices,

Proprietary and Confidential

Transformation (ETL)
What is it
Getting data from source form
into a standard, clean,
normalized form

How it gets done


Third-party tools
Custom home-grown scripts
Hadoop

Proprietary and Confidential

10

Direct Data Mart

Source 1

Transformation
Routines (ETL)

Sales
Data Mart

Source 2

Financial
Data Mart

Source 3

Customer
Service
Data Mart
11

Basic Inmon Architected Data Warehouse


ETL
Routines

Sales
Data Mart

Source 1

Source 2

Source 3

Enterprise
Data
Warehouse

Financial
Data Mart
ETL
Routines

Customer
Service
Data Mart
12

Corporate Information Factory


Information Workshop

Library & Toolbox

Workbench

Information Feedback

External

API

Data
Warehouse

ERP

Internet

API

API

Legacy

API

Other
Operational
Systems

Systems
Management

Data
Acquisition

CIF Data
Management

Data
Delivery

Operational
Data Store
TrI

Exploration
Warehouse

DSI

Data Mining
Warehouse

DSI

OLAP Data
Mart

DSI

Oper Mart

DSI

Meta Data Management


Data Acquisition
Management

2002, Intelligent Solutions, Inc.

Operation &
Administration

Service
Management

Change
Management

Courtesy of Intelligent Solutions, Inc.

13

tm
DW 2.0
Next Generation data warehouse architecture from Bill Inmon
Superseded CIF (for some)
Includes more accommodation and integration of meta data
Includes integration of unstructured data

14

DW 2.0tm

15

Data Vault
Invented and Developed by Daniel Linstedt
New, hybrid modeling for enterprise date
warehousing
Introduced with TDAN articles in 2002
Truly introduces an approach for agile, incremental
dw model development
Called hyper normalized by some
Methodology adapted from Scott Amblers
Disciplined Agile Development (DAD)
16

Data Vault Definition


The Data Vault is a detail oriented, historical tracking and uniquely
linked set of normalized tables that support one or more functional
areas of business.
It is a hybrid approach encompassing the best of breed between 3rd
normal form (3NF) and star schema. The design is flexible, scalable,
consistent and adaptable to the needs of the enterprise.

Architected specifically to meet the needs


of todays enterprise data warehouses
Dan Linstedt: Defining the Data Vault
TDAN.com Article
17

Where does a Data Vault Fit?

LearnDataVault.com

18

Data Vault: 3 Simple Structures

LearnDataVault.com

19

Standard Data Vault Model


Hub: List of UNIQUE business keys.
Link: List of UNIQUE relationships
Satellite: Historical descriptive data.
Email Information

Records a history
of the interaction

Sat
Sat

Sat

Email ID

Link

Bank Transactions
Sat
Bank ID

Sat

F(x)
Sat

Sat

Hub

Sat

Link

Passenger
ID

Sat

Satellite
Sat
** Dashed Line is a possible New Relationship

Airline Reservations

20

Data Vault Extensibility


Adding new components to
the EDW has NEAR ZERO
impact to:
Existing Loading
Processes
Existing Data Model
Existing Reporting & BI
Functions
Existing Source Systems
Existing Star Schemas
and Data Marts

(C) LearnDataVault.com

21

Back in the Real World

What a Data Warehouse Isnt?


A panacea
An IT department endeavor alone
Time to avoid user and IT communications
The sure-fire way to reduce overhead and increase company /
department profits
The answer to all decision support and reporting needs
Just a reporting data base

23

Typical DW/BI environment


Data sources

Datamarts

OLTP
databases

Enterprise
applications

BI / Analytics

EDW
ETL

Third-party
Web
applications

Other

Hadoop
Proprietary and Confidential

24

Lots of Hybrids
Most organizations mix Inmon & Kimball
ODS feeding Data Marts
Data Marts backed into an EDW
Off the Shelf models customized to work!
Canned BI apps
Oracle BI Apps

Data Vaults inside a CIF


Some using Hadoop for Staging

etc
25

Example:Hybrid -Original Schema Architecture


MSH EDW

Source(s)
of Record
HI
FDW / P MS
G2
EDW V1
KDW
Lynx

Insert
1X
only

<Enterprise
business key
model with
key mapping
pointers to
COMN_STG
data >

COMN
Stage

MU
CI SAS Routines

COMN
Presentation

COMN
Integration

SFDC

KDW Lite

Reporting

<Full copies of
source data
structures with
additional
plumbing
fields to
facilitate
CDC
capturing
subsequent
data changes
over time>

TBLU
Star
Schema(s)

BOBJ

JIT
Transformation
<Virtual v.
Physical>

Data
Marts

Web

26

Hoped for Schema Architecture (Parallel Loads)


Source(s)
of Record
HI
FDW / P MS

BOBJ / BI /
Reporting

MSH EDW
HI Stage

HI
Presentation

FIN Stage

FIN
Presentation

HI

G2

FIN

EDW V1
KDW
Lynx

COMN
Integration

SFDC

CLIN
COMN
Presentation

KDW Lite

MKTG

COMN
Stage

MU
CI SAS Routines

COMN Validation
27

Actual Schema Architecture


Source(s)
of Record
HI
FDW / P MS

BOBJ / BI /
Reporting

MSH EDW
HI Stage

HI
Presentation

FIN Stage

FIN
Presentation

HI

G2

FIN

EDW V1
KDW
Lynx
SFDC

COMN
Integration
COMN
Stage

CLIN
COMN
Presentation

MKTG
KDW Lite
MU
CI SAS Routines

COMN Validation
(DQ)

28

The Future

Todays realities
Datamarts

EDW

Hadoop

Data diversity

Complexity

Barriers to analytics

External data, machine-generated


data, streaming data

Complex systems, data pipelines,


data silos

Incomplete data, slow time to access,


performance and concurrency barriers

30

Current architectures cant keep up

Data Warehousing
Complex: manage hardware, data
distribution, indexes,
Limited elasticity: forklift upgrades, data
redistribution, downtime
Costly: overprovisioning, significant care &
feeding

Hadoop
Complex: specialized skills, new tools
Limited elasticity: data redistribution,
resource contention
Not a data warehouse: batch-oriented,
limited optimization, incomplete security

31

Next Generation Extended Data Warehouse


Architecture (XDW)

Analytic tools & applications

Investigative computing
platform

Traditional EDW
environment

Data i ntegration
platform

Operational systems

RT BI services

Data
refinery

Other internal & external


structured & multi-structured data
Real-time streaming data

RT analysis platform

Operational real-time
environment

Slide created by Colin White BI Research, Inc.

Copyright Intellegent Solutions, Inc 2105. All Rights Reserved. Used by Permission

32

What we need to solve for


Cost Containment!

More data all the time & more complexity


Hard to keep up infrastructure & skills

Quicker time to delivery


See the data sooner!

Elasticity

On demand resources
True grid utility computing

Security

33

New possibilities with the cloud


More & more data born in the cloud
Natural integration point for data
Low-cost, scalable storage
Capacity on demand

34

What is Snowflake?

All-new SQL data


warehouse
No legacy code or constraints

Delivered as a service
Infrastructure, resiliency,
optimization built in

Designed for the cloud


Running in Amazon Web
Services
35

Our vision:
Reinvent the Data Warehouse
Data Warehousing

for Everyone
Data
scientists

SQL relational database


Optimized storage & processing
Standard connectivity BI, ETL,

SQL
users &
tools

Existing SQL skills and tools


Load and go ease of use
Cloud-based elasticity to fit any scale
36

Brings together diverse data


{

Structured data
(e.g. CSV)

Apple

101.12

250

FIH-2316

Pear

56.22

202

IHO-6912

Orange

98.21

600

WHQ-6090

"firstName": "John",
"lastName": "Smith",
"height_cm": 167.64,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{ "type": "home", "number": "212 555-1234" },
{ "type": "office", "number": "646 555-4567" }
]

Semi-structured data
(e.g. JSON, Avro, XML)

Optimized storage
Flexible schema
Relational processing
37

Designed for the cloud

Low-cost, scalable
cloud storage

Elastic compute, on
demand

Optimized for
diverse data

Software as a
service

Never worry about sizing


for storage again

Exact amount of
compute needed, exactly
when needed

Load and optimize semi-


structured + structured
data without
transformation

No knobs, tuning, or
infrastructure
management

38

A new architecture:
multi-cluster, shared data
Standard interfaces
Cloud services layer
coordinates across service
Independent compute
clusters access data
Data centralized in enterprise-
class cloud storage
39

Enabling multi-dimensional scaling


Elastic scaling for storage
Low-cost cloud storage, fully
replicated and resilient
Elastic scaling for compute
Virtual warehouses scale up &
down on the fly to support
workload needs
Elastic scaling for concurrency
Scale concurrency using
independent virtual warehouses

Marketing
Loading /
ETL

Finance

Sales

Operations
Test / Dev

40

Delivered as a service:
no infrastructure, knobs, or tuning
Infrastructure
management

Data storage
management

Metadata
management

Manual query
optimization

**..
**..

Virtual hardware and


software managed by
Snowflake

Adaptive data distribution,


automatic compression,
automatic optimization

Automatic statistics
collection, scaling, and
redundancy

Dynamic optimization,
parallelization, and
concurrency management
41

Fits with existing tools & processes


EDW

Datamarts

Hadoop

Data Diversity Challenges

Barriers to Analysis

External data, machine-generated


data, streaming data

Analysis limited by incomplete data,


delays in access, performance
limitations

Complex Data Infrastructure


Complex systems, data pipelines,
data silos

42

Conclusions?

What Have We Learned Over The Years?


Need results soon
Multi-years projects not acceptable any more

Executive buy in ($$$)


Build incrementally, test, refactor
Get user feedback RIGHT AWAY!
Avoid over analysis
You will learn as you go

44

Critical Success Factors


A data warehouse will be considered a success if it:
Can be loaded in a timely manner
Regardless of the data type or source

Can be accessed in an easy fashion


By both data scientists and business users

Can be understood by the business community


Is recognized as bringing value to the decision making
process
For an acceptable TCO
45

An Option to Consider
Snowflake is:
a team of accomplished data experts
Funded by top-tier VCs including Altimeter Capital, Redpoint Ventures,
Sutter Hill Ventures, Wing VC

who have developed a completely new data warehouse


designed for the cloud

Data warehouse as a service


Multidimensional elasticity
Support for all business data including semi-structured
Compelling price:performance
46

SHAMELESS PLUG:
Available on
Amazon.com
http://www.amazon.com
/Better-Data-Modeling-
Introduction-
Engineering-
ebook/dp/B018BREV1C/

47

Contact Information
Kent Graziano
Snowflake Computing

Kent.graziano@snowflake.net
On Twitter @KentGraziano

More info at
http://snowflake.net
Visit my blog at
http://kentgraziano.com

48

Potrebbero piacerti anche