Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Session Objectives
Overview of Data Warehousing Data Warehouse Architectures How to create a data warehouse How to design a data warehouse Understand the ETL process What is metadata How to administer a data warehouse
These are the backbone systems of any enterprise, such as order entry inventory etc.
The classic examples are airline reservations, credit-card authorizations, and ATM withdrawals etc.,
Data : Informational data is distinctly different from operational data in its structure and content . Processing : Informational processing is distinctly different from operational processing in its characteristics and use of data
Management receives information, but... What took so long? and How do I know its right?
10
Tactical Information
Transported Quantity Inventory Control System
Order quantity
OLTP Server
Production quantity
Supports day to day control operations Transaction Processing High Performance Operational Systems Fast Response Time Initiates immediate action
Copyright 2004, Cognizant Academy, All Rights Reserved 11
Strategic Information
Production & Inventory Marketing
Finance
Payroll
Understand Business Issues Analyze Trends and Relationships Analyze Problems Discover Business Opportunities Plan for the Future
12
Operational Data
Periodic Refresh
Tactical Information
Strategic Information
Operational data helps the organization meet operational and tactical requirements for data. While the Data Warehouse data helps the organization meet strategic requirements for information
Copyright 2004, Cognizant Academy, All Rights Reserved 13
Analytical
v v
v v v v v
Constantly updated Minimal redundancy Highly detailed data Referential integrity Supports day-to-day business functions
v v v v v
Less frequently updated Managed redundancy Summarized data Historical integrity Supports long-term informational requirements
Normalized design
De-normalized design
14
16
Data Warehouse
Customer
Usage
Revenue
v Operational
v Warehoused
data is organized by subject area and is populated from many operational systems
17
Application Specific
v Applications
Integrated
and their databases were designed and built separately over long periods of time
v Integrated
v Evolved
v Designed
(or Architected) at one time, implemented iteratively over short periods of time
18
v Primarily
v Generally
19
Data warehouse
Load/ Update
Constant Change
v Updated
constantly
v Does NOT mean the Data warehouse is never updated or never changes!!
20
Separate DSS data base Storage of data only, no data is created Integrated and Scrubbed data Historical data Read only (no recasting of history) Various levels of summarization Meta data Subject Easily oriented accessible
21
22
Benefits To Business Understand business trends Better forecasting decisions Better products to market in timely manner Analyze daily sales information and make quick decisions Solution for maintaining your company's competitive edge
23
Data Mart
26
27
28
29
Different Approaches for Different Approaches for Implementing Data marts Implementing Data marts
Q: When is a Data Warehouse not a Data Warehouse? A: When its an unarchitected collection of data marts
31
32
The downsides of Non-architected Data marts are: 1.Multiple extraction processes 2. Multiple business rules 3. Multiple semantics 4. Extremely challenging to integrate
33
Source systems
34
Enterprise
Data Warehouse
Easy to do, Not architected ? Are the extracts, transformations, integration's & loads consistent? ? Is the redundancy managed? ? What is the impact on the sources?
Copyright 2004, Cognizant Academy, All Rights Reserved
v Architected v Data and results consistent v Redundancy is managed v Detailed history available for drilldown v Metadata is consistent!
35
ODS Definition
The ODS is defined to be a structure that is: Integrated Subject oriented Volatile, where update can be done Current valued, containing data that is a day or perhaps a month old Contains detailed data only.
37
Need To obtain a system of record that contains the best data that exists in a legacy environment as a source of information
38
ODS data resolves data integration issues Data physically separated from production environment to insulate it from the processing demands of reporting and analysis Access to current data facilitated.
ODS
Tactical Analysis
39
(e.g. Orders capture) Data from heterogeneous sources Does not store summary data Contains current data
40
ODS- Benefits
Integrates the data Synchronizes the structural differences in data High transaction performance Serves the operational and DSS environment Transaction level reporting on current data
Flat files 60,5.2,JOHN 72,6.2,DAVID Operational Data Store Relational Database
41
ODS Data
Update schedule - Daily or less time frequency Detail of Data is mostly between 30 and 90 days Addresses operational needs
Weekly or greater time frequency Potentially infinite history Address strategic needs
42
43
OLAP OLAP
What is OLAP
OLAP tools are used for analyzing data It helps users to get an insight into the organizations data It helps users to carry out multi dimensional analysis on the available data Using OLAP techniques users will be able to view the data from different perspectives Helps in decision making and business planning Converting OLTP data into information Solution for maintaining your company's competitive edge
45
OLAP Terminology
Drill Down and Drill Up Slice and Dice Multi dimensional analysis What IF analysis
46
Data warehouse
Reporting tools
ODS
Mining
OLAP
Data Marts
Information Servers
Web Browsers
Administration
48
49
50
51
Stores data used for informational analysis Present summarized data to the end-user for analysis The nature of the operational data, the end-user requirements and the business
Data ware house Layer
52
53
54
55
Different Approaches Different Approaches for Implementing an for Implementing an Enterprise Enterprise Datawarehouse Datawarehouse
Used for short- and long-term business planning and decision making covering multiple business units.
58
Staging
Enterprise Datawarehouse
59
In a top down scenario, the entire EDW is architected, and then a small slice of a subject area is chosen for construction
60
The downsides to a Top Down approach are: 1. Cross everything nature of enterprise project 2. Analysis paralysis 3. Scope control 4. Time to market 5. Risk and exposure
61
Staging
Enterprise Datawarehouse
62
Once the EDMA is complete, an initial subject area is selected for the first incremental Architected Data Mart (ADM).
The EDMA is expanded in this area to include the full range of detail required for the design and development of the incremental ADM.
63
65
67
Refine Model
ETL ETL
Logical Modeling
Map Req. to OLTP
Reverse Engg.
Mining
OLAP OLTP System Map Data sources External Data Storage Web Browsers
68
ER Modeling ER Modeling
Entity
Copyright 2004, Cognizant Academy, All Rights Reserved 70
71
72
{
Relationship
73
Factory Factory ID
74
Pr o fita bili
es al S
ev R
Net Pr ofit
Gros s
Marg in
ty
t os C
Facts or Measures are the Key Performance Indicators of an enterprise Factual data about the subject area Numeric, summarized
Copyright 2004, Cognizant Academy, All Rights Reserved 76
Dimension
e nu e ev ) R re s u le as a e S M (
What was sold ? Whom was it sold to ? When was it sold ? Where was it sold ?
Dimensions put measures in perspective What, when and where qualifiers to the measures Dimensions could be products, customers, time, geography etc.
77
78
Dimension Elements
Geography
Product Time
Components of a dimension Represents the natural elements in the business dimension Directly related to the dimension Facilitates analysis from different perspectives of a dimension Often referred to as levels of a dimension.
79
Dimension Hierarchy
Time Dimension
Drill Down
Represents the natural business hierarchy within dimension elements Clarifies the drill up, drill down directions Each element represents different levels of aggregation End users may need custom hierarchies
80
Drill Up
Multi-Dimensional Analysis
100.0 80.0 East A East B West A North A West A 1st 2nd 3rd Qtr Qtr Qtr East A 4th Qtr West B North A North B
Product
60.0 40.0
Ti m e
20.0 0.0
Geography
A B A B A B
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 20.4 27.4 90.0 20.4 19.8 26.6 87.3 19.8 30.6 38.6 34.6 31.6 29.7 37.4 33.6 30.7 45.9 46.9 45.0 43.9 44.5 45.5 43.7 42.6
81
Up
E as t W es t N orth
Down
Drill down is a process of requesting for detailed information Drill up is a process of summarizing the existing information
82
Dimensional Modeling
Subject Area Atomic Detail Dimensions Facts What do you want to know about? What level of detail do you need? Analyze key performance indicators Measures How fresh do you need it? How far back do you need to know it?
83
84
Logical Modeling
External
85
86
Employee
Fact Table
Product
Dimension
Time
Dimension
Customer
87
88
A star schema is a highly denormalized, query-centric model where the basic premise is that information can be broken into two groups: facts and dimensions.
In a star schema, facts are in a single place (the fact table) and the descriptions (or elements) that lead to those facts are in dimension tables. The star schema is built for simplicity and speed. The assumption behind it is that the database is static with no updates being performed online
89
90
91
Snow-flake schema
Customer Time
City
Product
Region
Brand
Country
Color
92
emp_code city_code cityname city_code state_code statename state_code region_code regionname region_code country_code countryname
color_code color_name
93
color_code color_name
94
color_code color_name
95
Since these changes are smaller in magnitude compared to changes in fact tables, these dimensions are known as slowly growing or slowly changing dimensions.
96
97
Source
Emp id Name Email Emp id
Target
Name Email
1001
Shane
Shane@xyz.c om
1001
Shane
Shane@xyz. com
Source
Emp id Name Email Emp id
Target
Name Email
1001
Shane
Shane@ abc.co.in
1001
Shane
Shane@ abc.co.in
Shane@xyz. com
98
Target Source
Emp id Name Email PM_PRI MARYK EY Emp id Name Email PM_VER SION_N UMBER
10
Shane
Shane@xyz.c om
1000
10
Shane
99
10
Shane
Shane@ abc.co.in
PM_PRIMA RYKEY
Emp id
Name
PM_VERSION_NUMBER
1000
10
Shane
1001
10
Shane
Target
100
10
Shane
Shane@ abc.com
PM_PRIM ARYKEY
Emp id
Name
PM_VERSION_NUM BER
Target
1000
10
Shane
Shane@ xyz.com
Shane@ abc.co.in Shane@ abc.com
1001 1003
10 10
Shane Shane
1 2
101
Emp id
Name
Emp id
Name
PM_CUR RENT_FL AG
10
Shane
Shane@xyz.c om
1000
10
Shane
Source
Target
102
PM_PRIMA RYKEY
Emp id
Name
PM_CURRENT_FLAG
1000
10
Shane
1001
10
Shane
Target
103
PM_PRIMA RYKEY
Emp id
Name
PM_CURRENT_FLAG
Target
1000
10
Shane
1001
10
Shane
1003
10
Shane
104
Emp id
Name
PM_PRI MARYK EY
Emp id
Name
PM_BEG IN_DAT E
PM_EN D_DATE
10
Shane
Source Target
Copyright 2004, Cognizant Academy, All Rights Reserved 105
Email
Shane@ abc.co.in
10
Shane
PM_PRIMAR YKEY
Emp id
Name
PM_BEGIN_D ATE
PM_END_D ATE
1000
10
Shane
01/01/00
03/01/00
1001
10
Shane
03/01/00
Target
Copyright 2004, Cognizant Academy, All Rights Reserved 106
Email
Shane@ abc.com
10
Shane
PM_PRIM ARYKEY
Emp id
Name
PM_BEGIN_D ATE
PM_END_DA TE
1000
10
Shane
01/01/00
03/01/00
1001 1003
10 10
Shane Shane
03/01/00 05/02/00
05/02/00
Target
Copyright 2004, Cognizant Academy, All Rights Reserved 107
Emp id
Name
PM_EFFEC T_DATE
10
Shane
Shane@xyz.c om
10
Shane
Shane@xyz. com
01/01/00
Source
Target
108
Email
Shane@ abc.co.in
10
Shane
PM_PRIMAR YKEY
Emp id
Name
PM_Prev_Colu mnName
PM_EFFEC T_DATE
10
Shane
Shane@ abc.co.in
Shane@xyz.co m
01/02/00
Target
Copyright 2004, Cognizant Academy, All Rights Reserved 109
Email
Shane@ abc.com
10
Shane
PM_PRIM ARYKEY
Emp id
Name
PM_Prev_Colu mnName
PM_EFFECT_ DATE
10
Shane
Shane@ abc.com
Shane@ abc.co.in
01/03/00
Target
110
Conformed Dimensions
Conformed dimensions are those which are consistent across Data marts.
Essential for integrating the Data marts into an Enterprise Data warehouse
111
Casual Dimensions
Casual dimensions can be used for explaining why a record exists in a fact table Casual dimensions should not change the grain of the fact table
112
113
114
115
116
Helper Tables
Helper tables are used when there are multi valued dimensions. That is when there is a many to many relationship between a fact table and a dimension table Helper table can be placed between two dimensions tables or between a dimension table and a fact table.
117
118
Surrogate Keys
Joins between fact and dimension tables should be based on surrogate keys Surrogate keys should not be composed of natural keys glued together Users should not obtain any information by looking at these keys These keys should be simple integers
119
Keys may be reused after they have been purged even thought they are used in the warehouse A product description or a customer description could be changed without changing the key Key formats may be generalized to handle some new situation A mistake could be made and a key could be reused
120
ETL- Extraction, ETL- Extraction, Transformation & Transformation & Loading Loading
What is ETL?
ETL(Extraction, Transformation and Loading) is a process by which data is integrated and transformed from the operational systems into the Data warehouse environment
Filters and Extractors Cleanser
Cleaning Rules Rule 1 Rule 2 Rule 3
Operational systems
Transformation Engine
Integrator
Loader
Warehouse
122
123
Extraction
80 tables
Data from tables 30
Oracle
50 tables
files om a fr t
Sybase
Da
124
Transformation
Source
Emp id
10001
Last Name
Jones
First Name
Indiana
10002
Holmes
Sherlock
Staging Area Name = Concat(First Name, Last Name) Indiana Jones Sherlock Homes
125
Loading
Source
Direct Load
Data Warehouse
Staging Area
d& rme d fo rans ata loa n,T d Clea
ed grat i nt e
126
Data Marts
60
to
0% 8
ft o
e h
rk o
is
re e
Access & Analysis Access & Analysis Resource Scheduling & Distribution Resource Scheduling & Distribution
128
Extraction Types
Extraction
Full Extract
130
Full Extract
Existing data
Data Mart
131
Full Extract
New data
Data Mart
132
Full Extract
New data
Data Mart
133
Incremental Extract
134
Incremental Extract
Incremental Data
New data
Existing data
Data Mart
Source System
Incremental Extract
Changed data
135
Incremental Extract
Incremental Data
New data
Changed data
136
Transformation Transformation
Data Transformation
Conversions
Data type (e.g. Char to Date) Bring data to common units (Currency,Measuring Units)
Classifications
Changing continuous values to discrete ranges (e.g. Temperatures to Temperature Ranges)
Splitting of fields Merging of fields Aggregations (e.g. Sum, Avg., Count) Derivations (Percentages, Ratios, Indicators)
138
Structural Transformations
Additive Orders arrive every two minutes OLTP
Aggregate
OLTP
Average
139
Format transformation
Source Schema Target Schema Transformation 32 32
Age as a String
Splitting
Source Schema
15-10-1992 Transformation
Month
Year
140
Simple Conversions
Source Schema Target Schema
Rs. 10000
Multiply by 1/43
$232.56
Revenue in Rupees
Revenue in Dollars
1000 lbs.
Multiply by 0.4536
453.56 kgs.
Classification
Name John Black Richard Wayne Jennifer Goldman Helmut Koch Anna Ludwig Shito Maketha Tracy Withman Ada Zhesky David Rosenberg Pankaj Sharma Zhu Ling George Kurtz Rita Hartman Age 27 53 45 37 32 28 39 25 33 29 44 27 34
Age Group Frequency 20-25 1 26-30 4 31-35 3 36-40 2 41-45 2 46-50 1 51-55 1 56-60 0
Grouping
142
143
144
Aggregates must be stored in their own fact tables and each level should have its own fact table Dimension tables attached to the aggregate fact tables should where ever possible be shrunken versions of the dimension tables attached to the base fact table The base fact table and all of its related aggregate fact tables must be associated together as a family of schemas
145
Loading Loading
147
Data Warehouse
Source data
Data Staging
v v v
v Insert v Full Replace v Selective Replace v Update v Update plus Retain History
148
Source data
New data OR Point-in-Time Snapshot (e.g.. Monthly) New Data Added to Existing Data
149
Changed data
150
History of change needs to be maintained. Changed data alone needs to be identified Changed data should be easier to access. Reconstruction of the dimension table any point in time should be easier
151
152
What is Metadata?
Data about data and the processes Metadata is stored in a data dictionary and repository. Insulates the data warehouse from changes in the schema of operational systems. It serves to identify the contents and location of data in the data warehouse
154
Document system
155
Know what data you have and You can trust it!
156
157
Access & Analysis Access & Analysis Resource Scheduling & Distribution Resource Scheduling & Distribution
159
Source Meta data This Meta data stores information about the source data and the mapping of source data to data warehouse data
160
Processing Information This Meta data stores information about the activities involved in the processing of data such as scheduling and archives etc
End User Information This Meta data records information about the user profile and security.
161
162
164
Dormant Data
The data that is hardly used in a data warehouse is called dormant data The faster data warehouses grows the more data becomes dormant. Over a period of time the amount of dormant data in a data warehouse increases
165
166
167
168