Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
An Overview
Understanding What is a Data Warehouse
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
Data Modeling
Effective way of using a Data Warehouse
10
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
13
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
14
15
16
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
17
18
20
21
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
22
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
23
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
24
25
26
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
27
Metadata Management
28
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
29
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
30
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
32
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
34
OLAP
36
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
37
37
38
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
39
39
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
40
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
41
3 x 3 x 3 = 27 cells
42
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
43
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
43
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
44
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
44
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
45
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
45
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
46
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
47
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
48
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
48
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
49
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
49
5/16/2013
50
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
50
1st Qtr
4th Qtr
51
52
5/16/2013
53
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
53
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
54
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
55
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
55
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
56
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
56
5/16/2013
57
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
57
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
58
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
58
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
59
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
59
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
60
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
60
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
61
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
61
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
62
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
62
63
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
64
65
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
66
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
67
68
69
70
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
71
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
72
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
73
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
74
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
75
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
76
Questions
77
Thank You
78
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
80
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
81
An Overview
Understanding What is a Data Warehouse
82
83
84
85
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
87
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
88
An Overview
Understanding What is a Data Warehouse
89
90
91
92
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
94
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
95
An Overview
Understanding What is a Data Warehouse
96
97
98
99
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
100
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
101
Data Modeling
Effective way of using a Data Warehouse
102
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
105
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
106
107
108
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
109
110
112
113
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
114
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
115
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
116
117
118
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
119
Metadata Management
120
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
121
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
122
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
124
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
126
OLAP
128
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
129
129
130
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 131 data
131
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
132
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
133
3 x 3 x 3 = 27 cells
134
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
135
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
135
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
136
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
136
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
137
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
137
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
138
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
139
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
140
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
140
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
141
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
141
5/16/2013
142
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
142
1st Qtr
4th Qtr
143
144
5/16/2013
145
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
145
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
146
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
147
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
147
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
148
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
148
5/16/2013
149
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
149
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
150
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
150
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
151
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
151
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
152
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
152
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
153
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
153
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
154
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
154
155
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
156
157
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
158
159
160
161
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
162
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
163
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
164
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
165
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
166
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
167
Questions
168
Thank You
169
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
170
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
171
Data Modeling
Effective way of using a Data Warehouse
172
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
175
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
176
177
178
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
179
180
182
183
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
184
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
185
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
186
187
188
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
189
Metadata Management
190
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
191
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
192
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
194
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
196
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
198
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
200
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
201
An Overview
Understanding What is a Data Warehouse
202
203
204
205
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
206
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
207
Data Modeling
Effective way of using a Data Warehouse
208
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
211
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
212
213
214
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
215
216
218
219
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
220
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
221
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
222
223
224
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
225
Metadata Management
226
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
227
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
228
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
230
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
232
OLAP
234
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
235
235
236
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 237 data
237
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
238
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
239
3 x 3 x 3 = 27 cells
240
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
241
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
241
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
242
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
242
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
243
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
243
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
244
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
245
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
246
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
246
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
247
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
247
5/16/2013
248
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
248
1st Qtr
4th Qtr
249
250
5/16/2013
251
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
251
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
252
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
253
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
253
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
254
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
254
5/16/2013
255
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
255
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
256
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
256
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
257
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
257
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
258
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
258
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
259
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
259
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
260
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
260
261
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
262
263
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
264
265
266
267
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
268
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
269
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
270
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
271
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
272
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
273
Questions
274
Thank You
275
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
276
An Overview
Understanding What is a Data Warehouse
277
278
279
280
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
281
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
282
Data Modeling
Effective way of using a Data Warehouse
283
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
286
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
287
288
289
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
290
291
293
294
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
295
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
296
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
297
298
299
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
300
Metadata Management
301
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
302
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
303
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
305
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
307
OLAP
309
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
310
310
311
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 312 data
312
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
313
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
314
3 x 3 x 3 = 27 cells
315
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
316
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
316
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
317
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
317
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
318
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
318
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
319
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
320
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
321
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
321
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
322
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
322
5/16/2013
323
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
323
1st Qtr
4th Qtr
324
325
5/16/2013
326
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
326
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
327
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
328
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
328
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
329
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
329
5/16/2013
330
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
330
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
331
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
331
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
332
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
332
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
333
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
333
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
334
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
334
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
335
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
335
336
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
337
338
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
339
340
341
342
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
343
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
344
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
345
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
346
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
347
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
348
Questions
349
Thank You
350
OLAP
352
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
353
353
354
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 355 data
355
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
356
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
357
3 x 3 x 3 = 27 cells
358
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
359
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
359
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
360
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
360
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
361
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
361
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
362
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
363
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
364
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
364
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
365
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
365
5/16/2013
366
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
366
1st Qtr
4th Qtr
367
368
5/16/2013
369
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
369
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
370
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
371
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
371
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
372
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
372
5/16/2013
373
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
373
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
374
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
374
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
375
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
375
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
376
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
376
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
377
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
377
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
378
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
378
379
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
380
381
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
382
383
384
385
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
386
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
387
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
388
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
389
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
390
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
391
Questions
392
Thank You
393
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
394
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
395
Data Modeling
Effective way of using a Data Warehouse
396
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
399
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
400
401
402
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
403
404
406
407
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
408
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
409
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
410
411
412
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
413
Metadata Management
414
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
415
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
416
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
418
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
420
OLAP
422
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
5/16/2013
423
423
424
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 425 data
425
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
426
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
427
3 x 3 x 3 = 27 cells
428
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
5/16/2013
429
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
429
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
5/16/2013
430
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
430
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
5/16/2013
431
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
431
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
432
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
433
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
5/16/2013
434
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
434
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
5/16/2013
435
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
435
5/16/2013
436
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
436
1st Qtr
4th Qtr
437
438
5/16/2013
439
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
439
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
440
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
5/16/2013
441
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
441
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
442
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
442
5/16/2013
443
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
443
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
5/16/2013
444
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
444
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
5/16/2013
445
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
445
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
5/16/2013
446
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
446
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
5/16/2013
447
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
447
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
5/16/2013
448
2009 2009 Wipro Wipro Ltd Ltd - Confidential Confidential
448
449
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
450
451
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
452
453
454
455
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
456
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
457
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
458
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
459
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
460
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
461
Questions
462
Thank You
463
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
465
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
466
An Overview
Understanding What is a Data Warehouse
467
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
468