Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
What is Data Warehouse OLTP Vs. Data Warehousing Data Warehousing Architecture Data Warehousing Schemas & Objects Physical Design in Data Warehouse Definition of Data Warehousing
Course Overview
Data Warehousing basic Design Approaches Data Warehousing Operational Processes Technical Problems in Data Warehousing Representative DSS Tools Business Intelligence
Facts !
Data Warehouse is NOT a specific technology NOT possible to purchase a Data Warehouse, but it is possible to build one.
What are the projected sales ? What if you sale more quantity of a particular product ?
Data Warehouse is a subjectoriented, integrated nonvolatile and time-variant collection of data in support of managements decisions. William Imon
6
Subject Oriented
The data in data warehouse is organized around the major subject of the enterprise ( i.e. the high level entities). The orientation around the major subject areas causes the data warehouse design to be data driven. The operational systems are designed around the application and functions. e.g. Loans , savings , credit cards in case of a Bank. Where Data Warehouse is designed around a subject like Customer , Product , Vendor etc.
Operational Systems Data Warehouse Customer
Supplier
Product
Organized by subject
7
Time Variant
Data is stored as a series of snapshots or views which record how it is collected across time.
It helps in Business trend analysis In contrast to OLTP environment, data warehouses focus on change over time that is what we mean by time variant.
{
Key
Integrated
Data is stored once in a single integrated location Auto Policy Auto Policy Processing Processing System System Customer data stored in several databases Fire Policy Fire Policy Processing Processing System System FACTS, LIFE FACTS, LIFE Commercial, Accounting Commercial, Accounting Applications Applications Subject = Customer Data Warehouse Database
It is closely related with subject orientation. Data from disparate sources need to be put in a consistent format. Resolving of problems such as naming conflicts and inconsistencies
9
Non-Volatile
Existing data in the warehouse is not overwritten or updated.
Load Read-Only
This is logical because the purpose of a data warehouse is to enable you to analyze
10
11
12
13
14
To summarize ...
OLTP Systems are used to run a business
15
In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance in opposite to distributed approaches.
Central Architecture
16
Federated Architecture
17
Advantages: Faster response time because the data is located closer to the client applications and Reduced volume of data to be searched. Tiered Architecture
18
Information
Data Management
Sales
Metadata
Data Mart
Knowledge
Access
Legacy Data
Extract Transform Load
Inventory
Enterprise Data Warehouse Data Mart
Operational Data
The Post
Purchase
Organizationally structured
Data Mart Departmentally structured
VISA
External Data Sources
Asset Exploitation
19
Data Management : Metadata - At all levels of the data warehouse, information is required to support the maintenance and use of the Data Warehouse. Data Mart A data mart is a subject oriented data warehouse.
20
21
DM Sales
Data Warehouse
DM Marketing DM HR DM Sales DM Finance DM HR
Human Resources
Data Marts
Satisfy 80% of the local endusers requests
DM Marketing
22
Data Warehouse
More
23
Operational
A B
Data Store
Data Warehouse
25
Benefits Of ODS
Supports operational reporting needs of the organization Provides a complete view of customer relationships, the data for which might be stored in several operational databases -- this data can include data from an organizations internal systems, as well as external data from third-party vendors. Operates as a store for detailed data, updated frequently and used for drill-downs from the data warehouse which contains summary data. Reduces the burden placed on other operational or data warehouse platforms by providing an additional data store for reporting. Provides more current data than in a data warehouse and more integrated than an OLTP system Feeds other operational systems in addition to the data warehouse
26
A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The are: Star Schema Snowflake Schema Galaxy Schema
27
Star Schema: It Consists of a fact table connected to a set of dimensional tables Data is in Dimension tables is De-Normalized Snowflake Schema: It is refinement of star schema where some dimensional hierarchy is normalized in to a set of dimensional tables Galaxy Schema: Multiple fact tables share dimension tables viewed as a collection of stars, therefore called galaxy schema
28
Star Schema
A star schema a highly De-Normalized, query-centric model where information is broken into two groups: facts and dimensions. Employee_Dim Employee_Dim
EmployeeKey EmployeeID . . .
TimeKey TheDate . . .
Time_Dim Time_Dim
ShipperKey ShipperID . . .
Shipper_Dim Shipper_Dim
TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .
Sales_Fact Sales_Fact
BranchID Branchno . . .
Branch_Dim
Customer_Dim Customer_Dim
CustomerKey CustomerID . . .
29
Snowflake Schema
Branch_Dim
branchID {PK} branchNo branchType city {FK} timeID {FK} propertyID {FK} branchID {FK} clientID {FK} promotionID {FK}
Sales_fact
City
city {PK} region {FK}
Region
region {PK} country
30
Galaxy Schema
Dimension1
Dimension2
Fact1
Dimension3 Dimension4
Fact2
Dimension6
Dimension5
Fact3
Dimension7
31
32
33
Fact Types :
Additive facts:
Additive facts are facts that can be summed up through all of the dimensions in the fact table
Semi-Additive facts:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table
Non-additive facts:
Non-additive facts are facts that cannot be summed up for any of the dimensions Present in the fact table
34
Fact table:
The purpose of this table is to record the Sales_Amount for each product in each store On a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because we can sum up this fact along with any of the 3 dimensions present in the fact table date, store, and product
35
Fact table:
The purpose of this table is to record the current balance for each account at the end of
each day, as well as the profit margin for each account for each day Current_Balance & Profit_Margin are the facts Current_Balance is a semi additive fact, as it makes sense to add them up for all accounts (whats the total current balance for all accounts in the bank?), but it does not make sense to add them up through time Profit_Margin is a non additive fact, for it does not make sense to add them up for the account level or the day level
36
Based on the above classifications, there are two types of fact tables
Cumulative Snapshot Cumulative: This type of fact table describes what has happened over a period of time For example this fact table may describe the total sales by product by store by day The facts for this type of fact tables are mostly additive. The first example is a Cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance Of time, and usually includes more semi additive and non-additive facts. The second example presented is a snapshot fact table
37
Dimension tables Types Slowly Changing dimensions Junk Dimensions Confirmed Dimensions Degenerated Dimensions.
39
Various data elements in the dimension undergo changes (e.g. changes in attributes, hierarchical structures) which need to be captured for analysis.
In a nutshell, this applies to cases where the attribute for a record varies over time. For eg:
Customer key 1001 Name Christina State Illinois
Christina is a customer who first lived in chicago,illinois. At a later date, she moved to Los Angeles,California. Now how to modify the table to reflect this change? This is a Slowly Changing Dimension problem
40
Types of SCD
There are in general 3 ways to solve this type of problem, and they are categorized as follows: Type 1 Type 2 Type 3 Type 1: New record places the original record. No trace of the old record exists Type 2: A new record is added to the customer dimension table Type 3: The Original record is modified to reflect the change
41
TYPE 1:
New record places the original record. No trace of the old record exists Eg:
Customer key 1001 Name Christina State Illinois
After Christina moved from illinois to California, the new information replaces the new record and we have the following table:
Customer key 1001 Name Christina State California
Advantages: This is the easiest way to handle the Slowly Changing dimension, Since there is no need to keep track of the old information. Disadvantages: All the history is lost. By applying this methodology, it is not possible to track back in history. For eg In the above case, the company would not able to know that Christina lived in Illinois before.
42
TYPE 2:
In type 2 SCD a new record is added to the table to represent the new Information. Therefore both the original & the new record will be present Eg:
Customer key 1001 1005 Name Christina Christina State Illinois
California
After Christina moved from illinois to California, we add the new information as a new row into the table Advantages: This allows us to accurately keep all historical information Disadvantages: This will cause the size of the table to grow fast where the number of rows for the table is very high to start with, storage and performance can become a concern
43
TYPE 3:
In type 3 SCD there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. Eg:
Customer key 1001 Name Christina Original State Illinois Current State Effective Date
California
15-Jan-03
After Christina moved from illinois to California, the original information gets updated, And we have the above table (Assuming the effective date of change is January 15,2003
Advantages: This does not increase the size of the table, since new information is updated This allows us to keep some part of history Disadvantages: Type 3 will not be able to keep all history where an attribute is changed more than Once. For eg, if Christina later moves from to Texas on December 15,2003 the California information is lost
44
Degenerated Dimension:
Degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table. Degenerate dimensions are often used when a fact table's grain represents transactional level data and one wishes to maintain system specific identifiers such as order numbers, invoice numbers and the like without forcing their inclusion in their own dimension.
45
Confirmed Dimensions :
Dimension which is fixed and reusable. It is also called as fixed dimension. It is a dimension which doesn't effect with respect to time. Ex : if the name of the city is changed from Bombay to Mumbai, the name will not change from time to time, once the change is done ,The change is permanent. This type of dimensions are called confirmed or fixed dimensions.
46
Junk dimensions:
A dimension where one can store random transactional codes, flags and text attributes that are not related to other dimensions and which provides a simple way for users to easily find those unrelated attributes. Ex: Martial Status : (Yes or No) Gender : (M or F) e.t.c.
47
48
49
The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach Inmon advocated a dependent data mart structure The data flow in the top down OLAP environment begins with data extraction from the operational data sources. This data is loaded into the staging area and validated and consolidated for ensuring a level of accuracy and then transferred to the Operational Data Store (ODS). Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area for aggregation, summarization and then extracted and loaded into the Data warehouse. Once the Data warehouse aggregation and summarization processes are complete, the data mart refresh cycles will extract the data from the Data warehouse into the staging area and perform a new set of transformations on them. This will help organize the data in particular structures required by data marts. Then the data marts can be loaded with the data and the OLAP environment becomes available to the users.
50
Inmon Approach
The data marts are treated as sub sets of the data warehouse. Each data mart is built for an individual department and is optimized for analysis needs of the particular department for which it is created.
51
In this context even the cubes constructed by using OLAP tools could be considered as data marts.
52
Kimball Approach
The bottom-up approach reverses the positions of the Data warehouse and the Data marts. Data marts are directly loaded with the data from the operational systems through the staging area. The data flow in the bottom up approach starts with extraction of data operational databases into the staging area where it is processed and consolidated and then loaded into the ODS. from
53
The data in the ODS is appended to or replaced by the fresh data being loaded. After the ODS is refreshed the current data is once again extracted into the staging area and processed to fit into the Data mart structure. The data from the Data Mart, then is extracted to the staging area aggregated, summarized and so on and loaded into the Data Warehouse and made available to the end user for analysis.
54
DATA MINING
Data Warehouse
OLAP
56
Decision Making
End User
Data Presentation Visualization Techniques Data Mining Information discovery Data Exploration OLAP, DSS, EIS, Querying and Reporting Data Warehouses / Data Marts
Business Analyst
Data Analyst
DB Admin