DWH Concepts

Course Overview
What is Data Warehouse OLTP Vs. Data Warehousing Data Warehousing Architecture Data Warehousing Schemas & Objects Physical Design in Data Warehouse Definition of Data Warehousing
Course Overview
Data Warehousing basic Design Approaches Data Warehousing Operational Processes Technical Problems in Data Warehousing Representative DSS Tools Business Intelligence
What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data. A data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, online analytical processing (OLAP) and data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. It is a series of processes, procedures and tools (h/w & s/w) that help the enterprise understand more about itself, its products, its customers and the market it services
Facts !
Data Warehouse is NOT a specific technology NOT possible to purchase a Data Warehouse, but it is possible to build one.
Why Data Warehousing?

Need of Intelligent Information in Competitive Market
Who are the potential Customers ? Which Products are sold the most ? What will be the impact on revenue ? Results of promotion schemes introduced ? What are the region-wise preferences ? What are the competitor products ?
What are the projected sales ? What if you sale more quantity of a particular product ?
Defining Data warehouse
Data Warehouse is a subjectoriented, integrated nonvolatile and time-variant collection of data in support of managements decisions. William Imon
6
Subject Oriented
The data in data warehouse is organized around the major subject of the enterprise ( i.e. the high level entities). The orientation around the major subject areas causes the data warehouse design to be data driven. The operational systems are designed around the application and functions. e.g. Loans , savings , credit cards in case of a Bank. Where Data Warehouse is designed around a subject like Customer , Product , Vendor etc.
Operational Systems Data Warehouse Customer
Supplier
Product
Organized by processes or tasks
Organized by subject
7
Time Variant
Data is stored as a series of snapshots or views which record how it is collected across time.
Data Warehouse Data Time Data
It helps in Business trend analysis In contrast to OLTP environment, data warehouses focus on change over time that is what we mean by time variant.
{
Key
Integrated
Data is stored once in a single integrated location Auto Policy Auto Policy Processing Processing System System Customer data stored in several databases Fire Policy Fire Policy Processing Processing System System FACTS, LIFE FACTS, LIFE Commercial, Accounting Commercial, Accounting Applications Applications Subject = Customer Data Warehouse Database
It is closely related with subject orientation. Data from disparate sources need to be put in a consistent format. Resolving of problems such as naming conflicts and inconsistencies
9
Non-Volatile
Existing data in the warehouse is not overwritten or updated.
External Sources Producti on Databas es Data Warehouse Database
Production Production Applications Applications
Data Data Warehouse Warehouse Environment Environment
Update Insert Delete

what has occurred.
Load Read-Only
This is logical because the purpose of a data warehouse is to enable you to analyze
10
OLTP vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while workload is not known in a data warehouse Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
11
OLTP vs. Data Warehouse

OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User WAREHOUSE (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager)
12
OLTP vs Data Warehouse

OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB -100 GB DATA WAREHOUSE Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB few terabytes
13
OLTP vs Data Warehouse

OLTP Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets
Transaction throughput is the performance metric Thousands of users Managed in entirety
14
To summarize ...
OLTP Systems are used to run a business
The Data Warehouse helps to optimize the business
15
Data Warehouse Architectures

Centralized
In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance in opposite to distributed approaches.
Central Architecture
16
Data Warehouse Architectures Contd

Federated In a federated architecture the data is logically consolidated but stored in separate physical databases, at the same or at different physical sites. The local data marts store only the relevant information for a department. The amount of data is reduced in contrast to a central data warehouse. The level of detail is enhanced.
Federated Architecture
17
Data Warehouse Architectures Contd

Tiered: A tiered architecture is a distributed data approach. This process can not be done in one step because many sources have to be integrated into the warehouse. On a first level, the data of all branches in one region is collected, in the second level the data from the regions is integrated into one data warehouse.
Advantages: Faster response time because the data is located closer to the client applications and Reduced volume of data to be searched. Tiered Architecture
18
Complete Warehouse Solution Architecture Data

Data Sources
Information
Data Management
Sales
Metadata
Data Mart
Knowledge
Access
Legacy Data
Extract Transform Load
Inventory
Enterprise Data Warehouse Data Mart
Operational Data
The Post
Purchase
Organizationally structured
Data Mart Departmentally structured
VISA
External Data Sources
Asset Assembly (and Management)
Asset Exploitation
19
Data Warehouse Architecture Components

Data Sources: Legacy data Operational data External data resources
Disparate data sources
Data Management : Metadata - At all levels of the data warehouse, information is required to support the maintenance and use of the Data Warehouse. Data Mart A data mart is a subject oriented data warehouse.
20
Introduction To Data Marts

What is a Data Mart From the Data Warehouse , atomic data flows to various departments for their customized needs. If this data is periodically extracted from data warehouse and loaded into a local database, it becomes a data mart. The data in Data Mart has a different level of granularity than that of Data Warehouse. Since the data in Data Marts is highly customized and lightly summarized , the departments can do whatever they want without worrying about resource utilization. Also the departments can use the analytical software they find convenient. The cost of processing becomes very low.
21
Data Mart Overview
DM Sales
Sales Representatives and Analysts
Data Warehouse
DM Marketing DM HR DM Sales DM Finance DM HR
Human Resources
Data Marts
Satisfy 80% of the local endusers requests
DM Marketing
Financial Analysts, Strategic Planners, and Executives
22
From The Data Warehouse To Data Marts
Information Individually Structured Departmentally Structured Less
History Normalized Detailed
Organizationally Structured Data
Data Warehouse
More
23
Operational Data Store (ODS)

What is an ODS An Operational Data Store (ODS) integrates data from multiple business operation sources to address operational problems that span one or more business functions. An ODS has the following features: Subject-oriented Organized around major subjects of an organization (customer, product, etc.), not specific applications (order entry, accounts receivable, etc.). Integrated Presents an integrated image of subject-oriented data which is pulled from fragmented operational source systems. Current Contains a snapshot of the current content of legacy source systems. History is not kept, and might be moved to the data warehouse for analysis. Volatile Since ODS content is kept current, it changes frequently. Identical queries run at different times may yield different results. Detailed ODS data is generally more detailed than data warehouse data. Summary data is usually not stored in an ODS; the exact granularity depends on the subject that is being supported.
24
Operational Data Store (ODS) Contd

The ODS provides an integrated view of data in operational systems. As the figure below indicates, there is a clear separation between the ODS and the data warehouse.
Operational
A B
Data Store
Data Warehouse
EIS DSS Apps PC
Current or near current data Detailed data Updates allowed
Historical data Summary and detail Non-volatile snapshots only
25
Benefits Of ODS
Supports operational reporting needs of the organization Provides a complete view of customer relationships, the data for which might be stored in several operational databases -- this data can include data from an organizations internal systems, as well as external data from third-party vendors. Operates as a store for detailed data, updated frequently and used for drill-downs from the data warehouse which contains summary data. Reduces the burden placed on other operational or data warehouse platforms by providing an additional data store for reporting. Provides more current data than in a data warehouse and more integrated than an OLTP system Feeds other operational systems in addition to the data warehouse
26
Data Warehousing SCHEMAS & OBJECTS
A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The are: Star Schema Snowflake Schema Galaxy Schema
27
Star Schema: It Consists of a fact table connected to a set of dimensional tables Data is in Dimension tables is De-Normalized Snowflake Schema: It is refinement of star schema where some dimensional hierarchy is normalized in to a set of dimensional tables Galaxy Schema: Multiple fact tables share dimension tables viewed as a collection of stars, therefore called galaxy schema
28
Star Schema
A star schema a highly De-Normalized, query-centric model where information is broken into two groups: facts and dimensions. Employee_Dim Employee_Dim
EmployeeKey EmployeeID . . .
TimeKey TheDate . . .
Time_Dim Time_Dim
ShipperKey ShipperID . . .
Shipper_Dim Shipper_Dim
TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .
Sales_Fact Sales_Fact
BranchID Branchno . . .
Branch_Dim
Customer_Dim Customer_Dim
CustomerKey CustomerID . . .
29
Snowflake Schema
Branch_Dim
branchID {PK} branchNo branchType city {FK} timeID {FK} propertyID {FK} branchID {FK} clientID {FK} promotionID {FK}
Sales_fact
City
city {PK} region {FK}
staffID {FK} ownerID {FK} offerPrice sellingPrice saleCommission saleRevenue
Region
region {PK} country
Figure32.2 Fact Table Dimension Tables
30
Galaxy Schema
Multiple Groups of Facts links by few common dimensions
Dimension1
Dimension2
Fact1
Dimension3 Dimension4
Fact2
Dimension6
Dimension5
Fact3
Dimension7
31
Data Warehousing Objects

All the three types of Schemas are described in the Data Modeling section Various Objects used in Data Warehousing are: Fact Tables Dimension Tables Hierarchies Unique Identifiers Relationships
32
Data Warehousing Objects

Fact Tables: Represent a business process, i.e., models the business process as an artifact in the data model Contain the measurements or metrics or facts of business processes "monthly sales number" in the Sales business process most are additive (sales this month), some are semi-additive (balance as of), some are not additive (unit price) The level of detail is called the grain of the table Contain foreign keys for the dimension tables
33
Fact Types :
Additive facts:
Additive facts are facts that can be summed up through all of the dimensions in the fact table
Semi-Additive facts:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table
Non-additive facts:
Non-additive facts are facts that cannot be summed up for any of the dimensions Present in the fact table
34
Examples to illustrate Additive, Semi-Additive & Non-Additive facts:
Fact table:
Date Store Product Sales_Amount
The purpose of this table is to record the Sales_Amount for each product in each store On a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because we can sum up this fact along with any of the 3 dimensions present in the fact table date, store, and product
35
Eg for semi-Additive & Non-Additive facts:
Fact table:
Date Account Current_Balance Profit_Margin
The purpose of this table is to record the current balance for each account at the end of
each day, as well as the profit margin for each account for each day Current_Balance & Profit_Margin are the facts Current_Balance is a semi additive fact, as it makes sense to add them up for all accounts (whats the total current balance for all accounts in the bank?), but it does not make sense to add them up through time Profit_Margin is a non additive fact, for it does not make sense to add them up for the account level or the day level
36
types of fact tables :
Based on the above classifications, there are two types of fact tables
Cumulative Snapshot Cumulative: This type of fact table describes what has happened over a period of time For example this fact table may describe the total sales by product by store by day The facts for this type of fact tables are mostly additive. The first example is a Cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance Of time, and usually includes more semi additive and non-additive facts. The second example presented is a snapshot fact table
37
Data Warehousing Objects Contd.

Dimension Tables: Dimension tables Define business in terms already familiar to users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc.
38
Dimension tables Types
Dimension tables Types Slowly Changing dimensions Junk Dimensions Confirmed Dimensions Degenerated Dimensions.
39
Slowly Changing Dimensions :(SCD)
Various data elements in the dimension undergo changes (e.g. changes in attributes, hierarchical structures) which need to be captured for analysis.
SCD problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time. For eg:
Customer key 1001 Name Christina State Illinois
Christina is a customer who first lived in chicago,illinois. At a later date, she moved to Los Angeles,California. Now how to modify the table to reflect this change? This is a Slowly Changing Dimension problem
40
Types of SCD
There are in general 3 ways to solve this type of problem, and they are categorized as follows: Type 1 Type 2 Type 3 Type 1: New record places the original record. No trace of the old record exists Type 2: A new record is added to the customer dimension table Type 3: The Original record is modified to reflect the change
41
TYPE 1:
New record places the original record. No trace of the old record exists Eg:
Customer key 1001 Name Christina State Illinois
After Christina moved from illinois to California, the new information replaces the new record and we have the following table:
Customer key 1001 Name Christina State California
Advantages: This is the easiest way to handle the Slowly Changing dimension, Since there is no need to keep track of the old information. Disadvantages: All the history is lost. By applying this methodology, it is not possible to track back in history. For eg In the above case, the company would not able to know that Christina lived in Illinois before.
42
TYPE 2:
In type 2 SCD a new record is added to the table to represent the new Information. Therefore both the original & the new record will be present Eg:
Customer key 1001 1005 Name Christina Christina State Illinois
California
After Christina moved from illinois to California, we add the new information as a new row into the table Advantages: This allows us to accurately keep all historical information Disadvantages: This will cause the size of the table to grow fast where the number of rows for the table is very high to start with, storage and performance can become a concern
43
TYPE 3:
In type 3 SCD there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. Eg:
Customer key 1001 Name Christina Original State Illinois Current State Effective Date
California
15-Jan-03
After Christina moved from illinois to California, the original information gets updated, And we have the above table (Assuming the effective date of change is January 15,2003
Advantages: This does not increase the size of the table, since new information is updated This allows us to keep some part of history Disadvantages: Type 3 will not be able to keep all history where an attribute is changed more than Once. For eg, if Christina later moves from to Texas on December 15,2003 the California information is lost
44
Degenerated Dimension:
Degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table. Degenerate dimensions are often used when a fact table's grain represents transactional level data and one wishes to maintain system specific identifiers such as order numbers, invoice numbers and the like without forcing their inclusion in their own dimension.
45
Confirmed Dimensions :
Dimension which is fixed and reusable. It is also called as fixed dimension. It is a dimension which doesn't effect with respect to time. Ex : if the name of the city is changed from Bombay to Mumbai, the name will not change from time to time, once the change is done ,The change is permanent. This type of dimensions are called confirmed or fixed dimensions.
46
Junk dimensions:
A dimension where one can store random transactional codes, flags and text attributes that are not related to other dimensions and which provides a simple way for users to easily find those unrelated attributes. Ex: Martial Status : (Yes or No) Gender : (M or F) e.t.c.
47
Definition Of Data Warehouse

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form
48
Basic Design Approaches of Data Warehouse

There are two major types of approaches to building or designing the Data Warehouse.
The Top-Down Approach The Bottom-Up Approach
49
The Top Down Approach
The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach Inmon advocated a dependent data mart structure The data flow in the top down OLAP environment begins with data extraction from the operational data sources. This data is loaded into the staging area and validated and consolidated for ensuring a level of accuracy and then transferred to the Operational Data Store (ODS). Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area for aggregation, summarization and then extracted and loaded into the Data warehouse. Once the Data warehouse aggregation and summarization processes are complete, the data mart refresh cycles will extract the data from the Data warehouse into the staging area and perform a new set of transformations on them. This will help organize the data in particular structures required by data marts. Then the data marts can be loaded with the data and the OLAP environment becomes available to the users.
50
The Top Down Approach Contd
Inmon Approach
The data marts are treated as sub sets of the data warehouse. Each data mart is built for an individual department and is optimized for analysis needs of the particular department for which it is created.
51
The Bottom-Up Approach

1. The Data warehouse Bus Structure: The Bottom-Up Approach Ralph Kimball designed the data warehouse with the data marts connected to it with a bus structure. The bus structure contained all the common elements that are used by data marts such as conformed dimensions, measures etc defined for the enterprise as a whole. This architecture makes the data warehouse more of a virtual reality than a physical reality All data marts could be located in one server or could be located on different servers across the enterprise while the data warehouse would be a virtual
entity being nothing more than a sum total of all the data marts
In this context even the cubes constructed by using OLAP tools could be considered as data marts.
52
The Bottom-Up Approach Contd
Kimball Approach
The bottom-up approach reverses the positions of the Data warehouse and the Data marts. Data marts are directly loaded with the data from the operational systems through the staging area. The data flow in the bottom up approach starts with extraction of data operational databases into the staging area where it is processed and consolidated and then loaded into the ODS. from
53
The Bottom-Up Approach Contd
The data in the ODS is appended to or replaced by the fresh data being loaded. After the ODS is refreshed the current data is once again extracted into the staging area and processed to fit into the Data mart structure. The data from the Data Mart, then is extracted to the staging area aggregated, summarized and so on and loaded into the Data Warehouse and made available to the end user for analysis.
54
Business Intelligence Components
DATA MINING
Data Warehouse
OLAP
LOAD TRANSFORM EXTRACT Operational Data

55
Business Intelligence Architecture
56
Business Intelligence Technologies

Increasing potential to support business decisions
Decision Making
End User
Data Presentation Visualization Techniques Data Mining Information discovery Data Exploration OLAP, DSS, EIS, Querying and Reporting Data Warehouses / Data Marts
Business Analyst
Data Analyst
DB Admin
Data Sources Paper, Files, Information Providers, Database Systems, OLTP

57

DWH Concepts

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DWH Concepts

Caricato da

Copyright:

Formati disponibili

Course Overview

What is a Data Warehouse?

Why Data Warehousing?

Defining Data warehouse

Organized by processes or tasks

Data Warehouse Data Time Data

External Sources Producti on Databas es Data Warehouse Database

Production Production Applications Applications

Data Data Warehouse Warehouse Environment Environment

Update Insert Delete

OLTP vs. Data Warehouse

OLTP vs. Data Warehouse

OLTP vs Data Warehouse

OLTP vs Data Warehouse

Transaction throughput is the performance metric Thousands of users Managed in entirety

The Data Warehouse helps to optimize the business

Data Warehouse Architectures

Data Warehouse Architectures Contd

Data Warehouse Architectures Contd

Complete Warehouse Solution Architecture Data

Asset Assembly (and Management)

Data Warehouse Architecture Components

Disparate data sources

Introduction To Data Marts

Data Mart Overview

Sales Representatives and Analysts

Financial Analysts, Strategic Planners, and Executives

From The Data Warehouse To Data Marts

Information Individually Structured Departmentally Structured Less

History Normalized Detailed

Organizationally Structured Data

Operational Data Store (ODS)

Operational Data Store (ODS) Contd

EIS DSS Apps PC

Current or near current data Detailed data Updates allowed

Historical data Summary and detail Non-volatile snapshots only

Data Warehousing SCHEMAS & OBJECTS

staffID {FK} ownerID {FK} offerPrice sellingPrice saleCommission saleRevenue

Figure32.2 Fact Table Dimension Tables

Multiple Groups of Facts links by few common dimensions

Data Warehousing Objects

Data Warehousing Objects

Examples to illustrate Additive, Semi-Additive & Non-Additive facts:

Date Store Product Sales_Amount

Eg for semi-Additive & Non-Additive facts:

Date Account Current_Balance Profit_Margin

types of fact tables :

Data Warehousing Objects Contd.

Dimension tables Types

Slowly Changing Dimensions :(SCD)

SCD problem is a common one particular to data warehousing.

Definition Of Data Warehouse

Basic Design Approaches of Data Warehouse

The Top-Down Approach The Bottom-Up Approach

The Top Down Approach

The Top Down Approach Contd

The Bottom-Up Approach

The Bottom-Up Approach Contd

The Bottom-Up Approach Contd

Business Intelligence Components

LOAD TRANSFORM EXTRACT Operational Data

Business Intelligence Architecture