Sei sulla pagina 1di 57

Course Overview

What is Data Warehouse OLTP Vs. Data Warehousing Data Warehousing Architecture Data Warehousing Schemas & Objects Physical Design in Data Warehouse Definition of Data Warehousing

Course Overview

Data Warehousing basic Design Approaches Data Warehousing Operational Processes Technical Problems in Data Warehousing Representative DSS Tools Business Intelligence

What is a Data Warehouse?


A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data. A data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, online analytical processing (OLAP) and data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. It is a series of processes, procedures and tools (h/w & s/w) that help the enterprise understand more about itself, its products, its customers and the market it services

Facts !
Data Warehouse is NOT a specific technology NOT possible to purchase a Data Warehouse, but it is possible to build one.

Why Data Warehousing?


Need of Intelligent Information in Competitive Market
Who are the potential Customers ? Which Products are sold the most ? What will be the impact on revenue ? Results of promotion schemes introduced ? What are the region-wise preferences ? What are the competitor products ?

What are the projected sales ? What if you sale more quantity of a particular product ?

Defining Data warehouse

Data Warehouse is a subjectoriented, integrated nonvolatile and time-variant collection of data in support of managements decisions. William Imon
6

Subject Oriented
The data in data warehouse is organized around the major subject of the enterprise ( i.e. the high level entities). The orientation around the major subject areas causes the data warehouse design to be data driven. The operational systems are designed around the application and functions. e.g. Loans , savings , credit cards in case of a Bank. Where Data Warehouse is designed around a subject like Customer , Product , Vendor etc.
Operational Systems Data Warehouse Customer

Supplier

Product

Organized by processes or tasks

Organized by subject
7

Time Variant
Data is stored as a series of snapshots or views which record how it is collected across time.

Data Warehouse Data Time Data

It helps in Business trend analysis In contrast to OLTP environment, data warehouses focus on change over time that is what we mean by time variant.

{
Key

Integrated
Data is stored once in a single integrated location Auto Policy Auto Policy Processing Processing System System Customer data stored in several databases Fire Policy Fire Policy Processing Processing System System FACTS, LIFE FACTS, LIFE Commercial, Accounting Commercial, Accounting Applications Applications Subject = Customer Data Warehouse Database

It is closely related with subject orientation. Data from disparate sources need to be put in a consistent format. Resolving of problems such as naming conflicts and inconsistencies
9

Non-Volatile
Existing data in the warehouse is not overwritten or updated.

External Sources Producti on Databas es Data Warehouse Database

Production Production Applications Applications

Data Data Warehouse Warehouse Environment Environment

Update Insert Delete


what has occurred.

Load Read-Only

This is logical because the purpose of a data warehouse is to enable you to analyze
10

OLTP vs. Data Warehouse


OLTP systems are tuned for known transactions and workloads while workload is not known in a data warehouse Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

11

OLTP vs. Data Warehouse


OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User WAREHOUSE (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager)

12

OLTP vs Data Warehouse


OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB -100 GB DATA WAREHOUSE Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB few terabytes

13

OLTP vs Data Warehouse


OLTP Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets

Transaction throughput is the performance metric Thousands of users Managed in entirety

14

To summarize ...
OLTP Systems are used to run a business

The Data Warehouse helps to optimize the business

15

Data Warehouse Architectures


Centralized

In a centralized architecture, there exists only one data warehouse which stores all data necessary for business analysis. As already shown in the previous section, the disadvantage is the loss of performance in opposite to distributed approaches.

Central Architecture

16

Data Warehouse Architectures Contd


Federated In a federated architecture the data is logically consolidated but stored in separate physical databases, at the same or at different physical sites. The local data marts store only the relevant information for a department. The amount of data is reduced in contrast to a central data warehouse. The level of detail is enhanced.

Federated Architecture

17

Data Warehouse Architectures Contd


Tiered: A tiered architecture is a distributed data approach. This process can not be done in one step because many sources have to be integrated into the warehouse. On a first level, the data of all branches in one region is collected, in the second level the data from the regions is integrated into one data warehouse.

Advantages: Faster response time because the data is located closer to the client applications and Reduced volume of data to be searched. Tiered Architecture
18

Complete Warehouse Solution Architecture Data


Data Sources

Information
Data Management
Sales
Metadata
Data Mart

Knowledge
Access

Legacy Data
Extract Transform Load

Inventory
Enterprise Data Warehouse Data Mart

Operational Data
The Post

Purchase
Organizationally structured
Data Mart Departmentally structured

VISA
External Data Sources

Asset Assembly (and Management)

Asset Exploitation

19

Data Warehouse Architecture Components


Data Sources: Legacy data Operational data External data resources

Disparate data sources

Data Management : Metadata - At all levels of the data warehouse, information is required to support the maintenance and use of the Data Warehouse. Data Mart A data mart is a subject oriented data warehouse.

20

Introduction To Data Marts


What is a Data Mart From the Data Warehouse , atomic data flows to various departments for their customized needs. If this data is periodically extracted from data warehouse and loaded into a local database, it becomes a data mart. The data in Data Mart has a different level of granularity than that of Data Warehouse. Since the data in Data Marts is highly customized and lightly summarized , the departments can do whatever they want without worrying about resource utilization. Also the departments can use the analytical software they find convenient. The cost of processing becomes very low.

21

Data Mart Overview

DM Sales

Sales Representatives and Analysts

Data Warehouse
DM Marketing DM HR DM Sales DM Finance DM HR

Human Resources

Data Marts
Satisfy 80% of the local endusers requests

DM Marketing

Financial Analysts, Strategic Planners, and Executives

22

From The Data Warehouse To Data Marts

Information Individually Structured Departmentally Structured Less

History Normalized Detailed

Organizationally Structured Data

Data Warehouse

More

23

Operational Data Store (ODS)


What is an ODS An Operational Data Store (ODS) integrates data from multiple business operation sources to address operational problems that span one or more business functions. An ODS has the following features: Subject-oriented Organized around major subjects of an organization (customer, product, etc.), not specific applications (order entry, accounts receivable, etc.). Integrated Presents an integrated image of subject-oriented data which is pulled from fragmented operational source systems. Current Contains a snapshot of the current content of legacy source systems. History is not kept, and might be moved to the data warehouse for analysis. Volatile Since ODS content is kept current, it changes frequently. Identical queries run at different times may yield different results. Detailed ODS data is generally more detailed than data warehouse data. Summary data is usually not stored in an ODS; the exact granularity depends on the subject that is being supported.
24

Operational Data Store (ODS) Contd


The ODS provides an integrated view of data in operational systems. As the figure below indicates, there is a clear separation between the ODS and the data warehouse.

Operational

A B

Data Store

Data Warehouse

EIS DSS Apps PC

Current or near current data Detailed data Updates allowed

Historical data Summary and detail Non-volatile snapshots only

25

Benefits Of ODS
Supports operational reporting needs of the organization Provides a complete view of customer relationships, the data for which might be stored in several operational databases -- this data can include data from an organizations internal systems, as well as external data from third-party vendors. Operates as a store for detailed data, updated frequently and used for drill-downs from the data warehouse which contains summary data. Reduces the burden placed on other operational or data warehouse platforms by providing an additional data store for reporting. Provides more current data than in a data warehouse and more integrated than an OLTP system Feeds other operational systems in addition to the data warehouse
26

Data Warehousing SCHEMAS & OBJECTS

A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The are: Star Schema Snowflake Schema Galaxy Schema

27

Star Schema: It Consists of a fact table connected to a set of dimensional tables Data is in Dimension tables is De-Normalized Snowflake Schema: It is refinement of star schema where some dimensional hierarchy is normalized in to a set of dimensional tables Galaxy Schema: Multiple fact tables share dimension tables viewed as a collection of stars, therefore called galaxy schema
28

Star Schema
A star schema a highly De-Normalized, query-centric model where information is broken into two groups: facts and dimensions. Employee_Dim Employee_Dim
EmployeeKey EmployeeID . . .

TimeKey TheDate . . .

Time_Dim Time_Dim

ShipperKey ShipperID . . .

Shipper_Dim Shipper_Dim

TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .

Sales_Fact Sales_Fact

BranchID Branchno . . .

Branch_Dim

Customer_Dim Customer_Dim
CustomerKey CustomerID . . .
29

Snowflake Schema
Branch_Dim
branchID {PK} branchNo branchType city {FK} timeID {FK} propertyID {FK} branchID {FK} clientID {FK} promotionID {FK}

Sales_fact

City
city {PK} region {FK}

staffID {FK} ownerID {FK} offerPrice sellingPrice saleCommission saleRevenue

Region
region {PK} country

Figure32.2 Fact Table Dimension Tables

30

Galaxy Schema

Multiple Groups of Facts links by few common dimensions

Dimension1

Dimension2

Fact1
Dimension3 Dimension4

Fact2
Dimension6

Dimension5

Fact3
Dimension7
31

Data Warehousing Objects


All the three types of Schemas are described in the Data Modeling section Various Objects used in Data Warehousing are: Fact Tables Dimension Tables Hierarchies Unique Identifiers Relationships

32

Data Warehousing Objects


Fact Tables: Represent a business process, i.e., models the business process as an artifact in the data model Contain the measurements or metrics or facts of business processes "monthly sales number" in the Sales business process most are additive (sales this month), some are semi-additive (balance as of), some are not additive (unit price) The level of detail is called the grain of the table Contain foreign keys for the dimension tables

33

Fact Types :

Additive facts:
Additive facts are facts that can be summed up through all of the dimensions in the fact table

Semi-Additive facts:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table

Non-additive facts:
Non-additive facts are facts that cannot be summed up for any of the dimensions Present in the fact table
34

Examples to illustrate Additive, Semi-Additive & Non-Additive facts:

Fact table:

Date Store Product Sales_Amount

The purpose of this table is to record the Sales_Amount for each product in each store On a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because we can sum up this fact along with any of the 3 dimensions present in the fact table date, store, and product

35

Eg for semi-Additive & Non-Additive facts:

Fact table:

Date Account Current_Balance Profit_Margin

The purpose of this table is to record the current balance for each account at the end of
each day, as well as the profit margin for each account for each day Current_Balance & Profit_Margin are the facts Current_Balance is a semi additive fact, as it makes sense to add them up for all accounts (whats the total current balance for all accounts in the bank?), but it does not make sense to add them up through time Profit_Margin is a non additive fact, for it does not make sense to add them up for the account level or the day level
36

types of fact tables :

Based on the above classifications, there are two types of fact tables
Cumulative Snapshot Cumulative: This type of fact table describes what has happened over a period of time For example this fact table may describe the total sales by product by store by day The facts for this type of fact tables are mostly additive. The first example is a Cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance Of time, and usually includes more semi additive and non-additive facts. The second example presented is a snapshot fact table

37

Data Warehousing Objects Contd.


Dimension Tables: Dimension tables Define business in terms already familiar to users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc.
38

Dimension tables Types

Dimension tables Types Slowly Changing dimensions Junk Dimensions Confirmed Dimensions Degenerated Dimensions.

39

Slowly Changing Dimensions :(SCD)

Various data elements in the dimension undergo changes (e.g. changes in attributes, hierarchical structures) which need to be captured for analysis.

SCD problem is a common one particular to data warehousing.

In a nutshell, this applies to cases where the attribute for a record varies over time. For eg:
Customer key 1001 Name Christina State Illinois

Christina is a customer who first lived in chicago,illinois. At a later date, she moved to Los Angeles,California. Now how to modify the table to reflect this change? This is a Slowly Changing Dimension problem
40

Types of SCD
There are in general 3 ways to solve this type of problem, and they are categorized as follows: Type 1 Type 2 Type 3 Type 1: New record places the original record. No trace of the old record exists Type 2: A new record is added to the customer dimension table Type 3: The Original record is modified to reflect the change
41

TYPE 1:

New record places the original record. No trace of the old record exists Eg:
Customer key 1001 Name Christina State Illinois

After Christina moved from illinois to California, the new information replaces the new record and we have the following table:
Customer key 1001 Name Christina State California

Advantages: This is the easiest way to handle the Slowly Changing dimension, Since there is no need to keep track of the old information. Disadvantages: All the history is lost. By applying this methodology, it is not possible to track back in history. For eg In the above case, the company would not able to know that Christina lived in Illinois before.
42

TYPE 2:

In type 2 SCD a new record is added to the table to represent the new Information. Therefore both the original & the new record will be present Eg:
Customer key 1001 1005 Name Christina Christina State Illinois

California

After Christina moved from illinois to California, we add the new information as a new row into the table Advantages: This allows us to accurately keep all historical information Disadvantages: This will cause the size of the table to grow fast where the number of rows for the table is very high to start with, storage and performance can become a concern
43

TYPE 3:

In type 3 SCD there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. Eg:
Customer key 1001 Name Christina Original State Illinois Current State Effective Date

California

15-Jan-03

After Christina moved from illinois to California, the original information gets updated, And we have the above table (Assuming the effective date of change is January 15,2003
Advantages: This does not increase the size of the table, since new information is updated This allows us to keep some part of history Disadvantages: Type 3 will not be able to keep all history where an attribute is changed more than Once. For eg, if Christina later moves from to Texas on December 15,2003 the California information is lost
44

Degenerated Dimension:
Degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table. Degenerate dimensions are often used when a fact table's grain represents transactional level data and one wishes to maintain system specific identifiers such as order numbers, invoice numbers and the like without forcing their inclusion in their own dimension.

45

Confirmed Dimensions :
Dimension which is fixed and reusable. It is also called as fixed dimension. It is a dimension which doesn't effect with respect to time. Ex : if the name of the city is changed from Bombay to Mumbai, the name will not change from time to time, once the change is done ,The change is permanent. This type of dimensions are called confirmed or fixed dimensions.

46

Junk dimensions:
A dimension where one can store random transactional codes, flags and text attributes that are not related to other dimensions and which provides a simple way for users to easily find those unrelated attributes. Ex: Martial Status : (Yes or No) Gender : (M or F) e.t.c.

47

Definition Of Data Warehouse


Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form

48

Basic Design Approaches of Data Warehouse


There are two major types of approaches to building or designing the Data Warehouse.

The Top-Down Approach The Bottom-Up Approach

49

The Top Down Approach

The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach Inmon advocated a dependent data mart structure The data flow in the top down OLAP environment begins with data extraction from the operational data sources. This data is loaded into the staging area and validated and consolidated for ensuring a level of accuracy and then transferred to the Operational Data Store (ODS). Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area for aggregation, summarization and then extracted and loaded into the Data warehouse. Once the Data warehouse aggregation and summarization processes are complete, the data mart refresh cycles will extract the data from the Data warehouse into the staging area and perform a new set of transformations on them. This will help organize the data in particular structures required by data marts. Then the data marts can be loaded with the data and the OLAP environment becomes available to the users.

50

The Top Down Approach Contd

Inmon Approach

The data marts are treated as sub sets of the data warehouse. Each data mart is built for an individual department and is optimized for analysis needs of the particular department for which it is created.

51

The Bottom-Up Approach


1. The Data warehouse Bus Structure: The Bottom-Up Approach Ralph Kimball designed the data warehouse with the data marts connected to it with a bus structure. The bus structure contained all the common elements that are used by data marts such as conformed dimensions, measures etc defined for the enterprise as a whole. This architecture makes the data warehouse more of a virtual reality than a physical reality All data marts could be located in one server or could be located on different servers across the enterprise while the data warehouse would be a virtual
entity being nothing more than a sum total of all the data marts

In this context even the cubes constructed by using OLAP tools could be considered as data marts.

52

The Bottom-Up Approach Contd

Kimball Approach

The bottom-up approach reverses the positions of the Data warehouse and the Data marts. Data marts are directly loaded with the data from the operational systems through the staging area. The data flow in the bottom up approach starts with extraction of data operational databases into the staging area where it is processed and consolidated and then loaded into the ODS. from

53

The Bottom-Up Approach Contd

The data in the ODS is appended to or replaced by the fresh data being loaded. After the ODS is refreshed the current data is once again extracted into the staging area and processed to fit into the Data mart structure. The data from the Data Mart, then is extracted to the staging area aggregated, summarized and so on and loaded into the Data Warehouse and made available to the end user for analysis.

54

Business Intelligence Components

DATA MINING
Data Warehouse

OLAP

LOAD TRANSFORM EXTRACT Operational Data


55

Business Intelligence Architecture

56

Business Intelligence Technologies


Increasing potential to support business decisions

Decision Making

End User

Data Presentation Visualization Techniques Data Mining Information discovery Data Exploration OLAP, DSS, EIS, Querying and Reporting Data Warehouses / Data Marts

Business Analyst

Data Analyst

DB Admin

Data Sources Paper, Files, Information Providers, Database Systems, OLTP


57

Potrebbero piacerti anche