Sei sulla pagina 1di 26

Data Warehouse

Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Data Warehouse Contd


Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. Ralph Kimball provided a more concise definition of a data warehouse: A data warehouse is a copy of transaction data specifically structured for query and analysis.

ETL
ETL : Extract Transform and load ETL is the method or technology used for implementation of data warehouse. Extract : Extract data from source Transform : Transform as per the business requirements. Load : Load data in Data warehouse

OLTP and OLAP systems


Difference b/w OLTP systems and Data warehouseOLTP : It stands for online transaction processing and is used basically for transactions. OLTP system is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

OLAP
Online analytics processing.These system are implemented where speed of data retrieval required is more and compromise can be made over the speed of data entry. OLAP system is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

Difference B/W OLTP and OLAP


Source of data
Purpose of data What the data

Operational data; OLTPs are the original source of the data.

Consolidation data; OLAP data comes from the various OLTP Databases

Inserts and Updates Queries

To control and run fundamental business To help with planning, problem solving, and tasks decision support Reveals a snapshot of on-going business Multi-dimensional views of various kinds of processes business activities Short and fast inserts and updates initiated Periodic long-running batch jobs refresh the by end users data Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Processing Speed

Typically very fast

Space Requirements Database Design

Can be relatively small if historical data is archived Highly normalized with many tables

Backup and Recovery

Backup religiously; operational data is Instead of regular backups, some critical to run the business, data loss is likely environments may consider simply reloading to entail significant monetary loss and legal the OLTP data as a recovery method liability

Data Warehousing concepts


Dimensional Modelling. Fact and Dimension tables. Star schema Snowflake schema Fact constellation schema Types of Dimension : Junk dimensions, Conformed dimension

Slowly changing Dimension (SCD)

Dimensional Modelling

DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance.

Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.

Fact and Dimension tables

Fact table : A fact table is the table that contain measure of interest i.e. Business facts.

Example : sales amount for a product by store and day.

Fact table mostly contain the additive values, that can be aggregated to provide figures that would help to take business decisions.

Dimension Table
A category of information. For example, the Product dimension, store dimension, Time dimension. Attribute : A unique level within attribute. Ex :product category in product dimension Or month in time dimension

Types of Dimensional Modelling


Star Schema

Dimensional Modelling
Snow flake schema
Fact constellatio n schema.

Star Schema
In Star schema a single Fact table is surrounded by multiple dimensional tables.

Snow Flake Schema


A snowflake schema applies normalization over a star schema, in which very large dimension tables are normalized into multiple tables. Dimensions with hierarchies can be decomposed into a snowflake structure when you want to avoid joins to big dimension tables when you are using an aggregate of the fact table.

Advantages and Disadvantages


The normalization of dimension tables tends to increase number of dimension tables or sub-dimension tables that require more foreign key joins when querying the data therefore reduce the query performance. Snowflake schema helps in saving space by normalizing dimension tables. The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.

It is more difficult for business users who use data warehouse system using snowflake schema because they have to work with more tables in a database than star schema.
By creating aggregate table(s) and joining it (them) to the required dimension table(s) improves performance by reducing the execution time.

Fact Constellation schema


Fact constellation schema contain multiple fact table using same dimension tables.

Fact constellation schema can implement between aggregated fact table or else when a complex fact table is decomposed into independent simple fact table
Conformed dimension is describes a common structured dimension that shared across the various FACT table in data warehouse. conformed dimensions are used to avoid redundant data in data warehouse.

Slowly changing Dimensions


SCD Type 1 : New record replace the old record. No trace of old record exist.
Product ID 1 2 Product ID 1 2 Item Soap powder Price 15 55 Load_date 02-Dec 02-Dec Update_date 02-Dec 02-Dec

Item Soap powder

Price 20 55

Load_date 02-Dec 02-Dec

Update_date 05-Dec 02-Dec

Slowly Changing Dimension


SCD Type 2 : A new record is added into the table, therefore both old and new record exist. Latest record can be tracked by various ways. 1)Effective and End date concept.
Product ID 1 3 2 Product ID 1 2 Item Soap Soap Powder Item Soap Powder Price 15 20 55 Price 15 55 Eff_dt 02-Dec 05-Dec 02-Dec Eff_dt 02-Dec 02-Dec End_Dt 05-Dec NULL NULL End_Dt NULL NULL

Slowly Changing Dimension Cont


SCD Type 2 Versioning
Product ID 1 2
Product ID 1 3 2

Item Soap powder


Item Soap Soap powder

Price 15 55
Price 15 20 55

Version 1 1
Version 1 2 1

Slowly Changing Dimension Cont


SCD Type 3

Junk Dimension
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. Keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance issues.

Fact table Customer_id prd_id prepay_ind

coupon_ind

Junk_ind 1 2 3 4

prepay_ind Y Y N N

coupon_ind Y N N Y

Types of Fact Tables


1)Additive Fact 2) Semi additive Fact 3)Non additive Fact 2)Semi

Surrogate key
Surrogate key is the DWH generated primary key that is used for uniquely identifying record in DWH. Why surrogate key is implemented : 1) When data is loaded from multiple sources. 2) When the history need to be maintained and primary key from source would violates the primary key constraint in DWH.

Data Mart
A data warehouse incorporates information about many subject areas often the entire enterprise. While the data mart focuses on one or more subject areas. The data mart represents only a portion of an enterprise's data perhaps data related to a business unit or work group.

Typically, a data mart's data is targeted to a smaller audience of end users or used to present information on a smaller scope.

Thank You

Potrebbero piacerti anche