Sei sulla pagina 1di 16

Data warehouse- introduction

A data warehouse is simply a single, complete, and consistent store of data obtained from a variety
of sources and made available to end users in a way they can understand and use it in a business
context. -- Barry Devlin, IBM Consultant
Data warehouse is subject oriented, integrated , non-volatile and time varying collection of data in
support of its decision making process.
Data warehouse is a collection of corporate information, derived directly from operational systems
and some external data sources. It is a relational database designed for query and analysis rather than
for transaction processing.
Extraction, Transformation and Loading
To serve purpose of facilitating business analysis, data warehouse system must be loaded regularly.
To do this data from one or more operational system must be extracted and copied into the
warehouse. The process of extracting data from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for extraction, transformation, and loading.

Extraction-- select data using different methods.


Transformation--validate, clean, integrate, and time stamp data.
Loading--move data into the warehouse.

Data warehouse- an Environment, not a product


A data warehouse is not a single software or hardware product you purchase to provide strategic
information. It is rather, a computing environment where users can find strategic information , an
environment where users are put directly in touch with the data they need to make better decisions.
It is a user-centric environment

An ideal environment for data analysis and decision support


Flexible and interactive.
100% user-driven.
Very responsive and conductive to the ask-answer-ask again pattern.
Provides the ability to discover answers to complex, unpredictable question.

Data warehouse- a Blend of many technologies


Data warehouse blend of many technologies. Many technologies are in used, they all work together
in a data warehouse. The end result is the creation of a new computing environment for the purpose
of providing the strategic information every organization needs desperately.

Take all the data from the operational systems.


Integrate all the data from the various sources.
Remove inconsistencies and transform the data.
Store the data in formats suitable for easy access for decision making.

Features of data warehouse

Subject Oriented:
Data is categorized and stored by business subject rather than by application. Data are organized
according to subject instead of application.
Integrity:
when data resides in many separate applications in the operational environment, encoding of data is
often inconsistent, for instance, in one application gender might be coded as male and female in
another 0 and 1.
Time variant:
Data is stored as a series of snapshots, each representing a period of time. Data warehouse contain a
place for storing data that are 8 to 10 or older to be used for comparison, tends and forecasting.

Non-Volatile:

Typically data in the data warehouse is not updated or deleted. Data are not updated or changed in
any way once they enter the data warehouse but are only loaded and accessed.

Advantages of data warehouse

High query performance

But not necessarily most current information

Doesnt interfere with local processing at sources

Complex queries at warehouse

OLTP at information sources

Information copied at warehouse

Can modify, annotate, summarize, restructure, etc.


Can store historical information
Security, no auditing

Disadvantages o data warehouse

May rely too heavily on data generated only from TPS

May complicate business processes by institutionalising reports, data for datas


sake

Learning curve too long - technical and business aspects

Culture of developing quick and dirty strategic applications

End-users may not have skills for building queries

Availability of data warehousing skills

Data warehouses require high maintenance

Cost of information may outweigh its benefit

Need for data warehouse


Real-time issues your current systems arent enabled to integrate
disparate sources of dataand keep historical records of those integrations, in
near real-time.
Scalability issues you have tons of historical data you need to gather in to
an easily accessible place, common formats, common keys, and common
access methods. AND you need to ensure that the system is scalable over the
next 3 to 5 years.
Avoidance of Siloed Solution Sets if you have many different or
disparate solutions already in existence, yet your corporation is unable to answer
common questions requiring consistency across your enterprise.
Enterprise Class System of Record across historical and integrated data
sets, if you have a need to do this, you probably need an enterprise data
warehouse
Disparate Source Systems along with Internal and External Data Sets
if you need to ingrate all of these for a single enterprise vision WITH HISTORY,
then you need a data warehouse.
Self-Service BI if you have a need to eventually reach this goal, where
users can visualize and construct their own reports, then you probably need an
enterprise data warehouse, along with its highly integrated historical facts from
all the different sources in your organization.

Consolidation of information resources


Improved query performance
Separate research and decision support functions from the operational systems
Foundation for data mining, data visualization, advanced reporting and OLAP tools
The operational DBMS designs are inadequate for decision support.
Data warehousing separate analytical processing from operational processing by providing a

separate architecture system for decisional implementation.


Data warehouse enables to analyze the current business trends and helps in decision making.
Ref:-http://danlinstedt.com/datavaultcat/why-when-datawarehousing-is-it-relevant/

2. Architecture of Data warehouse


The structure that bring all the components of a data warehouse together is known as the
architecture.
Data Warehouse Architectures: Conceptual View

Single-layer
Every data element is stored once only
Virtual warehouse

Two-layer

Real-time + derived data


Most commonly used approach in industry today
Three-layer
Transformation of real-time data to derived data really requires two steps.

Four views regarding the design of a data warehouse


Top-down view

allows selection of the relevant information necessary for the data warehouse

Data source view

exposes the information being captured, stored, and managed by operational systems

Data warehouse view

consists of fact tables and dimension tables

Business query view

sees the perspectives of data in the warehouse from the view of end-user

1)Data sources:
In data source we have all the sources where data is stored. Data stored in different-different
places. Data from operational databases and external sources are extracted using application
program interfaces known as gateways.
2)Data storage:
The data storage of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into
the bottom tier. These back end tools and utilities perform the Extract, Clean, Load,
and refresh functions.

Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh
propagate the updates from the data sources to the warehouse
Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports, audit
trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
Data Mart

a subset of corporate-wide data that is of value to a specific groups of users. Its scope is
confined to specific, selected groups, such as marketing data mart.
Independent vs. dependent (directly from warehouse) data mart.

3) OLAP server
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware
Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
Greater scalability
Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine


Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

Flexibility, e.g., low level: relational, high-level: array

3)Front-end tools - This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.

Data Warehouse Models

From the perspective of data warehouse architecture, we have the following data
warehouse models:
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is
easy to build a virtual warehouse. Building a virtual warehouse requires excess
capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is
valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular
group. For example, the marketing data mart may contain data related to items,
customers, and sales. Data marts are confined to subjects.

Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information
providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes
or beyond.
Ref:-http://www.tutorialspoint.com/dwh/dwh_architecture.htm

3.Explain in detail MDDM?

INTRODUCTION OF MDDM
The multidimensional data model is an integral part of On-Line Analytical
Processing, or OLAP. Because OLAP is on-line, it must provide answers quickly;
analysts pose iterative queries during interactive sessions, not in batch jobs that run
overnight. And because OLAP is also analytic, the queries are complex. The
multidimensional data model is designed to solve complex queries in real time.The
multidimensional data model is important because it enforces simplicity

Multi-dimensional Data Models


Classical relations:
One-dimensional (not in the mathematical sense)
Relation maps key onto attributes
However, in many cases in data warehousing one is interested in multiple
perspectives (dimensions)
Example: Sales based on product, time, region, customer, store, manager/employee
Cannot be represented with normal relations
Multi-dimensional data models
Multi-dimensional database systems

Component of MDDM
The two primary component of Dimensional Model are Dimensions and Facts

Dimensions:-Texture attribute to analysis data.


Facts:-Numeroc value to analysis business.

Types of MDDM

Star Schema
Snowflake Schema
Fact Constellation Schema

I. Star Schema

Each dimension in a star schema is represented with only one-dimension table.

This dimension table contains the set of attributes.

The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

There
is
a
fact
table at
the
center.
It

contains the keys to each of four dimensions.

The fact table also contains the attributes, namely dollars sold and units sold.

Note: Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

II. Snowflake Schema

Some dimension tables in the Snowflake schema are normalized.

The normalization splits up the data into additional tables.

Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

N
o
w

the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

<b< style="box-sizing: border-box;">Note: Due to normalization in the Snowflake schema, the


redundancy is reduced and therefore, it becomes easy to maintain and the save storage
space.</b<>

III. Fact Constellation Schema

A fact constellation has multiple fact tables. It is also known as galaxy schema.

The following diagram shows two fact tables, namely sales and shipping.

T
h
e

sales fact table is same as that in the star schema.

The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.

The shipping fact table also contains two measures, namely dollars sold and units sold.

It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.

Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.

Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition


define dimension < dimension_name > as ( < attribute_or_dimension_list >
)

Star Schema Definition


The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows:
define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter,
year)
define dimension item as (item key, item name, brand, type, supplier
type)
define

dimension

branch

as

(branch

key,

branch

name,

branch

type)

define dimension location as (location key, street, city, province or


state, country)

Snowflake Schema Definition


Snowflake schema can be defined using DMQL as follows:
define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter,
year)
define dimension item as (item key, item name, brand, type, supplier
(supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key,
city, province or state, country))

Fact Constellation Schema Definition


Fact constellation schema can be defined using DMQL as follows:
define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter,
year)
define dimension item as (item key, item name, brand, type, supplier
type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or
state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales


define dimension item as item in cube sales

define dimension shipper as (shipper key, shipper name, location as


location in cube sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

Ref:- http://www.tutorialspoint.com/dwh/dwh_schemas.htm

Typical OLAP Operations


Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level
summary or detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes
Other operations drill across: involving (across) more than one fact table drill through:
through the bottom level of the cube to its back-end relational tables (using SQL)
Ref:-http://www.nyu.edu/classes/jcf/g22.3033-002/slides/session4/DataWarehousingAndOLAP.pdf