Sei sulla pagina 1di 21

Fundamentals of Data Warehousing

28/02/2009

Kambali Chandrakanth Gowd


Chandrakanthgowd.k@tcs.com

Introduction:
A data warehouse is basically a storage area where all an organization's information
or data is stored and managed in a manner that will allow all users in the organization
to use that data in their decision-making process.
Before and early 1980's the data was evaluated in the form of Management
Infromation System reports and there are many difficulties in this method.
Fortunately, in the late 1980s, data warehousing concepts was intended to provide
an architectural model for the flow of data from operational systems to decision
support environments. Further, data warehouses have been designed and built as
separate technology entities from operational and transactional systems and have
become the primary repositories for performing business intelligence.
There are four advances in data warehouse technology that has allowed it to evolve.
These advances are offline operational databases, offline data warehouses, real time
data warehouses and the integrated data warehouses.
Offline Operational Databases - Data warehouses in this initial stage are
developed by simply copying the database of an operational system to an offline server where the processing load of reporting does not impact on the
operational systems performance.
Offline Data Warehouse - Data warehouses in this stage of evolution are
updated on a regular time cycle from the operational systems and the data is
stored in an integrated reporting-oriented data structure.

Real Time Data Warehouse - Data warehouses at this stage are updated on
a transaction or event basis, every time an operational system performs a
transaction

Integrated Data Warehouse - Data warehouses at this stage are used to


generate activity or transactions that are passed back into the operational
systems for use in the daily activity of the organization.

History of Data warehousing:


The executives and managers who are responsible for keeping the enterprise
competitive need information to make proper decisions. They need information to
formulate the business strategies, establish goals, set objectives and monitor results.
In spite of a lot of data accumulated by enterprises over the past decades, every
enterprise is caught in the middle of an information crisis. Information needed for the
strategic decision making is not readily available. Companies are desperate for
strategic information to extend market share and improve profitability.
Strategic information is not for running the day-to-day operations of the business. It is
more important for continued health and survival of the corporation. Critical business
decisions depend on the availability of proper strategic information in an enterprise.
Analysts, executives and managers use strategic information interactively to analyze
and spot business trends.
Information needed for strategic decision making has to be available in an interactive
manner. User must be able to query online, get results and query some more. The
information must be in a format suitable for analysis.
All the past attempts by Information Technology (IT) to provide strategic information
have been failures. This was mainly because IT has been trying to provide strategic
information from operational systems. Operational systems could not provide
strategic information. The operational computer systems provide information to run
day-to-day operations.
Only specially designed decision support systems can provide strategic information.
Specially designed decision support systems are not meant to run the core business
processes. They are used to watch how the business runs and then make strategic
decisions to improve the business. Decision support systems are developed to get
strategic information out of the database, but operational systems are designed to
put the data into the database.
Data warehousing is only the viable solution for providing strategic information. This
is not to generate fresh data, but to make use of large volumes of existing data and
to transform it into forms suitable for providing strategic information. The concept of
data warehousing is take all the data which already exists in the organization, clean
and transform it then provide useful strategic information.

Data Warehouse:
A Data Warehouse is a subject oriented, integrated, time variant and nonvolatile
collection of data in support of management's decisions.
Subject oriented Data:
In operational systems data is stored by individual applications. Data sets have to
provide data for the specific applications to perform the specific functions efficiently.
Therefore data sets for each application need to be organized around that specific
application.
In Data warehouses data is not stored by operational applications, but by Business
subjects. Business subjects differ from enterprise to enterprise and they are critical
for the enterprise.
Integrated Data:
All the relevant data from various applications must pull together for proper decision
making. The data in the data warehouse comes from several operational systems.
Sources data are in different databases, files and data segments.
Data inconsistencies are removed and process of transformation, consolidation and
integration of the source data are followed before the data is stored in a data
warehouse.
Nonvolatile Data:
The data in the data warehouse is primarily for query and analysis and not intended
to run the day-to-day business. The data in a data warehouse is not as volatile as the
data in an operational database is.
Time-variant Data:
All data in the data warehouse is identified with a particular time period.
The time-variant nature of the data in a data warehouse
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future

Benefits of Data warehousing:

The primary focus of Data warehousing environments are optimal analysis


and speed retrieval of data rather than efficient creation and modification of
data.

Implementations of data warehouses have been found to provide substantial


cost savings for organizations and have positive affects towards an
organizations financial bottom line.

The consistent data exists in the data warehouse.

Business users will be able to query data directly with less information
technology support.

Data warehouses enhance the value of operational business applications.

Decision makers will be able to retrieve highly organized information.

Operational systems Versus Data warehousing systems:

Operational systems
Operational systems are generally
concerned with current data.
Data is updated regularly according to
need.
Operational systems are generally
process-oriented (focused on specific
business processes or tasks)
Operational systems are generally
designed to support high-volume
transaction processing with minimal
back-end reporting.
Operational systems are generally
optimized to perform fast inserts and
updates of relatively small volumes of
data.
Operational systems generally require
a non-trivial level of computing skills
amongst the end-user community.

Data Warehousing systems


Data warehousing systems are generally
concerned with historical data.
Data is generally read-only.
Data warehousing systems are generally
subject-oriented
Data warehousing systems are generally
designed to support high-volume analytical
processing and subsequent, often elaborate
report generation.
Data warehousing systems are generally
optimized to perform fast retrievals of
relatively large volumes of data.
Data warehousing systems generally
appeal to an end-user community with a
wide range of computing skills, from novice
to expert users.

OLTP and OLAP:


OLTP stands for on-line transaction processing.
OLTP is a class of program that facilitates and manages transaction-oriented
applications.
OLTPs are designed for optimal transaction speed. The main purpose of OLTP is to
control and run fundamental business tasks.
OLAP stands for On-Line Analytical Processing.
OLAP has been growing in popularity due to the increase in data volumes and the
recognition of the business value of analytics. Until the mid-nineties, performing
OLAP analysis was an extremely costly process mainly restricted to larger
organizations. OLAP allows business users to slice and dice data at will. Normally
data in an organization is distributed in multiple data sources and are incompatible
with each other. Part of the OLAP implementation process involves extracting data
from the various data repositories and making them compatible. Making data
compatible involves ensuring that the meaning of the data in one repository matches
all other repositories. OLAPs are designed to give an overview analysis of what
happened.

OLAP Characteristics:

OLAP facilitate interactive query and complex analysis for the users.
Allow users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimensions.
Provide ability to perform intricate calculations and comparisons
Present results in a number of meaningful ways, including charts and graphs.

Types of OLAP models:


There are different types of OLAP models
Multidimensional OLAP (MOLAP)
Relational OLAP (ROALP)
Hybrid OLAP (HOALP)
Multidimensional OLAP (MOLAP):
In MOLAP model, data for analysis is stored in specialized multidimensional
databases. MOLAP is the fastest option for data retrieval (cubes are built for fast data
retrieval), but requires the most disk space. Disk space is less of a concern these
days with lowering storage and processing cost. MOLAP can handle only moderate
volumes of data, because all calculations are performed when the cube is built, it is
not possible to include a large amount of data in the cube itself. Data analysis is easy
irrespective of number of dimensions.

Relational OLAP (ROALP):


In the ROLAP model, data is stored as rows and columns in relational form and
presents data to users in the form of business dimensions. It handles very large
amount of data, but data retrieval is slow. There are limitations on complex data
analysis functions. ROLAP is best suited for smaller data warehousing
implementations.

Hybrid OLAP (HOALP):


HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
HOLAP can "drill through" from the cube into the underlying relational data and
leverages cube technology for faster performance.

OLTP versus OLAP:


Source of
data

OLTP System
OLTPs are the original source of the
data.(Operational data)

Purpose of
data

To control and run fundamental


business tasks

Inserts and
Updates
Queries

Short and fast inserts and updates


initiated by end users
Relatively standardized and simple
queries and returning relatively few
records
Typically very fast

Processing
Speed

Database
Design

Highly normalized with many tables

Backup and
Recovery

Backup religiously. Operational data


is critical to run the business, data
loss is likely to entail significant
monetary loss and legal liability

OLAP System
OLAP data comes from the
various OLTP Databases
(Consolidation data)
To help with planning,
problem solving, and decision
support
Periodic long-running batch
jobs refresh the data
Often complex queries
involving aggregations
Depends on the amount of
data involved. Batch
data refreshes and complex
queries may take many hours.
Query speed can be
improved by creating indexes
Typically de-normalized with
fewer tables; use of
star and/or snowflake
schemas
Instead of regular backups,
some environments
may consider simply
reloading the OLTP data as a
recovery method

Data warehouse Architecture:


Data warehouses and their architectures vary depending upon the specifics of an
organization's situation. The common architecture is:

Data Warehouse Architecture

Operational systems, ERP, CRM and Flat files are the different types of data
sources. ETL (Extraction, Transformation and Loading) is a process of pulling data
out from the source systems and placing it into a data warehouse.

Extraction:
Data from different source systems is converted into one consolidated data
warehouse format which is ready for transformation processing.

Transformation:
In transforming the data, the following tasks may involve.

Applying business rules (for example calculating new measures and


dimensions)
Cleaning (for example Mapping NULL to 0 or "Male" to "M" and "Female"
to "F" etc)
Filtering (for example selecting only certain columns to load),
Splitting a column into multiple columns and vice versa
Joining together data from multiple sources (for example lookup, merge)
Transposing rows and columns
Applying any kind of simple or complex data validation (for example if the first
three columns in a row are empty then reject the row from processing)

Loading:
Loading data into the data warehouse.
The data is loaded into the Data warehouse database. The metadata and raw data of
a traditional OLTP (on-line transaction processing) system is present, as is an
additional type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. End users
directly access data derived from several source systems through the data
warehouse
OLAP (Online Analytical Processing) are being used aggressively by organizations to
discover valuable business trends from data marts and data warehouses. OLAP
provides a historical view of data, although useful when used by itself, OLAP analysis
becomes truly powerful when combined with predictive analysis from Data Mining.
Data Mining:
Data mining, the extraction of hidden predictive information from large databases, is
the process of analyzing data from different perspectives and summarizing it into
useful information Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. It allows users to analyze
data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations
or patterns among dozens of fields in large relational databases.
The main difference between the database architecture and Data Warehouse
architecture is that the systems relational model is usually de-normalized into
dimension and fact tables which are typical to a data warehouse database design.

ER and Dimensional Modeling:


Entity-relationship modeling is a logical design technique that seeks to remove the
redundancy in data. Entity-relationship (ER) modeling is a powerful technique for
designing transaction processing systems in relational environments. By helping to
automate the normalization of physical data structures, ER has greatly contributed to
the phenomenal success of getting large amounts of data into relational databases.
However, ER models do not contribute to the users ability to query the data.ER is
very useful for the transaction capture and the data administration phases of
constructing a data warehouse, but it should be avoided for end-user delivery.
To understand dimensional Modeling, lets define some of the terms commonly used:
Attribute:
A unique level within a dimension.
For example, Month is an attribute in the Time Dimension.
Fact Table:
A fact table is a table that stores facts that measure the business, such as sales, cost
of goods, or profit. Fact tables also contain foreign keys to the dimension tables.
These foreign keys relate each row of data in the fact table to its corresponding
dimensions and levels.
Dimension Table:
A dimension table is a table that stores attributes that describe aspects of a
dimension. For example, a time table stores the various aspects of time such as
year, quarter, month, and day. A foreign key of a fact table references the primary
key in a dimension table in a many-to-one relationship.
Primary Key:
Each row in a dimensional table is identified by a unique value of an attribute
designated as the primary key of the dimension
Surrogate Key:
A Surrogate key is a system-generated sequence number which do not have any
built-in meaning.
Dimensional modeling (DM) is the name of a logical design technique often used for
data warehouses.DM is the only viable technique for databases that are designed to
support end-user queries in a data warehouse. Data warehouses are typically
developed using dimensional models rather than the traditional Entity-relationship
models associated with conventional relational databases.

The Strengths of Dimensional modeling:


The dimensional model has a number of important data warehouse advantages that
the ER model lacks.

The dimensional model is a predictable, standard framework. Report writers,


query tools, and user interfaces can all make strong assumptions about the
dimensional model to make the user interfaces more understandable and to
make processing more efficient.
The dimensional model withstands unexpected changes in user behavior
The dimensional model is gracefully extensible to accommodate unexpected
new data elements and new design decisions.
The dimensional model is a body of standard approaches for handling Slowly
Changing Dimensions
The dimensional model is the growing body of administrative utilities and
software processes that manage and use aggregates

ER modeling Versus Dimensional modeling:

An ER modeling focus on individual events whereas Dimensional modeling


focus on how managers view the business
The ER modeling is split as per the entities. A dimension model is split as per
the dimensions and facts.
An ER modeling has complex group of entities linked with each other,
whereas the Dimensional model has logical grouped set of star-schemas.
In an ER modeling all attributes for an entity including textual as well as
numeric, belong to the entity table. Whereas a 'dimension' entity in dimension
model has mostly the textual attributes, and the 'fact' entity has mostly
numeric attributes.
An ER modeling has highly normalized model whereas dimensional model
aggregates most of the attributes and hierarchies of a dimension into a single
entity.

Slowly Changing Dimensions (SCD):


In Dimensional Modeling, Most Dimensions are generally constant, but they do
change over time. The product key of the source record does not change but the
description and other attributes change slowly over time. For example, customer is
constant but demographical details of a customer might change several times during
the year. In Dimensional modeling, Slowly Changing Dimensions can record these
types of changes. The changing Dimensions means the variation in dimensional
attributes over time.
The Slowly Changing Dimensions can be categorized into three types

Type 1 SCD (Overwriting History)


Type 2 SCD (Preserving History)
Type 3 SCD (Preserving a version of History)

Type 1 SCD:
A Type 1 change overwrites an existing dimensional attribute with new information.
This updates only the attribute and doesnt insert any new record.
For example, if the customers address changes, the new address overwrites the old
address. Therefore old address is lost forever.
Type 2 SCD:
A Type 2 change writes a record with the new attribute information and preserves a
record of the old dimensional data. The new record is inserted with a new surrogate
key.
For example, if the customers address changes, the new address is added.
Therefore, both old and new address will be present. The new address is inserted
using surrogate key.
Type 3 SCD:
A Type 3 change places a value for the change in the original dimensional record,
instead of creating a new dimensional record to hold the attribute change.
For example, if the customers address changes then the old address, new address
and effective date of change is captured. Therefore old, new address and effective
date of change will be present.
Type 3 will not be able to keep all history where an attribute is changed more than
once. Type 3 is rarely used in actual practice.

Dimensional Model Schemas:


Data Warehouse environment usually transforms the relational data model into some
special architecture. There are many schema models designed for data warehousing
but the most commonly used are:

Star schema
Snowflake schema
Fact constellation schema

The determination of which schema model should be used for a data warehouse
should be based upon the analysis of project requirements, accessible tools and
project team preferences.
Star schema:
The arrangement of the collection of fact and dimension tables in the dimensional
data model, resembling a star formation, with the fact table placed in the middle

surrounded by the dimension tables. Usually the fact tables in a star schema are in
third normal form (3NF) whereas dimensional tables are de-normalized.

Snowflake schema:
Snowflaking is a method of normalizing the dimension tables in a STAR schema.
Snowflake schemas normalize dimensions to eliminate redundancy. The dimension
data has been grouped into multiple tables instead of one large table, so the
snowflake schema is a more complex schema than the star schema
The following figure shows a snowflake schema with two dimensions, each having
three levels. A snowflake schema can have any number of dimensions and each
dimension can have any number of levels.

Fact constellation schema:


For each star schema or snowflake schema it is possible to construct a fact
constellation schema. This schema is more complex than star or snowflake schema
because it contains multiple fact tables. This allows dimension tables to be shared
amongst many fact tables. That solution is very flexible but it may be hard to manage
and support.
The main disadvantage of the fact constellation schema is a more complicated
design because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when
some facts are associated with a given dimension level and other facts with a deeper
dimension level.

Data Mart:
A collection of related data from internal and external sources, transformed,
integrated and stored for the purpose of providing strategic information for a specific
set of users in an enterprise.
The data mart contains only a small amount of historical information and is granular
only to the point that it suits the needs of the department. The data mart is typically

housed in multidimensional technology which is great for flexibility of analysis but is


not optimal for large amounts of data. Data found in data marts is highly indexed.
There are two kinds of data marts

Dependent Data mart


Independent Data mart.

All dependent data marts has data warehouse as a source. Dependent data marts
are architecturally and structurally sound.
An independent data mart is one whose source is the legacy applications
environment. Each independent data mart is fed uniquely and separately by the
legacy applications environment. Independent data marts are unstable and
architecturally unsound. The problem with independent data marts is that their
deficiencies do not make themselves manifest until the organization has built multiple
independent data marts.

Operational Data Store (ODS):


An Operational Data Store (ODS) is an integrated database of operational data. Its
sources include legacy systems and it contains current or near term data.
An operational data store is basically a database that is used for being an interim
area for a data warehouse. It works with a data warehouse but unlike a data
warehouse, an operational data store does not contain static data. Instead, an
operational data store contains data which are constantly updated through the course
of the business operations.

Data Warehouse Methodologies:


The two major design methodologies of data warehousing are from Ralph Kimball
and Bill Inmom. Both Inmom and Kimball view data warehousing as separate from
OLTP and Legacy applications.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis.
Hence the development of the data warehouse can start with data from the online
store. Other subject areas can be added to the data warehouse as their needs arise.
Point-of-sale (POS) data can be added later if management decides it is necessary.
The data mart is the creation of a data warehouse's subject area.

Inmon's Data Warehouse Design Methodology

Kimball views data warehousing as a constituency of data marts. Data marts are
focused on delivering business objectives for departments in the organization. And
the data warehouse is a conformed dimension of the data marts. Hence a unified
view of the enterprise can be obtained from the dimension modeling on a local
departmental level.

Kimball's Data Warehousing Design Methodology

The Life cycle of a Data Warehouse project:


There are different phrases involved in a Data Warehousing project life cycle

Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Performance Tuning
Quality Assurance
Rollout To Production
Production Maintenance
Incremental Enhancements

Requirement Gathering:
The main objective of this phrase is to identify objects necessary for the Reporting
and Analysis requirements. During this phrase, Business managers will play the vital
role and there will be a direct discussion with the end users. The various data
sources are identified (Operational systems, ERP, CRM and Flat files etc).
The deliverables in this phrase are

A list of reports/cubes to be delivered to the end users by the end of this


current phase.
An updated project plan that clearly identifies resource loads and milestone
delivery dates.

Physical Environment Setup:


After Requirements gathering phrase is completed, physical environment has to be
set up by installing database and maintaining the physical servers.
The usual sets of servers include 3 sets of instances.

Development Instance

Test Instance

Production Instance

Development Instance: In this instance developers work on the database and


develop objects then move that code for testing.
Test Instance: In this instance Testers will test the objects developed by developers.

Production Instance: After testing, the objects are moved into the production
instance.
Along with the above instances, there will be separate database servers for ETL,
OLAP and Reporting tools. The Network admin and database Administrators will play
the key role in setup of the servers and they submit the detailed document about the
servers to project managers.
The deliverables in this phrase are

Hardware/Software setup document for all of the environments including


hardware specifications and scripts/settings for the software.

Data Modeling:
A Data model is a conceptual representation of data structures (tables) required for a
database and is very powerful in expressing and communicating the business
requirements. A data model represents the nature of data, business rules governing
the data and how it will be organized in the database. There are three levels of data
modeling. They are

Conceptual Data Model


Logical Data Model
Physical Data Model

Conceptual Data Model: At this level, the data modeler attempts to identify the
highest-level relationships among the different entities.
Logical Data Model: At this level, the data modeler attempts to describe the data in
detail, without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data
model to be combined into a single step
The steps for designing the logical data model are as follows:

Identify all entities.


Specify primary keys for all entities.
Find the relationships between different entities.
Find all attributes for each entity.
Resolve many-to-many relationships.
Normalization.

Physical Data Model: At this level, the data modeler will specify how the logical data
model will be realized in the database schema.

The steps for physical data model design are as follows:

Convert entities into tables.


Convert relationships into foreign keys.
Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements.

The deliverables in this phrase are

Identification of data sources.


Logical data model.
Physical data model.

ETL (Extraction, Transformation and Loading):


The ETL phrase typically takes time to develop and the reason for this is that it takes
time to get the source data, understand the necessary columns, understand the
business rules, and understand the logical and physical data models.
The deliverables in this phrase are

Data Mapping Document


ETL Script/ETL Package in the ETL tool

OLAP Cube Design:


OLAP databases provide aggregated summary information quickly using a schema
that is easily understood by end users. The cube consists of two primary concepts:
measures and dimensions. The measures are the numeric values that provide
summaries at various different levels of aggregation. The dimensions are the way in
which the numeric values are summarized. Within the cube, measures are organized
within measure groups. A measure group is associated with a single fact or event
that is tracked by the OLAP database. Also, the measures can be summarized by
various dimensions, some of which are common across the various measure groups.
Data warehousing is an iterative process. Its difficult to get all the requirements at
once
The deliverables in this phrase are

Documentation specifying the OLAP Cube dimensions and measures.


Actual OLAP Cube/report.

Front End Development:


Front end development is an important part of a data warehousing initiative. If the
reports are not bringing any value to the end user, then the efforts to build the OLAP
cube is wasted. It is the trend to have reports seen through a standard web browser
like Internet explorer. It is not a good idea to install report viewing software on each
and every machine of the end user. So its very important to think about the end
reports and timely delivery of reports to the end user.

The deliverables in this phrase are

Front End Deployment Documentation

Performance Tuning:
There are three major areas where a data warehousing system can use a little
performance tuning.

ETL
Query Processing
Report Delivery

ETL: Since loading data is very time consuming, its best to put that activity in a night
load job. The ETL process needs to be tuned more, because often the jobs do not
get started on-time due to factors that is beyond the control of the data warehousing
team.
Query Processing: Query performance is a big issue in cases where the reports are
run directly against Relationship database. (Especially in the ROLAP environment).
Hence ideal for the data warehousing team to invest some time to tune the query
Report Delivery: End users can experience delays in receiving their reports due to
factors other than the query performance. For example, network traffic, server setup
and the Reporting tool used. It is significant for the data warehouse team to look into
these areas for performance tuning.
The deliverables in this phrase are

Performance tuning document - Goal and Result

Quality Assurance (QA):


After the Date warehouse team completes the development work then the QA team
which is from the Client side starts doing testing.

The deliverables in this phrase are

QA Test Plan
QA verification that the data warehousing system is ready to go to production

Rollout to Production:
Once the QA team gives thumbs up (signoff document), it is time for the data
warehouse system to go live.
The deliverables in this phrase are

Delivery of the data warehousing system to the end users.

Production Maintenance:
Once the data warehouse goes production, it needs to be maintained. Tasks like
taking backup on regular time period and crisis management become very important
and needs to be planned well in advance.
The deliverables in this phrase are

Consistent availability of the data warehousing system to the end users.

Incremental Enhancements:
Once the data warehousing system goes live, there are often needs for incremental
enhancements. The task can be as simple as to do the changes in the production
environment, but is highly risky to do on live (production) systems. Do the changes
on the Development and roll out the changes in the production systems.
The deliverables in this phrase are

Change management documentation


Actual change to the data warehousing system

Potrebbero piacerti anche