Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
28/02/2009
Introduction:
A data warehouse is basically a storage area where all an organization's information
or data is stored and managed in a manner that will allow all users in the organization
to use that data in their decision-making process.
Before and early 1980's the data was evaluated in the form of Management
Infromation System reports and there are many difficulties in this method.
Fortunately, in the late 1980s, data warehousing concepts was intended to provide
an architectural model for the flow of data from operational systems to decision
support environments. Further, data warehouses have been designed and built as
separate technology entities from operational and transactional systems and have
become the primary repositories for performing business intelligence.
There are four advances in data warehouse technology that has allowed it to evolve.
These advances are offline operational databases, offline data warehouses, real time
data warehouses and the integrated data warehouses.
Offline Operational Databases - Data warehouses in this initial stage are
developed by simply copying the database of an operational system to an offline server where the processing load of reporting does not impact on the
operational systems performance.
Offline Data Warehouse - Data warehouses in this stage of evolution are
updated on a regular time cycle from the operational systems and the data is
stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse - Data warehouses at this stage are updated on
a transaction or event basis, every time an operational system performs a
transaction
Data Warehouse:
A Data Warehouse is a subject oriented, integrated, time variant and nonvolatile
collection of data in support of management's decisions.
Subject oriented Data:
In operational systems data is stored by individual applications. Data sets have to
provide data for the specific applications to perform the specific functions efficiently.
Therefore data sets for each application need to be organized around that specific
application.
In Data warehouses data is not stored by operational applications, but by Business
subjects. Business subjects differ from enterprise to enterprise and they are critical
for the enterprise.
Integrated Data:
All the relevant data from various applications must pull together for proper decision
making. The data in the data warehouse comes from several operational systems.
Sources data are in different databases, files and data segments.
Data inconsistencies are removed and process of transformation, consolidation and
integration of the source data are followed before the data is stored in a data
warehouse.
Nonvolatile Data:
The data in the data warehouse is primarily for query and analysis and not intended
to run the day-to-day business. The data in a data warehouse is not as volatile as the
data in an operational database is.
Time-variant Data:
All data in the data warehouse is identified with a particular time period.
The time-variant nature of the data in a data warehouse
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
Business users will be able to query data directly with less information
technology support.
Operational systems
Operational systems are generally
concerned with current data.
Data is updated regularly according to
need.
Operational systems are generally
process-oriented (focused on specific
business processes or tasks)
Operational systems are generally
designed to support high-volume
transaction processing with minimal
back-end reporting.
Operational systems are generally
optimized to perform fast inserts and
updates of relatively small volumes of
data.
Operational systems generally require
a non-trivial level of computing skills
amongst the end-user community.
OLAP Characteristics:
OLAP facilitate interactive query and complex analysis for the users.
Allow users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimensions.
Provide ability to perform intricate calculations and comparisons
Present results in a number of meaningful ways, including charts and graphs.
OLTP System
OLTPs are the original source of the
data.(Operational data)
Purpose of
data
Inserts and
Updates
Queries
Processing
Speed
Database
Design
Backup and
Recovery
OLAP System
OLAP data comes from the
various OLTP Databases
(Consolidation data)
To help with planning,
problem solving, and decision
support
Periodic long-running batch
jobs refresh the data
Often complex queries
involving aggregations
Depends on the amount of
data involved. Batch
data refreshes and complex
queries may take many hours.
Query speed can be
improved by creating indexes
Typically de-normalized with
fewer tables; use of
star and/or snowflake
schemas
Instead of regular backups,
some environments
may consider simply
reloading the OLTP data as a
recovery method
Operational systems, ERP, CRM and Flat files are the different types of data
sources. ETL (Extraction, Transformation and Loading) is a process of pulling data
out from the source systems and placing it into a data warehouse.
Extraction:
Data from different source systems is converted into one consolidated data
warehouse format which is ready for transformation processing.
Transformation:
In transforming the data, the following tasks may involve.
Loading:
Loading data into the data warehouse.
The data is loaded into the Data warehouse database. The metadata and raw data of
a traditional OLTP (on-line transaction processing) system is present, as is an
additional type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. End users
directly access data derived from several source systems through the data
warehouse
OLAP (Online Analytical Processing) are being used aggressively by organizations to
discover valuable business trends from data marts and data warehouses. OLAP
provides a historical view of data, although useful when used by itself, OLAP analysis
becomes truly powerful when combined with predictive analysis from Data Mining.
Data Mining:
Data mining, the extraction of hidden predictive information from large databases, is
the process of analyzing data from different perspectives and summarizing it into
useful information Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. It allows users to analyze
data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations
or patterns among dozens of fields in large relational databases.
The main difference between the database architecture and Data Warehouse
architecture is that the systems relational model is usually de-normalized into
dimension and fact tables which are typical to a data warehouse database design.
Type 1 SCD:
A Type 1 change overwrites an existing dimensional attribute with new information.
This updates only the attribute and doesnt insert any new record.
For example, if the customers address changes, the new address overwrites the old
address. Therefore old address is lost forever.
Type 2 SCD:
A Type 2 change writes a record with the new attribute information and preserves a
record of the old dimensional data. The new record is inserted with a new surrogate
key.
For example, if the customers address changes, the new address is added.
Therefore, both old and new address will be present. The new address is inserted
using surrogate key.
Type 3 SCD:
A Type 3 change places a value for the change in the original dimensional record,
instead of creating a new dimensional record to hold the attribute change.
For example, if the customers address changes then the old address, new address
and effective date of change is captured. Therefore old, new address and effective
date of change will be present.
Type 3 will not be able to keep all history where an attribute is changed more than
once. Type 3 is rarely used in actual practice.
Star schema
Snowflake schema
Fact constellation schema
The determination of which schema model should be used for a data warehouse
should be based upon the analysis of project requirements, accessible tools and
project team preferences.
Star schema:
The arrangement of the collection of fact and dimension tables in the dimensional
data model, resembling a star formation, with the fact table placed in the middle
surrounded by the dimension tables. Usually the fact tables in a star schema are in
third normal form (3NF) whereas dimensional tables are de-normalized.
Snowflake schema:
Snowflaking is a method of normalizing the dimension tables in a STAR schema.
Snowflake schemas normalize dimensions to eliminate redundancy. The dimension
data has been grouped into multiple tables instead of one large table, so the
snowflake schema is a more complex schema than the star schema
The following figure shows a snowflake schema with two dimensions, each having
three levels. A snowflake schema can have any number of dimensions and each
dimension can have any number of levels.
Data Mart:
A collection of related data from internal and external sources, transformed,
integrated and stored for the purpose of providing strategic information for a specific
set of users in an enterprise.
The data mart contains only a small amount of historical information and is granular
only to the point that it suits the needs of the department. The data mart is typically
All dependent data marts has data warehouse as a source. Dependent data marts
are architecturally and structurally sound.
An independent data mart is one whose source is the legacy applications
environment. Each independent data mart is fed uniquely and separately by the
legacy applications environment. Independent data marts are unstable and
architecturally unsound. The problem with independent data marts is that their
deficiencies do not make themselves manifest until the organization has built multiple
independent data marts.
Kimball views data warehousing as a constituency of data marts. Data marts are
focused on delivering business objectives for departments in the organization. And
the data warehouse is a conformed dimension of the data marts. Hence a unified
view of the enterprise can be obtained from the dimension modeling on a local
departmental level.
Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Performance Tuning
Quality Assurance
Rollout To Production
Production Maintenance
Incremental Enhancements
Requirement Gathering:
The main objective of this phrase is to identify objects necessary for the Reporting
and Analysis requirements. During this phrase, Business managers will play the vital
role and there will be a direct discussion with the end users. The various data
sources are identified (Operational systems, ERP, CRM and Flat files etc).
The deliverables in this phrase are
Development Instance
Test Instance
Production Instance
Production Instance: After testing, the objects are moved into the production
instance.
Along with the above instances, there will be separate database servers for ETL,
OLAP and Reporting tools. The Network admin and database Administrators will play
the key role in setup of the servers and they submit the detailed document about the
servers to project managers.
The deliverables in this phrase are
Data Modeling:
A Data model is a conceptual representation of data structures (tables) required for a
database and is very powerful in expressing and communicating the business
requirements. A data model represents the nature of data, business rules governing
the data and how it will be organized in the database. There are three levels of data
modeling. They are
Conceptual Data Model: At this level, the data modeler attempts to identify the
highest-level relationships among the different entities.
Logical Data Model: At this level, the data modeler attempts to describe the data in
detail, without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data
model to be combined into a single step
The steps for designing the logical data model are as follows:
Physical Data Model: At this level, the data modeler will specify how the logical data
model will be realized in the database schema.
Performance Tuning:
There are three major areas where a data warehousing system can use a little
performance tuning.
ETL
Query Processing
Report Delivery
ETL: Since loading data is very time consuming, its best to put that activity in a night
load job. The ETL process needs to be tuned more, because often the jobs do not
get started on-time due to factors that is beyond the control of the data warehousing
team.
Query Processing: Query performance is a big issue in cases where the reports are
run directly against Relationship database. (Especially in the ROLAP environment).
Hence ideal for the data warehousing team to invest some time to tune the query
Report Delivery: End users can experience delays in receiving their reports due to
factors other than the query performance. For example, network traffic, server setup
and the Reporting tool used. It is significant for the data warehouse team to look into
these areas for performance tuning.
The deliverables in this phrase are
QA Test Plan
QA verification that the data warehousing system is ready to go to production
Rollout to Production:
Once the QA team gives thumbs up (signoff document), it is time for the data
warehouse system to go live.
The deliverables in this phrase are
Production Maintenance:
Once the data warehouse goes production, it needs to be maintained. Tasks like
taking backup on regular time period and crisis management become very important
and needs to be planned well in advance.
The deliverables in this phrase are
Incremental Enhancements:
Once the data warehousing system goes live, there are often needs for incremental
enhancements. The task can be as simple as to do the changes in the production
environment, but is highly risky to do on live (production) systems. Do the changes
on the Development and roll out the changes in the production systems.
The deliverables in this phrase are