Sei sulla pagina 1di 21

Data Warehousing

Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all organizational
data, including historical data
A data warehouse is a repository (archive) of information gathered
from multiple sources, stored under a unified schema, at a single site
Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transaction
processing systems

Database System Concepts - 5th Edition, Aug 26, 2005 18.1 Silberschatz, Korth and Sudarshan
Data Warehousing

Database System Concepts - 5th Edition, Aug 26, 2005 18.2 Silberschatz, Korth and Sudarshan
OLTP Vs OLAP

OLTP System
Online OLOLAP System
Online Transaction Processing
Analytical Processing
(Operational System)
(Data Warehouse)
Operational data; OLTPs are the original Consolidation data; OLAP data comes
Source of data
source of the data. from the various OLTP Databases
To control and run fundamental To help with planning, problem solving,
Purpose of data
business tasks and decision support
Reveals a snapshot of ongoing business Multi-dimensional views of various kinds
What the data
processes of business activities
Short and fast inserts and updates Periodic long-running batch jobs refresh
Inserts and Updates
initiated by end users the data
Relatively standardized and simple Often complex queries involving
Queries
queries Returning relatively few records aggregations
Depends on the amount of data
involved; batch data refreshes and
Processing Speed Typically very fast complex queries may take many hours;
query speed can be improved by
creating indexes
Larger due to the existence of
Can be relatively small if historical data
Space Requirements aggregation structures and history data;
is archived
requires more indexes than OLTP
Typically de-normalized with fewer
Database Design Highly normalized with many tables tables; use of star and/or snowflake
schemas
Backup religiously; operational data is Instead of regular backups, some
critical to run the business, data loss is environments may consider simply
Backup and Recovery
likely to entail significant monetary loss reloading the OLTP data as a recovery
and legal liability method
Database System Concepts - 5th Edition, Aug 26, 2005 18.3 Silberschatz, Korth and Sudarshan
Database System Concepts - 5th Edition, Aug 26, 2005 18.4 Silberschatz, Korth and Sudarshan
Design Issues
When and how to gather data
Source driven architecture: data sources transmit new information
to warehouse, either continuously or periodically (e.g. at night)
Destination driven architecture: warehouse periodically requests
new information from data sources
Keeping warehouse exactly synchronized with data sources (e.g.
using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse
Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.
What schema to use
Schema integration

Database System Concepts - 5th Edition, Aug 26, 2005 18.5 Silberschatz, Korth and Sudarshan
More Warehouse Design Issues
Data cleansing : task of correcting and preprocessing data
E.g. correct mistakes in addresses (misspellings, zip code errors)
Merge address lists from different sources and purge duplicates
How to propagate updates
Warehouse schema may be a (materialized) view of schema from
data sources
What data to summarize
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by query optimizer
to use aggregate values

Database System Concepts - 5th Edition, Aug 26, 2005 18.6 Silberschatz, Korth and Sudarshan
Warehouse Schemas
Dimension values are usually encoded using small integers and
mapped to full values via dimension tables
Resultant schema is called a star schema
More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables

Database System Concepts - 5th Edition, Aug 26, 2005 18.7 Silberschatz, Korth and Sudarshan
Data Warehouse vs.
Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making

Database System Concepts - 5th Edition, Aug 26, 2005 18.8 Silberschatz, Korth and Sudarshan
Commercial systems for Data
Warehouse

Oracle 9i(Oracle Warehouse Builder):

Enterprise Edition includes improve performance and


manageability for the data warehouse. It is one of the
leading relational DBMS for data warehousing.

Microsoft SQL Server 2000

Database System Concepts - 5th Edition, Aug 26, 2005 18.9 Silberschatz, Korth and Sudarshan
Multidimensional Data Model

Data warehouse and OLAP tools are based on a


multidimensional data model.This model views data
in the form of a data cube.
Composed of one fact table and a set of dimension
tables.
Fact table: with a composite primary key
Dimensional table: each dimension table has a
simple table (non-composite) primary key that
corresponds exactly to one of the components of the
composite key in the fact table.

Database System Concepts - 5th Edition, Aug 26, 2005 18.10 Silberschatz, Korth and Sudarshan
Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures


Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation

Database System Concepts - 5th Edition, Aug 26, 2005 18.11 Silberschatz, Korth and Sudarshan
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measure
s
Database System Concepts - 5th Edition, Aug 26, 2005 18.12 Silberschatz, Korth and Sudarshan
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country

Database System Concepts - 5th Edition, Aug 26, 2005 18.13 Silberschatz, Korth and Sudarshan
Data Warehouse Usage

Three kinds of data warehouse applications


Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.

Differences among the three tasks


Database System Concepts - 5th Edition, Aug 26, 2005 18.14 Silberschatz, Korth and Sudarshan
Data Warehouse Schema

Database System Concepts - 5th Edition, Aug 26, 2005 18.15 Silberschatz, Korth and Sudarshan
Data Mining
Data mining is the process of semi-automatically analyzing large
databases to find useful patterns
Prediction based on past history
Predict if a credit card applicant poses a good credit risk, based on
some attributes (income, job type, age, ..) and past history
Predict if a pattern of phone calling card usage is likely to be
fraudulent
Some examples of prediction mechanisms:
Classification
Given a new item whose class is unknown, predict to which class
it belongs
Regression formulae
Given a set of mappings for an unknown function, predict the
function result for a new parameter value

Database System Concepts - 5th Edition, Aug 26, 2005 18.16 Silberschatz, Korth and Sudarshan
Data Mining (Cont.)
Descriptive Patterns
Associations
Find books that are often bought by similar customers. If a
new such customer buys one such book, suggest the others
too.
Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer,
Clusters
E.g. typhoid cases were clustered in an area surrounding a
contaminated well
Detection of clusters remains important in detecting epidemics

Database System Concepts - 5th Edition, Aug 26, 2005 18.17 Silberschatz, Korth and Sudarshan
Classification Rules

Classification rules help assign new objects to classes.


E.g., given a new automobile insurance applicant, should he or she
be classified as low risk, medium risk or high risk?
Classification rules for above example could use a variety of data, such
as educational level, salary, age, etc.
person P, P.degree = masters and P.income > 75,000
P.credit = excellent
person P, P.degree = bachelors and
(P.income 25,000 and P.income 75,000)
P.credit = good
Rules are not necessarily exact: there may be some misclassifications
Classification rules can be shown compactly as a decision tree.

Database System Concepts - 5th Edition, Aug 26, 2005 18.18 Silberschatz, Korth and Sudarshan
Decision Tree

Database System Concepts - 5th Edition, Aug 26, 2005 18.19 Silberschatz, Korth and Sudarshan
Construction of Decision Trees
Training set: a data sample in which the classification is already
known.
Greedy top down generation of decision trees.
Each internal node of the tree partitions the data into groups
based on a partitioning attribute, and a partitioning condition
for the node
Leaf node:
all (or most) of the items at the node belong to the same class,
or
all attributes have been considered, and no further partitioning
is possible.

Database System Concepts - 5th Edition, Aug 26, 2005 18.20 Silberschatz, Korth and Sudarshan
Other Types of Mining
Text mining: application of data mining to textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize their visit history
classify Web pages automatically into a Web directory
Data visualization systems help users examine large volumes of data
and detect patterns visually
Can visually encode large amounts of information on a single
screen
Humans are very good a detecting visual patterns

Database System Concepts - 5th Edition, Aug 26, 2005 18.21 Silberschatz, Korth and Sudarshan

Potrebbero piacerti anche