Sei sulla pagina 1di 45

Designing the data warehouse / data mart

Methodologies and Techniques

Basic principles

Life cycle of the DW


First time load Operational Databases Refresh

Warehouse Database

Refresh

Purge or Archive Refresh

Data transfers into a database


First time system implementation
From a manual system

Data warehousing projects Database version upgrade ERP projects Migration


From old to new system

Data transfers between systems


Dynamic data (eg. sales orders)
Interface required?

Static data (eg. customers)


Conversion required?

What can go wrong


Data not available
feature activated from implementation onwards Massive data entry Eg: different account structure

Data incomplete Data inconsistent (eg: engineering vs accounts) Wrong level of granularity Data not clean New system requires changes new product codes

Data cleaning must address


Different department record same info under different codes Multiple records of same company (under different names) Fields missing in input tables (eg: c/o) Different depts. Record different addresses for same customer Use of different units for time periods

Labour intensive tasks


Data entry Data checks Working on solving conflicts Allocating new codes

Solution = introduce as much automation as possible


SQL / SQL loader (Oracle) Custom conversion programmes to extract, modify and upload data Filtering Parsing (eg: excel) Staging areas for conversion in progress

Data utilities
ORACLE is king of data handling Export: to transfer data between DBs
Extract both table structure and data content into dump file

Import: corresponding facility SQL*loader automatic import from a variety of file formats into DB files
Needs a control file

Control files: using SQLloader


Data tranfers in and out of DB can be automated using the loader
Create a data file with the data(!) Create a control file to guide the operation

Load creates two files


Log file bad transactions file

Also a discard file if control file has selection criteria in it

Example 1 the supplier file


New supplier code to include city where firm is based Assignation of category based on amounts purchased OLD Sup code 4 digits Sup name Sup address City Phone

Example 1 the supplier file


New supplier code to include city where firm is based Assignation of category based on amounts purchased OLD Sup code 4 digits Sup name Sup address City Phone

NEW

Sup code 3 letters + 4 digits

Sup name

Sup address Phone Cat 1,2,3 depending on total purchases last year

Example 2 New Cost Accounting Structure


Maintenance department expenditure: 1 account => separate accounts for different production activities OLD Intervention code Desc. Date Labour Parts Total

Example 2 New Cost Accounting Structure


Maintenance department expenditure: 1 account => separate accounts for different production activities OLD Intervention code NEW Intervention code Desc. Date labour Parts Total Account Desc. Date Labour Parts Total

Example 3: merging files


Complete customer file based on Accounts and Sales and Shipping
OLD (finance)
CustID name address city account number credit limit balance

OLD (sales)
CustID* name address city discount rates sales_to_date rep_name

OLD (Shipping)
CustID** name address city Preferred haulier

Example 4: change of business practices


Payment by bank draft for international customers Automatic payment into account for national customers Payment direct into account for all customers

Data Staging Area


The construction site for the warehouse Required by most scenarios Connected to wide variety of sources Clean / aggregate / compute / validate data

Operational system

Extract

Data staging area

Transform

Transport (Load)

Warehouse

Remote Staging Model


Data staging area within the warehouse environment
Oper. envt.
Operational system

Warehouse environment Extract, transform, transport


Data staging area Warehouse

Transport Transform (Load)

Data staging area in its own environment, avoiding negative impact on the warehouse environment
Oper. envt.
Operational system

Staging envt.
Data staging area

Warehouse envt.
Warehouse

Extract, transform, transport

Transport (Load) Transform

Onsite Staging Model


Data staging area within the operational environment, possibly affecting the operational system

Operational environment
Operational system Data staging area

WH envt.

Extract

Transform

Transport (Load)

Warehouse

Data Mart
A subset of a data warehouse that supports the requirements of a particular department or business function. Characteristics include:
Do not normally contain detailed operational data unlike data warehouses. May contain certain levels of aggregation

Dependent Data Mart


Flat Files Operational Systems

Marketing

Marketing Sales Finance Human Resources Data Warehouse

Sales

Finance Data Marts

External Data

Independent Data Mart


Operational Systems Flat Files

Sales or Marketing

External Data

Reasons for Creating a Data Mart


To give users more flexible access to the data they need to analyse most often. To provide data in a form that matches the specific needs of a group of users To improve end-user response time. Potential users of a data mart are clearly defined and can be targeted for support

Reasons for Creating a Data Mart


To provide appropriately structured data as dictated by the requirements of the end-user access tools. Building a data mart is simpler (and much quicker) compared with establishing a corporate data warehouse. The cost of implementing data marts is far less than that required to establish a data warehouse.

Exploiting the DW data


DW is a platform for creating a wide array of reports It solves data feed problems, but does not lead to specific decision support Need a model for organising data into meaningful reports Need specific interfaces for users

Exploiting the DW data


Data Staging Area Static Reporting

Source Systems Scrutinising

Data Warehouse
Relational Database on a dedicated Server

Multidimensional Data Cubes OLAP tools

Extraction Cleaning Transformation Loading

De normalised, data

Discovering Data Mining .

Multidimensional Models
Customer Market Time

Product
Time FINANCE

SALES

Product

P/L_Line

The data is found at the intersection of dimensions.

Representing multidimensional data

MOLAP Server
The application layer stores data in a multidimensional structure DSS client The presentation layer provides the MOLAP multidimensional view Engine Efficient storage and processing Application layer Complexity hidden from the user (but NOT from developer) Analysis using preaggregated summaries and precalculated Warehouse measures

ROLAP Server
The warehouse stores DSS client atomic data. The application layer ROLAP generates SQL for the engine three- dimensional view. Application The presentation layer Multiple layer SQL provides the multidimensional view.
Warehouse server

MOLAP
MDDB
Query Periodic load Warehouse Server

Data user

ROLAP
Cache
Live fetch Query

Data cache Warehouse Server

Data

user

Also Hybrid (HOLAP)

Choosing a Reporting Architecture


Business needs Good Potential for growth Query interface Performance enterprise architecture OK Network architecture Speed of access Openness
MOLAP

ROLAP

Simple

Complex Analysis

Modeling
Warehouses differ from operational structures:
Analytical requirements Subject orientation

Data must map to subject oriented information:


Identify business subjects Define relationships between subjects Name the attributes of each subject

Modeling is iterative Modeling tools are available

Modeling the Data Warehouse


1. Defining the business model 2. Creating the dimensional model 2, 3 3. Modeling summaries 4. Creating the physical model
4 Physical model 1 Select a business process

Identifying Business Rules


Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Type PC Server Product Monitor 15 inch 17 inch 19 inch None Status New Rebuilt Custom

Time Month > Quarter > Year

Store Store > District > Region

Creating the Dimensional Model


Identify fact tables
Translate business measures into fact tables Analyze source system information for additional measures Identify base and derived measures Document additivity of measures

Identify dimension tables Link fact tables to the dimension tables Create views for users

Dimension Tables
Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference
Product

Channel

Facts (units, price)


Customer Time

Fact Tables
Fact tables have the following characteristics:
Contain numeric measures (metrics) of the business May contain summarized (aggregated) data May contain date-stamped data Are typically additive Have key value that is typically a concatenated key composed of the primary keys of the dimensions Joined to dimension tables through foreign keys that reference primary keys in the dimension tables

Dimensional Model (Star Schema)


Fact table Product Facts (units, price) Customer Time Channel

Dimension tables

Star Schema Model


Product Table Product_id Product_desc Store Table Store_id District_id ...

Central fact table Radiating dimensions Denormalized model


Time Table Day_id Month_id Period_id Year_id

Sales Fact Table Product_id Store_id Item_id Day_id Sales_dollars Sales_units ...
Item Table Item_id Item_desc ...

Star Schema Model


Easy for users to understand Fast response to simple queries Simple metadata Supported by many front end tools Less robust to change Does not support history

Using Summary Data


Phase 3: Modeling summaries

Provides fast access to precomputed data Reduces use of I/O, CPU, and memory Is distilled from source systems and precalculated summaries Usually exists in summary fact tables

Designing Summary Tables


Average Maximum
Units
Product A Total Product B Total Product C Total

Total Percentage
Sales() Store

Summary Tables Example


SALES FACTS Sales Region Month 10,000 North Jan 99 12,000 South Feb 99 11,000 North Jan 99 15,000 West Mar 99 18,000 South Feb 99 20,000 North Jan 99 10,000 East Jan 99 2,000 West Mar 99 SALES BY MONTH/REGION Month Region Tot_Sales$ Jan 99 North 41,000 Jan 99 East 10,000 Feb 99 South 40,000 Mar 99 West 17,000

SALES BY MONTH Month Tot_Sales Jan 99 51,000 Feb 99 40,000 Mar 99 17,000

Potrebbero piacerti anche