Sei sulla pagina 1di 36

1

Data Warehousing
OLAP
Data Mining
Further Reading

Enroll Now
https://goo.gl/QbTVal
2

Data Warehousing
OLTP (online transaction processing) systems
range in size from megabytes to terabytes
high transaction throughput
Decision makers require access to all data
Historical and current
'A data warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of
managements decision-making process' (Inmon 1993)

Enroll Now
https://goo.gl/QbTVal
3

Benefits
Potential high returns on investment
90% of companies in 1996 reported return of investment
(over 3 years) of > 40%
Competitive advantage
Data can reveal previously unknown, unavailable and
untapped information
Increased productivity of corporate decision-makers
Integration allows more substantive, accurate and consistent
analysis
4

Typical Architecture
Mainframe operational
n/w,h/w data
Warehouse mgr
Meta-data

Departmental
RDBMS data
Private data

Load
mgr

Highly
summarized Query
data
manager
Lightly summarized
data
Detailed data
DBMS
Warehouse mgr

External data

Reporting query, app


development,EIS tools

OLAP tools

Data-mining tools
Archive/backup

Source: Connolly and Begg p1157


5

Data Warehouses
Types of Data
Detailed
Summarised
Meta-data
Archive/Back-up

Enroll Now
https://goo.gl/QbTVal
6

Information Flows
Operational data
source 1

Inflow
Load
mgr

Warehouse Mgr
Meta-flow
Metadata
Highly
summ.
data
Lightly
Upflow
summ.
Detailed data
DBMS
Warehouse mgr

Downflow

Operational data
source n
Source Connolly and Begg p1162

Reporting query, app


development,EIS tools
Outflow
Query
manager

OLAP tools

Data-mining tools
Archive/backup

Information Flow
Processes
Five primary information flows
Inflow - extraction, cleansing and loading of data from
source systems into warehouse
Upflow - adding value to data in warehouse through
summarizing, packaging and distributing data
Downflow - archiving and backing up data in warehouse
Outflow - making data available to end users
Metaflow - managing the metadata
8

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Problems of Data
Warehousing
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration

Data Warehouse Design


Data must be designed to allow ad-hoc queries to
be answered with acceptable performance
constraints
Queries usually require access to factual data
generated by business transactions
e.g. find the average number of properties rented out
with a monthly rent greater than 700 at each branch
office over the last six months

Uses Dimensionality Modelling


10

Dimensionality Modelling
Similar to E-R modelling but with constraints

composed of one fact table with a composite


primary key
dimension tables have a simple primary key which
corresponds exactly to one foreign key in the fact
table
uses surrogate keys based on integer values
Enroll Now
Can efficiently and easily support ad-hoc end-user
https://goo.gl/QbTVal
queries
11

Star Schemas
The most common dimensional model
A fact table surrounded by dimension tables
Fact tables
contains FK for each dimension table
large relative to dimension tables
read-only

Dimension tables

reference data
query performance speeded up by denormalising
into a single dimension table
12

E-R Model Example

Enroll Now
https://goo.gl/QbTVal
13

Star Schema Example

Enroll Now
https://goo.gl/QbTVal
14

Other Schemas
Snowflake schemas
variant of star schema
each dimension can have its own dimensions

Starflake schemas
hybrid structure
contains mixture of (denormalised) star and
(normalised) snowflake schemas
15

OLAP
Online Analytical Processing
dynamic synthesis, analysis and consolidation of large
volumes of multi-dimensional data
normally implemented using specialized multidimensional DBMS
a method of visualising and manipulating data with
many inter-relationships

16

Codds OLAP Rules


1. Multi-dimensional conceptual view
2. Transparency
3. Accessibility
4. Consistent reporting performance
5. Client-server architecture
6. Generic dimensionality
7. Dynamic sparse matrix handling
8. Multi-user support
9. Unrestricted cross-dimensional operations
10. Intuitive data manipulation

17

OLAP Tools
Categorised according to architecture of underlying database
Multi-dimensional OLAP
data typically aggregated and stored according to predicted usage
use array technology
Relational OLAP
use of relational meta-data layer with enhanced SQL
Managed Query Environment
deliver data direct from DBMS or MOLAP server to desktop in form
of a datacube

18

MOLAP
RDB
Server

MOLAP
server
Load

Database/Application
Enroll Now Logic Layer

https://goo.gl/QbTVal

Request
Result

Presentation
Layer
19

ROLAP
RDB
Server

Database
Layer
Enroll Now

SQL
Result

https://goo.gl/QbTVal

ROLAP
server

Request
Result

Application
Logic Layer

Presentation
Layer

20

MQE
RDB
Server

End-user
tools

SQL
Result
MOLAP
server
Load

Request
Result

Enroll Now
https://goo.gl/QbTVal
21

Data Mining
The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial business decisions
focus is to reveal information which is hidden or
unexpected
patterns and relationships are identified by examining
the underlying rules and features of the data
work from data up
require large volumes of data

22

Example Data Mining


Applications
Retail/Marketing
Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis

Enroll Now
https://goo.gl/QbTVal
23

Example Data Mining Applications


Banking
Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their credit card
affiliation
Determining credit card spending by customer groups

Enroll Now
https://goo.gl/QbTVal

24

Data Mining Techniques


Four main techniques
Predictive Modeling
Database Segmentation
Link Analysis
Deviation Direction

Enroll Now
https://goo.gl/QbTVal
25

Data Mining Techniques


Predictive Modelling
using observations to form a model of the important
characteristics of some phenomenon
Techniques:
Classification
Value Prediction

Enroll Now
https://goo.gl/QbTVal

26

Classification Example- Tree


Induction
Customer
renting property
No

> 2 years

Yes
Customer age
> 25 years?

Rent property
No
Rent property

Yes
Buy property

Enroll Now
https://goo.gl/QbTVal
27

Data Mining Techniques


Database Segmentation:
to partition a database into an unknown number of
segments (or clusters) of records which share a number of
properties
Techniques:
Demographic clustering
Neural clustering
Enroll Now

https://goo.gl/QbTVal

28

Segmentation: Scatterplot
Example

Enroll Now
https://goo.gl/QbTVal
29

Data Mining Techniques


Link Analysis
establish associations between individual records (or sets of records) in a
database
e.g. when a customer rents property for more than two years and is
more than 25 years old, then in 40% of cases, the customer will buy the
property
Techniques
Association discovery
Sequential pattern discovery
Similar time sequence discovery

30

Data Mining Techniques


Deviation Detection
identify outliers, something which deviates from some
known expectation or norm
Statistics
Visualisation

Enroll Now
https://goo.gl/QbTVal

31

Deviation Detection: Visualisation


Example

32

Mining and Warehousing

Data mining needs single, separate, clean, integrated, selfconsistent data source
Data warehouse well equipped:
populated with clean, consistent data
contains multiple sources
utilises query capabilities
capability to go back to data source
33

Further Reading
Connolly and Begg, chapters 31 to 34.
W H Inmon, Building the Data Warehouse, New York, Wiley
and Sons, 1993.
Benyon-Davies P, Database Systems (2nd ed), Macmillan
Press, 2000, ch 34, 35 & 36.

Enroll Now
https://goo.gl/QbTVal
34

35

36

Potrebbero piacerti anche