Sei sulla pagina 1di 30

DATA WAREHOUSING

AND
DATA MINING
By:
Raghav Agrawal
Btech( E&T )
III yr – A
A1607107107
Overview

❚ Introduction
❚ Data Warehousing
❚ Data Warehousing
V/S OLAP
❚ Data Mining

2
Motivation: “Necessity is the
Mother of Invention”
 Data explosion problem

 Solution: Data warehousing and data mining



Data warehousing and on-line analytical processing

 Extraction of interesting knowledge (rules, regularities,


patterns, constraints) from data in large databases

3
What is a Data Warehouse?

A single, complete and


consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.

4
Warehouses are Very Large
Databases

35%

30%

25%
Respondents

20%

15%

10%
Initial
5% Projected 2Q96

Source: META Group, Inc.


0%
5GB 10-19GB 50-99GB 250-499GB
5-9GB 20-49GB 100-249GB 500GB-1TB
5
Very Large Data Bases
❚ Terabytes -- 1012 bytes: Walmart -- 24 Terabytes

❚ Petabytes -- 1015 bytes: Geographic Information


Systems
❚ Exabytes -- 1018 bytes: National Medical Records

❚ Zettabytes -- 1021 bytes: Weather images

❚ Zottabytes -- 1024 bytes: Intelligence Agency


Videos

6
Data Warehousing --
It is a process
❚ Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not
previous possible
❚ A decision support database
maintained separately from
the organization’s operational
database 7
Characteristics of Data Warehouse

❚ A data warehouse is a
❙ subject-oriented
❙ integrated
❙ time-varying
❙ non-volatile
collection of data that is used primarily in
organizational decision making.

8
Data Warehouse Architecture

Relational
Databases
Optimized Loader
Extraction
ERP
Cleansing
Systems
Data Warehouse
Engine Analyze
Query

Metadata Repository
9
Application Areas

Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providersValue added data
Utilities Power usage analysis

10
What makes data mining possible?

❚ Advances in the following areas are


making data mining deployable:
❙ data warehousing
❙ better and more data (i.e., operational,
behavioral, and demographic)
❙ the emergence of easily deployed data
mining tools and
❙ the advent of new data mining
techniques.
11
Benefits of Data warehouse
❙ A data warehouse provides a common data
model for all data of interest regardless of the
data's source.
❙ Prior to loading data into the data warehouse,
inconsistencies are identified and resolved.
❙ Because they are separate from operational
systems, data warehouses provide retrieval of
data without slowing down operational systems.
❙ Data warehouses can work in conjunction with
and, hence, enhance the value of operational
business applications, notably customer
relationship management (CRM) systems.

12
DISADVANTAGES OF DATA
WAREHOUSES
❙ Data warehouses are not the optimal
environment for unstructured data.
❙ There is an element of latency in data
warehouse data.
❙ Data warehouses can have high costs.
Maintenance costs are high.
❙ Data warehouses can get outdated relatively
quickly.

13
So, what’s different b/w OLTP &
DW?

Object 5
OLTP vs Data Warehouse

❚ OLTP ❚ Warehouse
❙ Application ❙ Subject Oriented
Oriented ❙ Used to analyze
❙ Used to run business
business ❙ Summarized and
❙ Detailed data refined
❙ Current up to date ❙ Snapshot data
❙ Isolated Data ❙ Integrated Data

15
OLTP V/S Data Warehouse

❚ OLTP ❚ Data Warehouse


❙ Performance Sensitive ❙ Performance relaxed
❙ Few Records accessed at ❙ Large volumes accessed
a time (tens) at a time(millions)
❙ Mostly Read (Batch
❙ Read/Update Access Update)
❙ Redundancy present
❙ No data redundancy ❙ Database Size
❙ Database Size 100MB 100 GB - few terabytes
-100 GB

16
To summarize DW & OLTP...
❚ OLTP Systems are
used to “run” a
business

❚ The Data
Warehouse helps
to “optimize” the
business
17
What Is Data Mining?
❚ The objective of data mining is to extract
valuable information from your data, to discover
the “hidden gold.”
❚ that you do not need a data warehouse to
successfully use data mining—all you need is
data.
❚ On-Line Analytical Processing (OLAP)- DM tool.

18
DATA MINING MODELS

Acc. To IBM

❚ Verification Model

❚ . Discovery Model

19
Steps for Data Mining
❚ Identify
❙ Find sales relationships between specific
products or services
❙ Identify specific purchasing patterns over
time
❙ Identify potential types of customers
❙ Find product sales trends.

20
❚ Select
❙ Are the data adequate to describe the
phenomena the data mining analysis is
attempting to model?
❙ Can you enhance internal customer records with
external lifestyle and demographic data?
❙ Are the data stable—will the mined attributes be
the same after the analysis?
❙ If you are merging databases can you find a
common field for linking them?
❙ How current and relevant are the data to the
business goal?

21
❚ Prepare
❙ Establish strategies for handling missing data,
extraneous noise, and outliers
❙ Identify redundant variables in the dataset and
decide which fields to exclude
❙ Decide on a log or square transformation, if
necessary
❙ Visually inspect the dataset to get a feel for
the database
❙ Determine the distribution frequencies of the
data

22
❚ Audit the data
❙ What is the ratio of categorical/binary
attributes in the database?
❙ What is the nature and structure of the
database?
❙ What is the overall condition of the
dataset?

❚ Select the Tool


❙ Is the data set heavily categorical?
❙ What platforms do your candidate tools
support?
❙ Are the candidate tools ODBC-compliant?
❙ What data format can the tools import? 23
❚ Format the Solution
❙ What is the optimum format of the solution—
decision tree, rules, C code, SQL syntax?
❙ What are the available format options?
❙ What is the goal of the solution?

❚ Construct the Model


❙ Are error rates at acceptable levels? Can you
improve them?
❙ What extraneous attributes did you find? Can
you purge them?
❙ Is additional data or a different methodology
necessary?
❙ Will you have to train and test a new data
set?
24
❚ Validate the Findings
❙ Do the findings make sense?
❙ Do you have to return to any prior steps to
improve results?
❙ Can use other data mining tools to replicate
the findings?

❚ Deliver The Findings


❙ Will additional data improve the analysis?
❙ What strategic insight did you discover and
how is it applica-ble?
❙ What proposals can result from the data
mining analysis?
❙ Do the findings meet the business objective?
25
❚ Integrate The Solution
❙ SQL syntax for distribution to end-users
❙ C code incorporated into a production system
❙ Rules integrated into a decision support
system.

26
Data Mining Algorithms

Some of the DM algorithms are

❚ Neural Networks
❚ Decision Trees

27
Neural Network

28
Decision Trees

29
Thank you !!! 30

Potrebbero piacerti anche