UNIT 1 Introduction of Data Mining

Data Mining: Introduction
Chapter 1
Introduction to Data Mining
6/30/2019 Introduction to Data Mining 1

Large-scale Data is Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
 New mantra
 Gather whatever data you can
whenever and wherever
possible.
 Expectations
 Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.
Sensor Networks Computational Simulations

Data Mining
Data Mining : A technology that blends

traditional analysis methods with
sophisticated algorithms for processing
large volume of data.
DM=(Traditional analysis methods +

Sophisticated algorithms) to process large
volume of data

Why Data Mining? Commercial Viewpoint
Business:
 Bar code scanners, RFID, Smart card technology collect up-to-the-
minute data about customer purchases.
 Retailers can utilize these information and data from e-commerce
websites to make better business decisions.
 Data mining techniques can be applied for
– customer profiling
– Marketing
– Store layout
– Fraud detection
 This helps retailers to answer important questions like
– “Who are the most profitable customers?”
– “What products can be cross-sold or up-sold?”
– “What is the revenue outlook of the company for next year?”

Why Data Mining? Scientific Viewpoint
Medicine, Science and Engineering
 Researchers accumulate data for
new discoveries.
Examples:
Understanding Earth’s climate
fMRI Data from Brain Sky Survey Data
system
 NASA EOSDIS archives over
petabytes of earth science data / year
– telescopes scanning the skies

 Sky survey data
– High-throughput biological data

Gene Expression Data
– scientific simulations
 terabytes of data generated in a few hours
Surface Temperature of Earth

Why Data Mining? Scientific Viewpoint(cont.)
Traditional methods are often not suitable for analyzing these

huge amounts of data, so techniques in data mining can aid
in answering questions like
•“What is the relationship between frequency and intensity of

ecosystem disturbances such as droughts and hurricanes to
global warming?”
•“How is land surface precipitation and temperature affected

by ocean surface temperature?”
•“How well can we predict the beginning and end of the

growing season for a region?”
Great Opportunities to Solve Society’s Major Problems
Improving health care and reducing costs Predicting the impact of climate change
Reducing hunger and poverty by

Finding alternative/ green energy sources
increasing agriculture production
What is Data Mining?
 Many Definitions
– Data mining is a technology that blends traditional data
analysis methods with sophisticated algorithms for
processing large volumes of data.
– Data mining is the process of automatically
discovering useful information in large data
repositories.
– Data mining is an integral part of knowledge discovery
in databases (KDD), which is the overall process of
converting raw data into useful information.

Data Mining: A KDD Process
– Data mining—core of Pattern Evaluation
knowledge discovery
process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
6/30/2019 Databases Introduction to Data Mining 9

7 steps of KDD
 Data Integration: Data collected and integrated from different

sources.
 Data cleaning : Data may contain errors, missing values, or
inconsistent data. Cleaning removes anomalies.
 Data Selection: select only those data which we think useful for data
mining.
 Data Transformation: transform the cleaned data into forms
appropriate for mining. By using techniques like smoothing,
aggregation, normalization etc.
 Data Mining: apply data mining techniques on the data. Basically, it
is to discover the interesting patterns
 Pattern Evaluation: includes visualization, transformation, removing
redundant patterns from the patterns we generated.
 Decisions / Use of Discovered Knowledge
It helps to use the knowledge acquired to take better decisions.
The process of knowledge discovery in databases
Fig: The process of knowledge discovery in databases.
• Input data
•Pre-processing:
• Fusing data from multiple sources
• Cleaning data to remove noise and duplicates
• Selecting features or records that are relevant to data mining task
• Transform the raw input data into appropriate format for analysis.
•Post-processing
•“Closing-the-loop” refers to the process of integrating data mining results into decision support
system. Ex: For business application data mining results can be integrated with campaign
management for effective marketing promotions. This requires post processing step to ensure valid
and useful results are incorporated into decision support system.
What is (not) Data Mining?
What is not Data  What is Data Mining?

Mining?
– Look up phone – Certain names are more

number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
Motivating Challenges
 Scalability:
– Novel data structures, out-of-the-core algorithms,
parallel and distributed algorithms.
 High Dimensionality:
– Ex: temporal and spatial components have high
dimensions.
 Heterogeneous and Complex Data:
– Collection of web pages containing semi-structured
data and hyperlinks; DNA data three dimensional
structure; climate data with time series
measurements.
6/30/2019
Introduction to Data Mining 13
Motivating Challenges (cont.)
 Data Ownership and Distribution:

Key challenges faced by distributed data mining algorithms,
– How to reduce the amount of communication needed
to perform the distributed computation
– How to effectively consolidate the data mining results
obtained from multiple sources
– How to address data security issues.

Motivating Challenges (cont.)
 Non-traditional Analysis:
– Traditional statistical approach is based on a
hypothesize-and-test paradigm. A hypothesis is
proposed, an experiment is designed to gather the
data, and then data is analyzed w.r.t. the hypothesis.
– Current data analysis requires evaluation of
thousands of hypothesis hence there is a
need for automating the process of hypothesis
generation and evaluation.

Origins of Data Mining
 Traditional techniques may be unsuitable due to data that is
– Large-scale
– High dimensional
– Heterogeneous
– Complex
– Distributed
 In order to meet these challenges in data mining researchers began to
focus on developing more efficient and scalable tools that could handle
diverse types of data.
 Draws ideas from
– Sampling estimation and hypothesis testing from statistics and
– Search algorithms, modeling techniques and learning theories from
artificial intelligence, pattern recognition and machine learning.
– Also been quick to adopt ideas from areas like optimization,
visualization.

Origins of Data Mining
 The figure shows the relationship of data mining to other areas.
 Database systems provide for efficient storage, indexing and query
processing.
 Support from high performance (parallel) computing to address massive
datasets.
 Distributed techniques to help
in addressing issue of size
when data cannot be
gathered in one location.

Data Mining Tasks
Data mining tasks are generally divided into 2 major

categories.
1. Predictive Tasks
2. Descriptive Tasks
 Predictive Tasks:
– Objective is to predict the value of a particular
attribute based on the values of other attributes.
– Attributes
to be predicted := target or dependent variables
Used for making prediction:= explanatory or
independent variables

Data Mining Tasks
 Descriptive Tasks:
– Objective here is to derive patterns like correlations,
trends, clusters, anomalies that summarize the
relationships in data.
– These are exploratory in nature and frequently require
post processing techniques to validate and explain the
results
– Find human-interpretable patterns that describe the
data.

Fig: illustrates 4 of the core data mining tasks.
Data
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
12 Yes Divorced 220K No
10
Milk

Predictive Modeling: Classification
 Task of building a model for the target variable as a
function of the explanatory variables.
Model for predicting credit
 Are of Two types worthiness
– Classification
Employed
– Regression
No Yes
Class
No Education
# years at
Level of Credit
Tid Employed present { High school,
Education Worthy Graduate
address Undergrad }
1 Yes Graduate 5 Yes
2 Yes High School 2 No Number of
Number of
3 No Undergrad 1 No years years
4 Yes High School 10 Yes
> 3 yr < 3 yr > 7 yrs < 7 yrs
10
… … … … …
Yes No Yes No

Classification Example
Used for discrete target variables, i.e., binary-
valued target Level of
# years at
Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
2 No Graduate 3 ?
3 Yes High School 2 ?
# years at
Level of Credit … … … … …
Tid Employed present
Education Worthy 10
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes Test
Set
10
… … … … …
Training
Learn
Model
Set Classifier

Examples of Classification Task
 Classifying credit card transactions

as legitimate or fraudulent
 Classifying land covers (water bodies, urban areas,

forests, etc.) using satellite data
 Categorizing news stories as finance,

weather, entertainment, sports, etc
 Identifying intruders in the cyberspace
 Predicting tumor cells as benign or malignant
 Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

Classification: Application 1
 Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
 Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
Classification: Application 2
 Churn prediction for telephone customers

– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997

Regression(for continuous target variables)
 Predict a value of a given continuous valued variable

based on the values of other variables
 Extensively studied in statistics, neural network fields.
 Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Forecasting the future price of a stock.
– Predicting the price of house based on values of other
variables.
NOTE: The goal of both predictive tasks is to find a
model that minimizes the error between predicted and
actual value of the target variable.
Cluster Analysis
 Finds groups of objects such that the objects in a

group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Cluster Analysis - Examples
Examples:
 Group sets of related customers
 Find areas of ocean which have significant
impact on Earth’s climate.

Applications of Cluster Analysis
 Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
 Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

Clustering: Application 1
 Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.

Clustering: Application 2
 Document Clustering:
– Goal: To find groups of documents that are similar to

each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in

each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

Document Clustering
Consider the collection of news articles in the table.

This table can be grouped based on their
respective topics.
Document Clustering (cont.)
 Each article is represented as a set of word-frequency pairs

(w,c)
Where,
w = word
c=number of times the word appears in the article.
There are two natural clusters in the data set
1. First 4 articles corresponds to news about the economy
2. Second 4 articles corresponds to news about health
care.
A good clustering algorithm should be able to identify these
two clusters based on similarity between words that
appear in the article.
Association Analysis
 Used to discover patterns that describe strongly

associated features in the data.
 The discovered patterns are represented in the form of
implication rules or feature subset
 Because of the exponential size of its search space, the
goal of association analysis is to extract the most
interesting patterns in an efficient manner.
 Examples:
– Identifying web pages that are accessed together
– Understanding the relationships between different
elements of Earth’s climate system

Association Rule Discovery: Definition
 Given a set of records each of which contain

some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

Association Analysis: Applications
 Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
 Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases

Market Based Analysis- Association analysis
Consider the transaction from sales data collected at a

grocery store check-out counter

Market Based Analysis- Association analysis
 Association rules can be applied to find items that

are frequently bought together by customers
 For ex: rule {Diapers}{Milk},
– Suggests that customers who buy diapers
also tend to buy milk.
 This type of rule can be used to identify potential
cross-selling opportunities among related items.

Deviation/Anomaly/Change Detection
 Identifies observations/objects whose
characteristics are significantly
different from the rest of the data.
 Such observations are called
anomalies or outliers.
 Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identifying malicious behavior in
network devices like sensors .
– Unusual patterns of disease
– Ecosystem disturbances

Credit card fraud detection
 Credit card company records transactions made

by the card holder and also personal information
like credit limit, age, annual income, address.
 Anomaly detection technique can be applied to
build a profile of legitimate transactions for the
users.
 When a new transaction arrives, it is compared
against profile of the user.
 If characteristics of the transaction are very
different from the previously created profile, then
the transaction is flagged as potentially fraudulent.

UNIT 1 Introduction of Data Mining

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

UNIT 1 Introduction of Data Mining

Caricato da

Copyright:

Formati disponibili

Data Mining: Introduction

Introduction to Data Mining

6/30/2019 Introduction to Data Mining 1

Sensor Networks Computational Simulations

6/30/2019 Introduction to Data Mining 2

Data Mining : A technology that blends

DM=(Traditional analysis methods +

6/30/2019 Introduction to Data Mining 3

6/30/2019 Introduction to Data Mining 4

– telescopes scanning the skies

– High-throughput biological data

Surface Temperature of Earth

Traditional methods are often not suitable for analyzing these

•“What is the relationship between frequency and intensity of

•“How is land surface precipitation and temperature affected

•“How well can we predict the beginning and end of the

Reducing hunger and poverty by

6/30/2019 Introduction to Data Mining 8

– Data mining—core of Pattern Evaluation

Data Warehouse Selection

6/30/2019 Databases Introduction to Data Mining 9

 Data Integration: Data collected and integrated from different

Fig: The process of knowledge discovery in databases.

What is not Data  What is Data Mining?

– Look up phone – Certain names are more

 Data Ownership and Distribution:

6/30/2019 Introduction to Data Mining 14

6/30/2019 Introduction to Data Mining 15

6/30/2019 Introduction to Data Mining 16

6/30/2019 Introduction to Data Mining 17

Data mining tasks are generally divided into 2 major

6/30/2019 Introduction to Data Mining 18

6/30/2019 Introduction to Data Mining 19

1 Yes Single 125K No

6/30/2019 Introduction to Data Mining 20

6/30/2019 Introduction to Data Mining 21

6/30/2019 Introduction to Data Mining 22

 Classifying credit card transactions

 Classifying land covers (water bodies, urban areas,

 Categorizing news stories as finance,

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein

6/30/2019 Introduction to Data Mining 23

 Churn prediction for telephone customers

From [Berry & Linoff] Data Mining Techniques, 1997

6/30/2019 Introduction to Data Mining 25

 Predict a value of a given continuous valued variable

 Finds groups of objects such that the objects in a

6/30/2019 Introduction to Data Mining 27

6/30/2019 Introduction to Data Mining 28

6/30/2019 Introduction to Data Mining 29

6/30/2019 Introduction to Data Mining 30

– Goal: To find groups of documents that are similar to

– Approach: To identify frequently occurring terms in

6/30/2019 Introduction to Data Mining 31

Consider the collection of news articles in the table.

 Each article is represented as a set of word-frequency pairs

 Used to discover patterns that describe strongly

6/30/2019 Introduction to Data Mining 34

 Given a set of records each of which contain

6/30/2019 Introduction to Data Mining 35

6/30/2019 Introduction to Data Mining 36

Consider the transaction from sales data collected at a