Sei sulla pagina 1di 40

Data Mining: Introduction

Chapter 1

Introduction to Data Mining

6/30/2019 Introduction to Data Mining 1


Large-scale Data is Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
 New mantra
 Gather whatever data you can
whenever and wherever
possible.
 Expectations
 Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations

6/30/2019 Introduction to Data Mining 2


Data Mining

Data Mining : A technology that blends


traditional analysis methods with
sophisticated algorithms for processing
large volume of data.

DM=(Traditional analysis methods +


Sophisticated algorithms) to process large
volume of data

6/30/2019 Introduction to Data Mining 3


Why Data Mining? Commercial Viewpoint
Business:
 Bar code scanners, RFID, Smart card technology collect up-to-the-
minute data about customer purchases.
 Retailers can utilize these information and data from e-commerce
websites to make better business decisions.
 Data mining techniques can be applied for
– customer profiling
– Marketing
– Store layout
– Fraud detection
 This helps retailers to answer important questions like
– “Who are the most profitable customers?”
– “What products can be cross-sold or up-sold?”
– “What is the revenue outlook of the company for next year?”

6/30/2019 Introduction to Data Mining 4


Why Data Mining? Scientific Viewpoint
Medicine, Science and Engineering
 Researchers accumulate data for
new discoveries.
Examples:
Understanding Earth’s climate
fMRI Data from Brain Sky Survey Data
system
 NASA EOSDIS archives over
petabytes of earth science data / year

– telescopes scanning the skies


 Sky survey data

– High-throughput biological data


Gene Expression Data

– scientific simulations
 terabytes of data generated in a few hours

Surface Temperature of Earth


6/30/2019 Introduction to Data Mining 5
Why Data Mining? Scientific Viewpoint(cont.)

Traditional methods are often not suitable for analyzing these


huge amounts of data, so techniques in data mining can aid
in answering questions like

•“What is the relationship between frequency and intensity of


ecosystem disturbances such as droughts and hurricanes to
global warming?”

•“How is land surface precipitation and temperature affected


by ocean surface temperature?”

•“How well can we predict the beginning and end of the


growing season for a region?”
6/30/2019 Introduction to Data Mining 6
Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
6/30/2019 Introduction to Data Mining 7
What is Data Mining?
 Many Definitions
– Data mining is a technology that blends traditional data
analysis methods with sophisticated algorithms for
processing large volumes of data.
– Data mining is the process of automatically
discovering useful information in large data
repositories.
– Data mining is an integral part of knowledge discovery
in databases (KDD), which is the overall process of
converting raw data into useful information.

6/30/2019 Introduction to Data Mining 8


Data Mining: A KDD Process

– Data mining—core of Pattern Evaluation

knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

6/30/2019 Databases Introduction to Data Mining 9


7 steps of KDD

 Data Integration: Data collected and integrated from different


sources.
 Data cleaning : Data may contain errors, missing values, or
inconsistent data. Cleaning removes anomalies.
 Data Selection: select only those data which we think useful for data
mining.
 Data Transformation: transform the cleaned data into forms
appropriate for mining. By using techniques like smoothing,
aggregation, normalization etc.
 Data Mining: apply data mining techniques on the data. Basically, it
is to discover the interesting patterns
 Pattern Evaluation: includes visualization, transformation, removing
redundant patterns from the patterns we generated.
 Decisions / Use of Discovered Knowledge
It helps to use the knowledge acquired to take better decisions.
6/30/2019 Introduction to Data Mining 10
The process of knowledge discovery in databases

Fig: The process of knowledge discovery in databases.

• Input data
•Pre-processing:
• Fusing data from multiple sources
• Cleaning data to remove noise and duplicates
• Selecting features or records that are relevant to data mining task
• Transform the raw input data into appropriate format for analysis.
•Post-processing
•“Closing-the-loop” refers to the process of integrating data mining results into decision support
system. Ex: For business application data mining results can be integrated with campaign
management for effective marketing promotions. This requires post processing step to ensure valid
and useful results are incorporated into decision support system.
6/30/2019 Introduction to Data Mining 11
What is (not) Data Mining?

What is not Data  What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
6/30/2019 Introduction to Data Mining 12
Motivating Challenges

 Scalability:
– Novel data structures, out-of-the-core algorithms,
parallel and distributed algorithms.
 High Dimensionality:
– Ex: temporal and spatial components have high
dimensions.
 Heterogeneous and Complex Data:
– Collection of web pages containing semi-structured
data and hyperlinks; DNA data three dimensional
structure; climate data with time series
measurements.

6/30/2019
Introduction to Data Mining 13
Motivating Challenges (cont.)

 Data Ownership and Distribution:


Key challenges faced by distributed data mining algorithms,
– How to reduce the amount of communication needed
to perform the distributed computation
– How to effectively consolidate the data mining results
obtained from multiple sources
– How to address data security issues.

6/30/2019 Introduction to Data Mining 14


Motivating Challenges (cont.)

 Non-traditional Analysis:
– Traditional statistical approach is based on a
hypothesize-and-test paradigm. A hypothesis is
proposed, an experiment is designed to gather the
data, and then data is analyzed w.r.t. the hypothesis.
– Current data analysis requires evaluation of
thousands of hypothesis hence there is a
need for automating the process of hypothesis
generation and evaluation.

6/30/2019 Introduction to Data Mining 15


Origins of Data Mining
 Traditional techniques may be unsuitable due to data that is
– Large-scale
– High dimensional
– Heterogeneous
– Complex
– Distributed
 In order to meet these challenges in data mining researchers began to
focus on developing more efficient and scalable tools that could handle
diverse types of data.
 Draws ideas from
– Sampling estimation and hypothesis testing from statistics and
– Search algorithms, modeling techniques and learning theories from
artificial intelligence, pattern recognition and machine learning.
– Also been quick to adopt ideas from areas like optimization,
visualization.

6/30/2019 Introduction to Data Mining 16


Origins of Data Mining
 The figure shows the relationship of data mining to other areas.
 Database systems provide for efficient storage, indexing and query
processing.
 Support from high performance (parallel) computing to address massive
datasets.
 Distributed techniques to help
in addressing issue of size
when data cannot be
gathered in one location.

6/30/2019 Introduction to Data Mining 17


Data Mining Tasks

Data mining tasks are generally divided into 2 major


categories.
1. Predictive Tasks
2. Descriptive Tasks
 Predictive Tasks:
– Objective is to predict the value of a particular
attribute based on the values of other attributes.
– Attributes
to be predicted := target or dependent variables
Used for making prediction:= explanatory or
independent variables

6/30/2019 Introduction to Data Mining 18


Data Mining Tasks

 Descriptive Tasks:
– Objective here is to derive patterns like correlations,
trends, clusters, anomalies that summarize the
relationships in data.
– These are exploratory in nature and frequently require
post processing techniques to validate and explain the
results
– Find human-interpretable patterns that describe the
data.

6/30/2019 Introduction to Data Mining 19


Fig: illustrates 4 of the core data mining tasks.

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk

6/30/2019 Introduction to Data Mining 20


Predictive Modeling: Classification
 Task of building a model for the target variable as a
function of the explanatory variables.
Model for predicting credit
 Are of Two types worthiness
– Classification
Employed
– Regression
No Yes

Class
No Education
# years at
Level of Credit
Tid Employed present { High school,
Education Worthy Graduate
address Undergrad }
1 Yes Graduate 5 Yes
2 Yes High School 2 No Number of
Number of
3 No Undergrad 1 No years years
4 Yes High School 10 Yes
> 3 yr < 3 yr > 7 yrs < 7 yrs
10
… … … … …

Yes No Yes No

6/30/2019 Introduction to Data Mining 21


Classification Example
Used for discrete target variables, i.e., binary-
valued target Level of
# years at
Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
2 No Graduate 3 ?
3 Yes High School 2 ?
# years at
Level of Credit … … … … …
Tid Employed present
Education Worthy 10

address
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes Test
Set
10
… … … … …

Training
Learn
Model
Set Classifier

6/30/2019 Introduction to Data Mining 22


Examples of Classification Task

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

6/30/2019 Introduction to Data Mining 23


Classification: Application 1

 Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
 Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
6/30/2019 Introduction to Data Mining 24
Classification: Application 2

 Churn prediction for telephone customers


– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

6/30/2019 Introduction to Data Mining 25


Regression(for continuous target variables)

 Predict a value of a given continuous valued variable


based on the values of other variables
 Extensively studied in statistics, neural network fields.
 Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Forecasting the future price of a stock.
– Predicting the price of house based on values of other
variables.
NOTE: The goal of both predictive tasks is to find a
model that minimizes the error between predicted and
actual value of the target variable.
6/30/2019 Introduction to Data Mining 26
Cluster Analysis

 Finds groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

6/30/2019 Introduction to Data Mining 27


Cluster Analysis - Examples

Examples:
 Group sets of related customers
 Find areas of ocean which have significant
impact on Earth’s climate.

6/30/2019 Introduction to Data Mining 28


Applications of Cluster Analysis
 Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
 Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

6/30/2019 Introduction to Data Mining 29


Clustering: Application 1

 Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.

6/30/2019 Introduction to Data Mining 30


Clustering: Application 2

 Document Clustering:

– Goal: To find groups of documents that are similar to


each other based on the important terms appearing in
them.

– Approach: To identify frequently occurring terms in


each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

6/30/2019 Introduction to Data Mining 31


Document Clustering

Consider the collection of news articles in the table.


This table can be grouped based on their
respective topics.
6/30/2019 Introduction to Data Mining 32
Document Clustering (cont.)

 Each article is represented as a set of word-frequency pairs


(w,c)
Where,
w = word
c=number of times the word appears in the article.
There are two natural clusters in the data set
1. First 4 articles corresponds to news about the economy
2. Second 4 articles corresponds to news about health
care.
A good clustering algorithm should be able to identify these
two clusters based on similarity between words that
appear in the article.
6/30/2019 Introduction to Data Mining 33
Association Analysis

 Used to discover patterns that describe strongly


associated features in the data.
 The discovered patterns are represented in the form of
implication rules or feature subset
 Because of the exponential size of its search space, the
goal of association analysis is to extract the most
interesting patterns in an efficient manner.
 Examples:
– Identifying web pages that are accessed together
– Understanding the relationships between different
elements of Earth’s climate system

6/30/2019 Introduction to Data Mining 34


Association Rule Discovery: Definition

 Given a set of records each of which contain


some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

6/30/2019 Introduction to Data Mining 35


Association Analysis: Applications

 Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management

 Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases

6/30/2019 Introduction to Data Mining 36


Market Based Analysis- Association analysis

Consider the transaction from sales data collected at a


grocery store check-out counter

6/30/2019 Introduction to Data Mining 37


Market Based Analysis- Association analysis

 Association rules can be applied to find items that


are frequently bought together by customers
 For ex: rule {Diapers}{Milk},
– Suggests that customers who buy diapers
also tend to buy milk.
 This type of rule can be used to identify potential
cross-selling opportunities among related items.

6/30/2019 Introduction to Data Mining 38


Deviation/Anomaly/Change Detection
 Identifies observations/objects whose
characteristics are significantly
different from the rest of the data.
 Such observations are called
anomalies or outliers.
 Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identifying malicious behavior in
network devices like sensors .
– Unusual patterns of disease
– Ecosystem disturbances

6/30/2019 Introduction to Data Mining 39


Credit card fraud detection

 Credit card company records transactions made


by the card holder and also personal information
like credit limit, age, annual income, address.
 Anomaly detection technique can be applied to
build a profile of legitimate transactions for the
users.
 When a new transaction arrives, it is compared
against profile of the user.
 If characteristics of the transaction are very
different from the previously created profile, then
the transaction is flagged as potentially fraudulent.
6/30/2019 Introduction to Data Mining 40

Potrebbero piacerti anche