Sei sulla pagina 1di 43

Introduction to Data Mining

Rattapoom Tuchinda
*Some of the slides are from
Jaideep Srivastava @
http://www.cs.umn.edu/faculty/srivasta.html
Mike Kassoff @
http://logic.stanford.edu/classes/cs246/lect
ures2001/mkassoff_lecture.ppt
So far…

Information Integration techniques


Extraction: wrapper building
Integration: record linkage,
Semantic web
Execution: streaming data flow

DATA
Data overloaded

z Gene data
z Customer/Sales data
z Astrophysics data
z Pricing
z ….

And no one wants to stare at 100k tuples


What is data mining?

z A process that uses various techniques to


discover “patterns” or knowledge from data
– Visualization..
– Machine learning algorithms..
Examples…

z Link analysis
z Frauds detection
z New medicines
z Revenue Management/Discriminatory pricing
z Marketing
z Stocks
z ….
Outline

z Introduction
z Data cleaning
z Data mining techniques
– Classification
– Clustering
– Association Rules
– Sequential Patterns
– Regression
– Deviation detection
– Meta-learning
z Case study: Biddingfortravel
Traditional Data Mining Process
Data is often of low quality

z Why?
– You didn’t collect it yourself!

– It probably was created for some other use, and


then you came along wanting to integrate it

– People make mistakes (typos)

– People are busy (“this is good enough”)


Problems with data

z Some data are have problems on their own

z Other data are problematic only when you


want to integrate it
Data with problems on their own

z Problems due to lack of structure


z Problems not due to lack of structure (it’s in a
database)
Government agency data

What we want:

id name city state

1 Dept. of Transportation New York NY

2 Dept. of Finance New York NY

3 Office of Veteran's Affairs New York NY


First problem

What’s wrong here?

1'Dept. of Transportation'New York'NY


2'Dept. of Finance'New York'NY
3'Office of Veteran's Affairs'New York'NY

z The separator is used in the data.


Second problem

What’s wrong here?

1,Dept. of Transportation,New York City,NY


2,Dept. of Finance,City of New York,NY
3,Office of Veteran's Affairs,New York,NY

z We need standardization / naming


conventions
Third problem

What’s wrong here?

1,Dept. of Transportation,New York,NY


,Dept. of Finance,New York,NY
3,Office of Veteran's Affairs,New York,NY

z A missing required field


Fourth problem

What’s wrong here?

1,Dept. of Transportation,New York,NY


Two,Dept. of Finance,New York,NY
Office of Veteran's Affairs,3,New York,NY

z No data type contraints


z Ordering
.
Fifth Problem

What’s wrong here?

1,Dept. of Transportation,New York,NY


2,Dept. of Finance,New York,NY
3,Dept. of Finance,New York,NY

z Redundancy!
Problems not due to lack of structure
(it’s in a database)

z Flags: 0, 9, null, x, “no data”


z Typos:
– Can use constraints to catch corrupt data (i.e., weight can’t
be negative)
– Or use statistical techniques to catch corrupt data
z Hidden semantics: white spaces can be important.
z Misleading Data building name stories
Guildford Plaza 9
Hartford Apts. 35
Braun Hotel 6
Data that that is fine on its own, but
becomes problematic when you want
to integrate it
z Format
z Dynamic data
z Different granularity
z Conflicting data
Formats

z Not everyone uses the same format as you

z Dates are especially problematic:


– 12/19/77
– 12/19/1977
– 12-19-77
– 19/12/77
– Dec 19, 1977
– 19 December 1977
– 9 in Tevet, 5738
Data that Moves

z You can’t store it all in the same currency


(say, US$) because the exchange rate
changes
z Price in foreign currency stays the same
z Must keep the data in foreign currency and
use the current exchange rate to convert
Data at a different level of detail than
you need

z If it is at a finer level of detail, you can


sometimes bin it
z Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate
categories
Data at a different level of detail than
you need (cont’d)

z Sometimes you cannot bin it


z Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
z Ignore age ranges because you aren’t sure
z Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40)
Conflicting Data

z Information source #1 says that George lives in


Texas
z Information source #2 says that George lives in
Washington, DC
z What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of info
– Use the “most trusted” info
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
Outline

z Introduction
z Data cleaning
z Data mining techniques
– Classification
– Clustering
– Association Rules
– Sequential Patterns
– Regression
– Deviation detection
– Meta-learning
z Case study: Biddingfortravel
Classification: Definition

z Given a collection of records (training set)


– Each record contains a set of attributes, one of the
attributes is the class.
z Find a model for class attribute as a function of the
values of other attributes.
z Goal: previously unseen records should be assigned
a class as accurately as possible
– A test set is used to determine the accuracy of the mo del.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Classification Example
Classification Techniques

z Decision Tree based Methods


z Rule-based Methods
z Memory based reasoning
z Neural Networks
z Genetic Algorithms
z Naïve Bayes and Bayesian Network
z Support Vector Machine
What is Cluster Analysis

z Finding groups of objects such that the object in a


group will be similar to one another and different
from the objects in other groups.
– Based on information found in the data that describes the
objects and their relationships
– Also known as unsupervised classification
z Many applications
– Understanding: group related documents for browsing
(similar websites) or to find genes or proteins that have
similar funtionality
Notion of a Cluster is Ambiguous
Partitional Clustering
Hierarchical Clustering
Mining Associations

z Given a set of records, find rules that will predict the


occurrence of an item based on the occurrences of
other items in the record
Definition of Association Rule
Association Rule Mining
Meta-learning

z Learning about …”learning”


z Combine multiple classifiers together to yield
a better result.
z Simple voting, boosting, stacking
Stacking
Algorithm selection

z Given that we have a wide range of


algorithms, which algorithm should I choose?
– Meta-learning approach [Brazdi 1995]
– Still an open-ended question
Outline

z Introduction
z Data cleaning
z Data mining techniques
– Classification
– Clustering
– Association Rules
– Sequential Patterns
– Regression
– Deviation detection
– Meta-learning
z Case study: Biddingfortravel
Case study: Bidding for travel

Can we predict the winning hotel (or price)?


How does it work (I think..)?
120 A

200 B $60 $63


$65

180 C
Priceline Winning: A

$68

A: 120 Æ 60
B: 200 Æ 65 120 < 200 < 180
C: 180 Æ 68
Biddingfortravel cleaning

Hotel 1
postdata
Hotel 2 join
Hotel 3 Biddingfortravel
.
(area, stars,hotels)
.
Hotel N

union cleaning mining


Prediction

z Given area (San Diego Coastal), stars (4*),


checkin date, checkout date, retail price of
each of the hotel in the area Æ Predict which
hotel will I get from priceline
Ending remarks

z Data mining will always be in demand


z What makes data mining from the web so
specials?
– Access to real time data
– Pricing data
– Consumer aspect

Potrebbero piacerti anche