Data Mining

Introduction to Data Mining
Rattapoom Tuchinda
*Some of the slides are from
Jaideep Srivastava @
http://www.cs.umn.edu/faculty/srivasta.html
Mike Kassoff @
http://logic.stanford.edu/classes/cs246/lect
ures2001/mkassoff_lecture.ppt
So far…
Information Integration techniques

Extraction: wrapper building
Integration: record linkage,
Semantic web
Execution: streaming data flow
DATA
Data overloaded
z Gene data
z Customer/Sales data
z Astrophysics data
z Pricing
z ….
And no one wants to stare at 100k tuples

What is data mining?
z A process that uses various techniques to

discover “patterns” or knowledge from data
– Visualization..
– Machine learning algorithms..
Examples…
z Link analysis
z Frauds detection
z New medicines
z Revenue Management/Discriminatory pricing
z Marketing
z Stocks
z ….
Outline
z Introduction
z Data cleaning
z Data mining techniques
– Classification
– Clustering
– Association Rules
– Sequential Patterns
– Regression
– Deviation detection
– Meta-learning
z Case study: Biddingfortravel
Traditional Data Mining Process
Data is often of low quality
z Why?
– You didn’t collect it yourself!
– It probably was created for some other use, and

then you came along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”)

Problems with data
z Some data are have problems on their own
z Other data are problematic only when you

want to integrate it
Data with problems on their own
z Problems due to lack of structure

z Problems not due to lack of structure (it’s in a
database)
Government agency data
What we want:
id name city state
1 Dept. of Transportation New York NY
2 Dept. of Finance New York NY
3 Office of Veteran's Affairs New York NY

First problem
What’s wrong here?
1'Dept. of Transportation'New York'NY

2'Dept. of Finance'New York'NY
3'Office of Veteran's Affairs'New York'NY
z The separator is used in the data.

Second problem
1,Dept. of Transportation,New York City,NY

2,Dept. of Finance,City of New York,NY
3,Office of Veteran's Affairs,New York,NY
z We need standardization / naming

conventions
Third problem
1,Dept. of Transportation,New York,NY

,Dept. of Finance,New York,NY
3,Office of Veteran's Affairs,New York,NY
z A missing required field

Fourth problem

Two,Dept. of Finance,New York,NY
Office of Veteran's Affairs,3,New York,NY
z No data type contraints

z Ordering
.
Fifth Problem

2,Dept. of Finance,New York,NY
3,Dept. of Finance,New York,NY
z Redundancy!
Problems not due to lack of structure
(it’s in a database)
z Flags: 0, 9, null, x, “no data”

z Typos:
– Can use constraints to catch corrupt data (i.e., weight can’t
be negative)
– Or use statistical techniques to catch corrupt data
z Hidden semantics: white spaces can be important.
z Misleading Data building name stories
Guildford Plaza 9
Hartford Apts. 35
Braun Hotel 6
Data that that is fine on its own, but
becomes problematic when you want
to integrate it
z Format
z Dynamic data
z Different granularity
z Conflicting data
Formats
z Not everyone uses the same format as you
z Dates are especially problematic:

– 12/19/77
– 12/19/1977
– 12-19-77
– 19/12/77
– Dec 19, 1977
– 19 December 1977
– 9 in Tevet, 5738
Data that Moves
z You can’t store it all in the same currency

(say, US$) because the exchange rate
changes
z Price in foreign currency stays the same
z Must keep the data in foreign currency and
use the current exchange rate to convert
Data at a different level of detail than
you need
z If it is at a finer level of detail, you can

sometimes bin it
z Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate
categories
Data at a different level of detail than
you need (cont’d)
z Sometimes you cannot bin it

z Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
z Ignore age ranges because you aren’t sure
z Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40)
Conflicting Data
z Information source #1 says that George lives in

Texas
z Information source #2 says that George lives in
Washington, DC
z What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of info
– Use the “most trusted” info
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
Outline
z Introduction
z Data cleaning
– Classification
– Clustering
– Regression
– Meta-learning
Classification: Definition
z Given a collection of records (training set)

– Each record contains a set of attributes, one of the
attributes is the class.
z Find a model for class attribute as a function of the
values of other attributes.
z Goal: previously unseen records should be assigned
a class as accurately as possible
– A test set is used to determine the accuracy of the mo del.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Classification Example
Classification Techniques
z Decision Tree based Methods

z Rule-based Methods
z Memory based reasoning
z Neural Networks
z Genetic Algorithms
z Naïve Bayes and Bayesian Network
z Support Vector Machine
What is Cluster Analysis
z Finding groups of objects such that the object in a

group will be similar to one another and different
from the objects in other groups.
– Based on information found in the data that describes the
objects and their relationships
– Also known as unsupervised classification
z Many applications
– Understanding: group related documents for browsing
(similar websites) or to find genes or proteins that have
similar funtionality
Notion of a Cluster is Ambiguous
Partitional Clustering
Hierarchical Clustering
Mining Associations
z Given a set of records, find rules that will predict the

occurrence of an item based on the occurrences of
other items in the record
Definition of Association Rule
Association Rule Mining
Meta-learning
z Learning about …”learning”

z Combine multiple classifiers together to yield
a better result.
z Simple voting, boosting, stacking
Stacking
Algorithm selection
z Given that we have a wide range of

algorithms, which algorithm should I choose?
– Meta-learning approach [Brazdi 1995]
– Still an open-ended question
Outline
z Introduction
z Data cleaning
– Classification
– Clustering
– Regression
– Meta-learning
Case study: Bidding for travel
Can we predict the winning hotel (or price)?

How does it work (I think..)?
120 A
200 B $60 $63

$65
180 C
Priceline Winning: A
$68
A: 120 Æ 60
B: 200 Æ 65 120 < 200 < 180
C: 180 Æ 68
Biddingfortravel cleaning
Hotel 1
postdata
Hotel 2 join
Hotel 3 Biddingfortravel
.
(area, stars,hotels)
.
Hotel N
union cleaning mining

Prediction
z Given area (San Diego Coastal), stars (4*),

checkin date, checkout date, retail price of
each of the hotel in the area Æ Predict which
hotel will I get from priceline
Ending remarks
z Data mining will always be in demand

z What makes data mining from the web so
specials?
– Access to real time data
– Pricing data
– Consumer aspect

Data Mining

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining

Caricato da

Copyright:

Formati disponibili

Introduction to Data Mining

Information Integration techniques

And no one wants to stare at 100k tuples

z A process that uses various techniques to

– It probably was created for some other use, and

– People make mistakes (typos)

– People are busy (“this is good enough”)

z Some data are have problems on their own

z Other data are problematic only when you

z Problems due to lack of structure

id name city state

1 Dept. of Transportation New York NY

2 Dept. of Finance New York NY

3 Office of Veteran's Affairs New York NY

What’s wrong here?

1'Dept. of Transportation'New York'NY

z The separator is used in the data.

What’s wrong here?

1,Dept. of Transportation,New York City,NY

z We need standardization / naming

What’s wrong here?

1,Dept. of Transportation,New York,NY

z A missing required field

What’s wrong here?

1,Dept. of Transportation,New York,NY

z No data type contraints

What’s wrong here?

1,Dept. of Transportation,New York,NY

z Flags: 0, 9, null, x, “no data”

z Not everyone uses the same format as you

z Dates are especially problematic:

z You can’t store it all in the same currency

z If it is at a finer level of detail, you can

z Sometimes you cannot bin it

z Information source #1 says that George lives in

z Given a collection of records (training set)

z Decision Tree based Methods

z Finding groups of objects such that the object in a

z Given a set of records, find rules that will predict the

z Learning about …”learning”

z Given that we have a wide range of

Can we predict the winning hotel (or price)?

200 B $60 $63

union cleaning mining

z Given area (San Diego Coastal), stars (4*),

z Data mining will always be in demand

Potrebbero piacerti anche