Data Mining Classification Basics

DATA MINING
LECTURE
Classification
Basic Concepts
Naïve Bayes
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

Tax-return data for year 2011
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes A new tax return for 2012
6 No Married 60K No Is this a cheating tax return?
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10
10 No Single 90K Yes

10
An instance of the classification problem: learn a method for discriminating between

records of different classes (cheaters vs non-cheaters)
What is classification?
• Classification is the task of learning a target function f that
maps attribute set x to one of the predefined class labels y

Status Income Cheat One of the attributes is the class attribute
In this case: Cheat
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
Why classification?
• The target function f is known as a classification
model
• Descriptive modeling: Explanatory tool to

distinguish between objects of different classes
(e.g., understand why people cheat on their
taxes)
• Predictive modeling: Predict a class of a

previously unseen record
Examples of Classification Tasks
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or

fraudulent
• Categorizing news stories as finance,
weather, entertainment, sports, etc
• Identifying spam email, spam web pages, adult

content
• Understanding if a web query has commercial intent

or not
General approach to classification
• Training set consists of records with known class
labels
• Training set is used to build a classification model
• A labeled test set of previously unseen data

records is used to evaluate the quality of the
model.
• The classification model is applied to new records

with unknown class labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Evaluation of classification models
• Counts of test records that are correctly (or
incorrectly) predicted by the classification model
• Confusion matrix Predicted Class
Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
# correct prediction s f11  f 00

Accuracy  
total # of prediction s f11  f10  f 01  f 00
# wrong prediction s f10  f 01

Error rate  
total # of prediction s f11  f10  f 01  f 00
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
NAÏVE BAYES CLASSIFIER
Classification
Basic Concepts
Bayes Classifier
• A probabilistic framework for solving classification
problems
• A, C random variables
• Joint probability: Pr(A=a,C=c)
• Conditional probability: Pr(C=c | A=a)
• Relationship between joint and conditional
probability distributions
Pr(C , A)  Pr(C | A)  Pr( A)  Pr( A | C )  Pr(C )
• Bayes Theorem: P( A | C ) P(C )

P(C | A) 
P( A)
Bayesian Classifiers
• Consider each
al attribute
al us
and class label as random
r ic r ic o
u
variables
te
g o
e g o
tin s s
ca c at c on cl
a
Evade C
Tid Refund Marital Taxable Event space: {Yes, No}
Status Income Evade
P(C) = (0.3, 0.7)
Refund A1
Event space: {Yes, No}
3 No Single 70K No P(A1) = (0.3,0.7)
Martial Status A2
Event space: {Single, Married, Divorced}
6 No Married 60K No
P(A2) = (0.4,0.4,0.2)
7 Yes Divorced 220K No
8 No Single 85K Yes Taxable Income A3
9 No Married 75K No Event space: R
10 No Single 90K Yes P(A3) ~ Normal(,)  distribusi gaussian
10
• Given a record X over attributes (A1, A2,…,An)
• E.g., X = (‘Yes’, ‘Single’, 125K)
• The goal is to predict class C

• Specifically, we want to find the value c of C that maximizes
P(C=c| X)
• Maximum Aposteriori Probability estimate
• Can we estimate P(C| X) directly from data?

• This means that we estimate the probability for all possible
values of the class variable.
• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P( A A  A | C ) P(C )
P(C | A A  A )  1 2 n
P( A A  A )
1 2 n
1 2 n
• Choose value of C that maximizes

P(C | A1, A2, …, An)
• Equivalent to choosing value of C that maximizes

P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier
• Assume independence among attributes Ai when class is
given:
• 𝑃(𝐴1 , 𝐴2 , … , 𝐴𝑛 |𝐶) = 𝑃(𝐴1 |𝐶) 𝑃(𝐴2 𝐶 ⋯ 𝑃(𝐴𝑛 |𝐶)
• We can estimate P(Ai| C) for all values of Ai and C.
• New point X is classified to class c if

𝑃 𝐶 = 𝑐 𝑋 = 𝑃 𝐶 = 𝑐 ς𝑖 𝑃(𝐴𝑖 |𝑐)
is maximum over all possible values of C.
How to Estimate
l l
Probabilities from Data?
a a u s
r ic r ic o
o o u
eg eg it n s
at at on cl
a s • Class Prior Probability:
c c c
𝑁𝑐
Tid Refund Marital
Status
Taxable
Income Evade 𝑃 𝐶=𝑐 =
𝑁
e.g., P(C = No) = 7/10,
P(C = Yes) = 3/10
3 No Single 70K No
• For discrete attributes:
𝑁𝑎,𝑐
5 No Divorced 95K Yes 𝑃 𝐴𝑖 = 𝑎 𝐶 = 𝑐 =
6 No Married 60K No 𝑁𝑐
7 Yes Divorced 220K No where 𝑁𝑎,𝑐 is number of
8 No Single 85K Yes instances having attribute 𝐴𝑖 =
9 No Married 75K No
𝑎 and belongs to class 𝑐
10 No Single 90K Yes • Examples:
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities from Data?
• For continuous attributes:

• Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
• Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(i.e., mean  and standard deviation )
• Once probability distribution is known, we can use it to estimate
the conditional probability P(Ai|c)
l l s
How togoEstimate
g o in uProbabilities from Data?
s
ric
a
ric
a
o u
t e t e n t as
l
ca ca c o c
Evade
• Normal distribution:
Status Income
( a   ij ) 2

1 Yes Single 125K No 1 2 ij2
P( Ai  a | c j )  e
2 No Married 100K No 2 ij2
3 No Single 70K No
4 Yes Married 120K No • One for each (ai,ci) pair
6 No Married 60K No
• For (Income, Class=No):
7 Yes Divorced 220K No • If Class=No
8 No Single 85K Yes • sample mean = 110
9 No Married 75K No • sample variance = 2975
10
1 
( 120110) 2
P( Income  120 | No)  e 2 ( 2975)

 0.0072
2 (54.54)
Example of Naïve Bayes Classifier
• Creating a Naïve Bayes Classifier, essentially
means to compute counts:
Total number of records: N = 10
Class No: Class Yes:

Number of records: 7 Number of records: 3
Attribute Refund: Attribute Refund:
Yes: 3 Yes: 0
No: 4 No: 3
Attribute Marital Status: Attribute Marital Status:
Single: 2 Single: 2
Divorced: 1 Divorced: 1
Married: 4 Married: 0
Attribute Income: Attribute Income:
mean: 110 mean: 90
variance: 2975 variance: 25
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1
= 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Married|No) = 4/7  P(Married| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Income=120K| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 = 1  0  1.2  10-9 = 0
P(Marital Status=Married|Yes) = 0
For taxable income: P(No) = 0.3, P(Yes) = 0.7

If class=No: sample mean=110
sample variance=2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class=Yes: sample mean=90 Therefore P(No|X) > P(Yes|X)
sample variance=25
=> Class = No
Naïve Bayes Classifier
• If one of the conditional probability is zero, then
the entire expression becomes zero
• Probability estimation:
N ac
Original : P( Ai  a | C  c) 
Nc Ni: number of attribute
values for attribute Ai
N ac  1
Laplace : P( Ai  a | C  c)  p: prior probability
Nc  Ni
m: parameter
N ac  mp
m - estimate : P( Ai  a | C  c) 
Nc  m
Given a Test Record: With Laplace Smoothing
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 4/9  P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 5/9  P(Married| Class=No)
P(Refund=Yes|Yes) = 1/5  P(Income=120K| Class=No)
P(Refund=No|Yes) = 4/5
= 5/9  5/10  0.0072
P(Marital Status=Single|No) = 3/10
P(Marital Status=Divorced|No)=2/10  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Married|No) = 5/10  P(Married| Class=Yes)
P(Marital Status=Single|Yes) = 3/6  P(Income=120K| Class=Yes)
P(Marital Status=Divorced|Yes)=2/6 = 4/5  1/6  1.2  10-9
P(Marital Status=Married|Yes) = 1/6
For taxable income: P(No) = 0.7, P(Yes) = 0.3

If class=No: sample mean=110
sample variance=2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class=Yes: sample mean=90 Therefore P(No|X) > P(Yes|X)
sample variance=25
=> Class = No
Implementation details
• Computing the conditional probabilities involves
multiplication of many very small numbers
• Numbers get very close to zero, and there is a danger
of numeric instability
• We can deal with this by computing the logarithm
of the conditional probability
log 𝑃 𝐶 𝐴 ~ log 𝑃 𝐴 𝐶 + log 𝑃 𝐴

= ෍ log 𝐴𝑖 𝐶 + log 𝑃(𝐴)
𝑖
Naïve Bayes for Text Classification
• Naïve Bayes is commonly used for text classification
• For a document 𝑑 = (𝑡1 , … , 𝑡𝑘 )
𝑃 𝑐 𝑑 = 𝑃(𝑐) ෑ 𝑃(𝑡𝑖 |𝑐)

𝑡𝑖 ∈𝑑
• 𝑃 𝑡𝑖 𝑐 = Fraction of terms from all documents in c
that are 𝑡𝑖 .
• Easy to implement and works relatively well

• Limitation: Hard to incorporate additional features
(beyond words).
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance

during probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some

attributes
• Use other techniques such as Bayesian Belief Networks
(BBN)
• Naïve Bayes can produce a probability estimate, but

it is usually a very biased one
• Logistic Regression is better for obtaining probabilities.
Generative vs Discriminative models
• Naïve Bayes is a type of a generative model
• Generative process:
• First pick the category of the record
• Then given the category, generate the attribute values from the
distribution of the category
C
• Conditional independence given C
𝐴1 𝐴2 𝐴𝑛
• We use the training data to learn the distribution

of the values in a class
Generative vs Discriminative models
• Logistic Regression and SVM are discriminative
models
• The goal is to find the boundary that discriminates
between the two classes from the training data
• In order to classify the language of a document,

you can
• Either learn the two languages and find which is more
likely to have generated the words you see
• Or learn what differentiates the two languages.

Data Mining Classification Basics

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining Classification Basics

Caricato da

Copyright:

Formati disponibili

DATA MINING

1 Yes Single 125K No

10 No Single 90K Yes

An instance of the classification problem: learn a method for discriminating between

Tid Refund Marital Taxable

• Descriptive modeling: Explanatory tool to

• Predictive modeling: Predict a class of a

• Classifying credit card transactions as legitimate or

• Identifying spam email, spam web pages, adult

• Understanding if a web query has commercial intent

• Training set is used to build a classification model

• A labeled test set of previously unseen data

• The classification model is applied to new records

# correct prediction s f11  f 00

# wrong prediction s f10  f 01

• Bayes Theorem: P( A | C ) P(C )

• The goal is to predict class C

• Can we estimate P(C| X) directly from data?

• Choose value of C that maximizes

• Equivalent to choosing value of C that maximizes

• How to estimate P(A1, A2, …, An | C )?

• We can estimate P(Ai| C) for all values of Ai and C.

• New point X is classified to class c if

• For continuous attributes:

P( Income  120 | No)  e 2 ( 2975)

Class No: Class Yes:

For taxable income: P(No) = 0.3, P(Yes) = 0.7

For taxable income: P(No) = 0.7, P(Yes) = 0.3

log 𝑃 𝐶 𝐴 ~ log 𝑃 𝐴 𝐶 + log 𝑃 𝐴

𝑃 𝑐 𝑑 = 𝑃(𝑐) ෑ 𝑃(𝑡𝑖 |𝑐)

• Easy to implement and works relatively well

• Handle missing values by ignoring the instance

• Robust to irrelevant attributes

• Independence assumption may not hold for some

• Naïve Bayes can produce a probability estimate, but

• We use the training data to learn the distribution

• In order to classify the language of a document,

Potrebbero piacerti anche