Sei sulla pagina 1di 57

Unit 3

Classification
&
Prediction
Steps in Classification
Classification Process : Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
Classification Process (2): Use the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Model Construction
Model Usage
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Issues (1): Data Preparation

• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Classification by Decision Tree
Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Training Dataset
age income student credit_rating
This <=30 high no fair
<=30 high no excellent
follows an 31…40 high no fair
example >40 medium no fair
from >40 low yes fair
>40 low yes excellent
Quinlan’s
31…40 low yes excellent
ID3 <=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
Attribute Selection Measure

• Information gain (ID3/C4.5)


– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each
attribute
– May need other tools, such as clustering, to get the possible
split values
– Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as

p p n n
I ( p, n)   log 2  log 2
pn pn pn pn
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
18
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

• Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
19
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
20
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

21
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
• This greatly reduces thek computation
k 1 2 n
1 cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ

( x )2
and P(xk|Ci) is 1 
g ( x,  ,  )  e 2 2

2 

P(X | C i)  g ( xk , Ci ,  Ci )
22
Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayes Classifier
• How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
23
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
24
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees age?

 One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
 Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes yes
prediction
 Rules are mutually exclusive and exhaustive
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
25 IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules
simultaneously
26
Classification by Backpropagation
• Neural Networks (NN) are important data mining tool used for
classification and clustering.
• It is an attempt to build machine that will mimic brain activities and
be able to learn.
• NN usually learns by examples. If NN is supplied with enough
examples, it should be able
• to perform classification and even discover new trends or patterns
in data.
• Basic NN is composed of three layers, input, output and hidden
layer.
• Each layer can have numberof nodes and nodes from input layer
are connected to the nodes from hidden layer.
• Nodes from hidden layer are connected to the nodes from output
layer. Those connections represent weights between nodes.
Cont’d….

• Idea behind BP algorithm is quite simple,


output of NN is evaluated against desired
output.
• If results are not satisfactory, connection
(weights) between layers are modified and
process is repeated again and again until error
issmall enough.
Cont’d…
• There are two types of NN based on learning
technique, they can be supervised where
output values are known beforehand (back
propagation algorithm) and unsupervised
where output values are not known
(clustering).
Back Propagation Algorithm
• One of the most popular NN algorithms is back propagation algorithm.
Rojas [2005] claimed that BP algorithm could be broken down to four
main steps. The algorithm can be decomposed in the following four
steps:
• i) Feed-forward computation
• ii) Back propagation to the output layer
• iii) Back propagation to the hidden layer
• iv) Weight updates
• The algorithm is stopped when the value of the error function has
become sufficiently small.
• This is very rough and basic formula for BP algorithm. There are some
variation proposed by other scientist
• but Rojas definition seem to be quite accurate and easy to follow. The
last step, weight updates is happening through out the algorithm.
Support Vector Machine
• “Support Vector Machine” (SVM) is a supervised
machine learning algorithm which can be used
for both classification or regression
• However, it is mostly used in classification
problems
• In this algorithm, we plot each data item as a
point in n-dimensional space (where n is number
of features you have) with the value of each
feature being the value of a particular coordinate
• Then, we perform classification by finding the
hyper-plane that differentiate the two classes
very well
Cont’d….
Lazy Learners (Learning from your
Neighbours)
• Eager learning versus Lazy Learning
• Eager learners create a classification model as
soon as the test data is given
• Lazy learners create a classification only after
it is given a test data
• Lazy learners store the test data as instances
and so called Instance based learning
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

3/21/2018 Data Mining: Concepts and Techniques 34


Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

3/21/2018 Data Mining: Concepts and Techniques 35


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x

3/21/2018 Data Mining: Concepts and Techniques 36


Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q )   ( pi
i
q ) i
2

• Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance

3/21/2018 Data Mining: Concepts and Techniques 37


K-nearest neighbour classifiers
• The unknown tuple is assigned the most
common class among its k-nearest neighbours
• When k = 1 the unknown tuple is assigned the
class of the training tuple that is closest to it in
pattern space
• In case of categorical attributes if both the
attributes have same value distance is 0 else 1
Case Based Reasoning
• CBR classifiers use a Database of problem
solutions to solve new problems
• CBR stores the tuples or cases for problem
solving as complex symbolic descriptions
• When given a new case to classify , CBR will
first check if an identical training case exists
• If one is found then then the accompanying
solution to that case is returned
Cont’d…
• If no identical case is found then the CBR will
search for training cases having components
that are similar to those of the new case
• Conceptually these training cases may be
considered as neighbors of the new case
Other Classification Methods
GeneticAlgorithms (GA)
• Genetic Algorithm: based on an analogy to biological evolution
• An initial population is created consisting of randomly generated rules
– Each rule is represented by a string of bits
– E.g., if A1 and ¬A2 then C2 can be encoded as 100
– If an attribute has k > 2 values, k bits can be used
• Based on the notion of survival of the fittest, a new population is formed to
consist of the fittest rules and their offsprings
• The fitness of a rule is represented by its classification accuracy on a set of
training examples
• Offsprings are generated by crossover and mutation
• The process continues until a population P evolves when each rule in P
satisfies a prespecified threshold
• Slow but easily parallelizable
March 21, 2018 Data Mining: Concepts and Techniques 41
Rough Set Approach
• Rough sets are used to approximately or “roughly” define equivalent
classes
• A rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation (cannot be
described as not belonging to C)
• Finding the minimal subsets (reducts) of attributes for feature reduction is
NP-hard but a discernibility matrix (which stores the differences between
attribute values for each pair of data tuples) is used to reduce the
computation intensity

March 21, 2018 Data Mining: Concepts and Techniques 42


Fuzzy Set
Approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent
the degree of membership (such as using fuzzy membership
graph)
• Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated
• For a given new sample, more than one fuzzy value may apply
• Each applicable rule contributes a vote for membership in the
categories
• Typically, the truth values for each predicted category are
summed, and these sums are combined
March 21, 2018 Data Mining: Concepts and Techniques 43
What Is Prediction?
• (Numerical) prediction is similar to classification
– construct a model
– use model to predict continuous or ordered value for a given input
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
• Major method for prediction: regression
– model the relationship between one or more independent or predictor
variables and a dependent or response variable
• Regression analysis
– Linear and multiple regression
– Non-linear regression
– Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
March 21, 2018 Data Mining: Concepts and Techniques 44
Linear Regression
• Linear regression: involves a response variable y and a single predictor
variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line
| D|

 (x  x )( yi  y )
w  w  y w x
i
i 1

1 | D|
0 1
 (x
i 1
i  x )2

• Multiple linear regression: involves more than one predictor variable


– Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
– Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
– Solvable by extension of least square method or using SAS, S-Plus
– Many nonlinear functions can be transformed into the above
March 21, 2018
Data Mining: Concepts and Techniques 45
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear
regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed
to linear model
• Some models are intractable nonlinear (e.g., sum of exponential
terms)
– possible to obtain least square estimates through extensive
calculation on more complex formulae
March 21, 2018 Data Mining: Concepts and Techniques 46
Other Regression-Based Models
• Generalized linear model:
– Foundation on which linear regression can be applied to modeling
categorical response variables
– Variance of y is a function of the mean value of y, not a constant
– Logistic regression: models the prob. of some event occurring as a linear
function of a set of predictor variables
– Poisson regression: models the data that exhibit a Poisson distribution
• Log-linear models: (for categorical data)
– Approximate discrete multidimensional prob. distributions
– Also useful for data compression and smoothing
• Regression trees and model trees
– Trees to predict continuous values rather than class labels

March 21, 2018 Data Mining: Concepts and Techniques 47


Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics
to consider?
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

48
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
49
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

• Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All  Sensitivity = TP/P

• Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

50
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

51
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
52
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

53
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a
measure of the accuracy of the model  Vertical axis
represents the true
• Rank the test tuples in decreasing positive rate
order: the one that is most likely to
belong to the positive class appears at  Horizontal axis rep.
the top of the list the false positive rate
• The closer to the diagonal line (i.e., the  The plot also shows a
closer the area is to 0.5), the less diagonal line
accurate is the model  A model with perfect
accuracy will have an
area of 1.0
54
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers

55
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
56
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data

57

Potrebbero piacerti anche