Sei sulla pagina 1di 66

Necessity Is the Mother of

Invention
• Data explosion problem

– Automated data collection tools and mature


database technology lead to tremendous
amounts of data accumulated and/or to be
analyzed in databases, data warehouses, and
other information repositories
• We are drowning in data, but starving for
knowledge!
• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical


processing

– Mining interesting knowledge (rules,


regularities, patterns, constraints) from data in
large databases
Why Data Mining?—Potential
Applications

• Data analysis and decision support


– Market analysis and management
• Target marketing, customer relationship
management (CRM), market basket
analysis, cross selling, market
segmentation
– Risk analysis and management
• Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis
– Fraud detection and detection of unusual
patterns (outliers)

• Other Applications
– Text mining (news group, email, documents)
and Web mining
– Stream data mining
– DNA and bio-data analysis
Corporate Analysis & Risk
Management

• Finance planning and asset


evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate
assets
– cross-sectional and time series
analysis (financial-ratio, trend
analysis, etc.)
• Resource planning
– summarize and compare the
resources and spending
• Competition
– monitor competitors and market
directions
– group customers into classes and a
class-based pricing procedure
– set pricing strategy in a highly
Data Mining: A KDD
Process

– Data mining— Pattern Evaluation

core of
knowledge
discovery Data Mining
process
Task-relevant Data

Data Selection
Warehouse

Data Cleaning

Data Integration

Databas
es
Steps of a KDD
Process
• Learning the application domain
– relevant prior knowledge and goals of
application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may
take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable
reduction, invariant representation.
• Choosing functions of data mining
– summarization, classification, regression,
association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of
interest
• Pattern evaluation and knowledge
presentation
– visualization, transformation, removing
redundant patterns, etc.
• Use of discovered knowledge
Architecture: Typical Data
Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge
Database -base
or data
warehouse
Data cleaningserver
& data Filteri
integration ng

Data
Databa Wareho
ses use
Data Mining: On What
Kinds of Data?

• Relational database
• Data warehouse
• Transactional database
• Advanced database and
information repository
– Object-relational database
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Heterogeneous and legacy
database
– Text databases & WWW
Data Mining
Functionalities
• Concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
• Association (correlation and causality)
– Diaper  Beer [0.5%, 75%]
• Classification and Prediction
– Construct models (functions) that describe and
distinguish classes or concepts for future
prediction
• E.g., classify countries based on climate, or
classify cars based on gas mileage
– Presentation: decision-tree, classification rule,
neural network
– Predict some unknown or missing numerical
values
Data Mining
Functionalities (2)
• Cluster analysis
– Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution
patterns
– Maximizing intra-class similarity & minimizing
interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with
the general behavior of the data
– Noise or exception? No! useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Data Mining: Confluence of
Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines
Major Issues in Data
Mining
• Mining methodology
– Mining different kinds of knowledge from diverse
data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and
scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing
one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of
abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Data Mining
Association Rules
Mining Association
Rules in Large
Databases
• Association rule mining
• Mining single-dimensional
Boolean association rules
from transactional databases
• Summary
What Is
Association
Mining?
• Association rule mining:
– Finding frequent patterns, associations,
correlations, or causal structures
among sets of items or objects in
transaction databases, relational
databases, and other information
repositories.
• Applications:
– Basket data analysis, cross-marketing,
catalog design, loss-leader analysis,
clustering, classification, etc.
• Examples.
– Rule form: “Body → Ηead [support,
confidence]”.
– buys(x, “diapers”) → buys(x, “beers”)
[0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) →
grade(x, “A”) [1%, 75%]
Association Rule:
Basic Concepts
• Given: (1) database of transactions,
(2) each transaction is a list of items
(purchased by a customer in a visit)
• Find: all rules that correlate the
presence of one set of items with
that of another set of items
– E.g., 98% of people who purchase tires
and auto accessories also get
automotive services done
• Applications
– * ⇒ Maintenance Agreement (What
the store should do to boost
Maintenance Agreement sales)
– Home Electronics ⇒ * (What other
products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients,
faulty “collisions”
Rule Measures:
Support and
Custo
Confidence
mer Custom • Find all the rules X &
er
buys
both buys Y ⇒ Z with minimum
diaper confidence and
support
– support, s, probability
that a transaction
contains {X  Y  Z}
Customer – confidence, c,
buys beer conditional probability
that a transaction
having {X  Y} also
contains
Let minimum Z
Transaction IDItems Bought
support 50%, and
2000 A,B,C minimum
1000 A,C confidence 50%,
4000 A,D we have
– A ⇒ C (50%,
5000 B,E,F 66.6%)
– C ⇒ A (50%,
100%)
Association Rule
Mining: A Road Map
• Boolean vs. quantitative associations
(Based on the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) →
buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) → buys(x,
“PC”) [1%, 75%]
• Single dimension vs. multiple dimensional
associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what
brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation
or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum
> 1,000)?
Mining Association
Rules in Large
Databases
• Association rule mining
• Mining single-dimensional
Boolean association rules
from transactional databases
• Summary
Mining Association
Min. support 50%
Rules—An Min.
Example
confidence 50%

Transaction ID Items Bought


2000 A,B,C
1000 A,C
4000 A,D
Frequent Itemset Support
5000 B,E,F
{A} 75%
{B} 50%
{C} 50%
For rule A ⇒ C: {A,C} 50%
support = support({A C}) = 50%
confidence = support({A C})/support({A})
= 66.6%
The Apriori principle:
Any subset of a frequent itemset must
be frequent
Mining Frequent
Itemsets: the Key
Step
• Find the frequent itemsets: the
sets of items that have minimum
support
– A subset of a frequent itemset must
also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A}
and {B} should be a frequent itemset
– Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
• Use the frequent itemsets to
generate association rules.
The Apriori Algorithm

• Join Step: C is generated by joining


k

Lk-1with itself
• Prune Step: Any (k-1)-itemset that is
not frequent cannot be a subset of a
frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates
in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with
min_support
end
return ∪k Lk;
The Apriori Algorithm
— Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 1 3 4 {2} 3 {2} 3
Scan D
200 2 3 5 {3} 3 {3} 3
300 1 2 3 5 {4} 1 {5} 3
400 2 5 {5} 3
C2
itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset L3 itemset sup
Scan D
{2 3 5} {2 3 5} 2
How to Generate
Candidates?
• Suppose the items in Lk-1 are listed
in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1

• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
How to Count
Supports of
Candidates?
• Why counting supports of
candidates a problem?
– The total number of candidates can
be very huge
– One transaction may contain many
candidates
• Method:
– Candidate itemsets are stored in a
hash-tree
– Leaf node of hash-tree contains a list
of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the
candidates contained in a transaction
Example of Generating
Candidates

• L3={abc, abd, acd, ace, bcd}

• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace

• Pruning:
– acde is removed because ade is not in
L3

• C4={abcd}
Methods to Improve
Apriori’s Efficiency

• Hash-based itemset counting: A k-itemset


whose corresponding hashing bucket count is
below the threshold cannot be frequent
• Transaction reduction: A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans
• Partitioning: Any itemset that is potentially
frequent in DB must be frequent in at least one
of the partitions of DB
• Sampling: mining on a subset of given data,
lower support threshold + a method to determine
the completeness
• Dynamic itemset counting: add new candidate
itemsets only when all of their subsets are
estimated to be frequent
Is Apriori Fast Enough?
— Performance
Bottlenecks

• The core of the Apriori algorithm:


– Use frequent (k – 1)-itemsets to generate
candidate frequent k-itemsets
– Use database scan and pattern matching to
collect counts for the candidate itemsets
• The bottleneck of Apriori: candidate
generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107
candidate 2-itemsets
• To discover a frequent pattern of size 100,
e.g., {a1, a2, …, a100}, one needs to generate
2100 ≈ 1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the
longest pattern
Chapter 6: Mining
Association Rules
in Large Databases
• Association rule mining
• Mining single-dimensional
Boolean association rules
from transactional databases
• Summary
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
Classification vs.
Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model)
based on the training set and the
values (class labels) in a classifying
attribute and uses it in classifying new
data
• Prediction:
– models continuous-valued functions,
i.e., predicts unknown or missing
values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification—A Two-
Step Process

• Model construction: describing a set of


predetermined classes
– Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
– The set of tuples used for model construction:
training set
– The model is represented as classification
rules, decision trees, or mathematical
formulae
• Model usage: for classifying future or
unknown objects
– Estimate accuracy of the model
• The known label of test sample is
compared with the classified result from
the model
• Accuracy rate is the percentage of test set
samples that are correctly classified by
the model
• Test set is independent of training set,
otherwise over-fitting will occur
Classification
Process (1): Model
Construction

Classification
Algorithms
Trainin
g
Data

AM E RANK YEARS TENURE


Classifier
(Model)
ike Assistant Prof 3 no
ary Assistant Prof 7 yes
ill Professor 2 yes
m Associate Prof 7 yes
ave Assistant Prof 6IF rank = ‘professo
no
OR years > 6
nne Associate Prof 3THEN tenured no= ‘y
Classification
Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2
Tenured?no
M erlisa A ssociate P rof 7 no
G eorge P rofesso r 5 yes
Joseph A ssistant P rof 7 yes
Supervised vs.
Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data
(observations, measurements, etc.)
are accompanied by labels indicating
the class of the observations
– New data is classified based on the
training set
• Unsupervised learning (clustering)
– The class labels of training data is
unknown
– Given a set of measurements,
observations, etc. with the aim of
establishing the existence of classes
or clusters in the data
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
classification and
prediction (1): Data
Preparation

• Data cleaning
– Preprocess data in order to reduce
noise and handle missing values
• Relevance analysis (feature
selection)
– Remove the irrelevant or redundant
attributes
• Data transformation
– Generalize and/or normalize data
classification and prediction
(2): Evaluating
Classification Methods

• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded
by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
Classification by
Decision Tree
Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class
distribution
• Decision tree generation consists of two
phases
– Tree construction
• At start, all the training examples are at the
root
• Partition examples recursively based on
selected attributes
– Tree pruning
• Identify and remove branches that reflect
noise or outliers
• Use of decision tree: Classifying an
unknown sample
– Test the attribute values of the sample against
the decision tree
Training
Dataset

age income student credit_rating


This <=30 high no fair
follow <=30 high no excellent
s an 31…40 high no fair
exam
>40 medium no fair
ple
>40 low yes fair
from
Quinl >40 low yes excellent
an’s 31…40 low yes excellent
ID3 <=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision
Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive
divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued,
they are discretized in advance)
– Examples are partitioned recursively based on
selected attributes
– Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the
same class
– There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
– There are no samples left
Attribute
Selection
Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be
categorical
– Can be modified for continuous-valued
attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-
valued
– Assume there exist several possible split
values for each attribute
– May need other tools, such as clustering,
to get the possible split values
– Can be modified for categorical
attributes
Information Gain
(ID3/C4.5)

• Select the attribute with the highest


information gain
• Assume there are two classes, P
and N
– Let the set of examples S contain p
elements of class P and n elements of
class N
– The amount of information, needed to
decide if an arbitrary example in S
belongs to P or N is defined as

p p n n
I ( p, n) = − log 2 − log 2
p+n p+n p+n p+n
Information Gain in
Decision Tree
Induction
• Assume that using attribute A a set
S will be partitioned into sets {S1,
S2 , …, Sv}
– If Si contains pi examples of P and ni
examples of N, the entropy, or the
expected information needed to
classify objects in all subtrees Si is
ν p +n
E ( A) = ∑ i i
I ( pi , ni )
i =1 p + n
• The encoding information that
would be gained by branching on
A
Gain( A) = I ( p, n) − E ( A)
Attribute Selection by
Information Gain
Computation
Class P: 5 4
buys_computer = E (age) = I (2,3) + I (4,0)
14 14
“yes”
5
Class N: + I (3,2) = 0.69
14
buys_computer = Hence
“no”
I(p, n) = I(9, 5)
=0.940 Similarly
Compute the Gain(age) = I ( p, n) − E (age)
entropy for age:
age pi ni I(pi, ni)
<=30 2 3 0.971
30…40 4 0 0
>40 Gain
3 (income ) = 0.029
2 0.971
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Extracting Classification
Rules from Trees

• Represent the knowledge in the form of IF-THEN


rules
• One rule is created for each path from the root to a
leaf
• Each attribute-value pair along a path forms a
conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN
buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN
buys_computer = “yes”
IF age = “31…40” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN
buys_computer = “no”
Avoid Overfitting in
Classification
• The generated tree may overfit the
training data
– Too many branches, some may
reflect anomalies due to noise or
outliers
– Result is in poor accuracy for unseen
samples
• Two approaches to avoid
overfitting
– Prepruning: Halt tree construction
early—do not split a node if this
would result in the goodness
measure falling below a threshold
• Difficult to choose an appropriate
threshold
– Postpruning: Remove branches from
a “fully grown” tree—get a sequence
of progressively pruned trees
• Use a set of data different from the
training data to decide which is the
“best pruned tree”
Enhancements to
basic decision tree
induction
• Allow for continuous-valued
attributes
– Dynamically define new discrete-valued
attributes that partition the continuous
attribute value into a discrete set of
intervals
• Handle missing attribute values
– Assign the most common value of the
attribute
– Assign probability to each of the
possible values
• Attribute construction
– Create new attributes based on existing
ones that are sparsely represented
– This reduces fragmentation, repetition,
and replication
Chapter 7.
Classification and
Prediction
• What is classification? What
is prediction?
• Issues regarding
classification and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
Bayesian
Classification: Why?
• Probabilistic learning: Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of
learning problems
• Incremental: Each training example can
incrementally increase/decrease the
probability that a hypothesis is correct.
Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple
hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods
are computationally intractable, they can
provide a standard of optimal decision
making against which other methods can be
measured
Bayesian
Theorem

• Given training data D, posteriori


probability of a hypothesis h, P(h|
D) follows the Bayes theorem
P(h | D) = P(D | h)P(h)
P(D)
• MAP (maximum posteriori)
hypothesis
h ≡ arg max P(h | D) = arg max P(D | h)P(h).
MAP h∈ H h∈ H
• Practical difficulty: require initial
knowledge of many probabilities,
significant computational cost
Naïve Bayes Classifier
(I)
• A simplified assumption:
attributes are conditionally
independent:

n
P( C j |V ) ∞ P( C j )∏ P( v i | C j )
i =1
• Greatly reduces the
computation cost, only count
the class distribution.
Naive Bayesian
Classifier (II)

• Given a training set, we can


compute the probabilities

O utlook P N H um idity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 norm al 6/9 1/5
rain 3/9 2/5
Tem preature W indy
hot 2/9 2/5 true 3/9 3/5
m ild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian classification

• The classification problem may


be formalized using a-posteriori
probabilities:
• P(C|X) = prob. that the
sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N |
outlook=sunny,windy=true,…)

• Idea: assign to sample X the


class label C such that P(C|X) is
maximal
Estimating a-posteriori
probabilities

• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Naïve Bayesian
Classification

• Naïve assumption: attribute


independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the
relative freq of samples having
value xi as i-th attribute in
class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a
Gaussian density function
• Computationally easy in both
cases
Play-tennis example:
estimating P(xi|C)
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = P(sunny|n) =
sunny hot high false N
2/9 3/5
sunny hot high true N
overcast hot high false P P(overcast|p) P(overcast|n)
rain mild high false P = 4/9 =0
rain cool normal false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal true N
overcast cool normal true P temperature
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast mild high true P
humidity
overcast hot normal false P
rain mild high true N P(high|p) = P(high|n) =
3/9 4/5
P(normal|p) = P(normal|n) =
6/9 2/5
P(p) =
windy
9/14
P(true|p) = 3/9 P(true|n) = 3/5
P(n) =
P(false|p) = P(false|n) =
5/14 6/9 2/5
Play-tennis example:
classifying X
• An unseen sample X = <rain,
hot, high, false>

• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|
p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 =
0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|
n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 =
0.018286

• Sample X is classified in class


n (don’t play)
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
Other Classification
Methods
• k-nearest neighbor classifier
• case-based reasoning
• Genetic algorithm
• Rough set approach
• Fuzzy set approaches
The k-Nearest
Neighbor Algorithm
• All instances correspond to points in the n-D
space.
• The nearest neighbor are defined in terms of
Euclidean distance.
• The target function could be discrete- or real-
valued.
• For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
• Vonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples.

_
_ _
.
+ _
+
. .
_ .
xq + .
_
+ .
Discussion on the k-
NN Algorithm
• The k-NN algorithm for continuous-valued target
functions
– Calculate the mean values of the k nearest
neighbors
• Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k
neighbors according to their distance to the
query point xq
• giving greater weight to closer neighbors
– Similarly, for real-valued target functions
• Robust to noisy data by averaging k-nearest
neighbors
• Curse of dimensionality: distance between
neighbors could be dominated by irrelevant
attributes.
– To overcome it, axes stretch or elimination of
the least relevant attributes.

w≡ 1
d ( xq , xi )2
Remarks on Lazy vs.
Eager Learning
• Instance-based learning: lazy evaluation
• Decision-tree and Bayesian classification:
eager evaluation
• Key differences
– Lazy method may consider query instance xq
when deciding how to generalize beyond the
training data D
– Eager method cannot since they have already
chosen global approximation when seeing the
query
• Efficiency: Lazy - less time training but
more time predicting
• Accuracy
– Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to
form its implicit global approximation to the target
function
– Eager: must commit to a single hypothesis that
covers the entire instance space
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
What Is
Prediction?

• Prediction is similar to
classification
– First, construct a model
– Second, use model to predict
unknown value
• Major method for prediction is
regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from
classification
– Classification refers to predict
categorical class label
– Prediction models continuous-valued
functions