0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

115 visualizzazioni66 pagineOct 15, 2008

© Attribution Non-Commercial (BY-NC)

PPT, PDF, TXT o leggi online da Scribd

Attribution Non-Commercial (BY-NC)

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

115 visualizzazioni66 pagineAttribution Non-Commercial (BY-NC)

Sei sulla pagina 1di 66

Invention

• Data explosion problem

database technology lead to tremendous

amounts of data accumulated and/or to be

analyzed in databases, data warehouses, and

other information repositories

• We are drowning in data, but starving for

knowledge!

• Solution: Data warehousing and data mining

processing

regularities, patterns, constraints) from data in

large databases

Why Data Mining?—Potential

Applications

– Market analysis and management

• Target marketing, customer relationship

management (CRM), market basket

analysis, cross selling, market

segmentation

– Risk analysis and management

• Forecasting, customer retention, improved

underwriting, quality control, competitive

analysis

– Fraud detection and detection of unusual

patterns (outliers)

• Other Applications

– Text mining (news group, email, documents)

and Web mining

– Stream data mining

– DNA and bio-data analysis

Corporate Analysis & Risk

Management

evaluation

– cash flow analysis and prediction

– contingent claim analysis to evaluate

assets

– cross-sectional and time series

analysis (financial-ratio, trend

analysis, etc.)

• Resource planning

– summarize and compare the

resources and spending

• Competition

– monitor competitors and market

directions

– group customers into classes and a

class-based pricing procedure

– set pricing strategy in a highly

Data Mining: A KDD

Process

core of

knowledge

discovery Data Mining

process

Task-relevant Data

Data Selection

Warehouse

Data Cleaning

Data Integration

Databas

es

Steps of a KDD

Process

• Learning the application domain

– relevant prior knowledge and goals of

application

• Creating a target data set: data selection

• Data cleaning and preprocessing: (may

take 60% of effort!)

• Data reduction and transformation

– Find useful features, dimensionality/variable

reduction, invariant representation.

• Choosing functions of data mining

– summarization, classification, regression,

association, clustering.

• Choosing the mining algorithm(s)

• Data mining: search for patterns of

interest

• Pattern evaluation and knowledge

presentation

– visualization, transformation, removing

redundant patterns, etc.

• Use of discovered knowledge

Architecture: Typical Data

Mining System

Pattern evaluation

Knowledge

Database -base

or data

warehouse

Data cleaningserver

& data Filteri

integration ng

Data

Databa Wareho

ses use

Data Mining: On What

Kinds of Data?

• Relational database

• Data warehouse

• Transactional database

• Advanced database and

information repository

– Object-relational database

– Spatial and temporal data

– Time-series data

– Stream data

– Multimedia database

– Heterogeneous and legacy

database

– Text databases & WWW

Data Mining

Functionalities

• Concept description: Characterization and

discrimination

– Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions

• Association (correlation and causality)

– Diaper Beer [0.5%, 75%]

• Classification and Prediction

– Construct models (functions) that describe and

distinguish classes or concepts for future

prediction

• E.g., classify countries based on climate, or

classify cars based on gas mileage

– Presentation: decision-tree, classification rule,

neural network

– Predict some unknown or missing numerical

values

Data Mining

Functionalities (2)

• Cluster analysis

– Class label is unknown: Group data to form new

classes, e.g., cluster houses to find distribution

patterns

– Maximizing intra-class similarity & minimizing

interclass similarity

• Outlier analysis

– Outlier: a data object that does not comply with

the general behavior of the data

– Noise or exception? No! useful in fraud

detection, rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

• Other pattern-directed or statistical analyses

Data Mining: Confluence of

Multiple Disciplines

Database

Statistics

Systems

Machine

Learning

Data Mining Visualization

Algorithm Other

Disciplines

Major Issues in Data

Mining

• Mining methodology

– Mining different kinds of knowledge from diverse

data types, e.g., bio, stream, Web

– Performance: efficiency, effectiveness, and

scalability

– Pattern evaluation: the interestingness problem

– Incorporation of background knowledge

– Handling noise and incomplete data

– Parallel, distributed and incremental mining methods

– Integration of the discovered knowledge with existing

one: knowledge fusion

• User interaction

– Data mining query languages and ad-hoc mining

– Expression and visualization of data mining results

– Interactive mining of knowledge at multiple levels of

abstraction

• Applications and social impacts

– Domain-specific data mining & invisible data mining

– Protection of data security, integrity, and privacy

Data Mining

Association Rules

Mining Association

Rules in Large

Databases

• Association rule mining

• Mining single-dimensional

Boolean association rules

from transactional databases

• Summary

What Is

Association

Mining?

• Association rule mining:

– Finding frequent patterns, associations,

correlations, or causal structures

among sets of items or objects in

transaction databases, relational

databases, and other information

repositories.

• Applications:

– Basket data analysis, cross-marketing,

catalog design, loss-leader analysis,

clustering, classification, etc.

• Examples.

– Rule form: “Body → Ηead [support,

confidence]”.

– buys(x, “diapers”) → buys(x, “beers”)

[0.5%, 60%]

– major(x, “CS”) ^ takes(x, “DB”) →

grade(x, “A”) [1%, 75%]

Association Rule:

Basic Concepts

• Given: (1) database of transactions,

(2) each transaction is a list of items

(purchased by a customer in a visit)

• Find: all rules that correlate the

presence of one set of items with

that of another set of items

– E.g., 98% of people who purchase tires

and auto accessories also get

automotive services done

• Applications

– * ⇒ Maintenance Agreement (What

the store should do to boost

Maintenance Agreement sales)

– Home Electronics ⇒ * (What other

products should the store stocks up?)

– Attached mailing in direct marketing

– Detecting “ping-pong”ing of patients,

faulty “collisions”

Rule Measures:

Support and

Custo

Confidence

mer Custom • Find all the rules X &

er

buys

both buys Y ⇒ Z with minimum

diaper confidence and

support

– support, s, probability

that a transaction

contains {X Y Z}

Customer – confidence, c,

buys beer conditional probability

that a transaction

having {X Y} also

contains

Let minimum Z

Transaction IDItems Bought

support 50%, and

2000 A,B,C minimum

1000 A,C confidence 50%,

4000 A,D we have

– A ⇒ C (50%,

5000 B,E,F 66.6%)

– C ⇒ A (50%,

100%)

Association Rule

Mining: A Road Map

• Boolean vs. quantitative associations

(Based on the types of values handled)

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) →

buys(x, “DBMiner”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) → buys(x,

“PC”) [1%, 75%]

• Single dimension vs. multiple dimensional

associations (see ex. Above)

• Single level vs. multiple-level analysis

– What brands of beers are associated with what

brands of diapers?

• Various extensions

– Correlation, causality analysis

• Association does not necessarily imply correlation

or causality

– Maxpatterns and closed itemsets

– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum

> 1,000)?

Mining Association

Rules in Large

Databases

• Association rule mining

• Mining single-dimensional

Boolean association rules

from transactional databases

• Summary

Mining Association

Min. support 50%

Rules—An Min.

Example

confidence 50%

2000 A,B,C

1000 A,C

4000 A,D

Frequent Itemset Support

5000 B,E,F

{A} 75%

{B} 50%

{C} 50%

For rule A ⇒ C: {A,C} 50%

support = support({A C}) = 50%

confidence = support({A C})/support({A})

= 66.6%

The Apriori principle:

Any subset of a frequent itemset must

be frequent

Mining Frequent

Itemsets: the Key

Step

• Find the frequent itemsets: the

sets of items that have minimum

support

– A subset of a frequent itemset must

also be a frequent itemset

• i.e., if {AB} is a frequent itemset, both {A}

and {B} should be a frequent itemset

– Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset)

• Use the frequent itemsets to

generate association rules.

The Apriori Algorithm

k

Lk-1with itself

• Prune Step: Any (k-1)-itemset that is

not frequent cannot be a subset of a

frequent k-itemset

• Pseudo-code:

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates

in Ck+1 that are

contained in t

Lk+1 = candidates in Ck+1 with

min_support

end

return ∪k Lk;

The Apriori Algorithm

— Example

Database D itemset sup.

L1 itemset sup.

TID Items C1 {1} 2 {1} 2

100 1 3 4 {2} 3 {2} 3

Scan D

200 2 3 5 {3} 3 {3} 3

300 1 2 3 5 {4} 1 {5} 3

400 2 5 {5} 3

C2

itemset sup C2 itemset

L2 itemset sup {1 2} 1 Scan D {1 2}

{1 3} 2 {1 3} 2 {1 3}

{2 3} 2 {1 5} 1 {1 5}

{2 3} 2 {2 3}

{2 5} 3

{2 5} 3 {2 5}

{3 5} 2

{3 5} 2 {3 5}

C3 itemset L3 itemset sup

Scan D

{2 3 5} {2 3 5} 2

How to Generate

Candidates?

• Suppose the items in Lk-1 are listed

in an order

• Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2,

p.itemk-1 < q.itemk-1

• Step 2: pruning

forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

How to Count

Supports of

Candidates?

• Why counting supports of

candidates a problem?

– The total number of candidates can

be very huge

– One transaction may contain many

candidates

• Method:

– Candidate itemsets are stored in a

hash-tree

– Leaf node of hash-tree contains a list

of itemsets and counts

– Interior node contains a hash table

– Subset function: finds all the

candidates contained in a transaction

Example of Generating

Candidates

• Self-joining: L3*L3

– abcd from abc and abd

– acde from acd and ace

• Pruning:

– acde is removed because ade is not in

L3

• C4={abcd}

Methods to Improve

Apriori’s Efficiency

whose corresponding hashing bucket count is

below the threshold cannot be frequent

• Transaction reduction: A transaction that does

not contain any frequent k-itemset is useless in

subsequent scans

• Partitioning: Any itemset that is potentially

frequent in DB must be frequent in at least one

of the partitions of DB

• Sampling: mining on a subset of given data,

lower support threshold + a method to determine

the completeness

• Dynamic itemset counting: add new candidate

itemsets only when all of their subsets are

estimated to be frequent

Is Apriori Fast Enough?

— Performance

Bottlenecks

– Use frequent (k – 1)-itemsets to generate

candidate frequent k-itemsets

– Use database scan and pattern matching to

collect counts for the candidate itemsets

• The bottleneck of Apriori: candidate

generation

– Huge candidate sets:

• 104 frequent 1-itemset will generate 107

candidate 2-itemsets

• To discover a frequent pattern of size 100,

e.g., {a1, a2, …, a100}, one needs to generate

2100 ≈ 1030 candidates.

– Multiple scans of database:

• Needs (n +1 ) scans, n is the length of the

longest pattern

Chapter 6: Mining

Association Rules

in Large Databases

• Association rule mining

• Mining single-dimensional

Boolean association rules

from transactional databases

• Summary

Chapter 7.

Classification and

Prediction

• What is classification? What is

prediction?

• Issues regarding classification

and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

Classification vs.

Prediction

• Classification:

– predicts categorical class labels

– classifies data (constructs a model)

based on the training set and the

values (class labels) in a classifying

attribute and uses it in classifying new

data

• Prediction:

– models continuous-valued functions,

i.e., predicts unknown or missing

values

• Typical Applications

– credit approval

– target marketing

– medical diagnosis

– treatment effectiveness analysis

Classification—A Two-

Step Process

predetermined classes

– Each tuple/sample is assumed to belong to a

predefined class, as determined by the class

label attribute

– The set of tuples used for model construction:

training set

– The model is represented as classification

rules, decision trees, or mathematical

formulae

• Model usage: for classifying future or

unknown objects

– Estimate accuracy of the model

• The known label of test sample is

compared with the classified result from

the model

• Accuracy rate is the percentage of test set

samples that are correctly classified by

the model

• Test set is independent of training set,

otherwise over-fitting will occur

Classification

Process (1): Model

Construction

Classification

Algorithms

Trainin

g

Data

Classifier

(Model)

ike Assistant Prof 3 no

ary Assistant Prof 7 yes

ill Professor 2 yes

m Associate Prof 7 yes

ave Assistant Prof 6IF rank = ‘professo

no

OR years > 6

nne Associate Prof 3THEN tenured no= ‘y

Classification

Process (2): Use the

Model in Prediction

Classifier

Testing

Data Unseen Data

(Jeff, Professor, 4)

T om A ssistant P rof 2

Tenured?no

M erlisa A ssociate P rof 7 no

G eorge P rofesso r 5 yes

Joseph A ssistant P rof 7 yes

Supervised vs.

Unsupervised

Learning

• Supervised learning (classification)

– Supervision: The training data

(observations, measurements, etc.)

are accompanied by labels indicating

the class of the observations

– New data is classified based on the

training set

• Unsupervised learning (clustering)

– The class labels of training data is

unknown

– Given a set of measurements,

observations, etc. with the aim of

establishing the existence of classes

or clusters in the data

Chapter 7.

Classification and

Prediction

• What is classification? What is

prediction?

• Issues regarding classification

and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

classification and

prediction (1): Data

Preparation

• Data cleaning

– Preprocess data in order to reduce

noise and handle missing values

• Relevance analysis (feature

selection)

– Remove the irrelevant or redundant

attributes

• Data transformation

– Generalize and/or normalize data

classification and prediction

(2): Evaluating

Classification Methods

• Predictive accuracy

• Speed and scalability

– time to construct the model

– time to use the model

• Robustness

– handling noise and missing values

• Scalability

– efficiency in disk-resident databases

• Interpretability:

– understanding and insight provded

by the model

• Goodness of rules

– decision tree size

– compactness of classification rules

Chapter 7.

Classification and

Prediction

• What is classification? What is

prediction?

• Issues regarding classification

and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

Classification by

Decision Tree

Induction

• Decision tree

– A flow-chart-like tree structure

– Internal node denotes a test on an attribute

– Branch represents an outcome of the test

– Leaf nodes represent class labels or class

distribution

• Decision tree generation consists of two

phases

– Tree construction

• At start, all the training examples are at the

root

• Partition examples recursively based on

selected attributes

– Tree pruning

• Identify and remove branches that reflect

noise or outliers

• Use of decision tree: Classifying an

unknown sample

– Test the attribute values of the sample against

the decision tree

Training

Dataset

This <=30 high no fair

follow <=30 high no excellent

s an 31…40 high no fair

exam

>40 medium no fair

ple

>40 low yes fair

from

Quinl >40 low yes excellent

an’s 31…40 low yes excellent

ID3 <=30 medium no fair

<=30 low yes fair

>40 medium yes fair

<=30 medium yes excellent

31…40 medium no excellent

31…40 high yes fair

>40 medium no excellent

Output: A Decision Tree for

“buys_computer”

age?

<=30 overcast

30..40 >40

no yes no yes

Algorithm for Decision

Tree Induction

• Basic algorithm (a greedy algorithm)

– Tree is constructed in a top-down recursive

divide-and-conquer manner

– At start, all the training examples are at the root

– Attributes are categorical (if continuous-valued,

they are discretized in advance)

– Examples are partitioned recursively based on

selected attributes

– Test attributes are selected on the basis of a

heuristic or statistical measure (e.g.,

information gain)

• Conditions for stopping partitioning

– All samples for a given node belong to the

same class

– There are no remaining attributes for further

partitioning – majority voting is employed for

classifying the leaf

– There are no samples left

Attribute

Selection

Measure

• Information gain (ID3/C4.5)

– All attributes are assumed to be

categorical

– Can be modified for continuous-valued

attributes

• Gini index (IBM IntelligentMiner)

– All attributes are assumed continuous-

valued

– Assume there exist several possible split

values for each attribute

– May need other tools, such as clustering,

to get the possible split values

– Can be modified for categorical

attributes

Information Gain

(ID3/C4.5)

information gain

• Assume there are two classes, P

and N

– Let the set of examples S contain p

elements of class P and n elements of

class N

– The amount of information, needed to

decide if an arbitrary example in S

belongs to P or N is defined as

p p n n

I ( p, n) = − log 2 − log 2

p+n p+n p+n p+n

Information Gain in

Decision Tree

Induction

• Assume that using attribute A a set

S will be partitioned into sets {S1,

S2 , …, Sv}

– If Si contains pi examples of P and ni

examples of N, the entropy, or the

expected information needed to

classify objects in all subtrees Si is

ν p +n

E ( A) = ∑ i i

I ( pi , ni )

i =1 p + n

• The encoding information that

would be gained by branching on

A

Gain( A) = I ( p, n) − E ( A)

Attribute Selection by

Information Gain

Computation

Class P: 5 4

buys_computer = E (age) = I (2,3) + I (4,0)

14 14

“yes”

5

Class N: + I (3,2) = 0.69

14

buys_computer = Hence

“no”

I(p, n) = I(9, 5)

=0.940 Similarly

Compute the Gain(age) = I ( p, n) − E (age)

entropy for age:

age pi ni I(pi, ni)

<=30 2 3 0.971

30…40 4 0 0

>40 Gain

3 (income ) = 0.029

2 0.971

Gain( student ) = 0.151

Gain(credit _ rating ) = 0.048

Extracting Classification

Rules from Trees

rules

• One rule is created for each path from the root to a

leaf

• Each attribute-value pair along a path forms a

conjunction

• The leaf node holds the class prediction

• Rules are easier for humans to understand

• Example

IF age = “<=30” AND student = “no” THEN

buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN

buys_computer = “yes”

IF age = “31…40” THEN

buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent”

THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN

buys_computer = “no”

Avoid Overfitting in

Classification

• The generated tree may overfit the

training data

– Too many branches, some may

reflect anomalies due to noise or

outliers

– Result is in poor accuracy for unseen

samples

• Two approaches to avoid

overfitting

– Prepruning: Halt tree construction

early—do not split a node if this

would result in the goodness

measure falling below a threshold

• Difficult to choose an appropriate

threshold

– Postpruning: Remove branches from

a “fully grown” tree—get a sequence

of progressively pruned trees

• Use a set of data different from the

training data to decide which is the

“best pruned tree”

Enhancements to

basic decision tree

induction

• Allow for continuous-valued

attributes

– Dynamically define new discrete-valued

attributes that partition the continuous

attribute value into a discrete set of

intervals

• Handle missing attribute values

– Assign the most common value of the

attribute

– Assign probability to each of the

possible values

• Attribute construction

– Create new attributes based on existing

ones that are sparsely represented

– This reduces fragmentation, repetition,

and replication

Chapter 7.

Classification and

Prediction

• What is classification? What

is prediction?

• Issues regarding

classification and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

Bayesian

Classification: Why?

• Probabilistic learning: Calculate explicit

probabilities for hypothesis, among the most

practical approaches to certain types of

learning problems

• Incremental: Each training example can

incrementally increase/decrease the

probability that a hypothesis is correct.

Prior knowledge can be combined with

observed data.

• Probabilistic prediction: Predict multiple

hypotheses, weighted by their probabilities

• Standard: Even when Bayesian methods

are computationally intractable, they can

provide a standard of optimal decision

making against which other methods can be

measured

Bayesian

Theorem

probability of a hypothesis h, P(h|

D) follows the Bayes theorem

P(h | D) = P(D | h)P(h)

P(D)

• MAP (maximum posteriori)

hypothesis

h ≡ arg max P(h | D) = arg max P(D | h)P(h).

MAP h∈ H h∈ H

• Practical difficulty: require initial

knowledge of many probabilities,

significant computational cost

Naïve Bayes Classifier

(I)

• A simplified assumption:

attributes are conditionally

independent:

n

P( C j |V ) ∞ P( C j )∏ P( v i | C j )

i =1

• Greatly reduces the

computation cost, only count

the class distribution.

Naive Bayesian

Classifier (II)

compute the probabilities

O utlook P N H um idity P N

sunny 2/9 3/5 high 3/9 4/5

overcast 4/9 0 norm al 6/9 1/5

rain 3/9 2/5

Tem preature W indy

hot 2/9 2/5 true 3/9 3/5

m ild 4/9 2/5 false 6/9 2/5

cool 3/9 1/5

Bayesian classification

be formalized using a-posteriori

probabilities:

• P(C|X) = prob. that the

sample tuple

X=<x1,…,xk> is of class C.

• E.g. P(class=N |

outlook=sunny,windy=true,…)

class label C such that P(C|X) is

maximal

Estimating a-posteriori

probabilities

• Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

• P(X) is constant for all classes

• P(C) = relative freq of class C samples

• C such that P(C|X) is maximum =

C such that P(X|C)·P(C) is maximum

• Problem: computing P(X|C) is unfeasible!

Naïve Bayesian

Classification

independence

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

• If i-th attribute is categorical:

P(xi|C) is estimated as the

relative freq of samples having

value xi as i-th attribute in

class C

• If i-th attribute is continuous:

P(xi|C) is estimated thru a

Gaussian density function

• Computationally easy in both

cases

Play-tennis example:

estimating P(xi|C)

outlook

Outlook Temperature Humidity Windy Class P(sunny|p) = P(sunny|n) =

sunny hot high false N

2/9 3/5

sunny hot high true N

overcast hot high false P P(overcast|p) P(overcast|n)

rain mild high false P = 4/9 =0

rain cool normal false P P(rain|p) = 3/9 P(rain|n) = 2/5

rain cool normal true N

overcast cool normal true P temperature

sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5

sunny cool normal false P

rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5

sunny mild normal true P P(cool|p) = 3/9 P(cool|n) = 1/5

overcast mild high true P

humidity

overcast hot normal false P

rain mild high true N P(high|p) = P(high|n) =

3/9 4/5

P(normal|p) = P(normal|n) =

6/9 2/5

P(p) =

windy

9/14

P(true|p) = 3/9 P(true|n) = 3/5

P(n) =

P(false|p) = P(false|n) =

5/14 6/9 2/5

Play-tennis example:

classifying X

• An unseen sample X = <rain,

hot, high, false>

• P(X|p)·P(p) =

P(rain|p)·P(hot|p)·P(high|

p)·P(false|p)·P(p) =

3/9·2/9·3/9·6/9·9/14 =

0.010582

• P(X|n)·P(n) =

P(rain|n)·P(hot|n)·P(high|

n)·P(false|n)·P(n) =

2/5·2/5·4/5·2/5·5/14 =

0.018286

n (don’t play)

Chapter 7.

Classification and

Prediction

• What is classification? What is

prediction?

• Issues regarding classification

and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

Other Classification

Methods

• k-nearest neighbor classifier

• case-based reasoning

• Genetic algorithm

• Rough set approach

• Fuzzy set approaches

The k-Nearest

Neighbor Algorithm

• All instances correspond to points in the n-D

space.

• The nearest neighbor are defined in terms of

Euclidean distance.

• The target function could be discrete- or real-

valued.

• For discrete-valued, the k-NN returns the most

common value among the k training examples

nearest to xq.

• Vonoroi diagram: the decision surface induced

by 1-NN for a typical set of training examples.

_

_ _

.

+ _

+

. .

_ .

xq + .

_

+ .

Discussion on the k-

NN Algorithm

• The k-NN algorithm for continuous-valued target

functions

– Calculate the mean values of the k nearest

neighbors

• Distance-weighted nearest neighbor algorithm

– Weight the contribution of each of the k

neighbors according to their distance to the

query point xq

• giving greater weight to closer neighbors

– Similarly, for real-valued target functions

• Robust to noisy data by averaging k-nearest

neighbors

• Curse of dimensionality: distance between

neighbors could be dominated by irrelevant

attributes.

– To overcome it, axes stretch or elimination of

the least relevant attributes.

w≡ 1

d ( xq , xi )2

Remarks on Lazy vs.

Eager Learning

• Instance-based learning: lazy evaluation

• Decision-tree and Bayesian classification:

eager evaluation

• Key differences

– Lazy method may consider query instance xq

when deciding how to generalize beyond the

training data D

– Eager method cannot since they have already

chosen global approximation when seeing the

query

• Efficiency: Lazy - less time training but

more time predicting

• Accuracy

– Lazy method effectively uses a richer hypothesis

space since it uses many local linear functions to

form its implicit global approximation to the target

function

– Eager: must commit to a single hypothesis that

covers the entire instance space

Chapter 7.

Classification and

Prediction

• What is classification? What is

prediction?

• Issues regarding classification

and prediction

• Classification by decision tree

induction

• Bayesian Classification

• Other Classification Methods

• Prediction

What Is

Prediction?

• Prediction is similar to

classification

– First, construct a model

– Second, use model to predict

unknown value

• Major method for prediction is

regression

– Linear and multiple regression

– Non-linear regression

• Prediction is different from

classification

– Classification refers to predict

categorical class label

– Prediction models continuous-valued

functions

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.