Sei sulla pagina 1di 5

Method of k-Nearest Neighbors

Yu Jiangsheng
c Institute of Computational Linguistics
Peking University, China, 100871
September 2, 2002
Abstract The method of k-Nearest Neighbors (k-NN), an important approach to
nonparametric classication, is quite easy and ecient. Firstly, we introduced the al-
gorithm of k-NN, then its error estimation. The Bayesian error rate is used in the
bound estimation of k-NN error. By the standard of Maximum Likelihood Estimation
(MLE), Bayesian error reaches the least, which is often acts as an evaluation param-
eter. It is reported that k-NN works in many an application of classication, such as
Text Categorization, face recognition, etc.
Keywords k-NN, pattern recognition, error estimation
1 Introduction
If we have innitely many sample points, then the density estimates converges to the
the actural density function. The classier becomes the Bayesian classier if the large-
scale sample is provided. But in practice, given a small sample, the Bayesian classier
usually fails in the esitimation of the Bayes error especially in a high-dimensional space,
which is called the disaster of dimension. The mothods of Parzen and k-NN are often
used in the case of small sample. The following content is about the approach of k-NN
and its error estimation.
There is a famous Chinese idiom which says the the Scotch cousins are not more
helpful than the nearest neighbors. Because the things gather together by classes (also
a Chinese idiom), so it is natural to assume that the interest belongs to the class of its
nearest neighbors. The nearest neighborhors are determined if given the considered
the vector and the distance measurement. Ususally, we use Euclidean distance to
measure the nearest k neighbors.
Problem 1.1 Given the training set T = {(x
1
, y
1
), (x
2
, y
2
), , (x
n
, y
n
)}, to deter-
mine the class of the input vector x.
The most special case is the 1-NN method, which just searches the nearest neighbor:
j = argmin
i
x x
i
(1)
1
then, (x, y
j
) is the solution.
Denition 1.1 A training set T = {(x
1
, y
1
), (x
2
, y
2
), , (x
n
, y
n
)} is called (k, d%)-
stable if the error rate of k-NN method is d%, where d% is the empirical error rate
from independent experiments. if the clusterings of data are quite distinct (the class
distance is the crucial standard of classication), then the k must be small. The key
idea is we prefer the least k in the case that d% is bigger the threshold value.
The k-NN method gathers the nearest k neighbors and let them vote the class of
most neighbors wins. Theoretically, the more neighbors we consider, the smaller error
rate it takes place.
2 Error Estimation
First of all, we will consider the method of 1-NN, the easiest approach. Then, the error
estimation of 1-NN leads us to the general cases. The key step is the construction of
P

(y|x), that indicates the usual situation of our empirical experiences.


2.1 Error Estimation of 1-NN
Suppose that x belongs to the ith class, say C
i
, but the 1-NN classier assigns it with
the class of C

(C

= C) since x

is the nearest neighbor of x in the sample space.


This error is related to the co-occurrence of x and x
j
, denoted by P(e|x, x

). The
conditional error rate P(e|x) can be calculated by
P(e|x) =
_
R
n
P(e|x, x

)P(x

|x)dx

(2)
Because x

is the nearest neighbor of x, so the density of x

centralizes at x. In the
ball centered by x with radius x x

, say B(x, x x

), there is no other sample


points. Therefore, we are able to assume that the denstity function of x

is (x x

).
Because of the independency of (x, y) and (x

, y

), we have
P(e|x, x

) = 1

C
P(y = C, y

= C|x, x

)
= 1

C
P(y = C|x)P(y

= C|x

)
(3)
So, equation (2) turns into:
P(e|x) =
_
R
n
_
1

C
P(y = C|x)P(y

= C|x

)
_
(x x

)dx

= 1

C
[P(y = C|x)]
2
(4)
2
Consequently, the expectation of P(e) (denoted by P
1NN
(e)) is then gotten by
P
1NN
(e) =
_
R
n
P(e|x)P(x)dx
=
_
R
n
_
1

C
[P(y = C|x)]
2
_
P(x)dx
(5)
For the following two ultra cases, P
1NN
(e) is easy to calculate:
CASE DISTRIBUTION ERROR
the class of x is
certain
P(y|x) =
_
1 y = C
0 y = C
P
1NN
(e) = 0
the distribution
of x is unknown
P(y|x) =
1
N
, where N is
the number of classes in the
sample space
P
1NN
(e) =
N1
N
Figure 1: two ultra cases
But, for the other cases it is dicult to calculate (5) directly. We will Bayesian error
estimation, P
Bayes
(e), in the bound estimation of P
1NN
(e).
P
Bayes
(e) =
_
R
n
[1 P(y = C|x)] P(x)dx (6)
P
Bayes
(e) coincides with P
1NN
(e) on the cases mentioned above.
A little more complicated assumption of P(y|x) is that P(y = C|x) reaches the most
and P(y = C

|x) =
1P(y=C|x)
N1
= :
P

(y|x) =
_
P(y = C|x) y = C
y = C
(7)
Thus,

=C
P(y = C

|x) = 1 P(y = C|x) = P


Bayes
(e|x). Equation (7) turns into:
P

(y|x) =
_

_
1 P
Bayes
(e|x) y = C
P
Bayes
(e|x)
N 1
y = C
(8)
3
The distribution of (7) is an ultra case that P(y = C|x) is known but P(y = C

|x) is
unknown, where C

= C. Whats more,

C
[P(y = C|x)]
2

C
[P

(y = C|x)]
2
= [P

(y = C|x)]
2
+

=C
P

[P(y = C

|x)]
2
= [1 P
Bayes
(e|x)]
2
+ (N 1)
_
P
Bayes
(e|x)
N 1
_
2
= 1 2P
Bayes
(e|x) +
N
N 1
[P
Bayes
(e|x)]
2
(9)
So, we get an upper bound of 1

C
[P(y = C|x)]
2
:
1

C
[P(y = C|x)]
2
2P
Bayes
(e|x)
N
N 1
[P
Bayes
(e|x)]
2
(10)
Input (10) into (4), we have
P
1NN
(e)
_
R
n
_
2P
Bayes
(e|x)
N
N 1
[P
Bayes
(e|x)]
2
_
P(x)dx
= 2
_
R
n
P
Bayes
(e|x)dx
N
N 1
_
R
n
[P
Bayes
(e|x)]
2
dx
= 2P
bayes
(e)
N
N 1
[P
bayes
(e)]
2
(11)
the error estimation of kNN
0
0.2
0.4
0.6
0.8
1
0.2 0.4 0.6 0.8 1
P
Lastly, we get the bound estimation of P
1NN
(e), which lies in
P
bayes
(e) P
1NN
(e) 2P
bayes
(e)
N
N 1
[P
bayes
(e)]
2
(12)
If it is a binary classication, then P
bayes
(e) P
1NN
(e)
2P
bayes
(e)[1P
bayes
(e)]. The more classes, the more the upper
bound the limitation is P
bayes
(e)[2 P
bayes
(e)]. So, in gen-
eral the error of 1-NN lies between x and x(2 x).
2.2 Error Estimation of k-NN
The general case is a little more complex. But by imagination, it is true to be the more
k the lower upper bound to P
Bayes
(e) if N is xed.
4
Conclusion
If you provide the machine a small training set, youd better not presume that the
data will bring many good conclusions. For the information hidden in the data, there
is always a limitation of mining.
Acknowledgement
I would like to share the happiness of discovery with my wife even though she could
not understand. It is strange that even though she does not understand, she is still
able to make me recognize that my result is not good yet. With her encouragements,
I continue to improve my method till she nds the self-condence in my voice.
References
[1] K. Fukunaga (1990), Introduction to Statistical Pattern Recognition, Academic
Press, Inc.
[2] T. M. Mitchell (1997), Machine Learning, McGraw Hill.
[3] V. K. Rohatgi (1976), An Introduction to Probability Theory and Mathematical
Statistics, John Wiley & Sons, Inc.
5

Potrebbero piacerti anche