Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
TO K-NEAREST NEIGHBOR
ALGORITHM
December 23, 2016 Rahul Saxena
10 Comments
The parameters could be the intercept and coefficient. For any classification
algorithm, we will try to get a boundary. Which successfully separates different target
classes.
Let’s say for support vector machine we will try to find the margins and the support
vectors. In this case, also we will have some set of parameters. Which needs to be
optimized to get decent accuracy.
But today we are going to learn a different kind of algorithm which is non-
parametric classification algorithm. Thinking how we can model such algorithm.
Let’s walk through this post to get know how we can do that. Just to give you one
line summary.
The algorithm uses the neighbor points information to predict the target class.
Table of contents
The simple version of the K-nearest neighbor classifier algorithms is to predict the
target label by finding the nearest neighbor class. The closest class will be identified
using the distance measures like Euclidean distance.
Using customer’s detailed information from the database, it will calculate a credit
score(discrete value) for each customer. The calculated credit score helps the
company and lenders to understand the credibility of a customer clearly. So they can
simply take a decision whether they should lend money to a particular customer or
not.
The company(XYZ) use’s these kinds of details to calculate credit score of a customer.
The process of calculating the credit score from the customer’s details is expensive.
To reduce the cost of predicting credit score, they realized that the customers with
similar background details are getting a similar credit score.
So, they decided to use already available data of customers and predict the credit
score using it by comparing it with similar data. These kinds of problems are
handled by the k-nearest neighbor classifier for finding the similar kind of customers.
Let (Xi, Ci) where i = 1, 2……., n be data points. Xi denotes feature values & Ci denotes
labels for Xifor each i.
Assuming the number of classes as ‘c’
Ci ∈ {1, 2, 3, ……, c} for all values of i
Let x be a point for which label is not known, and we would like to find the label class
using k-nearest neighbor algorithms.
For data science, beginners the about pseudocode will be hard to understand. So
let’s understand the knn algorithm using an example.
K-Nearest neighbor algorithm example
Let’s consider the above image where we have two different target classes white
and orangecircles. We have total 26 training samples. Now we would like to predict
the target class for the blue circle. Considering k value as three, we need to
calculate the similarity distance using similarity measures like Euclidean distance.
If the similarity score is less which means the classes are close. In the image, we have
calculated distance and placed the less distance circles to blue circle inside the Big
circle.
With the above example, you got some idea about the process of the knn algorithm.
Now read the next paragraph to understand the knn algorithm in technical words.
Let’s consider a setup with “n” training samples, where xi is the training data point.
The training data points are categorized into “c” classes. Using KNN, we want to
predict class for the new data point. So, the first step is to calculate
the distance(Euclidean) between the new data point and all the training data points.
Next step is to arrange all the distances in non-decreasing order. Assuming a positive
value of “K” and filtering “K” least values from the sorted list. Now, we have K top
distances. Let ki denotes no. of points belonging to the ith class among k points. If
ki >kj ∀i ≠ j then put x in class i.
Selecting the value of K in K-nearest neighbor is the most critical problem. A small
value of K means that noise will have a higher influence on the result i.e.,
the probability of overfitting is very high. A large value of K makes it computationally
expensive and defeats the basic idea behind KNN (that points that are near might
have similar classes ). A simple approach to select k is k = n^(1/2).
To optimize the results, we can use Cross Validation. Using the cross-
validation technique, we can test KNN algorithm with different values of K. The
model which gives good accuracy can be considered to be an optimal choice.
It depends on individual cases, at times best process is to run through each possible
value of k and test our result.
Working on a big dataset can be an expensive task. Using the condensed nearest
neighbor rule, we can clean our data and can sort the important observations out of
it. This process can reduce the execution time of the machine learning algorithm. But
there is a chance of accuracy reduction.
1. Outliers: Observations that lie at an abnormal distance from all the data
points. Most of these are extreme values. Removing these observations will
increase the accuracy of the model.
2. Prototypes: Minimum points in training set required to recognize non-outlier
points.
3. Absorbed points: These are points that are correctly identified to be non-
outlier points.
To diagnose Breast Cancer, the doctor uses his experience by analyzing details
provided by a) Patient’s Past Medical History b) Reports of all the tests performed.
At times, it becomes difficult to diagnose cancer even for experienced doctors. Since
the information provided by the patient might be unclear and insufficient.
Breast cancer database was obtained from the University of Wisconsin Hospitals,
Madison from Dr. William H. Wolberg. It contains 699 samples with 10 attributes.
1 # Attribute Domain
2 -- -----------------------------------------
3 1. Sample code number id number
4 2. Clump Thickness 1 - 10
5 3. Uniformity of Cell Size 1 - 10
6 4. Uniformity of Cell Shape 1 - 10
7 5. Marginal Adhesion 1 - 10
8 6. Single Epithelial Cell Size 1 - 10
9 7. Bare Nuclei 1 - 10
10 8. Bland Chromatin 1 - 10
11 9. Normal Nucleoli 1 - 10
12 10. Mitoses 1 - 10
13 11. Class: (2 for benign, 4 for malignant)
The Main objective is to predict whether it’s benign or malignant. Many Scientists
worked on this dataset and predicted class using different algorithms.
1. Parametric
2. Semiparametric
3. Non-Parametric
References