00 mi piace00 non mi piace

0 visualizzazioni10 pagineSupport Vector Machines

Jan 21, 2020

© © All Rights Reserved

0 visualizzazioni

00 mi piace00 non mi piace

Sei sulla pagina 1di 10

Thanks for reading this article, in this article we will go through a very powerful

and popular algorithm of machine learning algorithm Support Vector Machine.

Here we will try to understand the underlying concept on which SVM is based

through very simple example from our real-world scenarios to understand it

better. We will also try to understand few basic concepts of linear algebra,

because it will help in understanding the mathematics behind SVM, which we

will see in next article.

So, by reading this article you will get a proper understanding of following

points-

What is Support Vector Machine and how it works.

Then before going into graphical explanation of SVM and its underlying

concept we will see few important concepts of Linear Algebra which are-

o How we plot data in a space.

o How number of features of data (attributes or feature columns)

decides the dimension of space.

o Concept of plane or hyper-plane.

Then we will go back again to SVM and we will see objective of the SVM

and how it is important.

We will also explain Margin, Maximum Margin Hyper Plane and Why it is

necessary to select Maximum Margin Hyperplane.

So, after reading this article you will be able to understand the concept of

SVM, objective of SVM and importance of the objective.

A Support Vector Machine is a supervised machine learning algorithm

developed by Vladimir N Vapnik, it can be used for both classification and

regression but it’s mostly used for classification problems. Today it’s one of the

most used algorithms in Machine Learning. As we know the most trusted and

popular algorithm in machine learning is neural networks, but there is a dent in

its popularity and this dent is due to SVM, as by using much lesser computational

power than neural networks it gives very trusted results with both linear and

nonlinear data.

The main idea or bottom line on which SVM works is that it tries to find the

classifier or decision boundary such that the distance between decision

boundary to the nearest data points of each classes is maximum. That’s why its

also called maximal margin classifier.

People who are familiar with linear classifiers like logistic regression, neural

networks it’s easy for them to visualize the concept of decision boundary with

maximum distance from each class, but let’s discuss the entire concept from

scratch, to classify any data we first need to plot it in some space and it’s a

general idea and we can see many example in our day to day life, suppose you

were given a task to separate two kinds fruits say orange and mango kept in a

bag in that case you will take these and keep on a table int two groups with some

appropriate distance between each groups , so when I say we plot data in a

space then think the table as a space and fruits are the data points.

In same way in Support Vector Machine each data points are plotted in a N

Dimensional space where N is nothing but number of features. So why we take

dimension of space same as number of features, the reason of it is in linear

algebra, there if we wanted to plot a point where X1 = 2 and X2 = 3 then we use

2 draw a graph as below and to put point such that it has 2 distance from x1 and

3 from X2

X2 (2,3)

X1

So ideally to find location to plot a point we want distance from each line and

each line is a dimension in linear algebra and entire graph we can think as space

so we need space with dimension equal to number of coordinates and

coordinates are nothing but values of features. So, we need dimension of space

equal to number of features.

Now take same example of separating fruits to understand it properly

As you have a task to separate two types of fruits which are kept together in a

bag so, to differentiate between two types you will usually look for colour,

shape, size etc. so the shape, size and colour is nothing but the features on the

basis of which you can say which fruit is mango or orange. Let’s put these

features in a table as below assuming that

colour code for orange=1 and for mango=2

Colour Shape Size Type of

Fruits

1 3 6 Orange

So, all the above points will be plotted in 3 Dimension space.

The point drawn will have coordinates values which is nothing but the values of

features. So, for row above coordinate will be (1,3,6). Now, consider that each

such coordinates represent a fruit as the values of coordinates are of that fruit

only and then we plot each point in N (3 for this example) dimensional space,

and then we find the hyperplane which can separate these points in two classes

orange or mango. So that we can say data in one side of the plane belongs to

orange and data on other side are of mango. So, this gave us an overview what

our objectives are and how we can represent a day to day problems in

mathematical or more specifically in linear algebra way.

Now, let’s take another example to understand our objective of finding best

possible hyperplane to separate data points in terms of linear algebra here we

will take only 2 features and 2 classes as it’s easy to visualise -

Suppose we have two features X1 and X2 and we have two classes A and B. And

based on number of training examples (suppose we have n number of training

examples), we will have our points to be drawn and the values of X1 and X2 will

be the coordinates of these points -

X1

(X11,X21) (X13,X23)

(X12,X22) (X1n,X2n)

Class Name-A Class Name-B

X2

Now suppose after placing the above points in 2-Dimensional space we have got

graph like above. Since we have two feature columns (X1,X2) and as we know

dimension of space depends on number of features, so we have chosen a 2D

space.

Now we need a decision boundary in case of above example the decision

boundary is the line to separate the classes A and B. But a line can be drawn in

any direction and at any place, and the orientation and place of line is decided

on several basis but the main and top most criteria in which we are interested is

that the line should be drawn in such a way that it can divided entire data set in

two parts (as we have two classes here and for multiple class the number of

parts will be the count of classes) and division should be in a such a way that

points belonging to one class should be on one side and points belonging to

other should be on other side. So, in diagram above we can say that the line

drawn is effective and can be our decision boundary.

And seeing the above graph it’s quite obvious that we have got our perfect

boundary and as per training dataset it is even true, but if we see carefully the

points (X11,X21) and (X13,X23), these points seems to be very close to the decision

boundary and it’s even accepted for training dataset but suppose in test dataset

we have point (X1T,X2T) that has values near to (X13,X23) but has some slight

changes in any feature which is not even changing its class, but due to that there

is a high chance that point might go to other side of the decision boundary see

the figure below-

Misclassified Data

X1

(X1T,X2T)

(X11,X21) (X13,X23)

(X12,X22) (X1n,X2n)

X2

So from above example we can say that although the point seems to more

closer to class B but due to decision boundary now it will considered as Class A

as it falls slightly on left hand side of the line, so we can say that although the

line was best fitted for training data but it failed in case of test data and in

machine learning terminology we call it as over fitting.

To avoid such cases, we have many options like apply neural network and

through gradient decent we can draw an arbitrary shape to achieve best

classification.

See figure below-

Misclassified Data

X1

(X1T,X2T)

(X11,X21) (X13,X23)

(X12,X22) (X1n,X2n)

X2

But this will require huge computational resources, so the question here is that

do we have any other effective mechanism to solve this problem without using

that much resources?

The answer is Yes, we have Support Vector Machines which can help us in this.

But before we start with details of SVM lets first get an idea of few concepts

which we are going to use in explanation.

Hyperplane- As we have seen above that we draw a hyperplane to separate

the points but what’s exactly is hyperplane. A hyperplane is a geometric entity

which has dimension one less than the dimension surrounding it. As definition

says that, dimension of hyperplane is one less than dimension of space, so if

space has the dimension N, then

Dimension of Hyperplane= N-1

So, in 3 Dimensional space the hyperplane will have 2 Dimensions and as we

know that a 2 Dimensional entity is called plane so the hyperplane in 3D space

is a plane. Similarly, a hyperplane in 2D space will have one dimension and we

know a 1-dimensional entity is nothing but a line.

For example, above we had 2D dimension so dimension of our decision

boundary must be 1D, that is why we have drawn a line to separate our dataset.

So, a plane is nothing but a projection of a line in 3 Dimension space.

Before we go in mathematical equations of hyperplane, we should know the

concept of hyperplane in terms of Machine Learning. In machine learning a

hyperplane divides the dataset in their respective classes so if we have two

classes then the hyperplane should divide that in 2 parts.

The equation for a hyperplane is-

XTn+b=0

If we expand the equation we get-

X1n1+X2n2+X3n3+…….+Xnnn+b=0

Generic equation of plane

Now as we have seen the concept of plane so let’s see the Support Vector

Machines in detail-

As we know the idea behind support vector machine is to find a plane or a

decision boundary such that distance from nearest points of each classes to

the decision boundary is maximum. Here we can have 2 questions, one is that

why we need maximum distance and second is that why need maximum

distance from each class. To understand it let’s take same example of

separating fruits in 2 parts or classes i.e. orange or mango. And this time we

will not draw a graph and rather take a very simple approach, see the figure

below-

Size

Orange Mango

Consider only features size and colour for now and we can say that if size is less

and colour is Orange the class is orange and if size is more and colour is yellow

it belongs to mango. Now our goal is to find a decision boundary so that we can

say that data on left hand side of the decision boundary belong to orange class

and those on right hand side of decision boundary belong to mango class and by

seeing figure above we can say that we can place the decision boundary any

where so, let’s take 3 cases one closer to orange class denoted in figure below

as D1, one closer to mango class denoted by D2 in the figure below and for

decision boundary with maximum distance from each class lets calculate

distance between nearest points of each class i.e. Po of orange class and Pm of

mango class and place our decision boundary exactly at the middle of that

distance it is denoted by Dm in the figure below.

Size

Orange Po Pm Mango

D1 Dm D2

Let’s take the case of decision boundary D1 it seems okay as distance between

it and nearest point of at least one class i.e. mango is maximum, but what will

happen when we get a data of an orange (Po) whose size is little more than

other like in figure below-

Size

Orange Po Mango

D1

So as per our decision boundary it belongs to mango class as it is on right hand

side of decision boundary, but it seems very close to class orange so here we can

say it misclassification and our decision boundary is not capable of handling a

scenario where the size was little higher and from our personal experience, we

can say that the size of orange can be little larger. And in terminology of machine

learning our decision boundary is not generalised to handle such variations.

Similarly, for case of decision boundary D2 having maximum distance from class

orange it can be the an appropriate decision boundary, but case where size of

mango is less than usual shown as Pm below so here our decision boundary will

put it in class orange but the data is very close to class mango. So, this decision

boundary is also now best as it also fails to handle slight variations in data.

Size

Orange Pm Mango

D2

Now let’s take our decision boundary Dm having equal distance from nearest

point of each class.

Size

Orange Po Pm Mango

Dm

Here we can say that it’s the best decision boundary we can have as it has

successfully handled the variations present in points Po and Pm which were

getting misclassified by other 2 boundaries.

So, from above example we got the answers of our 2 questions-

One why we need maximum distance because if we take maximum distance

then we will able to avoid misclassification that can occur due to some variations

in data.

Second why we need maximum distance from each class, because we in this

case we will have freedom for each class to adjust variations in data properly.

So, when we talk about distance from nearest points, we actually have a

terminology in linear algebra for it which is Margin. We can define margin as

below-

Margin- A margin can be defined as the distance of the closest points to the

decision surface. We can also say that the margin is the distance between the

decision boundary and each of the classes. So in figure below we can see that

points (X11,X21) and (X12,X22) are closest to the decision surface hence the

distance between these point and decision surface is the Margin.

Let’s plot above points in the graph below to visualize the concept in more

details-

X1

(X12,X22)

(X11,X21) Margin

Class -A Class -B

And for the example of fruits classification the points Po and Pm are nearest

points so distance between those and the decision surface Dm is margin.

Margin Margin

Size

Orange Po Pm Mango

Dm

First, we understood what is support vector machine.

Next, we went through concepts of space through an example.

Then we saw how dimension of space is related to number of features.

Then through a graph and an example we got the idea of how data points

are plotted in a space and how we draw a plane to divide it.

After that idea of hyperplane was explained with equations.

Then we went through SVM in detail and with few examples we got the

idea of margin and why we need to have maximum margin from each

class.

Now in next article we will see the mathematical concepts behind SVM

and we will try to get the intuition behind it using an example of another

classifier algorithm Logistic Regression

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.