Sei sulla pagina 1di 50

Types of machine learning

Supervised learning: The computer is presented with example inputs and their
desired outputs, given by a “teacher”, and the goal is to learn a general rule that
maps inputs to outputs. The training process continues until the model achieves
the desired level of accuracy on the training data. Some real-life examples are:

○ Image Classification: You train with images/labels. Then in the


future you give a new image expecting that the computer will
recognize the new object.
○ Market Prediction/Regression: You train the computer with
historical market data and ask the computer to predict the new
price in the future.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on
its own to find structure in its input. It is used for clustering population in
different groups. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data).

○ Clustering: You ask the computer to separate similar data into


clusters, this is essential in research and science.
○ High Dimension Visualization: Use the computer to help us
visualize high dimension data.
○ Generative Models: After a model captures the probability
distribution of your input data, it will be able to generate more
data. This can be very useful to make your classifier more
robust.
Reinforcement learning: A computer program interacts with a dynamic
environment in which it must perform a certain goal (such as driving a vehicle or
playing a game against an opponent). The program is provided feedback in terms
of rewards and punishments as it navigates its problem space.

Steps In developing machine learning


application
1)Collect data. You could collect the samples by scraping a website and
extracting data, or you could get information from an RSS feed or an API.
You could have a device collect wind speed measurements and send them
to you, or blood glucose levels, or anything you can measure. The number
of options is endless. To save some time and effort, you could use publicly
available data.

2)Prepare the input data. Once you have this data, you need to make sure
it’s in a useable format.

3)Analyze the input data. This is looking at the data from the previous task.
You can also look at the data to see if you can recognize any patterns or if
there’s anything obvious, such as a few data points that are vastly different
from the rest of the set.

4)Train the algorithm. This is where the machine learning takes place. This
step and the next step are where the “core” algorithms lie, depending on
the algorithm. You feed the algorithm good clean data from the first two
steps and extract knowledge or information.

5)Test the algorithm. This is where the information learned in the previous
step is put to use. When you’re evaluating an algorithm, you’ll test it to see
how well it does.

6)Use it. Here you make a real program to do some task, and once again
you see if all the previous steps worked as you expected.

The McCulloch-Pitts model


The McCulloch-Pitts model was an extremely simple artificial neuron. The
inputs could be either a zero or a one. And the output was a zero or a one.
And each input could be either excitatory or inhibitory.

Now the whole point was to sum the inputs. If an input is one, and is
excitatory in nature, it added one. If it was one, and was inhibitory, it
subtracted one from the sum. This is done for all inputs, and a final sum is
calculated.

Here is a graphical representation of the McCulloch-Pitts model

The variables w1, w2 and w3 indicate which input is excitatory, and which
one is inhibitory. These are called "weights". So, in this model, if a weight is
1, it is an excitatory input. If it is -1, it is an inhibitory input.

x1, x2, and x3 represent the inputs. There could be more (or less) inputs if
required. And accordingly, there would be more 'w's to indicate if that
particular input is excitatory or inhibitory.
Now, if you think about it, you can calculate the sum using the 'x's and
'w's... something like this:

sum = x1w1 + x2w2 + x3w3 + ...

This is what is called a 'weighted sum'.

Now that the sum has been calculated, we check if sum < T or not. If it is,
then the output is made zero. Otherwise, it is made a one.

NOR Gate

Note that this example uses two neurons. The first neurons receives the
inputs you give. The second neuron works upon the output of the first
neuron. It has no clue what the initial inputs were.

NAND Gate
NAND gate with these neurons. A NAND gate gives a zero only when all
inputs are 1. This neuron needs 4 neurons. The output of the first three is
the input for the fourth neuron. If you try the different combinations of
inputs.

AND GATE(imp)
activation functions
A neural network without an activation function is essentially just a linear regression

model. The activation function does the non-linear transformation to the input making it

capable to learn and perform more complex tasks.

1). Linear Function :-

● Equation : Linear function has the equation similar to as of a straight line i.e.
y = ax
● No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
● Range : -inf to +inf
● Uses : Linear activation function is used at just one place i.e. output layer.
● Issues : If we will differentiate linear function to bring non-linearity, result
will no more depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.

For example : Calculation of price of a house is a regression problem. House price may

have any big/small value, so we can apply linear activation at output layer. Even in this

case neural net must have any non-linear function at hidden layers.

2). Sigmoid Function :-

● It is a function which is plotted as ‘S’ shaped graph.


● Equation :
A = 1/(1 + e-x)
● Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are
very steep. This means, small changes in x would also bring about large
changes in the value of Y.
● Value Range : 0 to 1
● Uses : Usually used in output layer of a binary classification, where result is
either 0 or 1, as value for sigmoid function lies between 0 and 1 only so,
result can be predicted easily to be 1 if value is greater than 0.5 and 0
otherwise.

3). Tanh Function :- The activation that works almost always better than sigmoid

function is Tanh function also knows as Tangent Hyperbolic function. It’s actually

mathematically shifted version of the sigmoid function. Both are similar and can be

derived from each other.

Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) - 1

OR

tanh(x) = 2 * sigmoid(2x) - 1

● Value Range :- -1 to +1
● Nature :- non-linear
● Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0.
This makes learning for the next layer much easier.
4). RELU :- Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.Equation :- A(x)
= max(0,x). It gives an output x if x is positive and 0 otherwise.

● Value Range :- [0, inf)


● Nature :- non-linear, which means we can easily backpropagate the errors
and have multiple layers of neurons being activated by the ReLU function.
● Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and
easy for computation.

In simple words, RELU learns much faster than sigmoid and Tanh function.

5). Softmax Function :- The softmax function is also a type of sigmoid function but is

handy when we are trying to handle classification problems.

● Nature :- non-linear
● Uses :- Usually used when trying to handle multiple classes. The softmax
function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
● Ouput:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
Random Search
Linear Regression
● Linear Regression is a machine learning algorithm based on supervised learning.

● It performs a regression task.

● Regression models a target prediction value based on independent variables.

● It is mostly used for finding out the relationship between variables and

forecasting.

Linear regression performs the task to predict a dependent variable value (y) based on a

given independent variable (x). So, this regression technique finds out a linear

relationship between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a

person. The regression line is the best fit line for our model.

Hypothesis function for Linear Regression :

While training the model we are given :

x: input training data (univariate – one input variable(parameter))

y: labels to data (supervised learning)

When training the model – it fits the best line to predict the value of y for a given value

of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.

θ1: intercept

θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally

using our model for prediction, it will predict the value of y for the input value of x.

Cost Function (J):

By achieving the best-fit regression line, the model aims to predict y value such that the

error difference between predicted value and true value is minimum. So, it is very
important to update the θ1 and θ2 values, to reach the best value that minimize the error

between predicted y value (pred) and true y value (y).


Logistic Regression
Classification and Regression Trees
A Classification and Regression Tree(CART) is a predictive algorithm used in
machine learning. It explains how a target variable’s values can be predicted based
on other values.
It is a decision tree where each fork is a split in a predictor variable and each node
at the end has a prediction for the target variable.
The CART algorithm is an important decision tree algorithmthat lies at the
foundation of machine learning. Moreover, it is also the basis for other powerful
machine learning algorithms like bagged decision trees, random forest and
boosted decision trees.

(i) Classification Trees

A classification tree is an algorithm where the target variable is fixed or categorical.


The algorithm is then used to identify the “class” within which a target variable
would most likely fall.
An example of a classification-type problem would be determining who will or will
not subscribe to a digital platform; or who will or will not graduate from high school.
These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive values. In
other cases, you might have to predict among a number of different variables. For
instance, you may have to predict which type of smartphone a consumer may
decide to purchase.
In such cases, there are multiple values for the categorical dependent variable.
Here’s what a classic classification tree looks like.
Classification Trees

(ii) Regression Trees

A regression tree refers to an algorithm where the target variable is and the
algorithm is used to predict it’s value. As an example of a regression type problem,
you may want to predict the selling prices of a residential house, which is a
continuous dependent variable.
This will depend on both continuous factors like square footage as well as
categorical factors like the style of home, area in which the property is located and
so on.
Regression Trees

Watch this video for a basic classification and regression trees tutorial as well as
some classification and regression trees examples.
Advantages of Classification and Regression Trees

The purpose of the analysis conducted by any classification or regression tree is


to create a set of if-else conditions that allow for the accurate prediction or
classification of a case.
Classification and regression trees work to produce accurate predictions or
predicted classifications, based on the set of if-else conditions. They usually have
several advantages over regular decision trees.
(i) The Results are Simplistic

The interpretation of results summarized in classification or regression trees is


usually fairly simple. The simplicity of results helps in the following ways.
(ii) Classification and Regression Trees are Nonparametric & Nonlinear

The results from classification and regression trees can be summarized in


simplistic if-then conditions.
(iii) Classification and Regression Trees Implicitly Perform Feature Selection

Feature selection or variable screening is an important part of analytics. When we


use decision trees, the top few nodes on which the tree is split are the most
important variables within the set. As a result, feature selection gets performed
automatically and we don’t need to do it again.
Limitations of Classification and Regression Trees

(i) Overfitting

Overfitting occurs when the tree takes into account a lot of noise that exists in the
data and comes up with an inaccurate result.
(ii) High variance

In this case, a small variance in the data can lead to a very high variance in the
prediction, thereby affecting the stability of the outcome.
(iii) Low bias

A decision tree that is very complex usually has a low bias. This makes it very
difficult for the model to incorporate any new data.
Rule Based Classification

IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −

IF condition THEN conclusion


Let us consider a rule R1,

R1: IF age = youth AND student = yes

THEN buy_computer = yes

Points to remember −

● The IF part of the rule is called rule antecedent or precondition.

● The THEN part of the rule is called rule consequent.

● The antecedent part the condition consist of one or more attribute tests and

these tests are logically ANDed.

● The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.
Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule
for a given class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned,
a tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.

Bayesian belief Network


Hidden Markov Model (HMM)
Hidden Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.
Hidden Markov models are especially known for their application in reinforcement
learning .
EXPECTATION MAXIMIZATION (EM)
Radial Basis Functions

RBF Network Architecture

An RBFN performs classification by measuring the input’s similarity to examples from


the training set. Each RBFN neuron stores a “prototype”, which is just one of the
examples from the training set. When we want to classify a new input, each neuron
computes the Euclidean distance between the input and its prototype. Roughly
speaking, if the input more closely resembles the class A prototypes than the class B
prototypes, it is classified as class A.
The above illustration shows the typical architecture of an RBF Network. It consists
of an input vector, a layer of RBF neurons, and an output layer with one node per
category or class of data.
The Input Vector

The input vector is the n-dimensional vector that you are trying to classify. The entire
input vector is shown to each of the RBF neurons.
The RBF Neurons
Each RBF neuron stores a “prototype” vector which is just one of the vectors from the
training set. Each RBF neuron compares the input vector to its prototype, and outputs a
value between 0 and 1 which is a measure of similarity. If the input is equal to the
prototype, then the output of that RBF neuron will be 1. As the distance between the
input and prototype grows, the response falls off exponentially towards 0. The shape of
the RBF neuron’s response is a bell curve, as illustrated in the network architecture
diagram.
The neuron’s response value is also called its “activation” value.

The prototype vector is also often called the neuron’s “center”, since it’s the value at the
center of the bell curve.
The Output Nodes

The output of the network consists of a set of nodes, one per category that we are trying
to classify. Each output node computes a sort of score for the associated category.
Typically, a classification decision is made by assigning the input to the category with
the highest score.
BF Neuron Activation Function

Support Vector Machines


“The support vector machine (SVM) is a supervised learning method that generates
input-output mapping functions from a set of labeled training data."
A Support Vector Machine (SVM) performs classification by finding the hyperplane that
maximizes the margin between the two classes.
The vectors (cases) that define the hyperplane are the support vectors.

Algorithm:

1. Define an optimal hyperplane: maximize margin.


2. Extend the above definition for non-linearly separable problems: have a penalty
term for misclassifications.
3. Map data to high dimensional space where it is easier to classify with linear
decision surfaces: reformulate problem so that data is mapped implicitly to this
space.
To define an optimal hyperplane we need to maximize the width of the margin (w).

The beauty of SVM is that if the data is linearly separable, there is a unique global
minimum value. An ideal SVM analysis should produce a hyperplane that completely
separates the vectors (cases) into two non-overlapping classes. However, perfect
separation may not be possible, or it may result in a model with so many cases that the
model does not classify correctly. In this situation SVM finds the hyperplane that
maximizes the margin and minimizes the misclassifications.
1. Maximum Margin Linear Separators

For the maximum margin hyperplane only examples on the margin matter (only these
affect the distances). These are called support vectors. The objective of the support
vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the
number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e the
maximum distance between data points of both classes. Maximizing the margin
distance provides some reinforcement so that future data points can be classified with
more confidence.
Hyperplanes

Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is 3,
then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support Vectors

Support vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, we maximize
the margin of the classifier. Deleting the support vectors will change the position of the
hyperplane. These are the points that help us build our SVM.

The support vectors are indicated by the circles around them.


To find the maximum margin the separator, we have to solve following optimization
problem:

2.Quadratic Programming Solution to Finding Maximum Margin Separators

3. Kernels for Learning Non-Linear Functions

Linear models are nice and interpretable but have limitations. Can’t learn difficult"
nonlinear patterns.
Linear models rely on \linear" notions of similarity/distance

which wouldn’t work well if the patterns we want to learn are nonlinear.

Replacing the inner product with a kernel is known as the kernel trick or kernel
substation
Kernels

Kernels, using a feature mapping φ, map data to a new space where the original
learning problem becomes easy" (e.g., a linear model can be applied)

Decision Tree Algorithm


Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike
other supervised learning algorithms, decision tree algorithm can be used for solving
regression and classification problems too.
The general motive of using Decision Tree is to create a training model which can use to
predict class or value of target variables by learning decision rules inferred from prior
data(training data).
The understanding level of Decision Trees algorithm is so easy compared with other
classification algorithms. The decision tree algorithm tries to solve the problem, by using
tree representation. Each internal node of the tree corresponds to an attribute, and each
leaf node corresponds to a class label.

Decision Tree Algorithm Pseudocode

1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that
each subset contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the
branches of the tree.

Decision Tree classifier, Image credit: www.packtpub.com

In decision trees, for predicting a class label for a record we start from the root of the
tree. We compare the values of the root attribute with record’s attribute. On the basis of
comparison, we follow the branch corresponding to that value and jump to the next node.
We continue comparing our record’s attribute values with other internal nodes of the tree
until we reach a leaf node with predicted class value. As we know how the modeled
decision tree can be used to predict the target class or the value. Now let’s understanding
how we can create the decision tree model.
The popular attribute selection measures:

● Information gain
● Gini index

Problems for decision tree learning

Instances are represented by attribute-value pairs.


Instances are described by a fixed set of attributes (e.g., Temperature) and their values
(e.g., Hot). The easiest situation for decision tree learning is when each attribute takes on
a small number of disjoint possible values (e.g. Hot, Mild, Cold). However, extensions to
the basic algorithm allow handling real-valued attributes as well (e.g., representing
Temperature numerically).
The target function has discrete output values.

The decision tree in assigns a Boolean classification (e.g., yes or no) to each
example. Decision tree methods easily extend to learning functions with
more than two possible output values. A more substantial extension allows
learning target functions with real- valued outputs, though the application of
decision trees in this setting is less common.

The training data may contain errors.

Decision tree learning methods are robust to errors, both errors in


classifications of the training examples and errors in the attribute values that
describe these examples.

The training data may contain missing attribute values. Decision tree
methods can be used even when some training examples have unknown
values (e.g., if the Humidity of the day is known for only some of the training
examples).

Independent Component Analysis(ICA)

Independent Component Analysis (ICA) is a machine learning technique to separate

independent sources from a mixed signal. Unlike principal component analysis which

focuses on maximizing the variance of the data points, the independent component

analysis focuses on independence, i.e. independent components.


Problem: To extract independent sources’ signals from a mixed signal composed of the

signals from those sources.

Given: Mixed signal from five different independent sources.

Aim: To decompose the mixed signal into independent sources:

● Source 1
● Source 2
● Source 3
● Source 4
● Source 5

Solution: Independent Component Analysis (ICA).

Consider Cocktail Party Problem or Blind Source Separation problem to understand the

problem which is solved by independent component analysis.


Here, There is a party going into a room full of people. There is ‘n’ number of speakers in

that room and they are speaking simultaneously at the party. In the same room, there

are also ‘n’ number of microphones placed at different distances from the speakers

which are recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal

to the number must of microphones in the room.

Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’

voice signals in the room given each microphone recorded the voice signals coming

from each speaker of different intensity due to the difference in distances between

them. Decomposing the mixed signal of each microphone’s recording into independent

source’s speech signal can be done by using the machine learning technique,

independent component analysis.


[ X1, X2, ….., Xn ] => [ Y1, Y2, ….., Yn ]

where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn

are the new features and are independent components which are independent of each

other.
Difference between PCA and ICA

PRINCIPAL COMPONENT ANALYSIS INDEPENDENT COMPONENT ANALYSIS

It reduces the dimensions to avoid the problem It decomposes the mixed signal into its

of overfitting. independent sources’ signals.

It deals with the Principal Components. It deals with the Independent Components.

It doesn’t focus on the issue of variance


It focuses on maximizing the variance.
among the data points.

It focuses on the mutual orthogonality property It doesn’t focus on the mutual orthogonality

of the principal components. of the components.

It doesn’t focus on the mutual independence of It focuses on the mutual independence of

the components. the components.


Dimensionality Reduction

There are two components of dimensionality reduction:

- Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1) Filter
2) Wrapper
3) Embedded

- Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:

● Principal Component Analysis (PCA)


● Linear Discriminant Analysis (LDA)
● Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method
used.

Principal Component Analysis


Principal Component Analysis is basically a statistical procedure to convert a set of

observation of possibly correlated variables into a set of values of linearly uncorrelated

variables.

Each of the principal components is chosen in such a way so that it would describe

most of the still available variance and all these principal components are orthogonal to

each other. In all principal components first principal component has maximum

variance.

Uses of PCA:

● It is used to find inter-relation between variables in the data.


● It is used to interpret and visualize data.
● As number of variables are decreasing it makes further analysis simpler.
● It’s often used to visualize genetic distance and relatedness between
populations.

These are basically performed on square symmetric matrix. It can be a pure sums of

squares and cross products matrix or Covariance matrix or Correlation matrix. A

correlation matrix is used if the individual variance differs much.

Objectives of PCA:

● It is basically a non-dependent procedure in which it reduces attribute space


from a large number of variables to a smaller number of factors.
● Main task in this PCA is to select a subset of variables from a larger set,
based on which original variables have the highest correlation with the
principal amount.
Free vs based

Derivative Free Derivative Based


Optimization Optimization

Derivative Free Optimization cannot be Derivative Based Optimization can be


derived derived

It makes use of evolutionary concepts. It does not makes use of evolutionary


concepts.

It is slower It is faster

It makes use of random number generator to It does not makes use of random number
find the search directions. generator to find the search directions.
No analysis is done sue to randomness. Analysis is performed at every step.

There is no need of differentiable function. There is need of differentiable function.

Some Natural wisdom is used that is based No Natural wisdom is used.


on evolution & thermo Dynamics.

Techique: Simulated Annealing. Techique: Descent Method & Newton's


Method.

Newton's method
STEEPEST DESCENT

Potrebbero piacerti anche