Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Supervised learning: The computer is presented with example inputs and their
desired outputs, given by a “teacher”, and the goal is to learn a general rule that
maps inputs to outputs. The training process continues until the model achieves
the desired level of accuracy on the training data. Some real-life examples are:
2)Prepare the input data. Once you have this data, you need to make sure
it’s in a useable format.
3)Analyze the input data. This is looking at the data from the previous task.
You can also look at the data to see if you can recognize any patterns or if
there’s anything obvious, such as a few data points that are vastly different
from the rest of the set.
4)Train the algorithm. This is where the machine learning takes place. This
step and the next step are where the “core” algorithms lie, depending on
the algorithm. You feed the algorithm good clean data from the first two
steps and extract knowledge or information.
5)Test the algorithm. This is where the information learned in the previous
step is put to use. When you’re evaluating an algorithm, you’ll test it to see
how well it does.
6)Use it. Here you make a real program to do some task, and once again
you see if all the previous steps worked as you expected.
Now the whole point was to sum the inputs. If an input is one, and is
excitatory in nature, it added one. If it was one, and was inhibitory, it
subtracted one from the sum. This is done for all inputs, and a final sum is
calculated.
The variables w1, w2 and w3 indicate which input is excitatory, and which
one is inhibitory. These are called "weights". So, in this model, if a weight is
1, it is an excitatory input. If it is -1, it is an inhibitory input.
x1, x2, and x3 represent the inputs. There could be more (or less) inputs if
required. And accordingly, there would be more 'w's to indicate if that
particular input is excitatory or inhibitory.
Now, if you think about it, you can calculate the sum using the 'x's and
'w's... something like this:
Now that the sum has been calculated, we check if sum < T or not. If it is,
then the output is made zero. Otherwise, it is made a one.
NOR Gate
Note that this example uses two neurons. The first neurons receives the
inputs you give. The second neuron works upon the output of the first
neuron. It has no clue what the initial inputs were.
NAND Gate
NAND gate with these neurons. A NAND gate gives a zero only when all
inputs are 1. This neuron needs 4 neurons. The output of the first three is
the input for the fourth neuron. If you try the different combinations of
inputs.
AND GATE(imp)
activation functions
A neural network without an activation function is essentially just a linear regression
model. The activation function does the non-linear transformation to the input making it
● Equation : Linear function has the equation similar to as of a straight line i.e.
y = ax
● No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
● Range : -inf to +inf
● Uses : Linear activation function is used at just one place i.e. output layer.
● Issues : If we will differentiate linear function to bring non-linearity, result
will no more depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may
have any big/small value, so we can apply linear activation at output layer. Even in this
case neural net must have any non-linear function at hidden layers.
3). Tanh Function :- The activation that works almost always better than sigmoid
function is Tanh function also knows as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar and can be
Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) - 1
OR
tanh(x) = 2 * sigmoid(2x) - 1
● Value Range :- -1 to +1
● Nature :- non-linear
● Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0.
This makes learning for the next layer much easier.
4). RELU :- Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.Equation :- A(x)
= max(0,x). It gives an output x if x is positive and 0 otherwise.
In simple words, RELU learns much faster than sigmoid and Tanh function.
5). Softmax Function :- The softmax function is also a type of sigmoid function but is
● Nature :- non-linear
● Uses :- Usually used when trying to handle multiple classes. The softmax
function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
● Ouput:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
Random Search
Linear Regression
● Linear Regression is a machine learning algorithm based on supervised learning.
● It is mostly used for finding out the relationship between variables and
forecasting.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
When training the model – it fits the best line to predict the value of y for a given value
of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally
using our model for prediction, it will predict the value of y for the input value of x.
By achieving the best-fit regression line, the model aims to predict y value such that the
error difference between predicted value and true value is minimum. So, it is very
important to update the θ1 and θ2 values, to reach the best value that minimize the error
A regression tree refers to an algorithm where the target variable is and the
algorithm is used to predict it’s value. As an example of a regression type problem,
you may want to predict the selling prices of a residential house, which is a
continuous dependent variable.
This will depend on both continuous factors like square footage as well as
categorical factors like the style of home, area in which the property is located and
so on.
Regression Trees
Watch this video for a basic classification and regression trees tutorial as well as
some classification and regression trees examples.
Advantages of Classification and Regression Trees
(i) Overfitting
Overfitting occurs when the tree takes into account a lot of noise that exists in the
data and comes up with an inaccurate result.
(ii) High variance
In this case, a small variance in the data can lead to a very high variance in the
prediction, thereby affecting the stability of the outcome.
(iii) Low bias
A decision tree that is very complex usually has a low bias. This makes it very
difficult for the model to incorporate any new data.
Rule Based Classification
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
Points to remember −
● The antecedent part the condition consist of one or more attribute tests and
If the condition holds true for a given tuple, then the antecedent is satisfied.
Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule
for a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned,
a tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
The input vector is the n-dimensional vector that you are trying to classify. The entire
input vector is shown to each of the RBF neurons.
The RBF Neurons
Each RBF neuron stores a “prototype” vector which is just one of the vectors from the
training set. Each RBF neuron compares the input vector to its prototype, and outputs a
value between 0 and 1 which is a measure of similarity. If the input is equal to the
prototype, then the output of that RBF neuron will be 1. As the distance between the
input and prototype grows, the response falls off exponentially towards 0. The shape of
the RBF neuron’s response is a bell curve, as illustrated in the network architecture
diagram.
The neuron’s response value is also called its “activation” value.
The prototype vector is also often called the neuron’s “center”, since it’s the value at the
center of the bell curve.
The Output Nodes
The output of the network consists of a set of nodes, one per category that we are trying
to classify. Each output node computes a sort of score for the associated category.
Typically, a classification decision is made by assigning the input to the category with
the highest score.
BF Neuron Activation Function
Algorithm:
The beauty of SVM is that if the data is linearly separable, there is a unique global
minimum value. An ideal SVM analysis should produce a hyperplane that completely
separates the vectors (cases) into two non-overlapping classes. However, perfect
separation may not be possible, or it may result in a model with so many cases that the
model does not classify correctly. In this situation SVM finds the hyperplane that
maximizes the margin and minimizes the misclassifications.
1. Maximum Margin Linear Separators
For the maximum margin hyperplane only examples on the margin matter (only these
affect the distances). These are called support vectors. The objective of the support
vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the
number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e the
maximum distance between data points of both classes. Maximizing the margin
distance provides some reinforcement so that future data points can be classified with
more confidence.
Hyperplanes
Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is 3,
then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support Vectors
Support vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, we maximize
the margin of the classifier. Deleting the support vectors will change the position of the
hyperplane. These are the points that help us build our SVM.
Linear models are nice and interpretable but have limitations. Can’t learn difficult"
nonlinear patterns.
Linear models rely on \linear" notions of similarity/distance
which wouldn’t work well if the patterns we want to learn are nonlinear.
Replacing the inner product with a kernel is known as the kernel trick or kernel
substation
Kernels
Kernels, using a feature mapping φ, map data to a new space where the original
learning problem becomes easy" (e.g., a linear model can be applied)
1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that
each subset contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the
branches of the tree.
In decision trees, for predicting a class label for a record we start from the root of the
tree. We compare the values of the root attribute with record’s attribute. On the basis of
comparison, we follow the branch corresponding to that value and jump to the next node.
We continue comparing our record’s attribute values with other internal nodes of the tree
until we reach a leaf node with predicted class value. As we know how the modeled
decision tree can be used to predict the target class or the value. Now let’s understanding
how we can create the decision tree model.
The popular attribute selection measures:
● Information gain
● Gini index
The decision tree in assigns a Boolean classification (e.g., yes or no) to each
example. Decision tree methods easily extend to learning functions with
more than two possible output values. A more substantial extension allows
learning target functions with real- valued outputs, though the application of
decision trees in this setting is less common.
The training data may contain missing attribute values. Decision tree
methods can be used even when some training examples have unknown
values (e.g., if the Humidity of the day is known for only some of the training
examples).
independent sources from a mixed signal. Unlike principal component analysis which
focuses on maximizing the variance of the data points, the independent component
● Source 1
● Source 2
● Source 3
● Source 4
● Source 5
Consider Cocktail Party Problem or Blind Source Separation problem to understand the
that room and they are speaking simultaneously at the party. In the same room, there
are also ‘n’ number of microphones placed at different distances from the speakers
which are recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’
voice signals in the room given each microphone recorded the voice signals coming
from each speaker of different intensity due to the difference in distances between
them. Decomposing the mixed signal of each microphone’s recording into independent
source’s speech signal can be done by using the machine learning technique,
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn
are the new features and are independent components which are independent of each
other.
Difference between PCA and ICA
It reduces the dimensions to avoid the problem It decomposes the mixed signal into its
It deals with the Principal Components. It deals with the Independent Components.
It focuses on the mutual orthogonality property It doesn’t focus on the mutual orthogonality
- Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1) Filter
2) Wrapper
3) Embedded
- Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
variables.
Each of the principal components is chosen in such a way so that it would describe
most of the still available variance and all these principal components are orthogonal to
each other. In all principal components first principal component has maximum
variance.
Uses of PCA:
These are basically performed on square symmetric matrix. It can be a pure sums of
Objectives of PCA:
It is slower It is faster
It makes use of random number generator to It does not makes use of random number
find the search directions. generator to find the search directions.
No analysis is done sue to randomness. Analysis is performed at every step.
Newton's method
STEEPEST DESCENT