Machine Learning and Deep Neural Networks

1
NED University of Engineering & Technology

Electrical Engineering Department
PROJECT REPORT
Machine Learning and Deep Neural Networks
Course Code EE-264
Name of Students Arman Ahmed Ansari (EE-
053)
M. Haziq Saleem (EE-027)
Ashar Mujeeb (EE-047)
Class S.E
Section A
2
Machine Learning and Deep Neural Networks

Ansari. A, Saleem. H and Mujeeb. A.
rather primitive as they were merely general statistical models

in a different form such as “perceptrons”.
Abstract— Nowadays machine learning is being implemented
on a large scale in computational programs to benefit in terms of
greater speed and accuracy. In this paper machine learning is
introduced, its feasibility over the years, applications and
utilizations are discussed and along with it a detailed analytical
study of deep neural networks, their architecture including
neurons, weights, biases, input/output layers, hidden layers and
their inter network communication is done. The methods used in
training neural networks such as activations by sigmoid function,
finding the changes required in the network by gradient-descent
method and making nudges accordingly by backpropagation
methodology is also discussed in the paper. Also how the neural
networks are deployed to help the computer recognize certain
patterns, images or models. This study reflects the concept of how
a machine is continuously taught or trained by means of efficient
programs and algorithms which make the computer a self-
learning machine that can improve its performance regularly as it
is used.
Index Terms—Machine Learning, Neural Networks, Deep

Learning , Deep Neural Network, Artificial Intelligence
I. INTRODUCTION
M ACHINE Learning is a branch of computer science that

involves computers learning from data to accomplish a
task without the need for explicit human intervention or
programming. The term “Machine Learning” was coined by
Arthur Samuel in 1959.[1] It employs algorithms that are
capable of learning from large amounts of data and making
predictions based on their previous experience with the data.
These buckle the common trend of explicitly programming
computers to perform tasks. A Neural Network is one such Fig. 1. The Mark I Perceptron machine was the first implementation of the
example of a learning algorithm that is used to model the link perceptron algorithm [2].
between input and output of the algorithm and detect recurring
patterns in it. Machine Learning and Artificial Intelligence were separated
Deep Neural Networks are an off shoot of Neural Networks as concepts because of a greater regard for logic and pre
which include multiple “hidden layers” in the algorithm which acquired knowledge based approach toward programming.
contributes to increased complexity and more further advanced Machine Learning was also held back due to difficulty in
modeling of problems. obtaining large scale data and a lack of computational power
required to represent and process it.
II. FEASIBILITY OF MACHINE LEARNING During the decade of 1980, “expert systems” ruled over AI
During the advent of Artificial Intelligence when it was first and statistical approach was not recommended. Also, study on
being considered as an academic discipline there were many neural networks within the branch of Artificial Intelligence and
that showed interest in the possibility of A.I being so advanced Computer Science had been left behind.
as to deal with problems by learning from data. This possibility
was approached with the help of various methods as well as
what sufficed for “Neural Networks” at that time which were
3
supercomputer to fuel their work on Machine Learning.
Fig. 3. NVIDIA's DGX-1 is a computer tailor-made for deep learning
The third reason that deserves to be mentioned here is the

increasing advancement in the availability of different
algorithms and certain theories developed by researchers. There
are various famous algorithms that are efficient and are being
used nowadays for different real world applications. Some of
them are Naïve Bayes Classifier Algorithm, Linear Regression,
Logistic Regression, Artificial Neural Networks and Random
Forests etc.[4]
One other reason for the growing utilization of machine
learning is the increasing investment and support from different
Fig. 2. A Symbolics Lisp Machine: an early platform for expert systems [3]
industries which is very helpful regarding the costs and
expenditures which are required for the state of the art machines
However, researchers from other departments of science needed to run the computationally complex algorithms of
continued their research on neural networks naming it machine learning.
“connectionism”. Their most significant achievement was the
work and renewal of backpropagation halfway through the
1980s. III. ARTIFICIAL NEURAL NETWORKS
By the 1990s, there was a marked increase in the progress of
machine learning as an individual field. This field aimed more To understand an Artificial Neural Network we will look at
towards solving issues regarding real world applications. It a simple example of an artificial neuron called the perceptron
tilted more towards theories regarding statistics and probability whose inception came about in the 1950s – 1960s era. Although
rather than paying attention to the knowledge based it has fallen out of favor it is useful in order to gain an
methodology it acquired from artificial intelligence. Its understanding of Neural Networks. A Perceptron has many
advancement was boosted and motivated by the greater binary inputs and produces a single binary output , the output is
obtainability of digital information and the ease of transferring computed with the help of simple weights (which are real
the data and information by means of internet facility. numbers that are assigned to assert the value of any given
Nowadays, a lot of work in depth is being carried out in the input). The output of the perceptron depends entirely on if the
premises of machine learning. Its boundaries are being weighted sum exceeds or falls short of a given value called the
expanded to greater extents day by day by hard working data threshold value. If inputs are taken as x1,x2,x3…. And the
science researchers and computer science experts. There are weights as w1,w2,w3… then the weighted sum can be considered
various reasons for this amazing flourishment in the field of
machine learning. The biggest reason is the flood of available ∑jwjx
data, thanks to the advancement of internet communication. Hence if this sum is greater than threshold value then the
Large scale data is now accessible and transferrable at greater output is 1 (as the neuron is binary) and if it falls short of the
speeds than ever which is significant in order to feed the data value then the output is 0.
samples and to train the machine with huge chunks of
information.
The second main reason is the ever increasing computational
power and processing efficiency which enables the machine
learning algorithms to run with greater speeds and perform
better within limited time and space. This motivates the
programmers to work further in the field of machine learning.
For example, The Computational Intelligence, Learning, Vision
and Robotics (CILVR) lab of New York University’s Dara
Fig 4. A Simple Neural Network
Science Center has recently acquired a NVIDIA DGX-1 AI
4
This example of a neural network has a second layer which

receives weighted input from the first layer called the input
layer. The second input then further weighs the inputs and gives
an output ; this layer allows for greater abstraction and more
advanced decision making and accuracy and further layers can
be added to expand upon this functionality.
Before moving further we will tackle the issue of notation
and represent the weighted sum as a dot product of w and x i.e.
w⋅x≡∑jwjxj
and instead of a fixed threshold value we will use perceptron
specific biases which are the negative of the threshold for that
given perceptron. If w is the weight and x is the input value and The shape of this function is similar to the plotted shape of a
b is the bias then step function but different in that it is smoother.
A. Sigmoid Neurons
Binary output is limiting in nature and to avoid such restriction
we do not employ the perceptron model any longer instead
sigmoid neurons are employed.
The sigmoid neurons essentially work in a way that a miniscule

changes in input cause similarly minute changes in output
network instead of just contributing to binary output. This
property of the network is essential to aid the working of a The thing to note is that if we had used a step function the
“Learning” network. sigmoid would just be another perceptron as there would be no
gradual variance in the output values instead there would be
This is because having this property lets us fine tune and gain stark jumps between minimum and maximum as evidence by
desired results by changing the values of the weights and biases. the graph shape.
Which is not possible with perceptrons as a small change in This smoothness can be represented mathematically by saying
weight/bias may completely flip the output result if perceptron that small changes Δwj in the weights and Δb in the bias will
is used. So this can be circumvented by making use of another produce a small change Δoutput in the output from the neuron.
artificial neuron known as the sigmoid. So
The main difference between the two is that instead of having 1
and 0 for the output where w is the weight , x is the input and b
is the bias we have
σ(w⋅x+b)
Which is called the sigmoid function and is defined as The output is very useful as it’s a linear function of the variance
σ(z)≡(1)/(1+e-z) in the properties i.e weights and biases. This function is our
activation function as in it gives us the value of activation of a
particular neuron/output. The sigmoid is the most commonly
used activation function even though there may other functions
which possess similar properties as the sigmoid where they
offer comparable ouput and wave form ; those may be used as
substitutes if needed.
Sigmoid output is not 1 and 0 instead it is any number in
between the range of 0-1 so decimal values such as 0.1456 and
0.7568 are both very legitimate outputs. This is useful if we
consider an example where we might need to identify the
average intensity of pixels which is more suitably represented
by the longer range of values than binary 1 and 0 [5].
5
IV. ARCHITECTURE OF A NEURAL NETWORK output. There exist certain models in which such feedback loops
In order to represent the working of an artificial neural are allowed, these models are called “recurrent neural
network the following architecture is taken into consideration. networks”. In these models the concept of letting the neuron fire
for a limited time duration before becoming inactive is applied.
That firing can activate other neurons to fire for a limited time.
That again causes the next neurons to fire. Therefore, loops do
not create complex problems in this model, as a neuron's output
can only influence its input at some later time, not instantly.
The leftmost layer is known as the “input layer” and the

neurons it contains are known as “input neurons”. Similarly, the
rightmost layer is known as the “output layer” and the neurons
it contains are known as “output neurons”. In our situation there
is a single output neuron in the output layer. The layers in the
middle are known as “hidden layers” because the neurons in
these layers are neither input nor output. There can be multiple
hidden layers, like in this instance we have 2 hidden layers
between the input and the output layer. such multiple layer
networks are sometimes called multilayer perceptrons or MLPs.
The structure of the input and output layers in a neural
network is frequently straightforward. For instance, consider
that our purpose is to identify if a handwritten picture shows a
"9" or not. A straightforward method to create the neural
network is to encrypt the concentrations of the image pixels into
the input neurons. If the image is a 28 by 28 greyscale image,
then we would have 28 x 28 = 784 input neurons, with the
concentrations or intensities measured suitably between 0.0 and
1.0. The output layer will possess just one neuron, with output
values of less than 0.5 representing "input image is not a 9", and
values greater than 0.50 representing "input image is a 9".
While the structure of the input and output layers of a neural
network is frequently straightforward, there are a lot of Recurrent neural networks are less used than feedforward
possibilities in the designing of these layers according to the ones mainly because their algorithms are less efficient and thus
needs of efficiency and accuracy of the algorithm. In other have less strength than feedforward neural networks’
words, it is not likely to encapsulate the structure designing algorithms. However, recurrent networks are more similar to
procedure for the hidden layers with few simple rubrics. In its the biological working of a human beings’ neural network and
place, neural networks scientists have established many design it is likely that they can prove to be quite useful in some cases
heuristics for the hidden layers, which assist people in getting where feedforward neural networks struggle. It depends on the
the performance they require out of their neural networks. For nature of the problem.
instance, such heuristics can be utilized to aid in determining
how to exchange the number of hidden layers with the time
needed to train the neural network.
The type of neural network that we have discussed till now V. WORKING OF A NEURAL NETWORK
is known as a “feedforward neural network”. In this type of
For the proper working of a neural network we must first
network, the output values from one layer is fed as input to the
obtain the weights and biases and establish a means to do so.
next layer. There do not exist any loops which may carry the
We shall use x to denote a single training input set which
data backward. The data is only fed in one direction hence the
depending on the number of node connections in our network
name. If loops are used, then a situation will arise in which input
to the σ (sigmoid) function will become dependent on the
6
is a multidimensional vector with number of dimensions equal

to the number of node connections.
The required output shall be considered as y= y(x) where y is a
multidimensional vector having the same amount of dimensions
as the number of neurons in the output layer.
Now we require an algorithm that gets the cumulative result of Our aim Is to find negative values for ΔC so that we may reach
y(x) for every training input data set x, this can be represented the minimum point of the function eventually. For this purpose
by a cost function:- we shall consider Δv to be vector of changes in v where it is the
transpose of the two values v1 and v2 as such the gradient of C
is the transpose vector of the partial derivatives of v 1 and v2.
Where w is the summation of every weight value in the

network , b is for the biases, n is the total number of training
input sets , a is vector of the outputs when x is input and the
In this situation ΔC Is gradient of vector Δv so
summation is done over all inputs x.
C shall be referred to as the quadratic cost function also ΔC≈∇C⋅Δv
referred to as the Mean Square Error in some cases. It should i.e ΔC relates changes in v to changes in C now we must
be noted that the the cost C becomes small in the situation where choose values of Δv to reduce values of ΔC for this purpose
a is equal to the required output y(x). A training algorithm is we choose
considered effective if it has found weights and biases which
result in a corresponding cost that is very close to zero but if the
Δv=−η∇C
cost is large it would in turn indicate that the algorithm is not Where η is the learning rate and is a minimal positive
doing an effective job of calculating as the output is not close parameter , this guarantees that the value of ΔC will be
to the desired output value and in such a case adjustments will negative and we will subtract this value from the original
be made to the weights and biases to compensate after every value of C and repeat this until we have reached the global
training input so that the cost function is minimized. minimum value.
So our goal is essentially to find a set of weights and biases
that result in the smallest value of the cost function possible. So far we have established that the gradient descent
This objective is reached by employing the Gradient Descent algorithm involves obtaining the gradient multiple times and
algorithm. then moving opposite to the gradient so that we may reach the
The algorithm is useful for minimizing any function not required minimum value
just the particular cost function we are using in our case.
Consider a function C(v) where C(v) is any real valued
functions of multiple variables . We shall consider C to have
two parameter variables v1 and v2 for the sake of simplicity.
Fig. 5. Showing downward motion
The essential portion here is that we decide upon a learning

rate that is suitable so that our above equation may result in an
accurate approximation ; there is a matter of striking a strong
balance between small value and a value of enough magnitude
so that our calculation may result in a positive cost function.
In real life implementations of such a system it is often
Fig. 4. 3D graph of our cost function observed that the learning rate is varied so that the algorithm is
Let us consider moving across the curve by a small amount Δv1 considered efficient and also still a suitable approximation of
in the v1 direction, and a small amount Δv2 in the v2 direction. the value.
We know that
7
A- Stochastic Gradient Descent One of the problems that may arise is “vanishing gradient
As efficiency in an algorithm of this sort is paramount next to problem”. In some deep neural networks, as we propagate
functionality people devised something known as the stochastic through the hidden layers in the backward direction the gradient
gradient descent which increases the efficiency. It achieves this becomes smaller and smaller and thus when we reach the earlier
result by computing the gradient for arbitrary values from the layers they cannot learn much quickly because of the gradient
training input set and averaging a small sample to get a real vanishing towards them. In contrast, sometimes the gradient
gradient ΔC. can become so large at the earlier layers that it becomes
These arbitrarily chosen training inputs are considered as a unstable. This problem is termed as the “exploding gradient
mini batch and these aid us in obtaining an actual gradient. problem”.
When we have made use of all these training inputs it is One argument may arise that if the gradient is becoming so
considered as one epoch or cycle and a new epoch is chosen small should that not be a good news as we are apparently
for further refinement. reaching an extremum and we do not have to do large
adjustments of the weights and biases in the network. The
B- Backpropagation answer is that we initialized the network with random weights
Backpropagation algorithm is an essential part for the and biases, certainly we cannot get that accurate results so
working of our neural network ; it is crucial because it aids quickly as it appears to be when the gradient is vanishing. So
in our initial calculation of the gradient of the cost function actually our network is not performing well, it is just that the
which is then further used in the Gradient Descent algorithm earlier layers are not able to learn when the gradient is so much
. The core concept of the algorithm is that the expression for small. It seems to them that they are already accurate and do not
partial derivatives of Cost function with respect to any given need any changes so they remain unlearned. These problems
weight or bias ; this tells us the rate of change of the cost due need to be taken care of if we want to train a deep network.
to any given variance in the weights and biases .[5] The main problem with vanishing or exploding gradient is
that the gradient in the early layers is the multiplication of
VI. DEEP NEURAL NETWORK elements emerging from the later layers. When there are several
A simple neural network is that which consists of a single later layers, it creates an unstable condition. The only solution
hidden layer between the input and output layers. However, to balance the learning speeds of all layers is to somehow
manage those multiplication results in such a way that they
a neural network in which multiple hidden layers are present
balance out. Some other evidences also show that the sigmoid
between input and output layers is called a “deep neural
activation function also creates a problem which saturates the
network” and the techniques used to train these networks are
final hidden layer near to zero in early training. So some other
termed as “deep learning”.
activation function is also advised which does not result in such
saturation problem.
VII. Example (Identifying Numbers)

Now that the intricacies of the working and architecture of a
Neural Network have been established , consider an example
where we aim to train a network and make it capable of
recognizing handwritten digits with great accuracy.
These multiple hidden layers are used to add more detailed

checking for a certain purpose. The neurons in the first hidden
layer check for a smaller and more detailed element, then these
neurons build up and make a bigger picture for the next layer to
check and so on the process continues to ensure achievement of
a greater accuracy and efficiency of a system. These several
layers of abstraction provide deep networks strong tools for
tackling more intricate problems like pattern recognition.
Deep neural networks are found hard to be trained by above
mentioned procedures of stochastic gradient descent by
backpropagation. The problem arises when the various hidden Fig. 6. Four Layer Deep Neural Network[6]
layers in our network are learning at different rates or speeds.
The Neural Network in this figure will take input training
When later hidden layers are learning quick the early hidden
data that consists of images of handwritten digits with 28x28
layers get stuck at slower rates and vice versa. When we go in
image resolution. Correspondingly we consider the network to
depth we find out that there is instability in learning due to
have 28x28 Neurons = 784 in total. A simplified view of these
gradient descent techniques used in algorithms.
8
is available in the Figure by omitting a fair few of the neurons.

The input pixels are monochromatic with 1denoting white
pixels and 0 denoting black and in between values denoting
shades of grey. The hidden layers are the second and third layer
of the network The output layer consists of ten neurons with
each corresponding to a number and whichever one of these
neurons is activated i.e has 1 as the value indicates that the
corresponding number is the output generated by the Network. Fig. 8. Output With 0.001 learning rate.
So if the first neuron is activated then the output is considered It should be noted that the performance of the network is still
zero and if the second is active then it is 1 and so on for each showing improvement over time which is remarkable as it
number. means that given enough training cycles the network would
The working in this example employs the algorithms eventually still reach the accurate weights and biases required
mentioned in the earlier section i.e Gradient Descent and Back for proper function even though it may take much longer due
Propagation. The input data set for this example is split into to the poorly chosen initial parameters. On the opposite
60,000 training images and 10,000 test images. spectrum it is also possible to choose a learning rate that is too
Initially the biases and weights are all initialized randomly ; high which would also end with poor results
this allows our algorithm for gradient descent i.e stochastic
gradient descent to start functioning.
Our activation function (The Sigmoid Function) is used to
obtain the activation values for each neurons and each neuron
takes the previous neuron’s activation value as its input which
is then used in further calculations along with its specific bias.
The gradient descent uses the mini batches in the epoch
(randomly selected input data) to minimize the cost function
after the first output of the network. The backpropagation
algorithm’s purpose is to return the value of the gradient of the
cost function in every mini batch in each training example.
The weights and biases are tuned after every output according Fig. 9. Output with 100 learning rate
to the desired output and output of the cost function with the
help of these algorithms.
If we set the learning rate to be 3.0 and have a mini batch size
of 10 then after running through the first epoch this network
identified 9129/10000 test images with an increase in the REFERENCES
number after every epoch
Examples:
[1]https://en.wikipedia.org/wiki/Machine_learning
(Accessed: 16 Feb. 2018)
[2] https://en.wikipedia.org/wiki/Perceptron
(Accessed: 16 Feb. 2018)
[3] https://en.wikipedia.org/wiki/Expert_system
(Accessed: 17 Feb.2018)
Fig. 7. Output obtained from our network [] [4] https://www.dezyre.com/article/top-10-machine-

We can gather from the above results that the function properly learning-algorithms/202
recognizes ~95% of the handwritten numbers used in our test (Accessed: 16 Feb. 2018)
images. There may be variance in this result in different runs of
the network as the weights and biases are randomly initialized [5] M. Nielsen, “Using neural nets to recognize handwritten
digits,” in Neural Networks and Deep Learning, 1st ed., New
and so the network may take longer to find the appropriate Y o r k , NY, USA: Determination Press 2015 ch 1.
weights and biases to work at peak shape.These results were
taken from best of three runs , if we increase the number of [6] https://youtu.be/aircAruvnKk
neurons in the hidden layers the program will take longer to (Accessed: 16 Feb 2018)
reach peak performance but will show a better top score. There
is also a large dependence on our Learning Rate ; if the learning
rate is poorly chosen such as 0.001 instead of 3.0 the results will
be much worse

Machine Learning and Deep Neural Networks

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Machine Learning and Deep Neural Networks

Caricato da

Copyright:

Formati disponibili

1

NED University of Engineering & Technology

Machine Learning and Deep Neural Networks

rather primitive as they were merely general statistical models

Index Terms—Machine Learning, Neural Networks, Deep

M ACHINE Learning is a branch of computer science that

supercomputer to fuel their work on Machine Learning.

Fig. 3. NVIDIA's DGX-1 is a computer tailor-made for deep learning

The third reason that deserves to be mentioned here is the

This example of a neural network has a second layer which

The sigmoid neurons essentially work in a way that a miniscule

The leftmost layer is known as the “input layer” and the

is a multidimensional vector with number of dimensions equal

Where w is the summation of every weight value in the

Fig. 5. Showing downward motion

The essential portion here is that we decide upon a learning

VII. Example (Identifying Numbers)

These multiple hidden layers are used to add more detailed

is available in the Figure by omitting a fair few of the neurons.

Fig. 7. Output obtained from our network [] [4] https://www.dezyre.com/article/top-10-machine-

Potrebbero piacerti anche