Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Definition of the area:
An Artificial Neural Network (ANN), often just called a "neural network" (NN), is a mathematical
model or computational model based on biological neural networks. It consists of an
interconnected group of artificial neurons and processes information using a connectionist
approach to computation. In most cases an ANN is an adaptive system that changes its structure
based on external or internal information that flows through the network during the learning
phase. In more practical terms neural networks are non‐linear statistical data modeling tools. They
can be used to model complex relationships between inputs and outputs or to find patterns in data.
A Feed Forward Neural Network is an artificial neural network where connections between the
units do not form a directed cycle. This is different from recurrent neural networks. The feed
forward neural network was the first and arguably simplest type of artificial neural network
devised. In this network, the information moves in only one direction, forward, from the input
nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in
the network.
The Back Propagation Algorithm is a common way of teaching artificial neural networks how to
perform a given task. It requires a teacher that knows, or can calculate, the desired output for any
given input. It is most useful for feed‐forward networks. Backpropagation algorithm learns the
weights for a multilayer network, given a network with a fixed set of units and interconnections. It
employs gradient descendent rule to attempt to minimize the squared error between the network
output values and the target values for these outputs.
Problem considered in NN:
Character recognition has become much important and interesting application industry. I am
planning to apply Neural Network for recognizing characters, initially digits. My search from
internet and other materials make me feel that Neural Network would be a good option to
recognize character. I am much interested in learning “How Neural Network can be applied to a
particular problem domain (in this case “Character Recognition”) and why NN is worth of applying.”
The Tassk:
The learn
ning task here involves recognizing
r characters (digits
( consid
dered in the beginning). The
target fun
nction is to classify the giiven characteer image to aa particular d
digit set early
y as target.
Initial In
nput Encod
ding:
There should be a drrawing paneel where digiits should bee drawn and
d then can be
b used for either
e
training o
or test. These drawing sh hould be con nverted into a matrix of ssize of 5 by 4 4 i.e 20 pixels. To
draw a diigit, 7 segmeents are even n enough butt I am consid dering more, if in future II can cover aall the
characterrs (English) tto recognize.
Initial O
Output Enco
oding:
I would u
use 10 distincct output uniits each representing onee of the 10 diigits.
Æ Æ
Æ
Figg1: a) Input verted Matrixx
b) Conv c) Target
Basically as I will be u
using sigmoid
d function, I would use 0..9 for 1 and 0
0.1 for 0 in practice.
Initial N
Network Strructure:
To repressent 20 matrrix pixels I would use 20 input layer u units and 10 output layerr units for disstinct
10 digits.. For hidden layer, I am planning to start with fo our units as I think 3 woould be enouugh to
map 8 (2 ) outputs, so 4 hidden u
3 units would b be fine for 10
0 output unitts. I might ch
hange the number
n layer units in case I feel it is required
of hidden d for better p
performancee while testin ng the system
m.
Fig2: NN
N Structure fo
or proposed digit recogniition system
Learning Process:
Basically Backpropagation algorithm should be used to train the NN Structure proposed above. I
will start with random value assigned to the input matrices, and will gradually adjust the weight
(i.e. train the network) by performing the following procedure for all pattern pairs:
Forward pass
1) Compute the hidden‐layer neuron activation:
h=F(iW1+bias1)
where h is the vector of the hidden‐layer neurons, i the vector of input‐layer neurons, and W1 the
weight matrix between the input and hidden layers. bias1 is the bias on the computed activation of
the hidden‐layer neurons. F is the sigmoid activation function (F(x)=1/(1+exp(‐x) here)..
2) Compute the output‐layer neuron activations:
o=F(hW2+bias2)
shere o represents the output layer, h the hidden layer, W2 the matrix of synapse connecting the
hidden and output layers, and bias2 the on the computed activation of the output‐layer neurons.
Backward pass
3) Compute the output‐layer error (the difference between the target and the observed output):
d=o(1‐o)(o‐t)
where d is the vector of errors for each output neuron, o the output layer vector, and t is the
target(correct) activation of the output layer.
4) Compute the hidden layer error:
e=h(1‐h)W2d
where e is the vector of errors for each hidden‐layer neuron.
5) Adjust the weight for the second layer synapses:
W2=W2+CW2
where CW2 is a matrix representing the change in matrix W2. It is computed as follow:
CW2t=rhd+OCW2t‐1
where r is the learning rate and O the momentum factor used to allow the previous weight change
to influence the weight change in this time period, t. This does not mean that time is somehow
incorporated to the model. It means only that a weight adjustment may depend to some degree on
the previous weight adjustment made.
6) Adjust the weight for the first layer of synapses:
W1=W1+CW1t
where CW1t=rie+OCW1t‐1
7) Adjust the bias vectors:
bias1=bias1+rd
bias2=bias2+re
Repeat steps 1 to 7 on all pattern pairs until the output layer error (vector d) is within the specified
tolerance for each pattern and each neuron.
Other Parameters:
I was planning to train using the learning rate used 0.1 – 0.3 and same for momentum. Based on the
performance I wanted to adjust their value.
Implementation:
I implemented the idea in Java. User should input test data in a paint rectangle and then process the
image to produce a matrix of size 5 by 7 as input and then get desired output as recognition of one
particular digit.
Experiment Setup:
Initially I used 20 input units initially as a 4*5 grid matrix for representing a digit and 4 hidden
units with 10 output units. This setup did not work much correctly. I think this was because, 20
input units (4*5 matrix grid) were probably a too congested of nodes to represent each digit
significantly. So I changed it to a little bigger grid of 5*7 using 35 input units. A 5*7 matrix grid
representing 35 units for input was used for each digit representation. There are 8 hidden units
which were initially 4 then I increased it to eight for better performance after I read some papers
and articles [8] indicating how many hidden units to use for a NN application. Actually there is no
fixed straight forward way to predict it but most ideas from those articles and papers say about
making a balance for hidden unit numbers with number of input and output units. I used 10 output
units, each unit for each of the 10 digits from 0 to 9. I paper readings and searches for implementing
a NN structure made me believe that for a NN of 35 input units and 10 output units, 8 hidden units
would be a balance for the NN.
Fig3: Final NN for Digit Recognition
I used bias weights for sigmoid units (hidden layer and output layers). Initially I assigned 0.1 as
initial weights for each weight and later I adjusted (discussed later in training section with more
details) the weights because of not getting get results and inspirations from the guidelines of
assigning initial weights in Tom Mitchell’s Machine Learning text book which says assign random
weights from ‐0.05 to 0.05. For the learning rate and momentum I used different values for them
with different training iterations (epoch) and looked for effects like squared error convergence and
accuracy.
Dataset:
Initially I had 10 data instances for each 10 digits. In the input grid matrix, initially I put 1 for a
place where ink or segment for the digit is supposed to be to represent or make the digit visible and
I put 0 to all other places in the grid. After I did couple of training iterations and did not get
significant result so I changed 1’s to 0.9 and 0 to 0.1. This was an inspiration and suggestion from
the course instructor Professor Dejing Dou who believed 0’s are not contributing anything actually
for the network. After several training iterations still I was getting 30‐40% accuracy and again after
instructor’s assumption of using small number of dataset might be a cause of not getting expected
result. I made 50 data instances and actually 5 different sets of 10 instances for each of the 10 digits.
I used all these 5 sets of 50 data for training and 1 set of 10 data for testing initially. Later did cross
validation and used 4 sets of 40 data for training and 1 set of 10 data instances for testing.
Sample initial data from 1st set which representing zero is given below,
{0.9,0.9,0.9,0.9,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.9,0.9,0.9,0.9}
Data instance for zero from another set
{0.7,0.9,0.9,0.9,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.8,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.8,0.9,0.9,0.9,0.78}
Training:
I trained the networks several times, hours after hour with different setup for learning rate,
momentum and varying number of iterations (epoch) and with different stopping criteria.
Some scenarios of Training with different setups are presented in the following table. This table
represents some initial training scenarios when the NN either could not classify all the test data or
accuracy was very low and I sometimes tried to test whether the NN can classify 3 or 5 digits only,
not all the ten digits to get some ideas on effects of parameters.
Training Training Number of Learning Momentum Stopping Accuracy Sqr Error
Instances Iterations Rate Criteria (%) Convergence
1 3 out of 10 500 0.3 0.8 After 500 100% Yes
2 3 out of 10 1000 0.3 0.8 After 1000 100 Yes
3 4 out of 10 1000 0.3 0.8 After 1000 60 No
4 4 out of 10 10000 0.3 0.8 After 10000 70% No
5 10 data (1Set) 1000 0.3 0.8 After 1000 30% No
6 10 data (1Set) 10000 0.3 0.8 After 10000 30% No
7 10 data (1Set) 10000 0.6 0.8 After 10000 30% No
8 10 data (1Set) 10000 0.6 0.5 After 10000 30% No
Table 1: Training Scenarios in initial stage of the Project
As the training fails and do not converges then I tried with more data instances, I made 5 sets of 50
instances as I described before in Dataset section in this paper. Following table shows some of the
training scenarios. Again as the training did not converge for these training scenarios so I did not
set up the squared error (10‐5) as a stopping criterion rather chose to observe the effect of
parameter variations.
I experienced from previous training scenarios that momentum has little effect on accuracy so I
kept momentum 0.8 for these following training scenarios and tried with several smaller learning
rates after I got suggestion from instructor and from another character recognition project [6]
which suggests to use very small learning rate and warns higher that higher learning rate would
sometimes diverges rather convergence.
Table 2: Training Scenarios from Intermediate Stage of the Project
I was calculating accuracy with how many digits it could classify correctly so it was always n/10(10
digits), where n is 0 to 10 and looked like 30%, 40%, 50% and no fractional accuracy like 55% or
35.7%. Even though I was not using threshold for squared error as a stopping criteria because it
was not converging, I was calculating the squared error each time for looking whether that
converges or not.
Best performance from these trainings was when used 100000 iterations with very low learning
rate (0.001) and the accuracy was only 50% and again the squared error was not converging to any
threshold value like 10‐3.
Finally according to Tom Michell’s “Machine Learning” text book which suggests using random
initial weights (‐0.05 to 0.05) rather what I used 0.1 for all initial weights and I got far better
performance than when I used fixed initial weights 0.1. I think I was getting all final weights for
each of the hidden units same and all the final weights for output units same were because of this
initial weight assignment 0.1. I generated random numbers between ‐0.05 to 0.05 using Java’s
Random function Generator method. Tom Mitchell’s book has an image recognition project [9] for
classifying the direction at which the person in the image looking at and for this project he says that
lower learning rate like 0.1 improves the accuracy but takes much time than a bit higher learning
rate like 0.3 makes the NN learn faster with little bit low accuracy but with no huge difference.
When I used random initial weight assignment with lower learning rate 0.075 with 50 training data
set with 3000 iterations I got dramatic performance and NN could classify all of the 10 digits when I
tested and the final weights were neither same for hidden units nor output units. Most importantly
the squared error was converging to the threshold 10‐4.
Table 3: Final Training Scenarios when squared error was converging to threshold 10‐4
*When expected output unit’s value was over 0.8 as target was 0.9
** Mixed stopping criteria was used, stopped when all data were recognized, considering output value over
0.8 and less than 0.9 as recognition criteria or whether squared error reached to 10‐4. Stopped training when
either of the stopping criteria is reached first. Actually this training stopped when it converged to 10‐4 with
output value 0.89 or over.
Results:
After successfully training the network testing was done in several ways. One pattern of testing was
after training with 50 training data, NN was tested with 10 of those training data and NN could
recognize 100% correctly all the 10 digits. Cross validation was performed by training with 40 data
instances and then testing with another 10 data. In this case 100% accuracy was gained. Again I
tested with unknown clear data using the digit drawing interface using mouse. This drawing (show
in right side of the interface) was converted to appropriate grid (show in the left of the interface) of
5*7 matrix grid for one input for the input layer. In this case I also got 100% accuracy.
Fig 4: Input for zero without noise
I also got very good performance using noise as I showed to instructor. I tried several times with
noise to see how reliable my designed NN is. The performance was really good. I think I can say it
was almost reliable to 15% noise which I tested on the drawing user interface. I think this was
possible because I trained the NN with little noise (5%) in data set.
Fig 5: Input for zero with noise
Classifier: Recognize as the digit with corresponding highest node value.
Algorithm:
for(int o=0;o<10;o++){
if(outValues[o]>maxout)
{
maxout=outValues[o];
maxO = o;
}
}
System.out.print("Recognized as : "+maxO);
}
-----------------------------------------------------------------------------
{0.9,0.9,0.9,0.9,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.1,0.1,0.1,0.9,
0.9,0.9,0.9,0.9,0.9}
Recognized as:
0
Similar project:
This was a project [5] done for CMPS 523 at the University of Southwestern Louisiana. The class is
The Computational Basis of Knowledge and is taught by Dr. Anthony Maida.
Fig: Seven Segments Input
The purpose of this network was to do digit recognition from a simple 7 piece LED style digit. To
this end, the network has seven inputs, and eleven outputs. These correspond to 1) the segments of
the LED digit being on or off and 2) the final output of what this digit is recognized as by the
network. The default network was created with 9 nodes in the hidden layer, a rate parameter of 0.5,
and a noise parameter of 0.05.
Fig: NN for digit recognition using Seven Segment Display
Comparison of Seven Segments project with my project:
Even though I could not compare with the accuracy of this seven segment display digit recognition
with my one but I think my project is much flexible for testing and recognizing new test data
because I was using a drawing interface. With this interface I could make noisy data and test them
where as this seven segment input was fixed input. Again I used 35 input nodes where as seven
segments used 7 units as a result I was able to make significance difference between each of the
testing data for same digit or even among different digits. This shows definitely my project is a
better and flexible one for digit recognition.
Discussions:
The most important thing I learned from this project is that there is no actual straight forward way
to design a NN for a specific problem. Rather iteratively we have to apply the learning process for
training the NN and observe for an expected structure of NN until we find one NN that solves our
problem with accuracy we desired for the problem solving.
After I did several trainings, each of which was converging to threshold 10‐4, I used a mixed
stopping criteria like I was trying to stop training when all digits have been recognized (I say
recognized when the target node’s output value is over 0.8 to 0.9 inclusive) and the threshold error
is near 10‐5. Last three training shows the final actual criteria for stopping. Out of these three I
choose the weights from the training (row indicated in blue in table 3) when I had a setup which is
“I would choose whether I get all the digits recognized with output nodes’ value ~0.9 as close
possible to target or converges to 10‐4 . Training would stop for the criterion which would come
first. Squared error could not reach to 0.9 which is exactly as target rather it stopped when it
converged to 10‐4 with giving desired output close to 0.89 after 1886 iterations on 50 training data
instances. I think this was the best performance for all my trainings (more than 200 times) and then
I tested even with noises and I got 100% accuracy with clear data and 100% accuracy with up to
15% noisy data.
I was really feeling frustrated with the project but at the end I finished with 100% accuracy to my
own test data and new clear data. NN is reliable to nearly 15% noisy data. I feel I could produce
something which can be used in real life digit recognition. This can be extended to for alphabet
recognition or whole ASCII character recognition for future.
Calculation for % of noisy data recognition was done by as follows. If at least 3 to 5 input units (out
of 35) were not actual and correct unit value and still could classify correctly (when expected
output unit produce values near target ~0.9). Noisy input was tested using the user interface which
can be used as painting board using mouse to draw a digit.
Effects of learning rate and momentum with iterations when training was performed over 50 data
sets and squared error converges to 10‐4 and 100% accuracy was gained for successfully
classifying (when desired output unit’s value was over 0.8) each of the 10 digits are listed below.
Table 4: Effects of Learning rate and Momentum variation with iterations (epochs)
Iteration Squared Error
101 0.3633160306245214
202 0.03395999748923626
303 0.017328892915319045
404 0.01290796911842668
505 0.010908880208753096
606 0.010143869494051464
707 0.009894881600209134
808 0.009849958492450347
909 0.00984722256073006
1010 0.009836698700188417
1111 0.009814743222216782
1212 0.009786460657574856
1313 0.009755744659216553
1414 0.009724222092780588
1515 0.009691884667288564
1616 0.009657703539330887
1717 0.00962116802194066
1818* 0.00959155704944527
Table 5: Iterations Vs Squared Error from the training that gave best performance
*Squared Error with iterations shows convergence for the training which was selected for final NN
weights. Learning rate 0.1 with momentum 0.6 were used for this training and stopped after 1886
iterations when squared error converged to 10‐4 and could 100% classify all digits (when each
expected node values were close to ~0.9 actually 0.89)
Future Work:
I strongly believe this digit recognition project can be extended for English alphabet recognition
using some modification may be by using 36 output nodes (10 for digits and 26 for each alphabets)
and using some more hidden units for balancing between input and output layer units. Again this
project might be extended to whole ASCII character set recognition which I might try for future. I
think for character recognition NN would produce very good result and a better tool to use after I
have done this project and read several articles and papers [1], [2], [3], [4] for character
recognition.
Acknowledgements:
I am grateful to the author [5] of the open source project that I used for user interface creating for
the painting board on which digits can be drawn and converted to grid input for testing digits and
beneath this user interface, I integrated my Back propagation algorithm code which I used for
training and also wrote codes for testing.
References:
1. http://cse.stanford.edu/class/sophomorecollege/projects00/neural
networks/Applications/character.html
2. http://home.eunet.no/~khunn/papers/2039.html
3. http://www.ccs.neu.edu/home/feneric/charrecnn.html
4. ww1.ucmss.com/books/LFS/CSREA2006/ICA5025.pdf
5. http://www.sff.net/people/dave_slusher/network.htp
6. http://www.codeproject.com/KB/dotnet/simple_ocr.aspx
7. http://cgm.cs.mcgill.ca/~godfried/student_projects/plang_neural/
8. http://www.faqs.org/faqs/aifaq/neuralnets/part3/section10.html
9. Machine Learning by Tom Mitchell, page 113116.