Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2)
There are slight differences between this architecture and the
Lenet-5 as the LeNet-5 architecture does not consist of any drop-
out layer but the architecture as shown in the homework consists
of a drop-out layer after the sub-sampling layer (S4). The
connections from S4 to C5 in LeNet-5 is full connection but here
a drop-out layer is added between S4 and C5.
Below is the confusion matrix, also known as error matrix which allows
the visualization of the performance of the algorithm.
Here, the columns are the original values and the rows depict the
predicted values of the ten digits.
[[97 0 0 1 0 0 1 1 0 0]
[ 0 98 1 7 5 11 1 2 18 7]
[ 0 0 91 5 0 0 0 2 1 0]
[ 0 1 0 75 0 2 0 0 5 0]
[ 0 0 0 0 84 2 1 3 0 18]
[ 1 1 1 11 0 83 2 0 9 0]
[ 1 0 2 0 5 1 95 0 2 2]
[ 0 0 4 0 0 1 0 88 1 2]
[ 1 0 0 0 0 0 0 0 60 0]
[ 0 0 1 1 6 0 0 4 4 71]]
Shown below are the digits which are misclassified to a greater extent as compared to the rest of the
digits.
In this case, 5 is predicted as 1 (11 times) out of the 100 batch size it has taken.
In this case, 8 is predicted as 1 (18 times) out of the 100 batch size it has taken.
In this case, 9 is predicted as 4 (18 times) out of the 100 batch size it has taken.
Code:-
import pandas as pd
import numpy as np
np.random.seed(1337)
batch_size = 100
nb_classes = 10
nb_epoch = 30
y_train = train[:, 0]
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
model = Sequential()
model.add(ZeroPadding2D((2,2),input_shape=in_shape))
# Roughly equivalent to C1
# Roughly equivalent to S2
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'tanh', kernel_initializer='he_normal'))
# Roughly equivalent to S4
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
model.add(Dropout(0.2))
# Roughly equivalent to C5
model.add(Flatten())
# Roughly equivalent to F6
# Output Layer
# Let's Learn!!
yPred = model.predict_classes(X_test)
# Line up our outputs on the test set with the labels from the test set and calculate a confusion
matrix
for i in range(len(targets)):
cm[yPred[i],targets[i]] += 1
print(cm)
accuracy is 0.905
[[99 0 0 1 0 0 0 0 1 0]
[ 0 98 1 4 3 7 0 0 7 0]
[ 0 1 94 6 0 0 0 1 1 1]
[ 0 0 0 79 0 1 0 0 1 1]
[ 0 0 0 0 81 2 1 2 1 3]
[ 0 0 0 6 0 85 1 0 0 0]
[ 0 0 0 0 1 0 97 0 0 0]
[ 0 0 3 0 0 1 0 94 0 1]
[ 1 1 1 3 0 3 1 0 87 3]
[ 0 0 1 1 15 1 0 3 2 91]]
plt.figure(figsize=(10, 10))
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.imshow(im, cmap='gray')
plt.show()
# Lets look at some of the ones we got RIGHT in thte test set
test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] == im[2]]
plt.figure(figsize=(10, 10))
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.imshow(im, cmap='gray')
plt.show()
In this network configuration, we are experimenting with the Activation Functions i.e changing the
linear activation function to a non-linear activation function in order to estimate the overall
performance of the model.
Since the linear activation functions serve as just an identity function which outputs the same
as the input which is fed to the node, the training is not so good in most of the cases.
So, we introduce some non-linearity in the network so as to compute more interesting
features of the network so that it can perform better.
Q. How/Why it does better or worse than the first network?
When replacing relu with tanh , the accuracy improved considerably from 85% to 90%.
accuracy is 0.905
[[99 0 0 1 0 0 0 0 1 0]
[ 0 98 1 4 3 7 0 0 7 0]
[ 0 1 94 6 0 0 0 1 1 1]
[ 0 0 0 79 0 1 0 0 1 1]
[ 0 0 0 0 81 2 1 2 1 3]
[ 0 0 0 6 0 85 1 0 0 0]
[ 0 0 0 0 1 0 97 0 0 0]
[ 0 0 3 0 0 1 0 94 0 1]
[ 1 1 1 3 0 3 1 0 87 3]
[ 0 0 1 1 15 1 0 3 2 91]]
The logistic function (sigmoid activation function) is that it is symmetric regarding the 0
while the relu is not. This makes the sigmoid one more prone to saturation of the later layers,
making training more difficult.
The match of relu function is that wherever a negative number occurs, we swap it out for a 0.
Since, there are more chances of the sigmoid function to get stuck at local minima, the accuracy gets
reduced .
accuracy is 0.392
[[62 0 0 0 0 1 0 0 0 0]
[ 0 77 12 2 0 7 5 2 4 0]
[ 0 0 8 0 0 0 0 0 0 0]
[ 2 0 37 23 0 5 0 0 8 0]
[22 0 8 3 70 22 63 5 21 33]
[ 3 0 0 0 0 0 0 0 0 0]
[ 0 0 20 1 0 1 30 0 0 0]
[ 1 23 15 66 18 42 1 92 48 37]
[ 0 0 0 0 0 0 0 0 0 0]
[10 0 0 5 12 22 1 1 19 30]]
Replacing sigmoid with tanh in sub-sampling layers:
The preference for the tanh over the logistic function (sigmoid activation function) is that the
first is symmetric regarding the 0 while the second is not.
Tanh produces more natural looking fits for the data using only linear inputs.
Sigmoid is more prone to be stuck in local minima.
Replacing the sigmoid activation function with tanh results in considerable improve in accuracy.
accuracy is 0.973
[[100 0 0 0 0 0 0 0 0 0]
[ 0 97 0 0 0 0 0 0 0 0]
[ 0 1 97 0 0 0 0 3 0 0]
[ 0 2 0 100 0 1 0 0 1 2]
[ 0 0 0 0 96 0 0 0 0 1]
[ 0 0 0 0 1 98 1 0 0 1]
[ 0 0 0 0 0 0 98 0 0 1]
[ 0 0 2 0 1 0 0 97 0 1]
[ 0 0 1 0 0 1 1 0 96 0]
[ 0 0 0 0 2 0 0 0 3 94]]
Code:
import pandas as pd
import numpy as np
np.random.seed(1337)
import matplotlib.pyplot as plt
batch_size = 100
nb_classes = 10
nb_epoch = 30
model = Sequential()
# Roughly equivalent to C1
model.add(Convolution2D(6, (5, 5), activation = 'relu',
kernel_initializer='he_normal'))
# Roughly equivalent to S2
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'relu',
kernel_initializer='he_normal'))
# Roughly equivalent to S4
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
model.add(Dropout(0.2))
# Roughly equivalent to C5
model.add(Convolution2D(120, (2, 2), activation = 'relu',
kernel_initializer='he_normal'))
model.add(Flatten())
# Roughly equivalent to F6
model.add(Dense(84, activation = 'tanh',
kernel_initializer='he_normal'))
# Output Layer
model.add(Dense(nb_classes, activation = 'softmax',
kernel_initializer='he_normal')) #Last layer with one output per class
# Let's Learn!!
model.fit(X_train, Y_train, batch_size=batch_size, epochs=nb_epoch,
verbose=1)
# Line up our outputs on the test set with the labels from the test set and calculate a confusion matrix
from sklearn import metrics
print(cm)
accuracy is 0.799
[[96 0 0 1 0 2 1 0 0 0]
[ 0 97 1 5 4 11 3 4 15 8]
[ 0 1 92 8 0 0 1 6 2 0]
[ 0 1 1 71 0 7 0 1 8 1]
[ 0 0 1 0 78 2 1 4 1 32]
[ 2 0 1 12 0 76 4 0 5 1]
[ 0 0 1 0 3 1 90 0 0 1]
[ 0 1 2 0 0 1 0 78 0 2]
[ 2 0 0 2 1 0 0 0 66 0]
[ 0 0 1 1 14 0 0 7 3 55]]
plt.figure(figsize=(10, 10))
for ind, val in enumerate(test_wrong[:100]):
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(10, 10, ind + 1)
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.text(0, 0, val[2], fontsize=14, color='blue')
plt.text(8, 0, val[1], fontsize=14, color='red')
plt.imshow(im, cmap='gray')
plt.show()
# Lets look at some of the ones we got RIGHT in thte test set
plt.figure(figsize=(10, 10))
for ind, val in enumerate(test_wrong[:100]):
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(10, 10, ind + 1)
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.text(0, 0, val[2], fontsize=14, color='blue')
plt.text(8, 0, val[1], fontsize=14, color='red')
plt.imshow(im, cmap='gray')
plt.show()
I decided to make those changes because I wanted to experiment with the drop out layer and try to
analyze the difference in performance of the model.
In drop-out, we randomly delete half of the hidden nodes excluding the input and output
nodes. As a result, we can avoid overfitting.
Dropout procedure is like averaging the effects of a very large number of different networks.
So, in this particular case , adding a drop-out layer does not have much impact on the improvement
of accuracy. It improves in case when we have very vast training data containing millions of training
samples. It is used in those large training sets where the problem of overfitting is very acute.
The observation is that on increasing the dropout value(from 0.2 to 0.5), the accuracy became less.
accuracy is 0.803
[[93 0 0 1 0 1 1 1 0 0]
[ 0 97 1 4 0 10 4 3 12 0]
[ 0 0 90 4 0 0 1 2 0 0]
[ 0 2 3 84 0 13 0 0 13 0]
[ 0 0 0 0 68 2 2 1 2 7]
[ 1 0 0 1 0 57 2 0 3 0]
[ 0 0 1 0 1 0 90 0 1 1]
[ 0 1 4 3 0 6 0 86 2 10]
[ 5 0 0 1 1 8 0 0 58 2]
[ 1 0 1 2 30 3 0 7 9 80]]
accuracy is 0.799
[[96 0 0 1 0 2 1 0 0 0]
[ 0 97 1 5 4 11 3 4 15 8]
[ 0 1 92 8 0 0 1 6 2 0]
[ 0 1 1 71 0 7 0 1 8 1]
[ 0 0 1 0 78 2 1 4 1 32]
[ 2 0 1 12 0 76 4 0 5 1]
[ 0 0 1 0 3 1 90 0 0 1]
[ 0 1 2 0 0 1 0 78 0 2]
[ 2 0 0 2 1 0 0 0 66 0]
[ 0 0 1 1 14 0 0 7 3 55]]
Since in drop out, we drop out most of the neurons and modify the network structure itself, we
might need more number of epochs in order to train the network better.
Below is the improved accuracy (from 79% to 86%) after increasing the number of epochs.
accuracy is 0.865
[[100 0 1 2 1 3 1 1 4 0]
[ 0 97 1 1 1 3 0 0 5 0]
[ 0 0 93 5 0 0 0 3 2 0]
[ 0 2 0 69 0 0 0 0 5 1]
[ 0 0 0 0 68 2 1 1 1 3]
[ 0 1 1 21 1 91 4 0 8 2]
[ 0 0 1 0 2 0 94 0 0 1]
[ 0 0 2 0 0 1 0 93 0 4]
[ 0 0 0 1 0 0 0 0 73 2]
[ 0 0 1 1 27 0 0 2 2 87]]
3rd Network Configuration:-
Code:
import pandas as pd
import numpy as np
np.random.seed(1337)
import matplotlib.pyplot as plt
batch_size = 100
nb_classes = 10
nb_epoch = 30
model = Sequential()
# Roughly equivalent to C1
model.add(Convolution2D(6, (5, 5), activation = 'tanh',
kernel_initializer='he_normal'))
# Roughly equivalent to S2
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'tanh',
kernel_initializer='he_normal'))
# Roughly equivalent to S4
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
model.add(Dropout(0.2))
# Roughly equivalent to C5
model.add(Convolution2D(120, (2, 2), activation = 'tanh',
kernel_initializer='he_normal'))
model.add(Flatten())
# Roughly equivalent to F6
model.add(Dense(84, activation = 'tanh',
kernel_initializer='he_normal'))
# Output Layer
model.add(Dense(nb_classes, activation = 'softmax',
kernel_initializer='he_normal')) #Last layer with one output per class
# Line up our outputs on the test set with the labels from the test set and calculate a confusion
matrix
targets = test[:,0]
for i in range(len(targets)):
cm[yPred[i],targets[i]] += 1
print(cm)
accuracy is 0.728
[[79 1 1 3 0 4 0 0 0 0]
[ 2 97 2 4 2 4 0 3 2 1]
[ 2 1 71 11 0 1 1 0 0 0]
[ 2 1 15 66 0 14 0 10 31 1]
[ 0 0 1 0 85 3 4 1 1 13]
[ 7 0 1 3 0 43 1 1 20 5]
[ 0 0 8 0 4 3 94 0 2 0]
[ 0 0 1 7 0 6 0 78 2 2]
[ 8 0 0 5 0 11 0 1 39 2]
[ 0 0 0 1 9 11 0 6 3 76]]
plt.figure(figsize=(10, 10))
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.imshow(im, cmap='gray')
plt.show()
# Lets look at some of the ones we got RIGHT in thte test set
plt.figure(figsize=(10, 10))
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.imshow(im, cmap='gray')
plt.show()
In the above network configuration the new set of layers (the convolution layer ‘C’ and the sub-
sampling layer ‘S’ ) are inserted between S4 and the drop-out layer.
Here, each unit in layer “C” is connected to a 2X2 receptive field from previous layer(S4).
Also,each unit in layer “S” is connected to a 2X2 receptive field from previous layer (C).
The number of feature maps in layer ‘C’ is doubled i.e from 16 to 32.
p= padding
f=size of filter
s=stride
So,here
n=5,p=0,s=1,f=2
therefore, dimension=[(5+0-2)/1] + 1
=4
The dimensions of the feature map in S is calculated as follows:
Dimension =[ (n + 2p –f/s) + 1]
Here, n=4,p=0,f=2,s=2
therefore, dimension=[(4+0-2)/2] + 1
=2
Adding more layers to a convolutional model brings considerable change in the behaviour of the
model . So, I added an extra layer and experimented a bit with the number of feature maps to
analyze the network’s performance for the given dataset.
Since, adding an extra layer to the convolutional network and doubling the number of feature
maps in the layers after it increases the number of trainable parameters and as a result ,the
networks finds it difficult to learn well or generalize to new unseen test samples.
Also, adding an extra layer increase the computation complexity.
As we increase the number of hidden layer, effectiveness of backpropagation decreases.
The network might suffer from the problem of vanishing gradient which refers to the fact that
if we keep on increasing the number of hidden layers, it results in very slow
learning/updating of weights in the lower layers of the network.
Thus, the gradients vanish for the earlier layers and those layers refuse to learn.
Also, we need to deal with overfitting problem i.e. the network will perform very well on the
training data set but it cannot generalize to new data it has not seen and gives awful
performance on the new data.
As a result the assumption made that the network performs somewhat poorly as compared to
earlier(i.e without adding extra layers) is due to the above results.