Sei sulla pagina 1di 21

Arificial Neural Network- HW#4b

By- Khyati Sinha


(ksinha@pdx.edu)

1. The diagram below shows the configuration of this deep network.

In the above diagram :


f denotes the size of filter between each layers.
C1 has 6 feature maps
S2 has 6 feature maps
C3 has 16 feature maps
S4 has 16 feature maps
There is an added dropout layer with drop-out function of 0.2 between
S4 and C5.
C5 consists of 120 nodes but is not fully connected to F6.
F6 is a fully connected convolutional layer whose output goes to the
Softmax Activation Function which has 10 outputs showing the
probability of each digit (from 0-9) being recognized.

2)
 There are slight differences between this architecture and the
Lenet-5 as the LeNet-5 architecture does not consist of any drop-
out layer but the architecture as shown in the homework consists
of a drop-out layer after the sub-sampling layer (S4). The
connections from S4 to C5 in LeNet-5 is full connection but here
a drop-out layer is added between S4 and C5.

 The LeNet-5 architecture was fed with very large datasets


containing around millions of training and test samples. This
particular architecture is fed with corresponding smaller
training dataset consisting of around 5000 training examples and
1000 test examples.

 C5 is not fully connected in this model as opposed to the LeNet-5


architecture.
 S2 does not contain the trainable coefficient parameters in this
model.

 The output is fed to a softmax activation function in this model


instead of RBF (Radial Bias Function).
Patterns in which the network is having trouble learning:-

Below is the confusion matrix, also known as error matrix which allows
the visualization of the performance of the algorithm.

Here, the columns are the original values and the rows depict the
predicted values of the ten digits.
[[97 0 0 1 0 0 1 1 0 0]
[ 0 98 1 7 5 11 1 2 18 7]
[ 0 0 91 5 0 0 0 2 1 0]
[ 0 1 0 75 0 2 0 0 5 0]
[ 0 0 0 0 84 2 1 3 0 18]
[ 1 1 1 11 0 83 2 0 9 0]
[ 1 0 2 0 5 1 95 0 2 2]
[ 0 0 4 0 0 1 0 88 1 2]
[ 1 0 0 0 0 0 0 0 60 0]
[ 0 0 1 1 6 0 0 4 4 71]]

Shown below are the digits which are misclassified to a greater extent as compared to the rest of the
digits.

In this case, 5 is predicted as 1 (11 times) out of the 100 batch size it has taken.

In this case, 8 is predicted as 1 (18 times) out of the 100 batch size it has taken.

In this case, 9 is predicted as 4 (18 times) out of the 100 batch size it has taken.

Three Different Network Configurations

1st Network Configuration:-

Code:-

import pandas as pd
import numpy as np

np.random.seed(1337)

import matplotlib.pyplot as plt

from keras import backend as K

from keras.models import Sequential

from keras.layers.core import Dense, Dropout, Activation, Flatten

from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D,


AveragePooling2D

from keras.utils import np_utils

from keras.optimizers import SGD, RMSprop

img_rows, img_cols = 28, 28

batch_size = 100

nb_classes = 10

nb_epoch = 30

# Read the train and test datasets

train = pd.read_csv("./small_train.csv",header = None).values

test = pd.read_csv("./small_test.csv",header = None).values

# Check Keras backend

if(K.image_dim_ordering()=="th"): # for Theano

X_train = train[:, 1:].reshape(train.shape[0], 1, img_rows, img_cols)

X_test = test[:, 1:].reshape(test.shape[0], 1, img_rows, img_cols)

in_shape = (1, img_rows, img_cols)

else: # for TensorFlow

X_train = train[:, 1:].reshape(train.shape[0], img_rows, img_cols, 1)

X_test = test[:, 1:].reshape(test.shape[0], img_rows, img_cols, 1)

in_shape = (img_rows, img_cols, 1)


# First data is label (already removed from X_train)

y_train = train[:, 0]

# Make the value floats in [0;1] instead of int in [0;255]

X_train = X_train.astype('float32')

X_test = X_test.astype('float32')

X_train /= 255

X_test /= 255

# convert class vectors to binary class matrices (ie one-hot vectors)

Y_train = np_utils.to_categorical(y_train, nb_classes)

# Display the shapes to check if everything's ok

print('X_train shape:', X_train.shape)

print('Y_train shape:', Y_train.shape)

print('X_test shape:', X_test.shape)

model = Sequential()

# Add padding to take 28x28 to 32x32

model.add(ZeroPadding2D((2,2),input_shape=in_shape))

# Roughly equivalent to C1

model.add(Convolution2D(6, (5, 5), activation = 'tanh', kernel_initializer='he_normal'))

# Roughly equivalent to S2

model.add(AveragePooling2D(pool_size=(2, 2)))

model.add(Activation("sigmoid"))

# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'tanh', kernel_initializer='he_normal'))

# Roughly equivalent to S4

model.add(AveragePooling2D(pool_size=(2, 2)))

model.add(Activation("sigmoid"))

model.add(Dropout(0.2))

# Roughly equivalent to C5

model.add(Convolution2D(120, (2, 2), activation = 'tanh', kernel_initializer='he_normal'))

model.add(Flatten())

# Roughly equivalent to F6

model.add(Dense(84, activation = 'tanh', kernel_initializer='he_normal'))

# Output Layer

model.add(Dense(nb_classes, activation = 'softmax', kernel_initializer='he_normal')) #Last layer with


one output per class

# Use RMSprop for training weights

model.compile(loss='categorical_crossentropy', optimizer=RMSprop(), metrics=["accuracy"])

# Let's Learn!!

model.fit(X_train, Y_train, batch_size=batch_size, epochs=nb_epoch, verbose=1)

# Use the test data to see how we do

yPred = model.predict_classes(X_test)

# Line up our outputs on the test set with the labels from the test set and calculate a confusion
matrix

from sklearn import metrics

print("accuracy is %s" % metrics.accuracy_score(yPred,test[:,0]))


targets = test[:,0]

cm = np.array([[0] * 10] * 10)

for i in range(len(targets)):

cm[yPred[i],targets[i]] += 1

print(cm)
accuracy is 0.905
[[99 0 0 1 0 0 0 0 1 0]
[ 0 98 1 4 3 7 0 0 7 0]
[ 0 1 94 6 0 0 0 1 1 1]
[ 0 0 0 79 0 1 0 0 1 1]
[ 0 0 0 0 81 2 1 2 1 3]
[ 0 0 0 6 0 85 1 0 0 0]
[ 0 0 0 0 1 0 97 0 0 0]
[ 0 0 3 0 0 1 0 94 0 1]
[ 1 1 1 3 0 3 1 0 87 3]
[ 0 0 1 1 15 1 0 3 2 91]]

# Lets look at the ones we got WRONG in thte test set

test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] != im[2]]

plt.figure(figsize=(10, 10))

for ind, val in enumerate(test_wrong[:100]):

plt.subplots_adjust(left=0, right=1, bottom=0, top=1)

plt.subplot(10, 10, ind + 1)

im = 1 - val[0].reshape((28,28))

plt.axis("off")

plt.text(0, 0, val[2], fontsize=14, color='blue')

plt.text(8, 0, val[1], fontsize=14, color='red')

plt.imshow(im, cmap='gray')

plt.show()

# Lets look at some of the ones we got RIGHT in thte test set
test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] == im[2]]

plt.figure(figsize=(10, 10))

for ind, val in enumerate(test_wrong[:100]):

plt.subplots_adjust(left=0, right=1, bottom=0, top=1)

plt.subplot(10, 10, ind + 1)

im = 1 - val[0].reshape((28,28))

plt.axis("off")

plt.text(0, 0, val[2], fontsize=14, color='blue')

plt.text(8, 0, val[1], fontsize=14, color='red')

plt.imshow(im, cmap='gray')

plt.show()

In this network configuration, we are experimenting with the Activation Functions i.e changing the
linear activation function to a non-linear activation function in order to estimate the overall
performance of the model.

Q. Why did you chose to make those changes?

Since the linear activation functions serve as just an identity function which outputs the same
as the input which is fed to the node, the training is not so good in most of the cases.
So, we introduce some non-linearity in the network so as to compute more interesting
features of the network so that it can perform better.
Q. How/Why it does better or worse than the first network?

Replacing relu with tanh in convolution layers:-

Relu is more prone to overfitting.


Sigmoid and tanh functions are bounded, so they don’t blow up and thus reduces overfitting.
As the data flows through a deep network, the weights and parameters adjust those values,
sometimes making the data too big or too small again. (Softmax or sigmoid has the advantage
of “no blowing up activation”, because their output is limited to < 1)
Linear functions does not compute interesting functions as it just a step size/identity function.

When replacing relu with tanh , the accuracy improved considerably from 85% to 90%.
accuracy is 0.905
[[99 0 0 1 0 0 0 0 1 0]
[ 0 98 1 4 3 7 0 0 7 0]
[ 0 1 94 6 0 0 0 1 1 1]
[ 0 0 0 79 0 1 0 0 1 1]
[ 0 0 0 0 81 2 1 2 1 3]
[ 0 0 0 6 0 85 1 0 0 0]
[ 0 0 0 0 1 0 97 0 0 0]
[ 0 0 3 0 0 1 0 94 0 1]
[ 1 1 1 3 0 3 1 0 87 3]
[ 0 0 1 1 15 1 0 3 2 91]]

Replacing relu with sigmoid in convolution layers:-

The logistic function (sigmoid activation function) is that it is symmetric regarding the 0
while the relu is not. This makes the sigmoid one more prone to saturation of the later layers,
making training more difficult.
The match of relu function is that wherever a negative number occurs, we swap it out for a 0.

When replacing relu with sigmoid , the accuracy decreased extensively.

Since, there are more chances of the sigmoid function to get stuck at local minima, the accuracy gets
reduced .
accuracy is 0.392
[[62 0 0 0 0 1 0 0 0 0]
[ 0 77 12 2 0 7 5 2 4 0]
[ 0 0 8 0 0 0 0 0 0 0]
[ 2 0 37 23 0 5 0 0 8 0]
[22 0 8 3 70 22 63 5 21 33]
[ 3 0 0 0 0 0 0 0 0 0]
[ 0 0 20 1 0 1 30 0 0 0]
[ 1 23 15 66 18 42 1 92 48 37]
[ 0 0 0 0 0 0 0 0 0 0]
[10 0 0 5 12 22 1 1 19 30]]
Replacing sigmoid with tanh in sub-sampling layers:

The preference for the tanh over the logistic function (sigmoid activation function) is that the
first is symmetric regarding the 0 while the second is not.
Tanh produces more natural looking fits for the data using only linear inputs.
Sigmoid is more prone to be stuck in local minima.

Replacing the sigmoid activation function with tanh results in considerable improve in accuracy.
accuracy is 0.973
[[100 0 0 0 0 0 0 0 0 0]
[ 0 97 0 0 0 0 0 0 0 0]
[ 0 1 97 0 0 0 0 3 0 0]
[ 0 2 0 100 0 1 0 0 1 2]
[ 0 0 0 0 96 0 0 0 0 1]
[ 0 0 0 0 1 98 1 0 0 1]
[ 0 0 0 0 0 0 98 0 0 1]
[ 0 0 2 0 1 0 0 97 0 1]
[ 0 0 1 0 0 1 1 0 96 0]
[ 0 0 0 0 2 0 0 0 3 94]]

2nd Network Configuration:-

Code:

import pandas as pd
import numpy as np
np.random.seed(1337)
import matplotlib.pyplot as plt

from keras import backend as K


from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D,
ZeroPadding2D, AveragePooling2D
from keras.utils import np_utils
from keras.optimizers import SGD, RMSprop

img_rows, img_cols = 28, 28

batch_size = 100
nb_classes = 10
nb_epoch = 30

# Read the train and test datasets


train = pd.read_csv("./small_train.csv",header = None).values
test = pd.read_csv("./small_test.csv",header = None).values

# Check Keras backend


if(K.image_dim_ordering()=="th"): # for Theano
X_train = train[:, 1:].reshape(train.shape[0], 1, img_rows,
img_cols)
X_test = test[:, 1:].reshape(test.shape[0], 1, img_rows, img_cols)
in_shape = (1, img_rows, img_cols)
else: # for TensorFlow
X_train = train[:, 1:].reshape(train.shape[0], img_rows, img_cols,
1)
X_test = test[:, 1:].reshape(test.shape[0], img_rows, img_cols, 1)
in_shape = (img_rows, img_cols, 1)

# First data is label (already removed from X_train)


y_train = train[:, 0]

# Make the value floats in [0;1] instead of int in [0;255]


X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices (ie one-hot vectors)


Y_train = np_utils.to_categorical(y_train, nb_classes)

# Display the shapes to check if everything's ok


print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print('X_test shape:', X_test.shape)

model = Sequential()

# Add padding to take 28x28 to 32x32


model.add(ZeroPadding2D((2,2),input_shape=in_shape))

# Roughly equivalent to C1
model.add(Convolution2D(6, (5, 5), activation = 'relu',
kernel_initializer='he_normal'))

# Roughly equivalent to S2
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

# added another dropout layer with dropout fraction of 0.2


model.add(Dropout(0.2))

# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'relu',
kernel_initializer='he_normal'))

# Add padding to take 28x28 to 32x32


# model.add(ZeroPadding2D((2,2),input_shape=in_shape))

# Roughly equivalent to S4
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

model.add(Dropout(0.2))
# Roughly equivalent to C5
model.add(Convolution2D(120, (2, 2), activation = 'relu',
kernel_initializer='he_normal'))
model.add(Flatten())

# Roughly equivalent to F6
model.add(Dense(84, activation = 'tanh',
kernel_initializer='he_normal'))

# Output Layer
model.add(Dense(nb_classes, activation = 'softmax',
kernel_initializer='he_normal')) #Last layer with one output per class

# Use RMSprop for training weights


model.compile(loss='categorical_crossentropy', optimizer=RMSprop(),
metrics=["accuracy"])

# Alternative training approach using stochastic gradient descent (very


very slow)
#sgd = SGD(lr=0.1,decay=1e-6,momentum=1.0)
#model.compile(loss='categorical_crossentropy', optimizer=sgd,
metrics=["accuracy"])

# Let's Learn!!
model.fit(X_train, Y_train, batch_size=batch_size, epochs=nb_epoch,
verbose=1)

# Use the test data to see how we do


yPred = model.predict_classes(X_test)

# Line up our outputs on the test set with the labels from the test set and calculate a confusion matrix
from sklearn import metrics

print("accuracy is %s" % metrics.accuracy_score(yPred,test[:,0]))


targets = test[:,0]

cm = np.array([[0] * 10] * 10)


for i in range(len(targets)):
cm[yPred[i],targets[i]] += 1

print(cm)
accuracy is 0.799
[[96 0 0 1 0 2 1 0 0 0]
[ 0 97 1 5 4 11 3 4 15 8]
[ 0 1 92 8 0 0 1 6 2 0]
[ 0 1 1 71 0 7 0 1 8 1]
[ 0 0 1 0 78 2 1 4 1 32]
[ 2 0 1 12 0 76 4 0 5 1]
[ 0 0 1 0 3 1 90 0 0 1]
[ 0 1 2 0 0 1 0 78 0 2]
[ 2 0 0 2 1 0 0 0 66 0]
[ 0 0 1 1 14 0 0 7 3 55]]

# Lets look at the ones we got WRONG in thte test set

test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] != im[2]]

plt.figure(figsize=(10, 10))
for ind, val in enumerate(test_wrong[:100]):
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(10, 10, ind + 1)
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.text(0, 0, val[2], fontsize=14, color='blue')
plt.text(8, 0, val[1], fontsize=14, color='red')
plt.imshow(im, cmap='gray')

plt.show()
# Lets look at some of the ones we got RIGHT in thte test set

test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] == im[2]]

plt.figure(figsize=(10, 10))
for ind, val in enumerate(test_wrong[:100]):
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(10, 10, ind + 1)
im = 1 - val[0].reshape((28,28))
plt.axis("off")
plt.text(0, 0, val[2], fontsize=14, color='blue')
plt.text(8, 0, val[1], fontsize=14, color='red')
plt.imshow(im, cmap='gray')

plt.show()

In this network configuration, we added an extra dropout layer with a


dropout fraction of 0.2.
So, both the drop-out layers(the one between S4 and C5 & between S2 and
C3 ) are added to analyze the changes in the network behavior.

Q. Why did you chose to make those changes?

I decided to make those changes because I wanted to experiment with the drop out layer and try to
analyze the difference in performance of the model.

In drop-out, we randomly delete half of the hidden nodes excluding the input and output
nodes. As a result, we can avoid overfitting.
Dropout procedure is like averaging the effects of a very large number of different networks.

Q. How/Why it does better or worse than the first network?


Since the dataset we are using is not so complicated and is not prone to overfitting, the accuracy
doesn’t seem to improve on adding further drop-out layers or by increasing the drop-out fraction
from 0.2 to 0.5 (where 0.2 is the actual minima for this dataset).

So, in this particular case , adding a drop-out layer does not have much impact on the improvement
of accuracy. It improves in case when we have very vast training data containing millions of training
samples. It is used in those large training sets where the problem of overfitting is very acute.
The observation is that on increasing the dropout value(from 0.2 to 0.5), the accuracy became less.

accuracy is 0.803
[[93 0 0 1 0 1 1 1 0 0]
[ 0 97 1 4 0 10 4 3 12 0]
[ 0 0 90 4 0 0 1 2 0 0]
[ 0 2 3 84 0 13 0 0 13 0]
[ 0 0 0 0 68 2 2 1 2 7]
[ 1 0 0 1 0 57 2 0 3 0]
[ 0 0 1 0 1 0 90 0 1 1]
[ 0 1 4 3 0 6 0 86 2 10]
[ 5 0 0 1 1 8 0 0 58 2]
[ 1 0 1 2 30 3 0 7 9 80]]

Also, on adding more dropout layers, the accuracy decreased .

accuracy is 0.799
[[96 0 0 1 0 2 1 0 0 0]
[ 0 97 1 5 4 11 3 4 15 8]
[ 0 1 92 8 0 0 1 6 2 0]
[ 0 1 1 71 0 7 0 1 8 1]
[ 0 0 1 0 78 2 1 4 1 32]
[ 2 0 1 12 0 76 4 0 5 1]
[ 0 0 1 0 3 1 90 0 0 1]
[ 0 1 2 0 0 1 0 78 0 2]
[ 2 0 0 2 1 0 0 0 66 0]
[ 0 0 1 1 14 0 0 7 3 55]]

Since in drop out, we drop out most of the neurons and modify the network structure itself, we
might need more number of epochs in order to train the network better.

Drop-out roughly requires more number of iterations to converge.

Below is the improved accuracy (from 79% to 86%) after increasing the number of epochs.
accuracy is 0.865
[[100 0 1 2 1 3 1 1 4 0]
[ 0 97 1 1 1 3 0 0 5 0]
[ 0 0 93 5 0 0 0 3 2 0]
[ 0 2 0 69 0 0 0 0 5 1]
[ 0 0 0 0 68 2 1 1 1 3]
[ 0 1 1 21 1 91 4 0 8 2]
[ 0 0 1 0 2 0 94 0 0 1]
[ 0 0 2 0 0 1 0 93 0 4]
[ 0 0 0 1 0 0 0 0 73 2]
[ 0 0 1 1 27 0 0 2 2 87]]
3rd Network Configuration:-
Code:

import pandas as pd
import numpy as np
np.random.seed(1337)
import matplotlib.pyplot as plt

from keras import backend as K


from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D,
ZeroPadding2D, AveragePooling2D
from keras.utils import np_utils
from keras.optimizers import SGD, RMSprop

img_rows, img_cols = 28, 28

batch_size = 100
nb_classes = 10
nb_epoch = 30

# Read the train and test datasets


train = pd.read_csv("./small_train.csv",header = None).values
test = pd.read_csv("./small_test.csv",header = None).values

# Check Keras backend


if(K.image_dim_ordering()=="th"): # for Theano
X_train = train[:, 1:].reshape(train.shape[0], 1, img_rows,
img_cols)
X_test = test[:, 1:].reshape(test.shape[0], 1, img_rows, img_cols)
in_shape = (1, img_rows, img_cols)
else: # for TensorFlow
X_train = train[:, 1:].reshape(train.shape[0], img_rows, img_cols,
1)
X_test = test[:, 1:].reshape(test.shape[0], img_rows, img_cols, 1)
in_shape = (img_rows, img_cols, 1)

# First data is label (already removed from X_train)


y_train = train[:, 0]

# Make the value floats in [0;1] instead of int in [0;255]


X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices (ie one-hot vectors)


Y_train = np_utils.to_categorical(y_train, nb_classes)

# Display the shapes to check if everything's ok


print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print('X_test shape:', X_test.shape)

model = Sequential()

# Add padding to take 28x28 to 32x32


model.add(ZeroPadding2D((2,2),input_shape=in_shape))

# Roughly equivalent to C1
model.add(Convolution2D(6, (5, 5), activation = 'tanh',
kernel_initializer='he_normal'))

# Roughly equivalent to S2
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

# Roughly equivalent to C3
model.add(Convolution2D(16, (5, 5), activation = 'tanh',
kernel_initializer='he_normal'))

# Add padding to take 28x28 to 32x32


# model.add(ZeroPadding2D((2,2),input_shape=in_shape))

# Roughly equivalent to S4
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

# Equivalent to new convolutional layer added ‘C’


model.add(Convolution2D(32, (2, 2), activation = 'tanh',
kernel_initializer='he_normal'))

# equivalent to new sub-sampling layer added ‘S’


model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

model.add(Dropout(0.2))

# Roughly equivalent to C5
model.add(Convolution2D(120, (2, 2), activation = 'tanh',
kernel_initializer='he_normal'))
model.add(Flatten())

# Roughly equivalent to F6
model.add(Dense(84, activation = 'tanh',
kernel_initializer='he_normal'))

# model.add(Dense(48, activation = 'tanh',


kernel_initializer='he_normal'))

# Output Layer
model.add(Dense(nb_classes, activation = 'softmax',
kernel_initializer='he_normal')) #Last layer with one output per class

# Use RMSprop for training weights


model.compile(loss='categorical_crossentropy', optimizer=RMSprop(),
metrics=["accuracy"])
# Let's Learn!!
model.fit(X_train, Y_train, batch_size=batch_size, epochs=nb_epoch,
verbose=1)

# Use the test data to see how we do


yPred = model.predict_classes(X_test)

# Line up our outputs on the test set with the labels from the test set and calculate a confusion
matrix

from sklearn import metrics

print("accuracy is %s" % metrics.accuracy_score(yPred,test[:,0]))

targets = test[:,0]

cm = np.array([[0] * 10] * 10)

for i in range(len(targets)):

cm[yPred[i],targets[i]] += 1

print(cm)

accuracy is 0.728
[[79 1 1 3 0 4 0 0 0 0]
[ 2 97 2 4 2 4 0 3 2 1]
[ 2 1 71 11 0 1 1 0 0 0]
[ 2 1 15 66 0 14 0 10 31 1]
[ 0 0 1 0 85 3 4 1 1 13]
[ 7 0 1 3 0 43 1 1 20 5]
[ 0 0 8 0 4 3 94 0 2 0]
[ 0 0 1 7 0 6 0 78 2 2]
[ 8 0 0 5 0 11 0 1 39 2]
[ 0 0 0 1 9 11 0 6 3 76]]

# Let’s look at the ones we got WRONG in thte test set

test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] != im[2]]

plt.figure(figsize=(10, 10))

for ind, val in enumerate(test_wrong[:100]):

plt.subplots_adjust(left=0, right=1, bottom=0, top=1)


plt.subplot(10, 10, ind + 1)

im = 1 - val[0].reshape((28,28))

plt.axis("off")

plt.text(0, 0, val[2], fontsize=14, color='blue')

plt.text(8, 0, val[1], fontsize=14, color='red')

plt.imshow(im, cmap='gray')

plt.show()

# Lets look at some of the ones we got RIGHT in thte test set

test_wrong = [im for im in zip(X_test,yPred,test[:,0]) if im[1] == im[2]]

plt.figure(figsize=(10, 10))

for ind, val in enumerate(test_wrong[:100]):

plt.subplots_adjust(left=0, right=1, bottom=0, top=1)

plt.subplot(10, 10, ind + 1)

im = 1 - val[0].reshape((28,28))

plt.axis("off")

plt.text(0, 0, val[2], fontsize=14, color='blue')

plt.text(8, 0, val[1], fontsize=14, color='red')

plt.imshow(im, cmap='gray')

plt.show()
In the above network configuration the new set of layers (the convolution layer ‘C’ and the sub-
sampling layer ‘S’ ) are inserted between S4 and the drop-out layer.

Here, each unit in layer “C” is connected to a 2X2 receptive field from previous layer(S4).

Also,each unit in layer “S” is connected to a 2X2 receptive field from previous layer (C).

The number of feature maps in layer ‘C’ is doubled i.e from 16 to 32.

The dimensions of the feature map in C is calculated as follows:

Dimension = [(n + 2p –f/s) + 1] where,

n=dimension of feature map of previous layer

p= padding

f=size of filter

s=stride

So,here

n=5,p=0,s=1,f=2

therefore, dimension=[(5+0-2)/1] + 1

=4
The dimensions of the feature map in S is calculated as follows:

Each unit in layer “S” is connected to a 2 X 2 receptive field of previous layer.

Dimension =[ (n + 2p –f/s) + 1]

Here, n=4,p=0,f=2,s=2

therefore, dimension=[(4+0-2)/2] + 1

=2

Q. Why did you chose to make those changes ?

Adding more layers to a convolutional model brings considerable change in the behaviour of the
model . So, I added an extra layer and experimented a bit with the number of feature maps to
analyze the network’s performance for the given dataset.

Q. How/Why it does better or worse than the first network?


Adding an extra convolutional and sub-sampling layer after S4. and before C5 and doubling
the number of feature maps in the layer afterwards(in C), slightly lowers its accuracy as
compared to the original network configuration.
accuracy is 0.728
[[79 1 1 3 0 4 0 0 0 0]
[ 2 97 2 4 2 4 0 3 2 1]
[ 2 1 71 11 0 1 1 0 0 0]
[ 2 1 15 66 0 14 0 10 31 1]
[ 0 0 1 0 85 3 4 1 1 13]
[ 7 0 1 3 0 43 1 1 20 5]
[ 0 0 8 0 4 3 94 0 2 0]
[ 0 0 1 7 0 6 0 78 2 2]
[ 8 0 0 5 0 11 0 1 39 2]
[ 0 0 0 1 9 11 0 6 3 76]]

Since, adding an extra layer to the convolutional network and doubling the number of feature
maps in the layers after it increases the number of trainable parameters and as a result ,the
networks finds it difficult to learn well or generalize to new unseen test samples.
Also, adding an extra layer increase the computation complexity.
As we increase the number of hidden layer, effectiveness of backpropagation decreases.
The network might suffer from the problem of vanishing gradient which refers to the fact that
if we keep on increasing the number of hidden layers, it results in very slow
learning/updating of weights in the lower layers of the network.
Thus, the gradients vanish for the earlier layers and those layers refuse to learn.
Also, we need to deal with overfitting problem i.e. the network will perform very well on the
training data set but it cannot generalize to new data it has not seen and gives awful
performance on the new data.
As a result the assumption made that the network performs somewhat poorly as compared to
earlier(i.e without adding extra layers) is due to the above results.

Potrebbero piacerti anche