Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Deep Learning
A Project report submitted in partial fulfillment
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
[2019-2020]
Submitted by
S. Chandra sekhar (16A51A0536) M. Suresh(16A51A0553)
We are also very much thankful to Dr.G.S.N.Murthy, head of Computer Science &
Engineering, AITAM, Tekkali for his help encouragement.
We have great pleasure to acknowledge our sincere gratitude to our project guide Sri.
K.Prasada Rao, Sr.Asst.Prof. Department of Computer Science and Engineering, AITAM,
Tekkali for his help and guidance during the project. His valuable suggestions and
encouragement helped us a lot carrying out this project work as well as in bringing this
project to this form.
We are extremely grateful to our department staff members, lab technicians and nonteaching
staff members for extreme help throughout our project.
Finally we express our heartfelt thanks to all my friends who helped me in successful
completion of this project.
PROJECT ASSOCIATES
S. Chandra sekhar(16A51A0536)
M. Suresh Kumar(16A51A0553)
CH. Khageshwara Rao (16A51A0537)
B.Mamatha(16A51A0528)
DECLARATION
We hereby declare that the project titled "Facial Expression Recognition in E-Learning
Environment using Deep Learning" is a bonafide work done by us at AITAM, Tekkali
Affiliated to JNTU, Kakinada towards the partial fulfillment for the award of Degree of
Bachelor of Technology in Computer Science and Engineering during the period 2019-20.
Project Associates
S. Chandra sekhar(16A51A0536)
M. Suresh Kumar(16A51A0553)
CH. Khageshwara Rao (16A51A0537)
B.Mamatha(16A51A0528)
ABSTRACT
Face recognition has become an attractive field in computer based application development
in the last few decades. The E-learning system is becoming more and more popular among
students nowadays. However, the emotion of students is usually neglected in the e-learning
system. This project is mainly concerned about using facial expression to detect emotion in
the e-learning system. OpenCV and Keras provide many algorithms for facial recognition
and emotion capturing. The captured facial expression will be used in the E-Learning
Environment for analyzing the learner mood. At last, we propose the design of an experiment
to evaluate the performance of this method in a real e-learning system.
1
INDEX
CHAPTERS
1. INTRODUCTION
1.1 Introduction 2
4. METHODOLOGY
4.1 Image Acquisition 15
4.5 Classification 19
5. DESIGN
5.1 UML Diagrams 23
2
5.4 Sequence Diagram 28
6. DATASETS
6.1 FER Datasets for standard emotion 31
7. IMPLEMENTATION
7.1 Loading and splitting dataset 34
10. CODING
10.1 Load and split the dataset 49
3
LIST OF FIGURES
Sequence Diagram
13 29
14 CNN Architecture 35
4
LIST OF TABLES
1 Software Used 11
2 Dependencies Used 11
3 Hardware Used 12
5
CHAPTER-1
INTRODUCTION
Page 1
1. INTRODUCTION
1.1 Introduction:
Facial expressions are the facial changes in response to a person’s internal emotional states,
intentions, or social communications. Facial emotion recognition is the process of detecting
human emotions from facial expressions. The human brain recognizes emotions
automatically, and software has now been developed that can recognize emotions as well.
This technology is becoming more accurate all the time, and will eventually be able to read
emotions as well as our brains do. AI can detect emotions by learning what each facial
expression means and applying that knowledge to the new information presented to it.
Emotional artificial intelligence, or emotion AI, is a technology that is capable of reading,
imitating, interpreting, and responding to human facial expressions and emotions.
It can be a detector to detect faces for each frame or just detect faces in the first frame and
then track the face in the remainder of the video sequence. To handle large head motion, the
head finder, head tracking, and pose estimation can be applied to a facial expression analysis
system. After the face is located, the next step is to extract and represent the facial changes
caused by facial expressions. In facial feature extraction for expression analysis, there are
mainly two types of approaches: geometric feature-based methods and appearance-based
methods. The geometric facial features present the shape and locations of facial components
(including mouth, eyes, brows, nose, etc.). The facial components or facial feature points are
Page 2
extracted to form a feature vector that represents the face geometry. With appearance-based
methods, image filters, such as Gabor wavelets, are applied to either the whole-face or
specific regions in a face image to extract a feature vector. Depending on the different facial
feature extraction methods, the effects of in-plane head rotation and different scales of the
faces can be eliminated by face normalization before the feature extraction or by feature
representation before the step of expression recognition.
Car manufacturers around the world are increasingly focusing on making cars more personal and
safe for us to drive. In their pursuit to build more smart car features, it makes sense for makers to
use AI to help them understand the human emotions. Using facial emotion detection smart cars
can alert the driver when he is feeling drowsy.
Market Research
Traditional market research companies have employed verbal methods mostly in the form of
surveys to find the consumers wants and needs. However, such methods assume that consumers
can formulate their preferences verbally and the stated preferences correspond to future actions
which may not always be right. Detecting emotions with technology is a challenging task, yet one
where machine learning algorithms have shown great promise. Using ParallelDots' Facial
Emotion Detection API, customers can process images, and videos in real-time for monitoring
video feeds or automating video analytics, thus saving costs and making life better for their users.
The API is priced on a Pay-As-You-Go model, allowing you to test out the technology before
scaling up.
Page 3
Facial Emotion detection is only a subset of what visual intelligence could do to analyze videos
and images automatically. Click here to check facial emotion in your picture.
As soon as the idea of e-learning was explained, expectations of converting formal education to
e-learning were increased. However, in the early 2000s, web technologies could not meet these
high expectations in the sense of the e-learning system and course materials (Fig. 2) . Since the
personal computer usage and internet bandwidth are increasing, e-learning systems are also
widely spreading. Although e-learning has some advantages in terms of information
accessibility, time and place flexibility compared to formal learning, it does not provide
enough face-to-face interactivity between an educator and learners. In this project, we are
proposing a hybrid information system, which is combining computer vision and machine
learning technologies for visual and interactive e-learning systems E-learning has become
more and more common among universities and colleges looking at its advantages over
traditional approaches where students are able to study and learn anytime. E-learning enables
students to access educational materials easily at any time and from anywhere. In addition,
with the personalized learning technologies, the productivity in education towards the student
groups in the heterogeneous structure is also increasing. Today, e-learning is not limited to
computer-aided systems. It is also accessible from mobile devices.. Thus, students can take
virtual courses whether from their personal computer or mobile devices, at home or in a cafe.
These elasticities allow the student to feel comfortable. This situation has led to increasing
efficiency and facilitating the learning process.
Page 4
1.3 Deep Learning
Deep learning[1] is a machine learning technique that teaches computers to do what comes
naturally to humans: learn by example. Deep learning is a key technology behind driverless
cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It
is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free
speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving
results that were not possible before.
Fig 3: Neural networks, which are organized in layers consisting of a set of interconnected
nodes. Networks can have tens or hundreds of hidden layers.
CNNs eliminate the need for manual feature extraction, so you do not need to identify
features used to classify images. CNN works by extracting features directly from images. The
relevant features are not pretrained; they are learned while the network trains on a collection
of images. This automated feature extraction makes deep learning models highly accurate for
computer vision tasks such as object classification.
Page 5
Activation Function:
1) Linear Function :- No matter how many layers we have, if all are linear in nature, the
final activation function of the last layer is nothing but just a linear function of the input of
the first layer. Issues : If we differentiate linear function to bring non-linearity, result will no
longer depend on input “x” and function will become constant, it won’t introduce any
ground-breaking behavior to our algorithm.
quation : y = ax
E
2) Sigmoid Function :- Non-linear. Notice that X values lies between -2 to 2, Y values are
very steep. This means, small changes in x would also bring about large changes in the value
of Y. Usually used in the output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be 1 if value is greater than 0.5 and 0 otherwise.
3) Tanh Function :- The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function. It’s actually a mathematically
shifted version of the sigmoid function. Both are similar and can be derived from each other.
4) RELU :- Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural networks. It gives an output x if x is positive
and 0 otherwise. non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function. ReLu is less
computationally expensive than tanh and sigmoid because it involves simpler mathematical
operations.
5) Softmax Function :- The softmax function is also a type of sigmoid function but is handy
when we are trying to handle classification problems. Usually used when trying to handle
multiple classes. The softmax function would squeeze the outputs for each class between 0
and 1 and would also divide by the sum of the outputs. The softmax function is ideally used
in the output layer of the classifier where we are actually trying to attain the probabilities to
define the class of each input.
Page 6
CHAPTER-2
LITERATURE SURVEY
Page 7
2 . Literature Survey
There is a wild usage of face emotion recognition in various fields such as Facial emotion
recognition medical field (impairment in chronic temporal lobe epilepsy), Effect of yoga
therapy on facial emotion recognition and much more. But they use either CNN model or an
SVM classifier for prediction.
Handcrafted Features are those which are manually extracted from a given Image. A
facial image consists of feature points which represents all the facial parts(such as nose,
eyes and lips etc.,). These feature points are then forwarded to a classifier(such as SVM,
Random Forest etc.,) to train the model. Now, The trained model will be used to predict the
class labels.
In this project, we are doing face detection and emotion recognition using some of the
packages available in python (such as Keras and OpenCv). The seven universal emotions are
happiness, sadness, anger, surprise, contempt, fear and disgust. But, These emotions will not
fit the E-Learning environment. So, the new cognitive emotions are: Boredom, Confusion,
Engaged, Frustration. However these emotions will suit best in an E-environment to
summarize the learner mental mood. Here, we are proposing a system where facial emotions
are recognised in e-Learning systems. In this system we used a hybrid architecture[2](Fig.4 )
where the features were extracted using CNN. Later the extracted features are passed into an
SVM classifier. The layers followed by fully connected layers were popped out from CNN
architecture to produce the feature vector. This hybrid model will lead to better performance
than the regular CNN architecture. All the results are included in section 9.
Page 8
Fig.4 : Proposed Hybrid Architecture for emotion detection
Page 9
CHAPTER-3
REQUIREMENTS AND TECHNICAL DESCRIPTION
Page 10
3. REQUIREMENTS AND TECHNICAL DESCRIPTION
3.1 System Configuration
A system configuration (SC) defines the computers, processes, and devices that compose
the system and its boundary. More generally, the system configuration is the specific
definition of the elements that define and/or prescribe what a system is composed of.
Alternatively, the term "system configuration" can be used to relate to a model ( declarative)
for abstract generalized systems. In this sense, the usage of the configuration information is
not tailored to any specific usage, but stands alone as a data set.
Software Used:
Python 3.6
Dependencies Used:
Requirements Version
Keras 2.2.1
Tensorflow 2.2.0
Numpy 1.18.1
DLib 19.9.0
Sklearn 0.21.1
Page 11
Hardware Used:
HardDisk 1 TB
RAM 8 GB
3.2.1 Python
Python is a general-purpose interpreted, interactive, object-oriented, and high-level
programming language. It was created by Guido van Rossum during 1985- 1990. It
provides many packages for image processing(such as OpenCv and Pillow etc.,)and deep
learning(such as Keras, Theano and Tensorflow etc.,).
Keras:
Inside Convnets:
There exists a filter or neuron or kernel which lays over some of the pixels of the input
image depending on the dimensions of the Kernel size. The Kernel actually slides over the
input image, thus it is multiplying the values in the filter with the original pixel values of the
image.
Kernel:
The kernel is nothing but a filter that is used to extract the features from the images. The
kernel is a matrix that moves over the input data, performs the dot product with the
sub-region of input data, and gets the output as the matrix of dot products. Kernel moves on
the input data by the stride value.
Stride in Convnets:
stride denotes how many steps we are moving in each step in convolution.By default it is
one. Stride controls how the filter convolves around the input volume. Stride is normally set
in a way so that the output volume is an integer and not a fraction.
Page 12
Pooling:
We can observe that the size of output is smaller than input. To maintain the dimension of
output as in input , we use padding. Padding is a process of adding zeros to the input matrix
symmetrically.
colab is the best tool for running the higher time consuming modules.
Colab[3] is used extensively in the machine learning community with applications including:
Page 13
CHAPTER-4
METHODOLOGY
Page 14
4 Methodology
There are some common features that we find on most common human faces :
The idea behind HOG is to extract features into a vector, and feed it into a classification
algorithm like a Support Vector Machine for example that will assess whether a face (or any
object you train it to recognize actually) is present in a region or not. The features extracted
are the distribution (histograms) of directions of gradients (oriented gradients) of the image.
Gradients are typically large around edges and corners and allow us to detect those regions.
Convolutional Neural Networks (CNN) are feed-forward neural networks that are mostly
used for computer vision. They offer an automated image pre-treatment as well as a dense
Page 15
neural network part. CNNs are special types of neural networks for processing datas with
grid-like topology. The architecture of the CNN is inspired by the visual cortex of animals.
In The below codelet the face was detected using haar cascade and cropped with a rectangle
of blue color
#face.xml is a faceCascade
faceCascade=cv2.CascadeClassifier("face.xml")
#returns the detected faces in given image
faces = faceCascade.detectMultiScale(
img,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30),
)
Page 16
4.3 Image Pre-processing:
Image pre-processing includes the removal of noise and normalization against the
variation of pixel position or brightness.
It‘s a landmark’s facial detector with pre-trained models, the dlib is used to estimate the
location of 68 coordinates (x, y) that map the facial points on a person’s face (Fig. 5- marked
with green dots)
Page 17
Fig. 5 Feature points detected using dlib shape predictor
Although detecting feature points is quite straightforward because its just a pretrained model
A convolutional neural network (CNN) is a type of artificial neural network usually designed
to extract features from given high dimensional data. CNN is designed specifically to
reorganize two dimensional shapes with a high degree of invariance to translation, scaling,
skewing and other forms of distortion. The structure includes feature extraction, feature
mapping and subsampling layers. A CNN model can be thought of as a combination of two
components: feature extraction part and the classification part. The convolution + pooling
layers perform feature extraction. For example, given an image, the convolution layer detects
features such as two eyes, long ears, four legs, a short tail and so on. The fully connected
layers then act as a classifier on top of these features, and assign a probability for the input
image being a dog. The convolution layers are the main powerhouse of a CNN model.
Automatically detecting meaningful features given only an image and a label is not an easy
task. The convolution layers learn such complex features by building on top of each other.
The first layers detect edges, the next layers combine them to detect shapes, and the
following layers merge this information to infer that this is a nose. To be clear, CNN doesn’t
know what a nose is. By seeing a lot of them in images, it learns to detect that as a feature.
Page 18
The fully connected layers learn how to use these features produced by convolutions in order
to correctly classify the images.
Pooling:
Pooling progressively reduces the size of the input representation. It makes it possible to
detect objects in an image no matter where they’re located. Pooling helps to reduce the
number of required parameters and the amount of computation required. It also helps control
overfitting. If pooling is not done periodically then the size of output will be increased
exponentially. There are two types of poolings that can be applied in convnets.They are
Global Average pooling, Max pooling. In global average pooling, The given matrix will be
replaced by its average and in Max pooling It will be replaced by the maximum value.
4.5 Classification:
The dimensionality of data obtained from the feature extraction method is very high so it is
reduced using classification. Features should take different values for object belonging to
different classes so classification will be done using the Support Vector Machine algorithm.
SVM offers very high accuracy compared to other classifiers such as logistic regression, and
decision trees. It is known for its kernel trick to handle nonlinear input spaces. It is used in a
variety of applications such as face detection, intrusion detection, classification of emails,
news articles and web pages, classification of genes, and handwriting recognition. It can
easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in
multidimensional space to separate different classes(Fig. 6). SVM generates optimal
hyperplanes in an iterative manner, which is used to minimize an error. The core idea of
SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into
classes. Support vectors are the data points, which are closest to the hyperplane. These points
will define the separating line better by calculating margins. These points are more relevant
to the construction of the classifier.
Page 19
Fig.6: Depicting the hyperplane and support vectors
The convolutional neural network (CNN) is a class of deep learning neural networks. CNNs
represent a huge breakthrough in image recognition. They’re most commonly used to analyze
visual imagery and are frequently working behind the scenes in image classification. They
can be found at the core of everything from Facebook’s photo tagging to self-driving cars.
They’re working hard behind the scenes in everything from healthcare to security. Image
classification is the process of taking an input (like a picture) and outputting a class (like
“cat”) or a probability that the input is a particular class (“there’s a 90% probability that this
input is a cat”).
Softmax Layer:
A Softmax function is a type of squashing function. Squashing functions limit the output of
the function into the range 0 to 1. This allows the output to be interpreted directly as a
probability. Similarly, softmax functions are multi-class sigmoids, meaning they are used in
determining probability of multiple classes at once. Since the outputs of a softmax function
can be interpreted as a probability (i.e.they must sum to 1), a softmax layer is typically the
final layer used in neural network functions. It is important to note that a softmax layer must
have the same number of nodes as the output later. A softmax layer(Fig. 7), allows the neural
network to run a multi-class function. In short, the neural network will now be able to
determine the probability that the dog is in the image, as well as the probability that
additional objects are included as well.
Page 20
Fig. 7: Illustrating the softmax layer
Page 21
CHAPTER-5
DESIGN
Page 22
5. DESIGN
5.1 UML Diagrams
The Unified Modeling Language (UML) is a standard language for specifying, visualizing,
constructing, and documenting the artifacts of software systems, as well as for business
modeling and other non-software systems. The UML represents a collection of best
engineering practices that have proven successful in the modeling of large and complex
systems. The UML is a very important part of developing objects oriented software and the
software development process. The UML uses mostly graphical notations to express the
design of software projects. Explore potential designs, and validate the architectural design
of the software
A conceptual model can be defined as a model which is made of concept and their
relationships.
A conceptual model is the first step before drawing UML diagrams. It helps to understand
the entities in the real world and how they interact with each other.
To understand how the UML works, we need to know the three elements:
● Rules to connect the building blocks (Rules for how these building blocks may be put
together).
Activity diagram is another important diagram in UML to describe the dynamic aspects of
the system.
Activity diagram is basically a flowchart to represent the flow from one activity to another
activity. The activity can be described as an operation of the system.
The control flow is drawn from one operation to another. This flow can be sequential,
branched, or concurrent. Activity diagrams deal with all type of flow control by using
different elements such as fork, join, etc
The basic purpose of activity diagrams is to capture the dynamic behavior of the system.
Other four diagrams are used to show the message flow from one object to another but
activity diagram is used to show message flow from one activity to another.
Activity is a particular operation of the system. Activity diagrams are not only used for
visualizing the dynamic nature of a system, but they are also used to construct the executable
Page 23
system by using forward and reverse engineering techniques. The only missing thing in the
activity diagram is the message part.
It does not show any message flow from one activity to another. Activity diagram is
sometimes considered as the flowchart. Although the diagrams look like a flowchart, they
are not. It shows different flows such as parallel, branched, concurrent, and single.
Page 24
Fig. 9 : Activity Diagram - For training CNN model
Page 25
Activity Diagram - Testing the model:
Page 26
Fig. 11 : Activity Diagram - For testing CNN model
A use case diagram at its simplest is a representation of a user‘s interaction with the system
and depicting the specifications of a use case. A use case diagram can portray the different
types of users of a system and the various ways that they interact with the system. This type
of diagram is typically used in conjunction with the textual use case and will often be
accompanied by other types of diagrams as well. While a use case itself might drill into a lot
of detail about every possibility, a use-case diagram can help provide a higher-level view of
the system. It has been said before that ―Use case diagrams are the blueprints for your
system‖. They provide the simplified and graphical representation of what the system must
actually do.
In its simplest form, a use case can be described as a specific way of using the system from a
user‘s (actor‘s) perspective. A use case is a set of scenarios that describe an interaction
between a user and a system. A use case diagram displays the relationship among actors and
use cases. The two main components of a use case diagram are use cases and actors.
Page 27
Fig 12: Use case Diagram
A sequence diagram is a graphical view of a scenario that shows object interaction in a time
based sequence: what happens first, what happens next. Sequence diagrams establish the
roles of objects and help provide essential information to determine class responsibilities and
interfaces. Sequence diagrams are normally associated with use cases. Sequence diagrams
are closely related to collaboration diagrams and both are alternate representations of an
interaction There are two main differences between sequence and collaboration diagrams:
sequence diagrams show time-based object interaction while collaboration diagrams show
how objects associate with each other. A sequence diagram has two dimensions: typically,
vertical placement represents time and horizontal placement represents different objects. A
sequence diagram shows, as parallel vertical lines (lifelines), different processes or objects
that live simultaneously, and, as horizontal arrows, the messages exchanged between them,
in the order in which they occur. This allows the specification of simple runtime scenarios in
graphical manner.
Page 28
Fig 13: Sequence Diagram
Page 29
CHAPTER-6
DATASETS
Page 30
6. Datasets:
A facial expression database is a collection of images or video clips with facial expressions of
a range of emotions. Well-annotated (emotion-tagged) media content of facial behavior is
essential for training, testing, and validation of algorithms for the development of expression
recognition systems. The emotion annotation can be done in discrete emotion labels or on a
continuous scale. Most of the databases are usually based on the basic emotions theory (by
Paul Ekman) which assumes the existence of six discrete basic emotions (anger, fear, disgust,
surprise, joy, sadness). However, some databases include the emotion tagging in continuous
arousal-valence scale. In posed expression databases, the participants are asked to display
different basic emotional expressions, while in spontaneous expression databases, the
expressions are natural. Spontaneous expressions differ from posed ones remarkably in terms
of intensity, configuration, and duration. Apart from this, synthesis of some AUs are barely
achievable without undergoing the associated emotional state. Therefore, in most cases, the
posed expressions are exaggerated, while the spontaneous ones are subtle and differ in
appearance. Some of the publicly available datasets are tabulated
Ryerson Audio-Visual Speech: Calm, happy, sad, angry, 24 7356 video and Color Posed
Database of Emotional fearful, surprise, disgust, and audio files
Speech and Song neutral.
(RAVDESS)
Song: Calm, happy, sad, angry,
fearful, and neutral. Each
expression at two levels of
emotional intensity.
Japanese Female neutral, sadness, surprise, 10 213 static images Gray Posed
Facial Expressions happiness, fear, anger, and disgust
(JAFFE)
Page 31
6.2 FER Dataset for E-learning:
6.2.1 DAiSEE:
The difference between real and virtual worlds is shrinking at an astounding pace. With more
and more users working on computers to perform a myriad of tasks from online learning to
shopping, interaction with such systems is an integral part of life. In such cases, recognizing a
user’s engagement level with the system (s)he is interacting with can change the way the
system interacts back with the user. This will lead not only to better engagement with the
system but also pave the way for better human-computer interaction. Hence, recognizing user
engagement can play a crucial role in several contemporary vision applications including
advertising, healthcare, autonomous vehicles, and e-learning. However, the lack of any
publicly available dataset to recognize user engagement severely limits the development of
methodologies that can address this problem. To facilitate this, we introduce DAiSEE, the
first multi-label video classification dataset comprising 9068 video snippets captured from
112 users for recognizing the user's affective states of boredom, confusion, engagement, and
frustration “in the wild”. The dataset has four levels of labels namely - very low, low, high,
and very high for each of the affective states, which are crowd annotated and correlated with
a gold standard annotation created using a team of expert psychologists. We have also
established benchmark results on this dataset using state-of-the-art video classification
methods that are available today. We believe that DAiSEE[5] will provide the research
community with challenges in feature extraction, context-based inference, and development
of suitable machine learning methods for related tasks, thus providing a springboard for
further research.
DAiSEE Features:
This work uses the above dataset (DAiSEE) to find the learner emotion
Page 32
CHAPTER-7
IMPLEMENTATION
Page 33
7. Implementation
Our dataset consists of 9068 sequences of videos over 112 no. of subjects. The
videos present in the dataset are labelled to one of the four cognitive states which are
included in section 2.2. The implementation initiated with preprocessing the datasets and the
video sequences are cut into a set of frames. Later we resized each frame into 48 X 48 pixels
followed by contrast limited adaptive histogram equalization. Then the dataset is splitted into
two sets as a training set and validation set with a weightage of 80% and 20% respectively.
Then The same is stored in a pickle format to start the training process.
Convolution Neural Networks are those which use neural networks and it is
inspired by the visual cortex of animals. The CNN consists of two components for feature
extraction and classification. Initially we used CNN model to extract features and classify
them to respective class labels.The architecture will take an image of size 48 X 48 with
grayscale channel. It totally consists of 24 layers(Fig.14). The fully connected layer will
provide the feature vector with 3200 features. They are sent to the softmax layer to classify
them according to the labels. But the performance was not upto the mark. The performance of
a CNN will be improved by replacing a softmax layer with a SVM classifier.
Page 34
Fig. 14. CNN Architecture
Page 35
Feature Extraction using Convnets and classification with SVM:
The last five layers in the above mentioned architecture were popped out to extract only the
Features. The new model is created upto fully connected layer with the same weights.
The features will be obtained by using the predict method. The same obtained features will be
brought forwarded to SVM[10] classifier. The performance of classification can be increased
by using SVM instead of two fully connected neural networks[6].
_________________________________________________________________
==========================================================
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
Page 36
conv2d_3 (Conv2D) (None, 23, 23, 64) 18496
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
Page 37
These 3200 sized feature vectors will be feeded to the SVM and will now create hyperplanes
with support vectors to classify the features. Now the model will result in 51% accuracy
overall. The total implementation code will be available in the public repository[11].
Now the same obtained model will be used to detect the state of the learner in a real-time
environment. These emotion states will be used to analyze the learner mood and behaviour.
The engaged cognitive mood will result in that the course contents are good and there is no
need to change the course contents. Boredom cognitive mood will describe that the tutor has
to change his course delivering process. The other two moods(confusion and Frustration)
will describe that the course has to be modified.
Page 38
CHAPTER-8
TESTING
Page 39
8.TESTING
8.1 Introduction
Testing is the process of detecting errors. Testing performs a very critical role for
quality assurance and for ensuring the reliability of software. The results of testing are used
later on during maintenance also.
Testing should be made at every level of performing the task. By making testing we
can know what our mistakes are. So testing should be done and it is very primary aspect.
There are different types of testing methods. These testing methods were divided into two
types.
1. White box testing
2. Black box testing
Testing objectives
The main objective of testing is to uncover a host of errors, systematically and with
minimum effort and time. Stating formally, we can say,
● Testing is a process of executing a program with the intent of finding an error.
● A good test case is one that has a big probability of finding error, if it exists.
● The tests are inadequate to detect possibly present errors.
We perform different types of testing at every stage. Among so many testing levels
we choose some testing levels which are involved. They are:
Page 40
1. Unit testing
2. Integration testing
Page 41
8.4 Test Cases:
Expected
S.No Test case Title Description Result
Outcome
1 Testing an Image The actual The predicted
with Frustration emotion is cognitive
state Frustration state of user is
Frustration
Page 42
CHAPTER-9
RESULTS AND DISCUSSIONS
Page 43
9.1 Results obtained for emotions related to E-Learning using proposed architecture:
The above mentioned architecture in section 7.2 will result in 51.97% accuracy. The
accuracy for every cognitive state is listed below.
Boredom 54%
Confusion 54%
Engaged 63%
Frustration 28%
Table 5: Obtained accuracies for each cognitive state for DAiSEE
The above results are satisfiable and markable up to now in the E-Learning Field Since the
dataset is being collected in wild(not posed) condition. The confusion matrix(Fig.15) is
drawn for obtained accuracy. The confusion matrix will depict the correct and incorrect
predictions made by the model.
Key Observations:
● Most of the Images with frustration cognitive state are predicted as confusion
(Although the both emotion states will results in change of contents of the courses)
● The accuracy is upto the mark for each emotion state except for the frustration state
Page 44
9.2 .Results obtained for standard emotions using proposed architecture:
The above proposed architecture will result in accuracy upto 98% for CK+[7] dataset. The
obtained results are extremely good. This much accuracy was obtained because the CK+
dataset contains the images which are taken in a posed environment. The confusion
matrix(Fig.16) will depict the false and true predictions.The result obtained for the same
dataset by using only CNN will result in 58% accuracy(Fig. 17).
Fig.17 accuracy and loss plot for CK+ using only CNN
Page 45
9.2.2 Results obtained for JAFFE dataset:
The above proposed architecture will result in accuracy upto 76% for FER2013[9] dataset. The
obtained results are extremely good compared to CNN.The CNN model will be trained for
100 epoches. It will result in 22 % accuracy(Fig. 19).The confusion matrix(Fig.18) is drawn
for our hybrid model to depict the false and true predictions.
Page 46
9.3 .Results obtained for emotions related to E-Learning using Handcrafted Features
Here, we used a dlib shape predictor to extract feature points on a given face image(Fig.5).
Later the mean point for all 68 feature points will be calculated.For every feature point we
calculate the distance from center(Fig. 20), Tangent angle to the center and Its coordinates in
2-D space. Then the total size of the vector is 272 (68 X 4) for a single image. There are
3219 images in the training set.
Then train the obtained feature vector of size 875568 (272 X 3219) with SVM classifier. The
SVM classifier classifies the feature vector with an accuracy of 40%. The true and false
predictions made by SVM by these handcrafted features are illustrated in confusion
matrix(Fg. 21)
Page 47
CHAPTER - 10
CODING
Page 48
10. Coding:
The above whole work is divided into two modules. Load and split the data followed by some
image preprocessings, Train the model using proposed hybrid architecture.
#function to get file list, randomly and shuffle it and split 80/20
def get_files(emotion):
files = glob.glob("PATH/%s/*" %emotion)
random.shuffle(files)
training = files[:int(len(files)*0.8)] #get first 80% of file list
prediction = files[-int(len(files)*0.2):] #get last 20% of file list
return training, prediction
#To load the images and perform CLAHE operation and return data
def get_img(ems):
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
pixel=[]
usage=[]
em=[]
for xx in range(len(ems)):
tr,te=get_files(ems[xx])
for item in tr:
image = cv2.imread(item) #open image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) #convert to grayscale
gray=np.asarray(Image.fromarray(gray).resize((48,48)))
clahe_image = clahe.apply(gray)
clahe_image=clahe_image.reshape(48*48)
s=""
for i in clahe_image:
s=s+str(i)+" "
em.append(xx)
pixel.append(s)
usage.append('train')
for item in te:
image = cv2.imread(item) #open image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) #convert to grayscale
gray=np.asarray(Image.fromarray(gray).resize((48,48)))
clahe_image = clahe.apply(gray)
clahe_image=clahe_image.reshape(48*48)
s=""
for i in clahe_image:
Page 49
s=s+str(i)+" "
em.append(xx)
pixel.append(s)
usage.append('test')
print('done',ems[xx])
df={'emotion':pd.Series(em),'pixels':pd.Series(pixel),'usage':pd.Series(usage)}
return df
Page 50
#Training the model over set of epoches
nb_epoch =100
batch_size = 64
history = model.fit(X_train,y_train, epochs=nb_epoch,class_weight=class_Weights,
validation_data=(X_val,y_val), shuffle=True, verbose=1)
#Recreating the model upto Fully Connected layer and perform classification using SVM
model_1=Model(inputs=model.input,outputs=model.layers[-6].output)
clf = SVC(kernel='linear', verbose=True,probability=True, tol=1e-3)
clf.fit(model_1.predict(X_train),train_labs)
Page 51
CHAPTER – 11
CONCLUSION AND FUTURE SCOPE
Page 52
10. CONCLUSION AND FUTURE SCOPE:
In this project, we have proposed a Facial Emotion Recognition system for the E-Learning
environment. The Mood of the Learners are usually neglected in the E-Environment. There
is no proper system to monitor the cognitive mood of the learner. This model can classify the
learner mood based on the learner visual features.
Future Work:
In future, This work can be extended to propose an application which will be placed at client
side to monitor the mood of the learner continuously. Based on the mood of the user the
course contents will be developed.
This work can also be extended to analyse the mood of the user based on their interaction
(such as., Assessments completed, Posted Feedback and active participation ) along with the
visual features. This may increase the accuracy of the proposed system.
Page 53
BIBLIOGRAPHY
[1] https://machinelearningmastery.com/what-is-deep-learning/
[2] Xiao-Xiao Niu, Ching Y. Suen,A novel hybrid CNN–SVM classifier for recognizing
handwritten digits, Pattern Recognition
[3] Pessoa, Tiago & Medeiros, Raul & Nepomuceno, Thiago & Bian, Gui-Bin & Albuquerque,
V.H.C. & Filho, Pedro Pedrosa. (2018). Performance Analysis of Google Colaboratory as a Tool
for Accelerating Deep Learning Applications. IEEE Access. PP. 1-1.
10.1109/ACCESS.2018.2874767
[5]Abhay Gupta and Arjun D'Cunha and Kamal Awasthi and Vineeth Balasubramanian,
DAiSEE: Towards User Engagement Recognition in the Wild, https://arxiv.org/abs/1609.01885
[6]Ahmad, Mubashir. (2019). Re: In CNN, can we replace fully connected layers with SVM as a
classifier?.
[7]P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar and I. Matthews, "The Extended
Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified
expression," 2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition - Workshops, San Francisco, CA, 2010, pp. 94-101, doi:
10.1109/CVPRW.2010.5543262.
[8]Pramerdorfer, Christopher & Kampel, Martin. (2016). Facial Expression Recognition using
Convolutional Neural Networks: State of the Art.
[10]Yujun Yang, Jianping Li and Yimei Yang, "The research of the fast SVM classifier method,"
2015 12th International Computer Conference on Wavelet Active Media Technology and
Information Processing (ICCWAMTIP), Chengdu, 2015, pp. 121-124, doi:
10.1109/ICCWAMTIP.2015.7493959.
[11]This public GitHub repository includes all modules used in this work,
https://github.com/chandrasekhar36/FER-for-E-Environment
Page 54