Sei sulla pagina 1di 28

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF ELECTRONICS AND TELECOMMUNICATIONS

INTERNSHIP REPORT
Topic:
PERSON RE-IDENTIFICATION

Instructor: Dr. Vo Le Cuong


Student: Nguyen Tuan Nghia
Student ID: 20122147
Class: ET AP K57

Hanoi, 3-2017
Nguyen Tuan Nghia - ET AP - K57

REVIEWS OF INTERNSHIP REPORT

Students name: Nguyen Tuan Nghia Student ID: 20122147


Class: Electronics and Telecoms AP Course: 57
Instructor: Dr. Vo Le Cuong
Critical Officer:
..............................................................................................................................
1. Internship report content:
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
2. Reviews of Critical Officer:
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
Hanoi, / /2017
Critical Officer
(sign and write full name)

2
Nguyen Tuan Nghia - ET AP - K57

INTRODUCTION
Internship is an important phase to every undergraduate student before working
on graduation thesis. Since there are many differences between theory and practice,
internship grants student a chance to develop and apply their knowledge adaptively on
practical situations. In addition, it provides students a view of expert working
environment, in which they also learn to teamwork, to communicate and to present their
works to others. Those skills are essential that every engineer should have.
It is getting simpler for students, especially who are learning electronics and
telecommunications, to do internship nowadays. As there are many high-tech companies
in Viet Nam, students are allowed to choose ones of interest. Being able to work in an
expected environment help them earn experience much faster than ever. If everything is
done well, they also have chance to continue working for the company after graduation.
I am lucky to be accepted to do internship under Dr. Vo Le Cuongs instruction
in AICS Lab, locating at room 618 of Ta Quang Buu library in Hanoi University of
Science and Technology. In this report, I will introduce AICS Lab in section 1. Section
2 will be focusing on my research during the internship. I would like to sincerely thank
Dr. Cuong and all staffs of School of Electronics and Telecommunications for helping
me complete my internship. I would like to sincerely thank Prof. Hyuk-Jae Lee of
Computer Architecture & Parallel Processing Lab, Seoul Nation University for allowing
me to use his workstation. Without his kindness, I cannot conduct any experiments due
to lacks of hardware.

3
Nguyen Tuan Nghia - ET AP - K57

ABSTRACT
Person re-identification, known as a process of recognizing an individual in
camera network, is a fundamental task in automated surveillance. It has been receiving
attentions for years. This task is challenging due to problems such as appearance
variations of an individual across different cameras or low quality of video and image
resolution. There have been many proposals to improve the accuracy of this process. In
recent years, deep learning based approach has been proven to outperform most
traditional ones for person re-identification. In this report, I introduce deep learning
concept and propose a method to optimize multi-shot deep learning based approach for
person re-identification using Recurrent Neural Network. I conduct extensive
experiments to compare different architectures and find out that Gated Recurrent Unit is
the most optimized one that helps achieve highest accuracy while having a reasonable
number of parameters.

4
Nguyen Tuan Nghia - ET AP - K57

TABLE OF CONTENTS

INTRODUCTION............................................................................................................ 3
ABSTRACT ..................................................................................................................... 4
TABLE OF CONTENTS ................................................................................................. 5
LIST OF FIGURES.......................................................................................................... 6
LIST OF TABLES ........................................................................................................... 6
LIST OF ABBREVIATIONS .......................................................................................... 7
SECTION 1: AICS LAB .................................................................................................. 8
1.1. General information .............................................................................................. 8
1.2. Projects and research areas ................................................................................... 8
SECTION 2: INTERNSHIP CONTENT ....................................................................... 10
2.1. Deep learning ...................................................................................................... 10
2.1.1. The concept .................................................................................................. 10
2.1.2. Deep learning for person re-identification ................................................... 11
2.2. Multi-shot deep learning methods ...................................................................... 12
2.2.1. Recurrent Neural Network ........................................................................... 13
2.2.2. Long Short Term Memory Network ............................................................ 14
2.2.3. Gated Recurrent Unit ................................................................................... 17
2.3. Experiment .......................................................................................................... 17
2.3.1. Caffe ............................................................................................................. 18
2.3.2. Datasets and evaluation settings .................................................................. 19
2.3.3. Network implementations ............................................................................ 20
2.3.4. Classifier ...................................................................................................... 22
2.3.5. Result............................................................................................................ 22
2.4. Conclusion .......................................................................................................... 23
SECTION 3: GRADUATION THESIS PLAN ............................................................. 24
REFERENCE ................................................................................................................. 25
APPENDIX: IMPLEMENTATION DETAILS ............................................................ 27

5
Nguyen Tuan Nghia - ET AP - K57

LIST OF FIGURES
Figure 2.1 Problems when choosing algorithm to map input x to category y [1] ......... 11
Figure 2.2 Recurrent Neural Networks with loops [18] ................................................ 13
Figure 2.3 Unrolled recurrent neural network [18] ........................................................ 13
Figure 2.4 The repeating module in a standard RNN [18] ............................................ 13
Figure 2.5 RNN make uses of temporal information [18] ............................................. 14
Figure 2.6 The problem of long term dependencies [18]............................................... 14
Figure 2.7 The repeating module in an LSTM [18] ....................................................... 15
Figure 2.8 Forget gate f [18] .......................................................................................... 15
Figure 2.9 Input gate i and candidate vector C [18] ....................................................... 16
Figure 2.10 Updating cell state C [18] ........................................................................... 16
Figure 2.11 Output gate o and hidden output h [18] ...................................................... 16
Figure 2.12 Repeating module of Gated Recurrent Unit [18] ....................................... 17
Figure 2.13 Experiment procedure ................................................................................. 18
Figure 2.14 Data split settings for PRID-2011 .............................................................. 20
Figure 2.15 LSTM with Peephole connections [15] [18] .............................................. 20
Figure 2.16 LSTM with coupled gate [18] .................................................................... 21
Figure 2.17 Recurrent Feature Aggregation Network [9] .............................................. 21

LIST OF TABLES
Table 2.1 Performance of different LSTM architectures (Rank-1 accuracy) ................ 23
Table 2.2 Size of different models (file .caffemodel) .................................................... 23

6
Nguyen Tuan Nghia - ET AP - K57

LIST OF ABBREVIATIONS
CNN Convolutional Neural Network
GRU Gated Recurrent Unit
LBP Local Binary Pattern
LSTM Long Short Term Memory
RFA Recurrent Feature Aggregation
RNN Recurrent Neural Network
SIFT Scale-Invariant Feature Transform
SVM Support Vector Machine

7
Nguyen Tuan Nghia - ET AP - K57

SECTION 1: AICS LAB

1.1. General information


AICS Lab, located at room 618 inside Ta Quang Buu library, is a laboratory of
School of Electronics and Telecommunications, belonging to research center of Hanoi
University of Science and Technology. Its research field includes IC, computer vision
and camera sensor. AICS Lab has been making positive contribution to the development
of School of Electronics and Telecommunications.
AICS Lab was found in 2010 by Dr. Vo Le Cuong with 5 members. At first, there
were many difficulties such as shortage of facilities and equipment. However, with youth
power and passion towards researching, they have won many achievements as well as
completed numerous of projects. Currently, there are 5 official members and 10 trainees
working on different areas. Members of AICS Lab have high chance of working for big
company after graduation and studying abroad in developed countries.
AICS Lab provides an open working environment. Lab room has essential
equipment for working such as desks and computers. Members can also decorate their
own workspaces with anything of interest for convenience. People work for all week
days from 8 A.M. All works will be reported to Dr. Cuong twice a week through a quick
discussion and a long one. Meeting schedule is decided by both members and Dr. Cuong
via email. In the long meeting, each member will present their work and receive
comments for what to do next week. Besides working, members also have
extracurricular activities such as eating lunch together.

1.2. Projects and research areas


There are currently three projects and two research areas. They are:
Lens defect detection (in co-operation with Haesung Vina Co., Ltd): focus
on applying efficient algorithm and building an automated lens defect
detecting system.

8
Nguyen Tuan Nghia - ET AP - K57

Rolling door application (in co-operation with Kato Company): focus on


building application that allows users to operate and control rolling door
and to protect their house from thieves.
Football player tracker (in cooperation with Vietnam Television): focus on
building an automated system that can recognize and track football player.
Image processing on FPGA: focus on implementing image processing
algorithm on FPGA for real-time object detection system.
Person re-identification: focus on building an efficient algorithm for
person re-identification task

9
Nguyen Tuan Nghia - ET AP - K57

SECTION 2: INTERNSHIP CONTENT

2.1. Deep learning

2.1.1. The concept

In computer science, machine learning is subfield that gives computers the ability
to learn without being explicitly programmed. Machine learning explores the study and
construction of algorithms that can learn from and make predictions on data [1]. Machine
learning is employed in a wide range of computing tasks including computer vision and
pattern recognition.
A machine learning algorithm is an algorithm that is able to learn from data.
Mitchell (1997) provides the definition [1] A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.. In the
context, the task, T, is generally defined and will change to fit a specific problem. For
example, if we want a robot to walk, then walking is the task, or if we want a robot to
speak, then speaking is the task. Similar to human learning process in which an
individual improve himself through experience, machine builds up experience E by
measuring its performance P while trying to solve the task T. Every attempt to solve the
task helps the machine learn from and gradually construct a model to fit the given data.
Machine learning strictly depends on data. Simple data structure requires simple
learning algorithm. As the data becomes complicated, building an equivalent algorithm
is essential. Trying to model complicated data structure with simple algorithm causes
underfitting problem which results in inaccurate predictions. The idea of deep learning
was proposed to fulfill the need of such difficult problems.
Deep learning is a branch of machine learning based on a set of algorithms that
attempt to model high level abstractions in data. The term deep generally describes a
property of this kind of learning algorithm. In neural computing, instead of having one
hidden layer, a deep feedforward network can have ten times more, in which each layer

10
Nguyen Tuan Nghia - ET AP - K57

connects to others in an unordered manner. Each feedforward network approximates a


function f that maps input x to a category y. In overall, whole deep network creates a
complicated function that is composition of many sub-functions inside it.
Deep learning used to be a theoretical idea instead of being applied in real tasks
due to its computational expensiveness. However, as hardware becomes more powerful
recently, it is practically proven to be very effective in many field such as computer
vision. Thus, using deep learning to solve a difficult task like person re-identification is
reasonable.

Figure 2.1 Problems when choosing algorithm to map input x to category y [1]

2.1.2. Deep learning for person re-identification

Person re-identification (person re-id) deals with the problem of recognizing an


individual across non-overlapping cameras. When a person appears in a camera, re-id
system should be able to distinguish him with other persons. The core idea of re-id is
classification. Thus, having a unique set of features for each person is essential.
Traditional methods focus on building an effective algorithm to extract feature
based on a specific characteristic of input images. Numerous types of features have been
explored to represent persons including global features like color and texture histograms
[2, 3] or local features such as SIFT [4] and LBP [5]. Those features may perform well
on some datasets but poorly on others. In order to solve adaptation problem, some

11
Nguyen Tuan Nghia - ET AP - K57

learning-based methods are proposed. Deep learning is the one that has been
acknowledged for its great and stable performance on multiple datasets.
In person re-id, various deep learning architectures such as convolutional deep
neural networks [6] and recurrent neural networks [7] have been applied. The input of a
network is typically image or video of a person and output is a set of learned features
that can be further classified with any metric-learning methods. There are two procedures
to build a deep learning model for a dataset. At first, the model needs to learn the features
of different persons in training data. After a number of training iterations, its
performance is measured on testing data for how unique the extracted features are, or in
equivalent, how those features work with classifiers. Deep learning is similar to human
learning process, in which students are taught in a period of time and their score on final
exam measures how well they learned from the course.
There are also two deep learning based approaches to solve re-id problem. The
first one is single-shot based methods, in which the input as well as learned features are
from one single image of a person. However in practical surveillance, persons always
appear in a video rather than in a single-shot image. Multi-shot based methods are
proposed to make full use of temporal information. In comparison with single-shot,
multi-shot features are obtained from a sequence of images, which contain human pose
changes and appearance variations as well as frame-wise level features. The
disadvantage of this method is its expensive computation and thus inappropriate for real
applications at present. Therefore, it is of great importance to explore a more effective
and efficient scheme to make full use of the richer sequence information for person re-
id. Also, the rapid growth of hardware for deep learning is expected to help bring this
method to real life.

2.2. Multi-shot deep learning methods


As stated above, multi-shot features are obtained from a sequence of images.
Extracted features are able to describe the whole sequence instead of each image
disjointedly. In other words, given a set of images, the order of which each image is fed

12
Nguyen Tuan Nghia - ET AP - K57

through the network decides the output features. The process is similar to logical
deduction of human that we do not think from scratch but we also use information that
we had before to make decisions. Traditional neural networks cannot do this, and it is a
major shortcoming.

2.2.1. Recurrent Neural Network

Recurrent Neural Networks were proposed to address the above issue. They are
networks with loops in them, allowing information to persist.

Figure 2.2 Recurrent Neural Networks with loops [18]

In Figure 2.2, a chunk of neural network, A, looks at some inputs xt and outputs
a value ht. The loop connection allows information at time step t to be fed to the next
step t+1. RNN can be thought of as multiple copies of the same network, each passing a
message to a successor. Figure 2.3 and 2.4 shows what happens if we unroll the loop.

Figure 2.3 Unrolled recurrent neural network [18]

Figure 2.4 The repeating module in a standard RNN [18]

13
Nguyen Tuan Nghia - ET AP - K57

There have been incredible success applying RNNs to a variety of problems


including person re-id. McLaughlin, N. et al achieved a rank-1 accuracy of 70% [7] on
PRID-2011 dataset by using RNNs combined with CNN for their proposed model.

2.2.2. Long Short Term Memory Network

RNNs connect information from previous input to the present one, which is
extremely useful to understand the changes of a person in the video. They are capable of
remembering information over time. However, when the time gap grows, RNNs become
less effective since it starts to forget relevant information and keeps only redundant one
[8]. In order to solve such long-term dependencies problem, Hochreiter & Schmidhuber
introduced Long Short Term Memory [9], a special kind of RNN.

Figure 2.5 RNN make uses of temporal information [18]

Figure 2.6 The problem of long term dependencies [18]

LSTM introduces a more complicated structure inside one chunk of network with
a new output cell state, which works like a memory. LSTM has ability to remove or add
information to the cell state, carefully regulated by structures called gate. Gates are
composed of a sigmoid neural network layer and a pointwise multiplication operation.
The outputs of sigmoid layer are between zero and one, describing how much of each

14
Nguyen Tuan Nghia - ET AP - K57

component to be passed through. A value of zero means let nothing through while a
value of one means let everything through.
An LSTM has three gates, named forget (f), input (i) and output (o). As shown in
Figure 2.7, all three gates look at both input x at time step t and hidden output h of
previous time step t-1 to decide the flow of information.

Figure 2.7 The repeating module in an LSTM [18]

Forget gate allows relevant information stored in the cell state to pass through
while forgetting the rest.

Figure 2.8 Forget gate f [18]

After flushing some information, LSTM look at the input and decide what to add
to current state. Input gate allows a part of candidate vector C, which is created by a tanh
layer, to be updated to current cell state.

15
Nguyen Tuan Nghia - ET AP - K57

Figure 2.9 Input gate i and candidate vector [18]

Current cell state is updated by combining remembered part of previous cell state
and candidate vector.

Figure 2.10 Updating cell state C [18]

Finally, hidden output h is decided based on current cell state with the help of
filter output gate o.

Figure 2.11 Output gate o and hidden output h [18]

LSTM has proven its effectiveness in many tasks, such as action recognition [10],
including person re-id. Yichao Yan et al. proposed a neural network architecture that
uses LSTM [11] to extract sequence-level features of a person. The network is trained

16
Nguyen Tuan Nghia - ET AP - K57

and tested on PRID-2011 dataset with a rank-1 accuracy of 58.2%. Even though it is
lower than one using RNN by McLaughlin, N. et al [6], the difference between Yans
and McLaughlins proposal may not come from using RNN and LSTM, but from choices
of inputs. In addition, Yans proposal combines simple traditional features with LSTM
and results in a faster and simpler model but still be able to achieve high accuracy. We
do not know yet if we replace RNN in McLaughlins proposal with LSTM.

2.2.3. Gated Recurrent Unit

Figure 2.12 Repeating module of Gated Recurrent Unit [18]

Besides original LSTM architecture, there are some popular LSTM variants used
in many papers. Most of LSTM variants are slightly different from the original one.
Therefore, their performances are almost the same. However, a dramatic variation on
LSTM, called Gated Recurrent Unit [12], is worth mentioning.
Instead of using separated forget and input gates, GRU combines these two into
a single update gate. Cell state and hidden state are also merged into one. The resulting
model is not only simpler than LSTM, but also achieve notably higher accuracy. During
my internship, I tried replacing LSTM in Yans proposal [11] with GRU and achieved
rank-1 accuracy of 61.73% on PRID-2011 dataset. Details of my work will be shown in
the next part.

2.3. Experiment
In order to test the LSTM performance on person re-id task, I have followed the
evaluation settings of [13]. In the paper, authors proposed using extracted LBP and Color

17
Nguyen Tuan Nghia - ET AP - K57

features to train Recurrent Feature Aggregation network (RFA-Net). The resulting


model is used to extract sequence-level features of persons in test set. Then a metric
learning method uses these multi-shot features to measure performance of whole system.
I have re-done the whole process. In addition, I tested the performance of new system in
which LSTM was replaced with various LSTM modifications including GRU and
received reasonable results.

Figure 2.13 Experiment procedure

2.3.1. Caffe

Caffe is a deep learning framework developed by the Berkeley Vision and


Learning Center and by community contributors [14]. Due to its popularity and strong
community, Caffe is chosen over TensorFlow or Theano.
Working with Caffe is straight forward. Simple task such as fine-tuning or
training a toy model does not require coding but specifying the architecture and its hyper-
parameter for training. Most commonly used features are implemented in Caffe

18
Nguyen Tuan Nghia - ET AP - K57

including LSTM layer. However, GRU and other LSTM variants were implemented by
myself since they are not yet available. As a result, I have earned a great deal of
knowledge on deep learning workflow, most of which is about forward and backward
implementation of a layer.

2.3.2. Datasets and evaluation settings

Two most popular datasets for video-based (or multi-shot) person re-id is PRID-
2011 and iLIDS-VID. Both of them are used to evaluate performance of most proposed
methods.
PRID-2011 dataset. The PRID-2011 dataset includes 400 image
sequences for 200 persons from two cameras. Each image sequence has a
variable length consisting of 5 to 675 image frames, with an average
number of 100. The images were captured in an uncrowded outdoor
environment with relatively simple and clean background and rare
occlusion; however, there are significant viewpoint and illumination
variations as well as color inconsistency between the two views.
iLIDS-VID dataset. The iLIDS-VID dataset contains 600 image
sequences for 300 persons in two non-overlapping camera views. Each
image sequence consists of variable length from 23 to 192, with an average
length of 73. This dataset was created at an airport arrival hall under a
multi-camera CCTV network, and the images were captured with
significant background clutter, occlusions, and viewpoint/illumination
variations, which makes the dataset very challenging.
Following [13], the sequence pairs with more than 21 frames are used in my
experiments. The whole set of human sequence pairs of both datasets is randomly split
into two subsets with equal size, i.e., one for training and the other for testing. The
sequence of the first camera is used as probe set while the gallery set is from the other
one. For both datasets, I report the performance of rank-1 accuracy over 10 trials.

19
Nguyen Tuan Nghia - ET AP - K57

Figure 2.14 Data split settings for PRID-2011

2.3.3. Network implementations

LSTM Variants Implementation. Original LSTM architecture is pre-


implemented on Caffe. In order to compare other architectures with original one, they
are written with C++ based on Caffe API.
LSTM with Peephole connections. Peephole connections allow gate to
look at cell state and tune the filter. Adding peephole connections result in
a bigger model as the number of parameters increases.

Figure 2.15 LSTM with Peephole connections [15] [18]

LSTM with coupled forget and input gate. Instead of separately decide
what to forget and pass through, coupled gate allows the flushed
information to be replaced with new one at once.

20
Nguyen Tuan Nghia - ET AP - K57

Figure 2.16 LSTM with coupled gate [18]

RFA-Net implementation. When implemented on Caffe, RFA-Net contains


LSTM and some data preparation layers. In training phase, outputs of LSTM are fed
through a fully connected layer and then a Softmax layer to calculate loss. In testing
phase, outputs of LSTM are directly used to train and test classifier. In my experiment,
outputs of LSTM from all nodes are fused. Each node is a vector of 512 elements, thus
making sequence-level feature vector to contain 512 L elements, where L is length of
an image sequence.

Figure 2.17 Recurrent Feature Aggregation Network [9]

Network Training. The sequence of image level features is input to an LSTM


network for sequential feature fusion. The network is trained as a classification problem

21
Nguyen Tuan Nghia - ET AP - K57

of N classes, where N is number of persons (N = 89 for PRID-2011 dataset and N = 150


for iLIDS-VID dataset). In my experiments, L = 10 was used as the number of frames
for each subsequence. That means features of 10 consecutive steps are learned. Training
30000 iterations with batch size of 8 took approximately 45 minutes on Titan X GPU
Card. Loss when training GRU converged to zero faster than when training LSTM and
resulting GRU model is also smaller than LSTM one.

2.3.4. Classifier

The use of classifier is in testing phase, after having set of features for each
person. Support Vector Machine is chosen due to its popularity.
Support Vector Machine. Support Vector Machine (SVM) is a supervised
learning algorithm that analyze data used for classification problem. Given training data
and labels, SVM builds a model that helps map a sample to one of some categories it has
learned. SVM can efficiently perform a non-linear classification using kernel, implicitly
mapping their inputs into high-dimensional feature spaces. SVM can be used to solve
various practical problems, and is very common in computer vision and pattern
recognition.

2.3.5. Result

The reason for using RFA-Net in the experiments is that its core layer is LSTM.
In combination with simple features LBP&Color as input and SVM classifier, it is
reasonable to say that LSTM is powerful and suitable for person re-id problem since its
performance is barely affected by other factors. The result shows that the slight
variations of LSTM perform almost the same as the original one. They give better results
on iLIDS-VID but lower on PRID-2011 with a difference of 1 to 2%. Coupled model is
smaller while Peephole model is a bit larger than original LSTM. This result matched
with one shown in a survey by Klauss Greff et al [16, 17]. In case of GRU, the model
not only performs notably better than all three other models but also has the smallest

22
Nguyen Tuan Nghia - ET AP - K57

size. In general, GRU is the most effective and efficient architecture for person re-id that
reduces training time and number of parameters while giving better result.

Table 2.1 Performance of different LSTM architectures (Rank-1 accuracy)

Dataset PRID-2011 iLIDS-VID


Color&LBP+LSTM+SVM 58.05 39.32
Color&LBP+Coupled+SVM 55.34 40.14
Color&LBP+Peephole+SVM 55.45 40.12
Color&LBP+GRU+SVM 61.73 43.34

Table 2.2 Size of different models (file .caffemodel)

Model Size
RFA+LSTM 475,884 KB
RFA+Coupled 356,958 KB
RFA+Peephole 477,932 KB
RFA+GRU 355,934 KB

2.4. Conclusion
In this section, I mentioned deep learning concept and multi-shot deep learning
based approaches for person re-id task. In general, deep learning aims to model high
level abstractions in data. For person re-id, it has shown the effectiveness for being
capable to extract unique set of features and stability on different datasets. Using
temporal information in addition even boosts the performance more. I followed [11, 13]
and proposed using GRU and LSTM variants to further assess the performance of multi-
shot person re-id system. The results are reasonable that matched with other previous
researches.

23
Nguyen Tuan Nghia - ET AP - K57

SECTION 3: GRADUATION THESIS PLAN


My graduation thesis will be continued with deep learning multi shot based
methods for person re-identification, mainly focusing on LSTM and GRU architecture.
Testing with other types of feature is one task that will be done next. One type of feature
that I expect to boost the performance of RFA network is CNN feature. Since it does not
perform well yet, there are more works to do. For a better motion analysis, I plan to make
use of Optical Flow in combination with spatial features with an expectation that this
method would further increase the accuracy of the task. In addition, new LSTM variants
can be invented and tested in parallel.

During my internship, I have gained experience of expert working environment.


I learned how a research project is done from analysis to developing a proposal. I also
had a chance to practice working skills including teamwork, report and presentation.

I evaluate myself that I had a successful internship, gaining a great deal of


knowledge and experience, which are extremely important to my future.

24
Nguyen Tuan Nghia - ET AP - K57

REFERENCE
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,
http://www.deeplearningbook.org.
[2] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, Person re-
identification by symmetry-driven accumulation of local features, in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,
pp. 23602367.
[3] D. Gray and H. Tao, Viewpoint invariant pedestrian recognition with an ensemble
of localized features, in European conference on computer vision. Springer, 2008,
pp. 262275.
[4] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
International journal of computer vision, vol. 60, no. 2, pp. 91110, 2004.
[5] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Transactions on
pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971987, 2002.
[6] Z. Zheng, L. Zheng, and Y. Yang, A discriminatively learned cnn embedding for
person re-identification, arXiv preprint arXiv:1611.05666, 2016.
[7] N. McLaughlin, J. Martinez del Rincon, and P. Miller, Recurrent convolutional
network for video-based person re-identification, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 13251334.
[8] Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with
gradient descent is difficult, IEEE transactions on neural networks, vol. 5, no. 2,
pp. 157166, 1994.
[9] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural
computation, vol. 9, no. 8, pp. 17351780, 1997.
[10] J. Liu, A. Shahroudy, D. Xu, and G. Wang, Spatio-temporal lstm with trust gates
for 3d human action recognition, in European Conference on Computer Vision.
Springer, 2016, pp. 816833.

25
Nguyen Tuan Nghia - ET AP - K57

[11] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, Person reidentification via
recurrent feature aggregation, in European Conference on Computer Vision.
Springer, 2016, pp. 701716.
[12] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H.
Schwenk, and Y. Bengio, Learning phrase representations using rnn encoder-
decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.
[13] T. Wang, S. Gong, X. Zhu, and S. Wang, Person re-identification by video
ranking, in European Conference on Computer Vision. Springer, 2014, pp. 688
703.
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in
Proceedings of the 22nd ACM international conference on Multimedia. ACM,
2014, pp. 675678.
[15] F. A. Gers and J. Schmidhuber, Recurrent nets that time and count, in Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNSENNS International
Joint Conference on, vol. 3. IEEE, 2000, pp. 189194.
[16] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmidhuber,
Lstm: A search space odyssey, IEEE transactions on neural networks and
learning systems, 2016.
[17] W. Zaremba, An empirical exploration of recurrent network architectures, 2015.
[18] http://colah.github.io/posts/2015-08-Understanding-LSTMs/, last visit: 3/8/2017
[19] R. Kohavi and F. Provost, Glossary of terms, Machine Learning, vol. 30, no. 2-
3, pp. 271274, 1998.

26
Nguyen Tuan Nghia - ET AP - K57

APPENDIX: IMPLEMENTATION DETAILS


Recurrent Neural Network.
ht = tanh(Wx xt + Wh ht1 ) (1)

Long Short Term Memory Network.


ft = (Wfx xt + Wfh ht1 + bf ) (2)
it = (Wix xt + Wih ht1 + bi ) (3)
Ct = tanh(WCx xt + WCh ht1 + bC ) (4)
Ct = ft Ct1 + it Ct (5)
ot = (Wox xt + Woh ht1 + bo ) (6)
ht = ot tanh(Ct ) (7)

Long Short Term Memory Network with Coupled gate.


ft = (Wfx xt + Wfh ht1 + bf ) (8)
Ct = tanh(WCx xt + WCh ht1 + bC ) (9)
Ct = ft Ct1 + (1 ft ) Ct (10)
ot = (Wox xt + Woh ht1 + bo ) (11)
ht = ot tanh(Ct ) (12)

Long Short Term Memory Network with Peephole connection.


ft = (Wfx xt + Wfh ht1 + Wfcp Ct1 + bf ) (13)
it = (Wix xt + Wih ht1 + Wicp Ct1 + bi ) (14)
Ct = tanh(WCx xt + WCh ht1 + bC ) (15)
Ct = ft Ct1 + it Ct (16)
ot = (Wox xt + Woh ht1 + Woc Ct + bo ) (17)
ht = ot tanh(Ct ) (18)

27
Nguyen Tuan Nghia - ET AP - K57

Gated Recurrent Unit.


zt = (Wzx xt + Wzh ht1 + bz ) (19)
rt = (Wrx xt + Wrh ht1 + br ) (20)
ht = tanh(WHx xt + WHhr (ht1 rt ) + bH ) (21)
ht = (1 zt ) ht1 + zt ht (22)

28

Potrebbero piacerti anche