Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INTERNSHIP REPORT
Topic:
PERSON RE-IDENTIFICATION
Hanoi, 3-2017
Nguyen Tuan Nghia - ET AP - K57
2
Nguyen Tuan Nghia - ET AP - K57
INTRODUCTION
Internship is an important phase to every undergraduate student before working
on graduation thesis. Since there are many differences between theory and practice,
internship grants student a chance to develop and apply their knowledge adaptively on
practical situations. In addition, it provides students a view of expert working
environment, in which they also learn to teamwork, to communicate and to present their
works to others. Those skills are essential that every engineer should have.
It is getting simpler for students, especially who are learning electronics and
telecommunications, to do internship nowadays. As there are many high-tech companies
in Viet Nam, students are allowed to choose ones of interest. Being able to work in an
expected environment help them earn experience much faster than ever. If everything is
done well, they also have chance to continue working for the company after graduation.
I am lucky to be accepted to do internship under Dr. Vo Le Cuongs instruction
in AICS Lab, locating at room 618 of Ta Quang Buu library in Hanoi University of
Science and Technology. In this report, I will introduce AICS Lab in section 1. Section
2 will be focusing on my research during the internship. I would like to sincerely thank
Dr. Cuong and all staffs of School of Electronics and Telecommunications for helping
me complete my internship. I would like to sincerely thank Prof. Hyuk-Jae Lee of
Computer Architecture & Parallel Processing Lab, Seoul Nation University for allowing
me to use his workstation. Without his kindness, I cannot conduct any experiments due
to lacks of hardware.
3
Nguyen Tuan Nghia - ET AP - K57
ABSTRACT
Person re-identification, known as a process of recognizing an individual in
camera network, is a fundamental task in automated surveillance. It has been receiving
attentions for years. This task is challenging due to problems such as appearance
variations of an individual across different cameras or low quality of video and image
resolution. There have been many proposals to improve the accuracy of this process. In
recent years, deep learning based approach has been proven to outperform most
traditional ones for person re-identification. In this report, I introduce deep learning
concept and propose a method to optimize multi-shot deep learning based approach for
person re-identification using Recurrent Neural Network. I conduct extensive
experiments to compare different architectures and find out that Gated Recurrent Unit is
the most optimized one that helps achieve highest accuracy while having a reasonable
number of parameters.
4
Nguyen Tuan Nghia - ET AP - K57
TABLE OF CONTENTS
INTRODUCTION............................................................................................................ 3
ABSTRACT ..................................................................................................................... 4
TABLE OF CONTENTS ................................................................................................. 5
LIST OF FIGURES.......................................................................................................... 6
LIST OF TABLES ........................................................................................................... 6
LIST OF ABBREVIATIONS .......................................................................................... 7
SECTION 1: AICS LAB .................................................................................................. 8
1.1. General information .............................................................................................. 8
1.2. Projects and research areas ................................................................................... 8
SECTION 2: INTERNSHIP CONTENT ....................................................................... 10
2.1. Deep learning ...................................................................................................... 10
2.1.1. The concept .................................................................................................. 10
2.1.2. Deep learning for person re-identification ................................................... 11
2.2. Multi-shot deep learning methods ...................................................................... 12
2.2.1. Recurrent Neural Network ........................................................................... 13
2.2.2. Long Short Term Memory Network ............................................................ 14
2.2.3. Gated Recurrent Unit ................................................................................... 17
2.3. Experiment .......................................................................................................... 17
2.3.1. Caffe ............................................................................................................. 18
2.3.2. Datasets and evaluation settings .................................................................. 19
2.3.3. Network implementations ............................................................................ 20
2.3.4. Classifier ...................................................................................................... 22
2.3.5. Result............................................................................................................ 22
2.4. Conclusion .......................................................................................................... 23
SECTION 3: GRADUATION THESIS PLAN ............................................................. 24
REFERENCE ................................................................................................................. 25
APPENDIX: IMPLEMENTATION DETAILS ............................................................ 27
5
Nguyen Tuan Nghia - ET AP - K57
LIST OF FIGURES
Figure 2.1 Problems when choosing algorithm to map input x to category y [1] ......... 11
Figure 2.2 Recurrent Neural Networks with loops [18] ................................................ 13
Figure 2.3 Unrolled recurrent neural network [18] ........................................................ 13
Figure 2.4 The repeating module in a standard RNN [18] ............................................ 13
Figure 2.5 RNN make uses of temporal information [18] ............................................. 14
Figure 2.6 The problem of long term dependencies [18]............................................... 14
Figure 2.7 The repeating module in an LSTM [18] ....................................................... 15
Figure 2.8 Forget gate f [18] .......................................................................................... 15
Figure 2.9 Input gate i and candidate vector C [18] ....................................................... 16
Figure 2.10 Updating cell state C [18] ........................................................................... 16
Figure 2.11 Output gate o and hidden output h [18] ...................................................... 16
Figure 2.12 Repeating module of Gated Recurrent Unit [18] ....................................... 17
Figure 2.13 Experiment procedure ................................................................................. 18
Figure 2.14 Data split settings for PRID-2011 .............................................................. 20
Figure 2.15 LSTM with Peephole connections [15] [18] .............................................. 20
Figure 2.16 LSTM with coupled gate [18] .................................................................... 21
Figure 2.17 Recurrent Feature Aggregation Network [9] .............................................. 21
LIST OF TABLES
Table 2.1 Performance of different LSTM architectures (Rank-1 accuracy) ................ 23
Table 2.2 Size of different models (file .caffemodel) .................................................... 23
6
Nguyen Tuan Nghia - ET AP - K57
LIST OF ABBREVIATIONS
CNN Convolutional Neural Network
GRU Gated Recurrent Unit
LBP Local Binary Pattern
LSTM Long Short Term Memory
RFA Recurrent Feature Aggregation
RNN Recurrent Neural Network
SIFT Scale-Invariant Feature Transform
SVM Support Vector Machine
7
Nguyen Tuan Nghia - ET AP - K57
8
Nguyen Tuan Nghia - ET AP - K57
9
Nguyen Tuan Nghia - ET AP - K57
In computer science, machine learning is subfield that gives computers the ability
to learn without being explicitly programmed. Machine learning explores the study and
construction of algorithms that can learn from and make predictions on data [1]. Machine
learning is employed in a wide range of computing tasks including computer vision and
pattern recognition.
A machine learning algorithm is an algorithm that is able to learn from data.
Mitchell (1997) provides the definition [1] A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.. In the
context, the task, T, is generally defined and will change to fit a specific problem. For
example, if we want a robot to walk, then walking is the task, or if we want a robot to
speak, then speaking is the task. Similar to human learning process in which an
individual improve himself through experience, machine builds up experience E by
measuring its performance P while trying to solve the task T. Every attempt to solve the
task helps the machine learn from and gradually construct a model to fit the given data.
Machine learning strictly depends on data. Simple data structure requires simple
learning algorithm. As the data becomes complicated, building an equivalent algorithm
is essential. Trying to model complicated data structure with simple algorithm causes
underfitting problem which results in inaccurate predictions. The idea of deep learning
was proposed to fulfill the need of such difficult problems.
Deep learning is a branch of machine learning based on a set of algorithms that
attempt to model high level abstractions in data. The term deep generally describes a
property of this kind of learning algorithm. In neural computing, instead of having one
hidden layer, a deep feedforward network can have ten times more, in which each layer
10
Nguyen Tuan Nghia - ET AP - K57
Figure 2.1 Problems when choosing algorithm to map input x to category y [1]
11
Nguyen Tuan Nghia - ET AP - K57
learning-based methods are proposed. Deep learning is the one that has been
acknowledged for its great and stable performance on multiple datasets.
In person re-id, various deep learning architectures such as convolutional deep
neural networks [6] and recurrent neural networks [7] have been applied. The input of a
network is typically image or video of a person and output is a set of learned features
that can be further classified with any metric-learning methods. There are two procedures
to build a deep learning model for a dataset. At first, the model needs to learn the features
of different persons in training data. After a number of training iterations, its
performance is measured on testing data for how unique the extracted features are, or in
equivalent, how those features work with classifiers. Deep learning is similar to human
learning process, in which students are taught in a period of time and their score on final
exam measures how well they learned from the course.
There are also two deep learning based approaches to solve re-id problem. The
first one is single-shot based methods, in which the input as well as learned features are
from one single image of a person. However in practical surveillance, persons always
appear in a video rather than in a single-shot image. Multi-shot based methods are
proposed to make full use of temporal information. In comparison with single-shot,
multi-shot features are obtained from a sequence of images, which contain human pose
changes and appearance variations as well as frame-wise level features. The
disadvantage of this method is its expensive computation and thus inappropriate for real
applications at present. Therefore, it is of great importance to explore a more effective
and efficient scheme to make full use of the richer sequence information for person re-
id. Also, the rapid growth of hardware for deep learning is expected to help bring this
method to real life.
12
Nguyen Tuan Nghia - ET AP - K57
through the network decides the output features. The process is similar to logical
deduction of human that we do not think from scratch but we also use information that
we had before to make decisions. Traditional neural networks cannot do this, and it is a
major shortcoming.
Recurrent Neural Networks were proposed to address the above issue. They are
networks with loops in them, allowing information to persist.
In Figure 2.2, a chunk of neural network, A, looks at some inputs xt and outputs
a value ht. The loop connection allows information at time step t to be fed to the next
step t+1. RNN can be thought of as multiple copies of the same network, each passing a
message to a successor. Figure 2.3 and 2.4 shows what happens if we unroll the loop.
13
Nguyen Tuan Nghia - ET AP - K57
RNNs connect information from previous input to the present one, which is
extremely useful to understand the changes of a person in the video. They are capable of
remembering information over time. However, when the time gap grows, RNNs become
less effective since it starts to forget relevant information and keeps only redundant one
[8]. In order to solve such long-term dependencies problem, Hochreiter & Schmidhuber
introduced Long Short Term Memory [9], a special kind of RNN.
LSTM introduces a more complicated structure inside one chunk of network with
a new output cell state, which works like a memory. LSTM has ability to remove or add
information to the cell state, carefully regulated by structures called gate. Gates are
composed of a sigmoid neural network layer and a pointwise multiplication operation.
The outputs of sigmoid layer are between zero and one, describing how much of each
14
Nguyen Tuan Nghia - ET AP - K57
component to be passed through. A value of zero means let nothing through while a
value of one means let everything through.
An LSTM has three gates, named forget (f), input (i) and output (o). As shown in
Figure 2.7, all three gates look at both input x at time step t and hidden output h of
previous time step t-1 to decide the flow of information.
Forget gate allows relevant information stored in the cell state to pass through
while forgetting the rest.
After flushing some information, LSTM look at the input and decide what to add
to current state. Input gate allows a part of candidate vector C, which is created by a tanh
layer, to be updated to current cell state.
15
Nguyen Tuan Nghia - ET AP - K57
Current cell state is updated by combining remembered part of previous cell state
and candidate vector.
Finally, hidden output h is decided based on current cell state with the help of
filter output gate o.
LSTM has proven its effectiveness in many tasks, such as action recognition [10],
including person re-id. Yichao Yan et al. proposed a neural network architecture that
uses LSTM [11] to extract sequence-level features of a person. The network is trained
16
Nguyen Tuan Nghia - ET AP - K57
and tested on PRID-2011 dataset with a rank-1 accuracy of 58.2%. Even though it is
lower than one using RNN by McLaughlin, N. et al [6], the difference between Yans
and McLaughlins proposal may not come from using RNN and LSTM, but from choices
of inputs. In addition, Yans proposal combines simple traditional features with LSTM
and results in a faster and simpler model but still be able to achieve high accuracy. We
do not know yet if we replace RNN in McLaughlins proposal with LSTM.
Besides original LSTM architecture, there are some popular LSTM variants used
in many papers. Most of LSTM variants are slightly different from the original one.
Therefore, their performances are almost the same. However, a dramatic variation on
LSTM, called Gated Recurrent Unit [12], is worth mentioning.
Instead of using separated forget and input gates, GRU combines these two into
a single update gate. Cell state and hidden state are also merged into one. The resulting
model is not only simpler than LSTM, but also achieve notably higher accuracy. During
my internship, I tried replacing LSTM in Yans proposal [11] with GRU and achieved
rank-1 accuracy of 61.73% on PRID-2011 dataset. Details of my work will be shown in
the next part.
2.3. Experiment
In order to test the LSTM performance on person re-id task, I have followed the
evaluation settings of [13]. In the paper, authors proposed using extracted LBP and Color
17
Nguyen Tuan Nghia - ET AP - K57
2.3.1. Caffe
18
Nguyen Tuan Nghia - ET AP - K57
including LSTM layer. However, GRU and other LSTM variants were implemented by
myself since they are not yet available. As a result, I have earned a great deal of
knowledge on deep learning workflow, most of which is about forward and backward
implementation of a layer.
Two most popular datasets for video-based (or multi-shot) person re-id is PRID-
2011 and iLIDS-VID. Both of them are used to evaluate performance of most proposed
methods.
PRID-2011 dataset. The PRID-2011 dataset includes 400 image
sequences for 200 persons from two cameras. Each image sequence has a
variable length consisting of 5 to 675 image frames, with an average
number of 100. The images were captured in an uncrowded outdoor
environment with relatively simple and clean background and rare
occlusion; however, there are significant viewpoint and illumination
variations as well as color inconsistency between the two views.
iLIDS-VID dataset. The iLIDS-VID dataset contains 600 image
sequences for 300 persons in two non-overlapping camera views. Each
image sequence consists of variable length from 23 to 192, with an average
length of 73. This dataset was created at an airport arrival hall under a
multi-camera CCTV network, and the images were captured with
significant background clutter, occlusions, and viewpoint/illumination
variations, which makes the dataset very challenging.
Following [13], the sequence pairs with more than 21 frames are used in my
experiments. The whole set of human sequence pairs of both datasets is randomly split
into two subsets with equal size, i.e., one for training and the other for testing. The
sequence of the first camera is used as probe set while the gallery set is from the other
one. For both datasets, I report the performance of rank-1 accuracy over 10 trials.
19
Nguyen Tuan Nghia - ET AP - K57
LSTM with coupled forget and input gate. Instead of separately decide
what to forget and pass through, coupled gate allows the flushed
information to be replaced with new one at once.
20
Nguyen Tuan Nghia - ET AP - K57
21
Nguyen Tuan Nghia - ET AP - K57
2.3.4. Classifier
The use of classifier is in testing phase, after having set of features for each
person. Support Vector Machine is chosen due to its popularity.
Support Vector Machine. Support Vector Machine (SVM) is a supervised
learning algorithm that analyze data used for classification problem. Given training data
and labels, SVM builds a model that helps map a sample to one of some categories it has
learned. SVM can efficiently perform a non-linear classification using kernel, implicitly
mapping their inputs into high-dimensional feature spaces. SVM can be used to solve
various practical problems, and is very common in computer vision and pattern
recognition.
2.3.5. Result
The reason for using RFA-Net in the experiments is that its core layer is LSTM.
In combination with simple features LBP&Color as input and SVM classifier, it is
reasonable to say that LSTM is powerful and suitable for person re-id problem since its
performance is barely affected by other factors. The result shows that the slight
variations of LSTM perform almost the same as the original one. They give better results
on iLIDS-VID but lower on PRID-2011 with a difference of 1 to 2%. Coupled model is
smaller while Peephole model is a bit larger than original LSTM. This result matched
with one shown in a survey by Klauss Greff et al [16, 17]. In case of GRU, the model
not only performs notably better than all three other models but also has the smallest
22
Nguyen Tuan Nghia - ET AP - K57
size. In general, GRU is the most effective and efficient architecture for person re-id that
reduces training time and number of parameters while giving better result.
Model Size
RFA+LSTM 475,884 KB
RFA+Coupled 356,958 KB
RFA+Peephole 477,932 KB
RFA+GRU 355,934 KB
2.4. Conclusion
In this section, I mentioned deep learning concept and multi-shot deep learning
based approaches for person re-id task. In general, deep learning aims to model high
level abstractions in data. For person re-id, it has shown the effectiveness for being
capable to extract unique set of features and stability on different datasets. Using
temporal information in addition even boosts the performance more. I followed [11, 13]
and proposed using GRU and LSTM variants to further assess the performance of multi-
shot person re-id system. The results are reasonable that matched with other previous
researches.
23
Nguyen Tuan Nghia - ET AP - K57
24
Nguyen Tuan Nghia - ET AP - K57
REFERENCE
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,
http://www.deeplearningbook.org.
[2] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, Person re-
identification by symmetry-driven accumulation of local features, in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,
pp. 23602367.
[3] D. Gray and H. Tao, Viewpoint invariant pedestrian recognition with an ensemble
of localized features, in European conference on computer vision. Springer, 2008,
pp. 262275.
[4] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
International journal of computer vision, vol. 60, no. 2, pp. 91110, 2004.
[5] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Transactions on
pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971987, 2002.
[6] Z. Zheng, L. Zheng, and Y. Yang, A discriminatively learned cnn embedding for
person re-identification, arXiv preprint arXiv:1611.05666, 2016.
[7] N. McLaughlin, J. Martinez del Rincon, and P. Miller, Recurrent convolutional
network for video-based person re-identification, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 13251334.
[8] Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with
gradient descent is difficult, IEEE transactions on neural networks, vol. 5, no. 2,
pp. 157166, 1994.
[9] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural
computation, vol. 9, no. 8, pp. 17351780, 1997.
[10] J. Liu, A. Shahroudy, D. Xu, and G. Wang, Spatio-temporal lstm with trust gates
for 3d human action recognition, in European Conference on Computer Vision.
Springer, 2016, pp. 816833.
25
Nguyen Tuan Nghia - ET AP - K57
[11] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, Person reidentification via
recurrent feature aggregation, in European Conference on Computer Vision.
Springer, 2016, pp. 701716.
[12] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H.
Schwenk, and Y. Bengio, Learning phrase representations using rnn encoder-
decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.
[13] T. Wang, S. Gong, X. Zhu, and S. Wang, Person re-identification by video
ranking, in European Conference on Computer Vision. Springer, 2014, pp. 688
703.
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in
Proceedings of the 22nd ACM international conference on Multimedia. ACM,
2014, pp. 675678.
[15] F. A. Gers and J. Schmidhuber, Recurrent nets that time and count, in Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNSENNS International
Joint Conference on, vol. 3. IEEE, 2000, pp. 189194.
[16] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmidhuber,
Lstm: A search space odyssey, IEEE transactions on neural networks and
learning systems, 2016.
[17] W. Zaremba, An empirical exploration of recurrent network architectures, 2015.
[18] http://colah.github.io/posts/2015-08-Understanding-LSTMs/, last visit: 3/8/2017
[19] R. Kohavi and F. Provost, Glossary of terms, Machine Learning, vol. 30, no. 2-
3, pp. 271274, 1998.
26
Nguyen Tuan Nghia - ET AP - K57
27
Nguyen Tuan Nghia - ET AP - K57
28