Sei sulla pagina 1di 5

2016 International Conference on Frontiers of Information Technology

Crowd Video Classication using Convolutional


Neural Networks
Atika Burney Tahir Q. Syed
Department of Computer Science Department of Computer Science
National University of Computer National University of Computer
and Emerging Sciences and Emerging Sciences
Email: atika.mustafa@nu.edu.pk Email: tahir.syed@nu.edu.pk

AbstractDeep learning tools such as the convolutional neural Deep learning tools such as 2D convolutional neural net-
network (CNN) are extensively used for image analysis and works (CNN) are extensively being used in image analysis ap-
interpretation tasks but they become relatively expensive to use plications like image classication[8], [9], segmentation[10],
for a corresponding analysis in videos by requiring memory
provision for the additional temporal information. Crowd video [11] but they lack the capability to capture motion information
analysis is one of the subareas in video analysis that has recently in terms of a 3D volume of video frames. A 3D CNN which
gained notoriety. In this paper we have shown that a 2D CNN can is an extension of 2D CNN, can be used to capture temporal
be used to classify videos by using 3-channel image map input for information but its computational cost may be exorbitant
each video computed using spatial and temporal information and because of the much larger number of features maps generated
this reduces space and time complexity over a classical 3D CNN
usually used for video analysis. We test the model developed with from each convolutional layer.
the state-of-the-art method of [1] using their proposed dataset, Consider an RGB image of dimension 32 32 3 and the
and without any additional processing steps, improve upon their convulation layer with depth of 32. The output of this layer
reported accuracy. would be 32 32 32 features. On the other hand if we use
video as input to a 3D CNN then for a video of length 10
I. I NTRODUCTION
secs with 15fps, the input to the CNN would be a volume
Analysing and understanding the behaviour of human of dimensions 32 32 3 10 15 and the output will be
crowds is gaining importance, especially in the context of 32 32 32 10 15 features, amounting to a space cost
citizen security, safety and city administration following the of 150 times than the 2D CNN requires in memory space and
ubiquity of video surveillance[2], [3], [4], [5]. Despite a time for computation. For this reason it becomes important
number of works done in this area, understanding the dynamics to to have different descriptors that can capture both spatial
of crowd still remain non-trivial[6]. One of the difculties is and temporal information and can be used with a 2D CNN
in being able to dene the behaviour of the entire crowd in a and thus reducing computation cost both in terms of time and
scene, say for example a crowd in panic, because this varies space and also give results that are comparable to the state-
with the kind of crowd in question. This is why it is primordial of-the-art. This serves to limit the computational cost of the
to have categorized a crowd as a preparatory step towards more system, but there are important additional remarks to be made
visible and socially-signicant applications. In this paper we around its use.
tackle the problem of crowd video classication that can be Usually deep learning begins constructing features from the
used in the more meaningful applications in video surveillance smallest unit of information available, for instance a pixel in
such as anomaly detection, crowd planning. an image. In the case of videos, a derived quantity such as
A crowd is dened as collection of large number of people optical ow might be this smallest unit as was the RGB value
gathered together within dened geo-physical boundaries. The for an image. However, whenever we have prior information
primary entities that make up a crowd are groups[1]. A group about a problem, it may be desirable to use a higher-level
can be dened as a collection of individuals that show similar unit of information than a raw pixel containing optical ow
behaviour and/or are in close proximity of each other [7], [1]. direction vectors. Such models of deep learning are sometimes
Groups dene local behaviour in the crowd and their combined called hybrid model. The use of crowd properties as explained
behaviour denes the crowds behaviour. in the following is a hybrid approach where a hierarchy
Crowd behaviour analysis consists of a number of activities of features that has meaning in terms of the application at
that includes identifying objects of interests (groups and/or in- hand has already been built, namely from tracklets that carry
dividuals), appearance of the objects, tracking their movement information of an individuals motion history, to a higher-
separately and collectively using some tracking methods and level representation that could be understood in the context
combining the information from all of these to gain a general of human crowds, say how self-contained a group is. This
understanding of the crowd behaviour. But all of these task becomes the input to the deep learning architecture, and the
pose a challenge. latter derives further features from these primitives. These

978-1-5090-5300-1/16 $31.00 2016 IEEE 248


247
DOI 10.1109/FIT.2016.50
new deeply-extracted features may not be human-interpretable [19] like riots, panic situation, for crowd management was the
but offer a better division of the feature hyperspace for the main focus.
classication task. Deep learning methods such as CNN have been effective in
Shao et al.[1] have proposed three descriptors that can be static image analysis and now are being used also for temporal
used to dene crowd properties independent of the type of data. Karpathy et al. [20] have shown their effectiveness for
crowd and in [12] have shown that these can be used for video classication for a large dataset of 1M sports videos.
describing the video in terms of place, action and actor. In Kang et al. [21] proposed a fully convolutional neural network
this paper we will use these descriptors to construct an image for crowd segmentation by integrating appearance and motion
map of each video. These images are then fed to a 2D CNN information. Ng et al. [22] proposed and evaluated several deep
for complex-feature generation and classication. neural network architectures and explored various convolu-
tional temporal feature pooling architectures but did not focus
A. Crowd Properties
on any specic video analysis task. [23], [6] used CNNs for
Shao et al.[1] proposed a framework for detecting groups in videos using multi-label classication to determine the actor,
a crowd independent of the type of scene. The scene can have action and place. They also proposed a large dataset of 10, 000
anywhere from few individuals e.g. on a street pavement to for this purpose. Wang et al.[24] also used deep networks
highly dense crowd such as marathons, train stations. In this for action recognition by extracting motion and appearance
work, descriptors for four group properties are dened and information and concatenating their results at the end. Zha et
quantied. These are collectiveness, uniformity, stability, and al. [25] explored different strategies for doing event detection
conict. in videos using CNNs trained for image classication.
Collectiveness: The property indicates the degree of
individuals acting as a union in collective motion and III. E XPERIMENTS AND R ESULTS
is used as a universal measure to study crowds [13]. An Our goal is to classify the crowd into one of the eight classes
example of collective behaviour is when in a train station, from [1] using the group and crowd properties dened in [1],
everyone moves towards a common destination such as [12]. The experiment is run on the crowd dataset of [1].
exit. The group descriptors in [1] are extended by Shao et al
Uniformity: Uniformity characterizes homogeneity of a in [12] for characterizing crowd properties. Three out of four
group in terms of spatial distribution. In congested area, properties are used: collectiveness and stability. Uniformity is
group tends to show more uniformity as individuals do not applicable in our case as crowd as a unit is not homoge-
not have space to move freely and therefore tend to stay nous. These modied crowd properties are used for multi-label
close to each other. classication of videos using convolutional neural network on
Stability: This property characterizes whether a group the WWW dataset consisting of 10, 000 videos, where each
can keep internal topological structure over time. During video is assigned labels describing three properties: action,
crowd disaster the chaos and panic can be measured using actor, place of action.
stability property of the groups. For our purpose, we used the crowd properties dened in
Conict: The conict property characterizes interac- [12] on the crowd dataset of [1] to classify the videos using
tion/friction between groups when they approach each convolutional neural network.
other. This property is present when groups move in dif-
ferent directions such as crossing the road from opposite A. Dataset
sides. The dataset contains 474 videos of varying resolutions and
Collectiveness, uniformity and stability are intragroup prop- durations. The data suffer from class imbalance problem as
erties that dene how each group behaves irrespective of evident from Table 1. For example class 2 and 8 had 58% of
the other group and the crowd as a whole. Conict is an the videos and class 4 had only 9%). Due to the fact that this
intergroup property that describes interaction between groups. dataset has make everything uniform we resize the frames of
Collectiveness and stability measures the temporal aspects all the videos to 360 640 but the number of frames were
whereas the uniformity measures the spatial aspect of the kept intact. The data is divided into train and test set with an
groups. 80/20 ratio. The training set contains 377 videos and the test
set has 94 videos. Three videos are discarded due to invalid
II. R ELATED W ORK and non-readable contents.
Crowd video analysis is an active area of research especially The three descriptors: collectiveness, stability and conict
in the context of automated video surveillance. The earlier are computed seperately for each video in both train and test
works focused mostly on determining crowd behaviour either sets. Each descriptor is computed by grouping xed number of
at individual/group level or at a global level where the crowd frames together. For example, if the total number of frames of
is considered as a single entity. Work at group/individual level dimension n m is 60 and 10 frames are grouped to compute
includes detecting and tracking individuals[14] and/or groups the descriptor then the whole video will have 6 values for it
[15], [7], [1], [16] and, detecting interactions between them. At giving us a volume of nm6. Since we are using a 2D CNN
the global level, identifying anomalous behaviour [17], [18], we take the average across frames for each video pixel and

248
249
TABLE I
L IST OF CROWD VIDEO CLASSES

Class Label Class Description Videos Ratio


1 Highly mixed pedestrian walking 15 0.03
2 Crowd walking in mainstream in well organized manner 153 0.32
3 Crowd walking in mainstream in poorly organized manner 72 0.15
4 Crowd merge 9 0.02
5 Crowd split 13 0.03
6 Crowd crossing in opposite direction 70 0.15
7 Intervened escalator trafc 21 0.05
8 Smooth escalator trafc 121 0.26
Total 471

store it as a one channel image map of size nm. In this way


TABLE III
we have three image maps per video, each one corresponding C LASSIFICATION RESULT (W ITH DATA AUGMENTATION )
to a different crowd property.
Three CNNs with similar architecture are Descriptor Testing Testing Validation Validation
Loss Accuracy Loss Accuracy
trained separately one for each descriptor.
Collectivness 0.41 82% 1.7 52%
The network conguration is: conv(5 5
32), relu, pool(4 4), conv(5 5 32), relu, pool(4 Stability 0.09 96% 4.48 34%
4), dropout(0.25), conv(5 5 64), relu, pool(2 Conict 0.50 77% 2.06 52%
2), dropout(0.25), f c(128), relu, dropout(0.5). Combined 0.76 71% 1.17 56%
The activation function in the last layer is sigmoid, the loss
function is categorical entropy, and the optimizer is SGD. The
learning rate is set at 0.01 with decay of 1106 . The number TABLE IV
C LASSIFICATION RESULT (W ITH R EPLICATION )
of epochs run are 150.
The architecture of the CNN remains the same, but we run
a number of experiments based on how the data are prepared. Descriptor Testing Testing
Loss Accuracy
We compare results on the basis of accuracy as reported by
Collectivness 0.6 89%
the Keras deep learning library. The state-of-the-art for the
ensemble of crowd properties is at 77% [1]. Stability 1.53 83%
Conict 0.84 94%
B. Experiment 1
In the rst experiment no data augmentation is used and
data bias due to having very different number of examples
The same CNN architecture is used except with this input.
per category is not considered. This experiment is used to
The classication result obtained is in the last row of Table II.
establish a baseline for our CNN, and mirrors the way data
The classication accuracies using the descriptors collec-
were prepared in the literature for this dataset.
tiveness, stability and conict separately on the given dataset
Table V and table VI show the training and validation
are 77%, 53% and, 68% respectively. By combining the three
loss and accuracy for 150 epochs. The classication accuracy
descriptors we were able to achieve the accuracy of 70%
obtained on the test dataset with no data augmentation is given
which is same as the state of the art currently available with
in table II.
this dataset by [1] that uses the same group properties as the
TABLE II
features and applies SVM for classication.
C LASSIFICATION RESULT (N O AUGMENTATION ) Because of the class imbalance problem in the data, classes
with higher number of videos (class 2 & 8) have higher
Descriptor Testing Testing Validation Validation
Loss Accuracy Loss Accuracy
accuracies than classes with the fewest videos (classes 4 & 5).
Collectivness 0.37 83% 1.28 77%
In confusion matrix (table V) for collectiveness, accuracies for
class 2 & 8 are 80%and 100% respectively and for class 4 it is
Stability 0.14 95% 2.55 53%
0 since there were only 9 videos in it, and the classier learns
Conict 0.85 65% 0.87 68% to misclassify this class completely in favour of correctly
Combined 0.94 66% 0.93 70% classifying some of the more populous classes.
The bottleneck in deep learning technique for the dataset
used is the class imbalance unlike SVM where the condence
C. Experiment 1.5 level is the criteria. The result may differ if a dataset with
uniform classes is used. At present no such dataset is available
In the next experiment we combine the three single-channel
for crowd.
images of each video to form one three-channel image map.
We try to reduce this impact by using data augmentation and

249
250
TABLE V TABLE VI
T RAINING AND VALIDATION L OSS AND ACCURACY (N O L OSS AND ACCURACY (C OMBINED F EATURES WITH NO AUGMENTATION )
AUGMENTATION ) & C ONFUSION M ATRICES

Collectiveness

TABLE VII
Conict V ISUALIZATION OF F EATURES

Collectiveness Conict

Stability

Stability Combined

The classication accuracy using individual features and


comibned with data augmentation is given in Table III. There
is an improvement in accuracy when all three features are used
compared to individual features. But the accuracy dropped
from 70% to 56% compared to experiment 1 and 1.5.
generate 300 videos per class for the training set in experiment The reason is that the features used were not the standard
2. image features but the features extracted from temporal data.
Visualization of the three features and the combined feature
D. Experiment 2 of one video is shown in table VII that shows them as blobs
Experiment 2 is run by using data augmentation to reduce with no specic shape or orientation. The classic techniques
the class imbalance problem. We generate 300 videos for each of scaling, rotation, translation that are used to generate
class in the training set and hence have 2400 videos in total. augmented data do not perform well with these blobs. The
Data augmentation became popular with the seminal work of putative reason is that data augmentation for neural networks
Krezhevsky and Hinton [26]. It is an attempt to overcome was developed in the context of object recognition, where
the problem of there being not enough data to match model convolutional lters matching shapes of objects are eventually
complexity for deep networks by generating multiple instances learnt, whereas in the case of our application, there are no
from one data example that are slight variations in orientation, objects per se, instead we have morpheous blobs that are
scale and colour normalization of the original image example. transformed into other blobs of unrelated shapes by the data
Data augmentation was performed separately on each class. augmentation routines as the result would have another shape
Data augmentation on test set was applied to generate 600 that may not related to the original image of the feature. Thus
test videos. we have reached another bottleneck of deep learning, that is,

250
251
insufcient data. [10] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks
for Semantic Segmentation, in Computer Vision and Image Understand-
ing, 2015.
E. Experiment 3 [11] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Trans-
Experiment 3 is run by inverse-frequency sampling to gen- ferring Mid-Level Image Representations using Convolutional Neural
Networks, CVPR, pp. 17171724, 2014.
erate bootsrapped data to reduce the class imbalance problem. [12] J. Shao, K. Kang, C. C. Loy, and X. Wang, Deeply Learned Attributes
The largest class has 150 videos which was used as the number for Crowded Scene Understanding, in Computer Vision and Pattern
of samples required for other classes. For classes 1, 3, 4, 5, 6, 7 Recognition, 2015.
[13] T. Vicseka and A. Zafeiris, CollMotRev.pdf, 2012.
we replicated the data to have 140 150 videos per class. The [14] L. Bazzani, M. Cristani, and V. Murino, Decentralized particle
classication accuracies using the descriptors collectiveness, lter for joint individual-group tracking, in Proceedings of the
stability and conict separately are given in table IV. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. Ieee, jun 2012, pp. 18861893. [Online]. Available:
It is found that standard methods for improving classica- http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6247888
tion in neural networks help our CNN outperform the rather [15] W. Ge, R. T. Collins, and R. B. Ruback, Vision-Based Analysis of Small
classic method presented in [1], and is better suited as a Groups in Pedestrian Crowds, IEEE Transactions on pattern analysis
and machine intelligence, vol. 34, no. 5, pp. 10031016, 2012.
precursor to more involved crowd understanding tasks. [16] X. Yi, Shuai, Wang, Proling Stationary Crowd Groups, in IEEE
International Conference Multimedia and Expo., 2014.
IV. C ONCLUSION [17] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, Abnormal crowd behavior
detection and localization using maximum sub-sequence search, in
Crowd video classication is the precursor of more useful ACM/IEEE international workshop on Analysis and retrieval of tracked
crowd video analysis applications. A crowds behaviour can events and motion in imagery stream - ARTEMIS 13, vol. 1. New
York, New York, USA: ACM Press, 2013, pp. 4958. [Online].
be dened as its own and also in terms of its components. In Available: http://dl.acm.org/citation.cfm?doid=2510650.2510655
this paper we have shown that the mid-level descriptors of the [18] H. Fradi and S. Antipolis, Sparse Feature Tracking for Crowd Change
groups can be used to classify crowd videos and also similar Detection and Event Recognition, in IEEE International Conference on
Pattern Recognition, 2014.
accuracy can be obtained if the descriptors are computed [19] W. Li and V. Mahadevan, Anomaly Detection and Localization in
considering the whole crowd as a single entity. The additional Crowded Scenes, IEEE Transactions on pattern analysis and machine
tasks of identifying groups, computing their features and then intelligence, vol. 36, no. 1, pp. 1832, 2014.
[20] A. Karpathy and T. Leung, Large-scale Video Classication with
combining them to dene the crowd can be reduced to single Convolutional Neural Networks, Proceedings of 2014 IEEE Conference
step of computing the features of the crowd. on Computer Vision and Pattern Recognition, pp. 17251732, 2014.
[21] K. Kang and X. Wang, Fully Convolutional Neural Networks for Crowd
Segmentation, arXiv, vol. 1411.4464, 2014.
R EFERENCES [22] J. Y.-h. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,
[1] J. Shao, C. C. Loy, and X. Wang, Scene-Independent Group Proling and G. Toderici, Beyond Short Snippets : Deep Networks for Video
in Crowd, in IEEE Conference on Computer Vision and Pattern Classication, in Computer Vision and Pattern Recognition, 2015.
Recognition. IEEE, June 2014, pp. 22272234. [Online]. Available: [23] J. Shao, C. C. Loy, and X. Wang, Learning Scene-Independent Group
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6909682 Descriptors for Crowd Understanding, IEEE Transactions on Circuits
[2] B. Zhan, D. N. Monekosso, P. Remagnino, S. a. Velastin, and L.-Q. and Systems for Video Technology, pp. 11, 2016. [Online]. Available:
Xu, Crowd analysis: a survey, Machine Vision and Applications, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7428859
vol. 19, no. 5-6, pp. 345357, apr 2008. [Online]. Available: [24] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, Temporal Pyramid
http://link.springer.com/10.1007/s00138-008-0132-4 Pooling Based Convolutional Neural Networks for Action Recognition,
[3] C. C. Loy, K. Chen, S. Gong, and T. Xiang, Crowd Counting and Arxiv, pp. 18, 2015.
Proling : Methodology and Evaluation, Springer, vol. 11, pp. 347 [25] S. Zha, Exploiting Image-trained CNN Architectures for Unconstrained
382, 2013. Video Classication, BMVC 2015, p. 51, 2015.
[4] R. Mehran, A. Oyama, and M. Shah, Abnormal crowd [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classication
behavior detection using social force model, IEEE with deep convolutional neural networks, in Advances in Neural Infor-
Conference on Computer Vision and Pattern Recognition, mation Processing Systems, p. 2012.
no. 1, pp. 935942, jun 2009. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5206641
[5] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert, Data-driven crowd
analysis in videos, in IEEE International Conference on Computer
Vision. IEEE, November 2011, pp. 12351242. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6126374
[6] J. Shao, C. C. Loy, K. Kang, and X. Wang, Slicing Convolutional
Neural Network for Crowd Video Understanding, in Computer Vision
and Pattern Recognition, 2016.
[7] L. Bazzani, M. Zanotto, M. Cristani, and V. Murino,
Joint Individual-Group Modeling for Tracking, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 37, no. 4, pp. 746 759, 2014. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6891328
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classica-
tion with Deep Convolutional Neural Networks, Advances In Neural
Information Processing Systems, pp. 19, 2012.
[9] J. Yim, J. Ju, H. Jung, and J. Kim, Image Classication
Using Convolutional Neural Networks With Multi-stage Feature,
in Robot Intelligence Technology and Applications 3. Springer
International Publishing, 2015, pp. 587594. [Online]. Available:
http://link.springer.com/chapter/10.1007%2F978-3-319-16841-8 52

251
252

Potrebbero piacerti anche