Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
AbstractDeep learning tools such as the convolutional neural Deep learning tools such as 2D convolutional neural net-
network (CNN) are extensively used for image analysis and works (CNN) are extensively being used in image analysis ap-
interpretation tasks but they become relatively expensive to use plications like image classication[8], [9], segmentation[10],
for a corresponding analysis in videos by requiring memory
provision for the additional temporal information. Crowd video [11] but they lack the capability to capture motion information
analysis is one of the subareas in video analysis that has recently in terms of a 3D volume of video frames. A 3D CNN which
gained notoriety. In this paper we have shown that a 2D CNN can is an extension of 2D CNN, can be used to capture temporal
be used to classify videos by using 3-channel image map input for information but its computational cost may be exorbitant
each video computed using spatial and temporal information and because of the much larger number of features maps generated
this reduces space and time complexity over a classical 3D CNN
usually used for video analysis. We test the model developed with from each convolutional layer.
the state-of-the-art method of [1] using their proposed dataset, Consider an RGB image of dimension 32 32 3 and the
and without any additional processing steps, improve upon their convulation layer with depth of 32. The output of this layer
reported accuracy. would be 32 32 32 features. On the other hand if we use
video as input to a 3D CNN then for a video of length 10
I. I NTRODUCTION
secs with 15fps, the input to the CNN would be a volume
Analysing and understanding the behaviour of human of dimensions 32 32 3 10 15 and the output will be
crowds is gaining importance, especially in the context of 32 32 32 10 15 features, amounting to a space cost
citizen security, safety and city administration following the of 150 times than the 2D CNN requires in memory space and
ubiquity of video surveillance[2], [3], [4], [5]. Despite a time for computation. For this reason it becomes important
number of works done in this area, understanding the dynamics to to have different descriptors that can capture both spatial
of crowd still remain non-trivial[6]. One of the difculties is and temporal information and can be used with a 2D CNN
in being able to dene the behaviour of the entire crowd in a and thus reducing computation cost both in terms of time and
scene, say for example a crowd in panic, because this varies space and also give results that are comparable to the state-
with the kind of crowd in question. This is why it is primordial of-the-art. This serves to limit the computational cost of the
to have categorized a crowd as a preparatory step towards more system, but there are important additional remarks to be made
visible and socially-signicant applications. In this paper we around its use.
tackle the problem of crowd video classication that can be Usually deep learning begins constructing features from the
used in the more meaningful applications in video surveillance smallest unit of information available, for instance a pixel in
such as anomaly detection, crowd planning. an image. In the case of videos, a derived quantity such as
A crowd is dened as collection of large number of people optical ow might be this smallest unit as was the RGB value
gathered together within dened geo-physical boundaries. The for an image. However, whenever we have prior information
primary entities that make up a crowd are groups[1]. A group about a problem, it may be desirable to use a higher-level
can be dened as a collection of individuals that show similar unit of information than a raw pixel containing optical ow
behaviour and/or are in close proximity of each other [7], [1]. direction vectors. Such models of deep learning are sometimes
Groups dene local behaviour in the crowd and their combined called hybrid model. The use of crowd properties as explained
behaviour denes the crowds behaviour. in the following is a hybrid approach where a hierarchy
Crowd behaviour analysis consists of a number of activities of features that has meaning in terms of the application at
that includes identifying objects of interests (groups and/or in- hand has already been built, namely from tracklets that carry
dividuals), appearance of the objects, tracking their movement information of an individuals motion history, to a higher-
separately and collectively using some tracking methods and level representation that could be understood in the context
combining the information from all of these to gain a general of human crowds, say how self-contained a group is. This
understanding of the crowd behaviour. But all of these task becomes the input to the deep learning architecture, and the
pose a challenge. latter derives further features from these primitives. These
248
249
TABLE I
L IST OF CROWD VIDEO CLASSES
249
250
TABLE V TABLE VI
T RAINING AND VALIDATION L OSS AND ACCURACY (N O L OSS AND ACCURACY (C OMBINED F EATURES WITH NO AUGMENTATION )
AUGMENTATION ) & C ONFUSION M ATRICES
Collectiveness
TABLE VII
Conict V ISUALIZATION OF F EATURES
Collectiveness Conict
Stability
Stability Combined
250
251
insufcient data. [10] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks
for Semantic Segmentation, in Computer Vision and Image Understand-
ing, 2015.
E. Experiment 3 [11] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Trans-
Experiment 3 is run by inverse-frequency sampling to gen- ferring Mid-Level Image Representations using Convolutional Neural
Networks, CVPR, pp. 17171724, 2014.
erate bootsrapped data to reduce the class imbalance problem. [12] J. Shao, K. Kang, C. C. Loy, and X. Wang, Deeply Learned Attributes
The largest class has 150 videos which was used as the number for Crowded Scene Understanding, in Computer Vision and Pattern
of samples required for other classes. For classes 1, 3, 4, 5, 6, 7 Recognition, 2015.
[13] T. Vicseka and A. Zafeiris, CollMotRev.pdf, 2012.
we replicated the data to have 140 150 videos per class. The [14] L. Bazzani, M. Cristani, and V. Murino, Decentralized particle
classication accuracies using the descriptors collectiveness, lter for joint individual-group tracking, in Proceedings of the
stability and conict separately are given in table IV. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. Ieee, jun 2012, pp. 18861893. [Online]. Available:
It is found that standard methods for improving classica- http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6247888
tion in neural networks help our CNN outperform the rather [15] W. Ge, R. T. Collins, and R. B. Ruback, Vision-Based Analysis of Small
classic method presented in [1], and is better suited as a Groups in Pedestrian Crowds, IEEE Transactions on pattern analysis
and machine intelligence, vol. 34, no. 5, pp. 10031016, 2012.
precursor to more involved crowd understanding tasks. [16] X. Yi, Shuai, Wang, Proling Stationary Crowd Groups, in IEEE
International Conference Multimedia and Expo., 2014.
IV. C ONCLUSION [17] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, Abnormal crowd behavior
detection and localization using maximum sub-sequence search, in
Crowd video classication is the precursor of more useful ACM/IEEE international workshop on Analysis and retrieval of tracked
crowd video analysis applications. A crowds behaviour can events and motion in imagery stream - ARTEMIS 13, vol. 1. New
York, New York, USA: ACM Press, 2013, pp. 4958. [Online].
be dened as its own and also in terms of its components. In Available: http://dl.acm.org/citation.cfm?doid=2510650.2510655
this paper we have shown that the mid-level descriptors of the [18] H. Fradi and S. Antipolis, Sparse Feature Tracking for Crowd Change
groups can be used to classify crowd videos and also similar Detection and Event Recognition, in IEEE International Conference on
Pattern Recognition, 2014.
accuracy can be obtained if the descriptors are computed [19] W. Li and V. Mahadevan, Anomaly Detection and Localization in
considering the whole crowd as a single entity. The additional Crowded Scenes, IEEE Transactions on pattern analysis and machine
tasks of identifying groups, computing their features and then intelligence, vol. 36, no. 1, pp. 1832, 2014.
[20] A. Karpathy and T. Leung, Large-scale Video Classication with
combining them to dene the crowd can be reduced to single Convolutional Neural Networks, Proceedings of 2014 IEEE Conference
step of computing the features of the crowd. on Computer Vision and Pattern Recognition, pp. 17251732, 2014.
[21] K. Kang and X. Wang, Fully Convolutional Neural Networks for Crowd
Segmentation, arXiv, vol. 1411.4464, 2014.
R EFERENCES [22] J. Y.-h. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,
[1] J. Shao, C. C. Loy, and X. Wang, Scene-Independent Group Proling and G. Toderici, Beyond Short Snippets : Deep Networks for Video
in Crowd, in IEEE Conference on Computer Vision and Pattern Classication, in Computer Vision and Pattern Recognition, 2015.
Recognition. IEEE, June 2014, pp. 22272234. [Online]. Available: [23] J. Shao, C. C. Loy, and X. Wang, Learning Scene-Independent Group
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6909682 Descriptors for Crowd Understanding, IEEE Transactions on Circuits
[2] B. Zhan, D. N. Monekosso, P. Remagnino, S. a. Velastin, and L.-Q. and Systems for Video Technology, pp. 11, 2016. [Online]. Available:
Xu, Crowd analysis: a survey, Machine Vision and Applications, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7428859
vol. 19, no. 5-6, pp. 345357, apr 2008. [Online]. Available: [24] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, Temporal Pyramid
http://link.springer.com/10.1007/s00138-008-0132-4 Pooling Based Convolutional Neural Networks for Action Recognition,
[3] C. C. Loy, K. Chen, S. Gong, and T. Xiang, Crowd Counting and Arxiv, pp. 18, 2015.
Proling : Methodology and Evaluation, Springer, vol. 11, pp. 347 [25] S. Zha, Exploiting Image-trained CNN Architectures for Unconstrained
382, 2013. Video Classication, BMVC 2015, p. 51, 2015.
[4] R. Mehran, A. Oyama, and M. Shah, Abnormal crowd [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classication
behavior detection using social force model, IEEE with deep convolutional neural networks, in Advances in Neural Infor-
Conference on Computer Vision and Pattern Recognition, mation Processing Systems, p. 2012.
no. 1, pp. 935942, jun 2009. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5206641
[5] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert, Data-driven crowd
analysis in videos, in IEEE International Conference on Computer
Vision. IEEE, November 2011, pp. 12351242. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6126374
[6] J. Shao, C. C. Loy, K. Kang, and X. Wang, Slicing Convolutional
Neural Network for Crowd Video Understanding, in Computer Vision
and Pattern Recognition, 2016.
[7] L. Bazzani, M. Zanotto, M. Cristani, and V. Murino,
Joint Individual-Group Modeling for Tracking, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 37, no. 4, pp. 746 759, 2014. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6891328
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classica-
tion with Deep Convolutional Neural Networks, Advances In Neural
Information Processing Systems, pp. 19, 2012.
[9] J. Yim, J. Ju, H. Jung, and J. Kim, Image Classication
Using Convolutional Neural Networks With Multi-stage Feature,
in Robot Intelligence Technology and Applications 3. Springer
International Publishing, 2015, pp. 587594. [Online]. Available:
http://link.springer.com/chapter/10.1007%2F978-3-319-16841-8 52
251
252