Abstract
Frequent vehicle thefts have a highly detrimental impact on public safety. Thanks to surveillance equipment distributed throughout a city, a large number of videos that can be used to recognize vehicle theft are available. However, vehicle theft behavior has the characteristics of a small criminal target and small movement. Hence, the existing action recognition algorithms cannot be directly applied for the recognition of vehicle theft. In this paper, we propose a method for vehicle theft recognition based on a spatiotemporal attention mechanism. First, a database of vehicle theft is established by collecting videos from the Internet and an existing dataset. Then, we establish a vehicle theft recognition network and introduce a spatiotemporal attention mechanism for application when extracting the spatiotemporal features of theft. Through the learning of adaptive feature weights, the features that contribute most greatly to recognition are emphasized. Simulation experiments show that our proposed algorithm can achieve 97.04% accuracy on the collected vehicle theft database.
Similar content being viewed by others
References
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878
Wang Y, Huang K, Tan T (2007) Human activity recognition based on R transform. IEEE Comput Soc Conf Comput Vis Pattern Recog:1–8. https://doi.org/10.1109/CVPR.2007.383505
Chen HS, Chen HT, Chen YW, Lee S (2006) Human action recognition using star skeleton. VSSN '06: Proc 4th ACM Int Workshop Video Surveill Sensor Networks 171–178. https://doi.org/10.1145/1178782.1178808
Wang L, Suter D (2006) Informative shape representations for human action recognition. 18th Int Conf Pattern Recog (ICPR'06), Hong Kong 1266–1269. https://doi.org/10.1109/ICPR.2006.711
Harris C, Stephens M (1988) A combined corner and edge detector. Proc Alvey Vis Conf 147–151. https://doi.org/10.5244/C.2.23
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. 2005 IEEE Int Workshop Visual Surveill Perform Eval Track Surveill, Beijing 65–72. https://doi.org/10.1109/VSPETS.2005.1570899
Willems G, Tuytelaars T, Van Gool LJ (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. ECCV '08: Proceedings of the 10th European Conference on Computer Vision: Part II 650–663. https://doi.org/10.1007/978-3-540-88688-4_48
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local Spatio-temporal features for action recognition. British Mach Vis Conf 124–135. https://doi.org/10.5244/C.23.124
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc Conf Comput Vis Pattern Recog 886–893. https://doi.org/10.1109/CVPR.2005.177
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. ECCV'06: Proceedings of the 9th European conference on Computer Vision - Volume Part II 428–441. https://doi.org/10.1007/11744047_33
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recog 1–8. https://doi.org/10.1109/CVPR.2008.4587756
Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE Trans Inf Syst 87(1):113–120
Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25(1):2–11. https://doi.org/10.1016/j.jvcir.2013.03.001
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. IEEE Conf Comput Vision Pattern Recog 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407
Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE Int Conf Comput Vis 3551–3558. https://doi.org/10.1109/ICCV.2013.441
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551. https://doi.org/10.1162/neco.1989.1.4.541
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 60:1097–1105. https://doi.org/10.1145/3065386
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence. https://core.ac.uk/reader/25056064
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S (2015) Going deeper with convolutions. IEEE Conf Comput Vis Pattern Recog 1–9. https://doi.org/10.1109/cvpr.2015.7298594
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conf Comput Vis Pattern Recog 770–778. https://doi.org/10.1109/cvpr.2016.90
Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590. https://doi.org/10.1016/j.knosys.2020.105590
Theagarajan R, Thakoor N, Bhanu B (2019) Physical features and deep learning-based appearance features for vehicle classification from rear view videos. IEEE Trans Intell Transp Syst 21(3):1096–1108. https://doi.org/10.1109/TITS.2019.2902312
Yao Y, Wang X, Xu M, Pu Z, Crandall D (2020) When, where, and what? A new dataset for anomaly detection in driving videos. arXiv preprint arXiv:2004.03044
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576. https://doi.org/10.1002/14651858.CD001941.pub3
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Eur Conf Comput Vis 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Eur Conf Comput Vision 803–818. https://doi.org/10.1007/978-3-030-01246-5_49
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal residual networks for video action recognition. IEEE Conf Comput Vis Pattern Recog 3468–3476. https://doi.org/10.1109/CVPR.2017.787
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. IEEE Conf Comput Vis Pattern Recog 1924–1932. https://doi.org/10.1109/CVPR.2016.212
Fernando B, Gould S (2016) Learning end-to-end video classification with RankPooling. Proc 33rd Int Conf Int Conf Mach Learn 48:1187–1196
Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40(12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. IEEE Conf Comput Vis Pattern Recognition 1390–1399. https://doi.org/10.1109/CVPR.2018.00151
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. IEEE/CVF Int Conf Comput Vis 6202–6211. https://doi.org/10.1109/iccv.2019.00630
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. IEEE Int Conf Comput Vis 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. Proc IEEE Int Conf Comput Vis 4597–4605. https://doi.org/10.1109/ICCV.2015.522
Qiu ZF, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo-3D residual networks. IEEE Int Conf Comput Vis 5533–5541. https://doi.org/10.1109/ICCV.2017.590
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. IEEE Conf Comput Vis Pattern Recog 6299–6308. https://doi.org/10.1109/CVPR.2017.502
Tran D, Wang H, Torresani L, Feiszli M (2019) Video classification with channel-separated convolutional networks. IEEE Int Conf Comput Vis 5551–5560. https://doi.org/10.1109/ICCV.2019.00565
Donahue J, Hendricks LA, Guadarrama S et al (2016) Long-term recurrent convolutional networks for visual recognition and description. IEEE Conf Comput Vis Pattern Recognition 39:2625–2634. https://doi.org/10.1109/TPAMI.2016.2599174
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. IEEE Int Conf Acoustics, Speech Signal Process 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
Majd M, Safabakhsh R (2019) A motion-aware ConvLSTM network for action recognition. Appl Intell 49(7):2515–2521. https://doi.org/10.1007/s10489-018-1395-8
Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proc IEEE Conf Comput Vis Pattern Recog 7132–7141. https://doi.org/10.1109/CVPR.2018.00745
Xiao XH, Leedham G (1999) Signature verification by neural networks with selective attention. Appl Intell 11(2):213–223. https://doi.org/10.1023/A:1008380515294
Woo S, Park J, Lee J Y, et al (2018) Cbam: convolutional block attention module. Proc Eur Conf Comput Vis 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. Proc IEEE Int Conf Comput Vis:7083–7093. https://doi.org/10.1109/iccv.2019.00718
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proc IEEE Conf Comput Vis Pattern Recognition:6546-6555. https://doi.org/10.1109/cvpr.2018.00685
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. Proc IEEE Conf Comput Vis Pattern Recognition:6479-6488. https://doi.org/10.1109/CVPR.2018.00678
He L, Wen S, Wang L, Li F (2020), Vehicle theft dataset. https://drive.google.com/drive/folders/19c2KNVotM15bLlU9FHAvqORTA00sV5lE
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp 2556-2563. https://doi.org/10.1109/ICCV.2011.6126543
Acknowledgements
This research work was supported in part by the National Science Foundation of China (61671365, U1903213), and the Key Research and Development Program of Shaanxi Province (2020KW-009).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
He, L., Wen, S., Wang, L. et al. Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell 51, 2128–2143 (2021). https://doi.org/10.1007/s10489-020-01933-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01933-8