Vehicle theft recognition from surveillance video based on spatiotemporal attention

He, Lijun; Wen, Shuai; Wang, Liejun; Li, Fan

doi:10.1007/s10489-020-01933-8

Vehicle theft recognition from surveillance video based on spatiotemporal attention

Published: 28 October 2020

Volume 51, pages 2128–2143, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Lijun He¹,
Shuai Wen¹,
Liejun Wang² &
…
Fan Li¹

735 Accesses
8 Citations
Explore all metrics

Abstract

Frequent vehicle thefts have a highly detrimental impact on public safety. Thanks to surveillance equipment distributed throughout a city, a large number of videos that can be used to recognize vehicle theft are available. However, vehicle theft behavior has the characteristics of a small criminal target and small movement. Hence, the existing action recognition algorithms cannot be directly applied for the recognition of vehicle theft. In this paper, we propose a method for vehicle theft recognition based on a spatiotemporal attention mechanism. First, a database of vehicle theft is established by collecting videos from the Internet and an existing dataset. Then, we establish a vehicle theft recognition network and introduce a spatiotemporal attention mechanism for application when extracting the spatiotemporal features of theft. Through the learning of adaptive feature weights, the features that contribute most greatly to recognition are emphasized. Simulation experiments show that our proposed algorithm can achieve 97.04% accuracy on the collected vehicle theft database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-view Vision Transformer for Driver Action Recognition

Fine-Grained Vehicle Recognition in Traffic Surveillance

FAFVTC: A Real-Time Network for Vehicle Tracking and Counting

References

Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878
Article Google Scholar
Wang Y, Huang K, Tan T (2007) Human activity recognition based on R transform. IEEE Comput Soc Conf Comput Vis Pattern Recog:1–8. https://doi.org/10.1109/CVPR.2007.383505
Chen HS, Chen HT, Chen YW, Lee S (2006) Human action recognition using star skeleton. VSSN '06: Proc 4th ACM Int Workshop Video Surveill Sensor Networks 171–178. https://doi.org/10.1145/1178782.1178808
Wang L, Suter D (2006) Informative shape representations for human action recognition. 18th Int Conf Pattern Recog (ICPR'06), Hong Kong 1266–1269. https://doi.org/10.1109/ICPR.2006.711
Harris C, Stephens M (1988) A combined corner and edge detector. Proc Alvey Vis Conf 147–151. https://doi.org/10.5244/C.2.23
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7
Article Google Scholar
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. 2005 IEEE Int Workshop Visual Surveill Perform Eval Track Surveill, Beijing 65–72. https://doi.org/10.1109/VSPETS.2005.1570899
Willems G, Tuytelaars T, Van Gool LJ (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. ECCV '08: Proceedings of the 10th European Conference on Computer Vision: Part II 650–663. https://doi.org/10.1007/978-3-540-88688-4_48
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local Spatio-temporal features for action recognition. British Mach Vis Conf 124–135. https://doi.org/10.5244/C.23.124
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc Conf Comput Vis Pattern Recog 886–893. https://doi.org/10.1109/CVPR.2005.177
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. ECCV'06: Proceedings of the 9th European conference on Computer Vision - Volume Part II 428–441. https://doi.org/10.1007/11744047_33
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recog 1–8. https://doi.org/10.1109/CVPR.2008.4587756
Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE Trans Inf Syst 87(1):113–120
Google Scholar
Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25(1):2–11. https://doi.org/10.1016/j.jvcir.2013.03.001
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. IEEE Conf Comput Vision Pattern Recog 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407
Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE Int Conf Comput Vis 3551–3558. https://doi.org/10.1109/ICCV.2013.441
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551. https://doi.org/10.1162/neco.1989.1.4.541
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 60:1097–1105. https://doi.org/10.1145/3065386
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence. https://core.ac.uk/reader/25056064
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S (2015) Going deeper with convolutions. IEEE Conf Comput Vis Pattern Recog 1–9. https://doi.org/10.1109/cvpr.2015.7298594
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conf Comput Vis Pattern Recog 770–778. https://doi.org/10.1109/cvpr.2016.90
Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590. https://doi.org/10.1016/j.knosys.2020.105590
Article Google Scholar
Theagarajan R, Thakoor N, Bhanu B (2019) Physical features and deep learning-based appearance features for vehicle classification from rear view videos. IEEE Trans Intell Transp Syst 21(3):1096–1108. https://doi.org/10.1109/TITS.2019.2902312
Article Google Scholar
Yao Y, Wang X, Xu M, Pu Z, Crandall D (2020) When, where, and what? A new dataset for anomaly detection in driving videos. arXiv preprint arXiv:2004.03044
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576. https://doi.org/10.1002/14651858.CD001941.pub3
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Eur Conf Comput Vis 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Eur Conf Comput Vision 803–818. https://doi.org/10.1007/978-3-030-01246-5_49
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal residual networks for video action recognition. IEEE Conf Comput Vis Pattern Recog 3468–3476. https://doi.org/10.1109/CVPR.2017.787
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. IEEE Conf Comput Vis Pattern Recog 1924–1932. https://doi.org/10.1109/CVPR.2016.212
Fernando B, Gould S (2016) Learning end-to-end video classification with RankPooling. Proc 33rd Int Conf Int Conf Mach Learn 48:1187–1196
Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40(12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Article Google Scholar
Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. IEEE Conf Comput Vis Pattern Recognition 1390–1399. https://doi.org/10.1109/CVPR.2018.00151
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. IEEE/CVF Int Conf Comput Vis 6202–6211. https://doi.org/10.1109/iccv.2019.00630
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. IEEE Int Conf Comput Vis 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Article Google Scholar
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. Proc IEEE Int Conf Comput Vis 4597–4605. https://doi.org/10.1109/ICCV.2015.522
Qiu ZF, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo-3D residual networks. IEEE Int Conf Comput Vis 5533–5541. https://doi.org/10.1109/ICCV.2017.590
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. IEEE Conf Comput Vis Pattern Recog 6299–6308. https://doi.org/10.1109/CVPR.2017.502
Tran D, Wang H, Torresani L, Feiszli M (2019) Video classification with channel-separated convolutional networks. IEEE Int Conf Comput Vis 5551–5560. https://doi.org/10.1109/ICCV.2019.00565
Donahue J, Hendricks LA, Guadarrama S et al (2016) Long-term recurrent convolutional networks for visual recognition and description. IEEE Conf Comput Vis Pattern Recognition 39:2625–2634. https://doi.org/10.1109/TPAMI.2016.2599174
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. IEEE Int Conf Acoustics, Speech Signal Process 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
Majd M, Safabakhsh R (2019) A motion-aware ConvLSTM network for action recognition. Appl Intell 49(7):2515–2521. https://doi.org/10.1007/s10489-018-1395-8
Article Google Scholar
Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proc IEEE Conf Comput Vis Pattern Recog 7132–7141. https://doi.org/10.1109/CVPR.2018.00745
Xiao XH, Leedham G (1999) Signature verification by neural networks with selective attention. Appl Intell 11(2):213–223. https://doi.org/10.1023/A:1008380515294
Article Google Scholar
Woo S, Park J, Lee J Y, et al (2018) Cbam: convolutional block attention module. Proc Eur Conf Comput Vis 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. Proc IEEE Int Conf Comput Vis:7083–7093. https://doi.org/10.1109/iccv.2019.00718
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proc IEEE Conf Comput Vis Pattern Recognition:6546-6555. https://doi.org/10.1109/cvpr.2018.00685
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. Proc IEEE Conf Comput Vis Pattern Recognition:6479-6488. https://doi.org/10.1109/CVPR.2018.00678
He L, Wen S, Wang L, Li F (2020), Vehicle theft dataset. https://drive.google.com/drive/folders/19c2KNVotM15bLlU9FHAvqORTA00sV5lE
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp 2556-2563. https://doi.org/10.1109/ICCV.2011.6126543

Download references

Acknowledgements

This research work was supported in part by the National Science Foundation of China (61671365, U1903213), and the Key Research and Development Program of Shaanxi Province (2020KW-009).

Author information

Authors and Affiliations

School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an, 710049, China
Lijun He, Shuai Wen & Fan Li
College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, China
Liejun Wang

Authors

Lijun He
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Wen
View author publications
You can also search for this author in PubMed Google Scholar
Liejun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, L., Wen, S., Wang, L. et al. Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell 51, 2128–2143 (2021). https://doi.org/10.1007/s10489-020-01933-8

Download citation

Accepted: 07 September 2020
Published: 28 October 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10489-020-01933-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vehicle theft recognition from surveillance video based on spatiotemporal attention

Abstract

Access this article

Similar content being viewed by others

Multi-view Vision Transformer for Driver Action Recognition

Fine-Grained Vehicle Recognition in Traffic Surveillance

FAFVTC: A Real-Time Network for Vehicle Tracking and Counting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vehicle theft recognition from surveillance video based on spatiotemporal attention

Abstract

Access this article

Similar content being viewed by others

Multi-view Vision Transformer for Driver Action Recognition

Fine-Grained Vehicle Recognition in Traffic Surveillance

FAFVTC: A Real-Time Network for Vehicle Tracking and Counting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation