nach oben

Pattern Analysis and Applications

Erschienen in:

01.06.2024 | Original Article

A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs

verfasst von: Javad Mahmoodi, Hossein Nezamabadi-pour

Erschienen in: Pattern Analysis and Applications | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Violence detection is a difficult task because it involves analyzing video clips from multiple security cameras, which are located in various places and operate continuously. When violent crimes occur, a system should be able to reliably detect them in real-time and immediately alert a surveillance team. Currently, researchers employ deep learning models to detect violent behavior. Notably, a large number of deep learning approaches are based on extracting spatio-temporal information from a video by exploiting either 3D Convolutional Neural Networks (CNNs) or multi-stream networks. Despite their success, these techniques require a lot of parameters than 2D CNNs and have high computational complexity. Therefore, we present a simple spatio-temporal attention mechanism combined with a 2D CNN for an effective violence detection system. We propose a Squeeze Temporal Attention block that allows a 2D CNN to learn spatiotemporal features in videos. This effective block uses squeeze and temporal attention modules to summarize a video stream into three channels. In addition, we introduce spatial attention and feature fusion modules to improve the performance of the proposed system. The spatial attention module, Entropy Spatial Module, utilizes an entropy filter and frame differences to focus on spatial regions of the video with more movement. The fusion module parallelizes two dense layers with a 2D CNN to effectively enhance the classifier's performance. As a result, our proposed model achieves improved performance results in terms of accuracy when compared to Long Short-Term Memory, multi-stream networks, and current 3D CNNs.

Vorheriger Artikel Domain-free fire detection using the spatial–temporal attention transform of the YOLO backbone

Nächster Artikel Correction to: FBRNet: a feature fusion and border refinement network for real-time semantic segmentation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep Convolutional Neural Networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386CrossRef

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd international conference on learning representations ICLR 2015—conference track proceedings, pp 1–14

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision, vol 2015 Inter, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings, 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-Janua, pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502

Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407

Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: real-time detection of violent crowd behavior. In: IEEE computer society conference on computer vision and pattern recognition workshops, pp 1–6. https://doi.org/10.1109/CVPRW.2012.6239348

Vijeikis R, Raudonis V, Dervinis G (2022) Efficient violence detection in surveillance. Sensors 22(6):2216. https://doi.org/10.3390/S22062216CrossRef

Cai J, Hu J (2020) 3D RANs: 3D residual attention networks for action recognition. Vis Comput 36(6):1261–1270. https://doi.org/10.1007/s00371-019-01733-3CrossRef

Zhu Y, Lan Z, Newsam S, Hauptmann A (2017) Hidden two-stream convolutional networks for action recognition. Lecture notes in computer science (including subseries lecture notes artificial intelligence and lecture notes in bioinformatics), vol 11363. LNCS, pp 363–378. https://doi.org/10.1007/978-3-030-20893-6_23

10.

Freire-Obregón D, Barra P, Castrillón-Santana M, De Marsico M (2021) Inflated 3D ConvNet context analysis for violence detection. Mach Vis Appl 33(1):15. https://doi.org/10.1007/s00138-021-01264-9CrossRef

11.

Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3D Convolutional Neural Networks. IEEE Access 7:39172–39179CrossRef

12.

Mahmoodi J, Nezamabadi-pour H, Abbasi-Moghadam D (2022) Violence detection in videos using interest frame extraction and 3D convolutional neural network. Multimed Tools Appl 81(15):20945–20961. https://doi.org/10.1007/s11042-022-12532-9CrossRef

13.

Ding C, Fan S, Zhu M, Feng W, Jia B (2014) Violence detection in video by using 3D Convolutional Neural Networks. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 8888, pp 551–558. https://doi.org/10.1007/978-3-319-14364-4_53

14.

Su J, Her P, Clemens E, Yaz E, Schneider S, Medeiros H (2022) Violence detection using 3D Convolutional Neural Networks. In: AVSS 2022—18th IEEE international conference advanced video and signal based surveillance. https://doi.org/10.1109/AVSS56176.2022.9959393

15.

Simonyan K, Zisserman A (2021) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(January), 568–576. Accessed 25 April 2021. http://arxiv.org/abs/1406.2199

16.

Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Lecture notes in computer science (including subseries lecture notes artificial intelligence and lecture notes in bioinformatics), vol 11219. LNCS, no 1, pp 318–335. https://doi.org/10.1007/978-3-030-01267-0_19

17.

Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001. https://doi.org/10.1109/TMM.2020.2965434CrossRef

18.

Xue F, Ji H, Zhang W, Cao Y (2019) Attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos. IET Comput Vis 13(8):708–718. https://doi.org/10.1049/iet-cvi.2018.5830CrossRef

19.

Laptev I, Lindeberg T (2003) Space-time interest points. In: Proceedings of the IEEE international conference on computer vision, vol 1, pp 432–439. https://doi.org/10.1109/iccv.2003.1238378

20.

Chen M, Hauptmann A (2009) MoSIFT: recognizing human actions in surveillance videos. Informedia@TRECVID, pp 1–16

21.

Bermejo Nievas E, Deniz Suarez O, Bueno García G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6855. LNCS, no PART 2, pp 332–339. https://doi.org/10.1007/978-3-642-23678-5_39

22.

Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detection based on MoSIFT feature and sparse coding. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, pp 3538–3542. https://doi.org/10.1109/ICASSP.2014.6854259

23.

Mahmoodi J, Salajeghe A (2019) A classification method based on optical flow for violence detection. Expert Syst Appl 127:121–127. https://doi.org/10.1016/j.eswa.2019.02.032CrossRef

24.

Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using oriented violent flows. Image Vis Comput 48:37–41. https://doi.org/10.1016/j.imavis.2016.01.006CrossRef

25.

Ben Mabrouk A, Zagrouba E (2017) Spatio-temporal feature using optical flow based distribution for violence detection. Pattern Recognit Lett 92:62–67. https://doi.org/10.1016/j.patrec.2017.04.015CrossRef

26.

Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors (Switzerland). https://doi.org/10.3390/s19112472CrossRef

27.

Keçeli ASS, Kaya A (2017) Violent activity detection with transfer learning method. Electron Lett 53(15):1047–1048. https://doi.org/10.1049/el.2017.0970CrossRef

28.

Serrano I, Deniz O, Espinosa-Aranda JL, Bueno G (2018) Fight recognition in video using Hough forests and 2D convolutional neural network. IEEE Trans Image Process 27(10):4787–4797. https://doi.org/10.1109/TIP.2018.2845742MathSciNetCrossRef

29.

Kang MS, Park RH, Park HM (2021) Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access 9:76270–76285. https://doi.org/10.1109/ACCESS.2021.3083273CrossRef

30.

Xu X, Wu X, Wang G, Wang H (2018) Violent video classification based on spatial-temporal cues using deep learning. In: 2018 11th international symposium on computational intelligence and design (ISCID), vol 01, pp 319–322. https://doi.org/10.1109/ISCID.2018.00079

31.

Traore A, Akhloufi MA, Traoré A, Akhloufi MA, Traore A, Akhloufi MA (2020) Violence detection in videos using deep recurrent and Convolutional Neural Networks. In: 2020 IEEE international conference on systems, man, and cybernetics (SMC), vol 2020-Octob, pp 154–159. https://doi.org/10.1109/SMC42975.2020.9282971

32.

Tan M, Le QV (2019) EfficientNet: rethinking model scaling for Convolutional Neural Networks. In: 36th international conference on machine learning, ICML 2019, vol 2019-June, pp 10691–10700

33.

Dong Z, Qin J, Wang Y (2016) Multi-stream deep networks for person to person violence detection in videos. Commun Comput Inf Sci 662:517–531. https://doi.org/10.1007/978-981-10-3002-4_43CrossRef

34.

Mohtavipour SM, Saeidi M, Arabsorkhi A (2021) A multi-stream CNN for deep violence detection in video sequences using handcrafted features. Vis Comput. https://doi.org/10.1007/s00371-021-02266-4CrossRef

35.

Li H, Wang J, Han J, Zhang J, Yang Y, Zhao Y (2020) A novel multi-stream method for violent interaction detection using deep learning. Meas Control (United Kingdom) 53(5–6):796–806. https://doi.org/10.1177/0020294020902788CrossRef

36.

Ullah W, Hussain T, Khan ZA, Haroon U, Baik SW (2022) Intelligent dual stream CNN and echo state network for anomaly detection. Knowl Based Syst 253:109456. https://doi.org/10.1016/j.knosys.2022.109456CrossRef

37.

Abdali A-MR, Al-Tuma RF (2019) Robust real-time violence detection in video using CNN And LSTM. In: 2019 2nd scientific conference of computer sciences (SCCS), pp 104–108. https://doi.org/10.1109/SCCS.2019.8852616

38.

Asad M, Yang J, He J, Shamsolmoali P, He X (2021) Multi-frame feature-fusion-based model for violence detection. Vis Comput 37(6):1415–1431. https://doi.org/10.1007/s00371-020-01878-6CrossRef

39.

Akti S, Tataroglu GA, Ekenel HK (2019) Vision-based fight detection from surveillance cameras. https://doi.org/10.1109/IPTA.2019.8936070

40.

Chollet F (2016) Xception: deep learning with depthwise separable convolutions. In: Proceeding of 30th IEEE conference computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 1800–1807. https://doi.org/10.1109/CVPR.2017.195

41.

Ullah FUM et al (2021) An intelligent system for complex violence pattern analysis and detection. Int J Intell Syst. https://doi.org/10.1002/int.22537CrossRef

42.

He K, Gkioxari G, Dollár P, Girshick R (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386–397. https://doi.org/10.1109/TPAMI.2018.2844175CrossRef

43.

Liang Q, Li Y, Chen B, Yang K (2021) Violence behavior recognition of two-cascade temporal shift module with attention mechanism. J Electron Imaging 30(04):1–13. https://doi.org/10.1117/1.jei.30.4.043009CrossRef

44.

Gopali S, Abri F, Siami-Namini S, Namin AS (2021) A comparison of TCN and LSTM models in detecting anomalies in time series data. In: Proceedings of 2021 IEEE international conference Big Data, Big Data 2021, pp 2415–2420. https://doi.org/10.1109/BigData52589.2021.9671488

45.

Hussain A, Hussain T, Ullah W, Baik SW (2022) Vision transformer and deep sequence learning for human activity recognition in surveillance videos. Comput Intell Neurosci. https://doi.org/10.1155/2022/3454167CrossRef

46.

Naik AJ, Gopalakrishna MT (2022) Automated Violence detection in video crowd using spider monkey-grasshopper optimization oriented optimal feature selection and deep neural network. J Control Autom Electr Syst 33(3):858–880. https://doi.org/10.1007/s40313-021-00868-wCrossRef

47.

Irfanullah T, Hussain A, Iqbal B, Yang AH (2022) Real time violence detection in surveillance videos using Convolutional Neural Networks. Multimed Tools Appl 81(26):38151–38173. https://doi.org/10.1007/s11042-022-13169-4CrossRef

48.

Mohammadi H, Nazerfard E (2023) Video violence recognition and localization using a semi-supervised hard attention model. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2022.118791CrossRef

49.

Ullah W, Min Ullah FU, Ahmad Khan Z, Wook Baik S (2023) Sequential attention mechanism for weakly supervised video anomaly detection. Expert Syst Appl 230(June):120599. https://doi.org/10.1016/j.eswa.2023.120599CrossRef

50.

Shoaib M, Ullah A, Abbasi IA, Algarni F, Khan AS (2023) Augmenting the Robustness and efficiency of violence detection systems for surveillance and non-surveillance scenarios. IEEE Access 11:123295–123313. https://doi.org/10.1109/access.2023.3329062CrossRef

51.

Magdy M, Fakhr MW, Maghraby FA (2023) Violence 4D: violence detection in surveillance using 4D Convolutional Neural Networks. IET Comput Vis 17(3):282–294. https://doi.org/10.1049/cvi2.12162CrossRef

52.

He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, vol 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

53.

Ullah FUM et al (2022) AI-assisted edge vision for violence detection in IoT-based industrial surveillance networks. IEEE Trans Ind Inform 18(8):5359–5370. https://doi.org/10.1109/TII.2021.3116377CrossRef

54.

Leutenegger S, Chli M, Siegwart RY (2011) BRISK: binary robust invariant scalable keypoints. In: Proceedings of the IEEE international conference on computer vision, pp 2548–2555. https://doi.org/10.1109/ICCV.2011.6126542

55.

Rosten E, Drummond T (2005) Fusing points and lines for high performance tracking. In: Proceedings of the IEEE international conference on computer vision, vol II, pp 1508–1515. https://doi.org/10.1109/ICCV.2005.104

Titel: A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs
verfasst von: Javad Mahmoodi
Hossein Nezamabadi-pour
Publikationsdatum: 01.06.2024
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 2/2024
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-024-01265-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2024

Enhancing cross-domain transferability of black-box adversarial attacks on speaker recognition systems using linearized backpropagation

Domain-free fire detection using the spatial–temporal attention transform of the YOLO backbone

Early detection of Alzheimer’s disease using squeeze and excitation network with local binary pattern descriptor

A weakly supervised end-to-end framework for semantic segmentation of cancerous area in whole slide image

Saliency information and mosaic based data augmentation method for densely occluded object recognition

Cotton crop classification using satellite images with score level fusion based hybrid model

Premium Partner