Skip to main content

2016 | OriginalPaper | Buchkapitel

Multi-stream Deep Networks for Person to Person Violence Detection in Videos

verfasst von : Zhihong Dong, Jie Qin, Yunhong Wang

Erschienen in: Pattern Recognition

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Violence detection in videos has numerous applications, ranging from parental control and children protection to multimedia filtering and retrieval. A number of approaches have been proposed to detect vital clues for violent actions, among which most methods prefer employing trajectory based action recognition techniques. However, these methods can only model general characteristics of human actions, thus cannot well capture specific high order information of violent actions. Therefore, they are not suitable for detecting violence, which is typically intense and correlated with specific scenes. In this paper, we propose a novel framework, i.e., multi-stream deep convolutional neural networks, for person to person violence detection in videos. In addition to conventional spatial and temporal streams, we develop an acceleration stream to capture the important intense information usually involved in violent actions. Moreover, a simple and effective score-level fusion strategy is proposed to integrate multi-stream information. We demonstrate the effectiveness of our method on the typical violence dataset and extensive experimental results show its superiority over state-of-the-art methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Claire-Heilene, D.: VSD, a public dataset for the detection of violent scences in movies: design, annotation, ananlysis and evaluation. In: The Handbook of Brain Theory and Neural Networks, vol. 3361 (1995) Claire-Heilene, D.: VSD, a public dataset for the detection of violent scences in movies: design, annotation, ananlysis and evaluation. In: The Handbook of Brain Theory and Neural Networks, vol. 3361 (1995)
2.
Zurück zum Zitat Dai, Q., Tu, J., Shi, Z., Jiang, Y.G., Xue, X.: Fudan at mediaeval 2013: violent scenes detection using motion features and part-level attributes. In: MediaEval (2013) Dai, Q., Tu, J., Shi, Z., Jiang, Y.G., Xue, X.: Fudan at mediaeval 2013: violent scenes detection using motion features and part-level attributes. In: MediaEval (2013)
3.
Zurück zum Zitat Dai, Q., Wu, Z., Jiang, Y.G., Xue, X., Tang, J.: Fudan-NJUST at mediaeval 2014: violent scenes detection using deep neural networks. In: MediaEval (2014) Dai, Q., Wu, Z., Jiang, Y.G., Xue, X., Tang, J.: Fudan-NJUST at mediaeval 2014: violent scenes detection using deep neural networks. In: MediaEval (2014)
4.
Zurück zum Zitat Demarty, C.-H., Penet, C., Gravier, G., Soleymani, M.: A benchmarking campaign for the multimodal detection of violent scenes in movies. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7585, pp. 416–425. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33885-4_42 CrossRef Demarty, C.-H., Penet, C., Gravier, G., Soleymani, M.: A benchmarking campaign for the multimodal detection of violent scenes in movies. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7585, pp. 416–425. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-33885-4_​42 CrossRef
5.
Zurück zum Zitat Ding, C., Fan, S., Zhu, M., Feng, W., Jia, B.: Violence detection in video by using 3D convolutional neural networks. In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 551–558. Springer, Heidelberg (2014). doi:10.1007/978-3-319-14364-4_53 Ding, C., Fan, S., Zhu, M., Feng, W., Jia, B.: Violence detection in video by using 3D convolutional neural networks. In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 551–558. Springer, Heidelberg (2014). doi:10.​1007/​978-3-319-14364-4_​53
6.
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015) Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
7.
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
8.
Zurück zum Zitat Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)CrossRef Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)CrossRef
9.
Zurück zum Zitat Hahn, M., Chen, S., Dehghan, A.: Deep tracking: visual tracking using deep convolutional networks (2015). arXiv preprint arXiv:1512.03993 Hahn, M., Chen, S., Dehghan, A.: Deep tracking: visual tracking using deep convolutional networks (2015). arXiv preprint arXiv:​1512.​03993
10.
Zurück zum Zitat Ionescu, B., Schlüter, J., Mironica, I., Schedl, M.: A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: Proceedings of the 3rd ACM International Conference on Multimedia Retrieval, pp. 215–222. ACM (2013) Ionescu, B., Schlüter, J., Mironica, I., Schedl, M.: A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: Proceedings of the 3rd ACM International Conference on Multimedia Retrieval, pp. 215–222. ACM (2013)
11.
Zurück zum Zitat Jiang, Y.G., Dai, Q., Tan, C.C., Xue, X., Ngo, C.W.: The Shanghai-Hongkong team at mediaeval 2012: violent scene detection using trajectory-based features. In: MediaEval (2012) Jiang, Y.G., Dai, Q., Tan, C.C., Xue, X., Ngo, C.W.: The Shanghai-Hongkong team at mediaeval 2012: violent scene detection using trajectory-based features. In: MediaEval (2012)
12.
Zurück zum Zitat Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-based modeling of human actions with motion reference points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_31 CrossRef Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-based modeling of human actions with motion reference points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-33715-4_​31 CrossRef
13.
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
14.
Zurück zum Zitat Martin, V., Glotin, H., Paris, S., Halkias, X., Prevot, J.M.: Violence detection in video by large scale multi-scale local binary pattern dynamics. In: MediaEval. Citeseer (2012) Martin, V., Glotin, H., Paris, S., Halkias, X., Prevot, J.M.: Violence detection in video by large scale multi-scale local binary pattern dynamics. In: MediaEval. Citeseer (2012)
15.
Zurück zum Zitat Mohamed, A.r., Seide, F., Yu, D., Droppo, J., Stolcke, A., Zweig, G., Penn, G.: Deep bi-directional recurrent networks over spectral windows. In: ASRU (2015) Mohamed, A.r., Seide, F., Yu, D., Droppo, J., Stolcke, A., Zweig, G., Penn, G.: Deep bi-directional recurrent networks over spectral windows. In: ASRU (2015)
16.
Zurück zum Zitat Na, S.H.: Deep learning for natural language processing and machine translation (2015) Na, S.H.: Deep learning for natural language processing and machine translation (2015)
17.
Zurück zum Zitat Bermejo Nievas, E., Deniz Suarez, O., Bueno García, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23678-5_39 CrossRef Bermejo Nievas, E., Deniz Suarez, O., Bueno García, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-23678-5_​39 CrossRef
18.
Zurück zum Zitat Raj, A., Maturana, D., Scherer, S.: Multi-scale convolutional architecture for semantic segmentation (2015) Raj, A., Maturana, D., Scherer, S.: Multi-scale convolutional architecture for semantic segmentation (2015)
19.
Zurück zum Zitat Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetCrossRefMATH Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetCrossRefMATH
20.
Zurück zum Zitat Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)CrossRef Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)CrossRef
21.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
22.
Zurück zum Zitat Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:1212.0402 Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:​1212.​0402
23.
Zurück zum Zitat de Souza, F.D., Chávez, G.C., do Valle, E.A., Araujo, A.D.: Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 224–230. IEEE (2010) de Souza, F.D., Chávez, G.C., do Valle, E.A., Araujo, A.D.: Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 224–230. IEEE (2010)
24.
Zurück zum Zitat Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). arXiv preprint arXiv:1412.0767 Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). arXiv preprint arXiv:​1412.​0767
25.
Zurück zum Zitat Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
26.
Zurück zum Zitat Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013) Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
27.
Zurück zum Zitat Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015) Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
28.
Zurück zum Zitat Weninger, F., Bergmann, J., Schuller, B.: Introducing current: the Munich open-source CUDA recurrent neural network toolkit. J. Mach. Learn. Res. 16(1), 547–551 (2015)MathSciNet Weninger, F., Bergmann, J., Schuller, B.: Introducing current: the Munich open-source CUDA recurrent neural network toolkit. J. Mach. Learn. Res. 16(1), 547–551 (2015)MathSciNet
29.
Zurück zum Zitat Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2577–2584 (2014) Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2577–2584 (2014)
30.
Zurück zum Zitat Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461–470. ACM (2015) Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461–470. ACM (2015)
31.
Zurück zum Zitat Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015) Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Metadaten
Titel
Multi-stream Deep Networks for Person to Person Violence Detection in Videos
verfasst von
Zhihong Dong
Jie Qin
Yunhong Wang
Copyright-Jahr
2016
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-3002-4_43

Premium Partner