Skip to main content
Top

2019 | OriginalPaper | Chapter

Natural Language Description of Surveillance Events

Authors : Sk. Arif Ahmed, Debi Prosad Dogra, Samarjit Kar, Partha Pratim Roy

Published in: Information Technology and Applied Mathematics

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents a novel method to represent hours of surveillance video in a pattern-based text log. We present a tag and template-based technique that automatically generates natural language descriptions of surveillance events. We combine the output of some of the existing object tracker, deep learning guided object and action classifiers, and graph-based scene knowledge to assign hierarchical tags and generate natural language description of surveillance events. Unlike some state-of-the-art image and short video descriptor methods, our approach can describe videos, specifically surveillance videos by combining frame-level, temporal-level, and behavior-level target tags/features. We evaluate our method against two baseline video descriptors, and our analysis suggests that supervised scene knowledge and template can improve video descriptions, specially in surveillance videos.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Aradhye, H., Toderici, G., Yagnik, J.: Video2text: learning to annotate video content. In: IEEE International Conference on Data Mining Workshops, 2009. ICDMW’09, pp. 144–151. IEEE (2009) Aradhye, H., Toderici, G., Yagnik, J.: Video2text: learning to annotate video content. In: IEEE International Conference on Data Mining Workshops, 2009. ICDMW’09, pp. 144–151. IEEE (2009)
2.
go back to reference Chen, X., Lawrence, Z.C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015) Chen, X., Lawrence, Z.C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)
3.
go back to reference Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014) Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
4.
go back to reference Dogra, D.P., Ahmed, A., Bhaskar, H.: Smart video summarization using mealy machine-based trajectory modelling for surveillance applications. Multimed. Tools Appl. 75(11), 6373–6401 (2016)CrossRef Dogra, D.P., Ahmed, A., Bhaskar, H.: Smart video summarization using mealy machine-based trajectory modelling for surveillance applications. Multimed. Tools Appl. 75(11), 6373–6401 (2016)CrossRef
5.
go back to reference Donahue, J., Anne, H.L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015) Donahue, J., Anne, H.L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
6.
go back to reference Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013) Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
7.
go back to reference Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)CrossRef Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)CrossRef
8.
go back to reference Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetMATH Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetMATH
9.
go back to reference Huang, H., Lu, Y., Zhang, F., Sun, S.: A multi-modal clustering method for web videos. In: International Conference on Trustworthy Computing and Services, pp. 163–169. Springer (2012)CrossRef Huang, H., Lu, Y., Zhang, F., Sun, S.: A multi-modal clustering method for web videos. In: International Conference on Trustworthy Computing and Services, pp. 163–169. Springer (2012)CrossRef
10.
go back to reference Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
11.
go back to reference Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-semantic Embeddings with Multimodal Neural Language Models (2014). arXiv:1411.2539 Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-semantic Embeddings with Multimodal Neural Language Models (2014). arXiv:​1411.​2539
12.
go back to reference Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, vol. 1, p. 2 (2013) Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, vol. 1, p. 2 (2013)
13.
go back to reference Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: Treetalk: composition and compression of trees for image descriptions. TACL 2(10), 351–362 (2014) Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: Treetalk: composition and compression of trees for image descriptions. TACL 2(10), 351–362 (2014)
14.
go back to reference Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3153–3160. IEEE (2011) Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3153–3160. IEEE (2011)
15.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
16.
go back to reference Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015) Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
17.
go back to reference Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–440 (2013) Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–440 (2013)
18.
go back to reference Vedantam, R., Lawrence, Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015) Vedantam, R., Lawrence, Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
19.
go back to reference Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015) Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
20.
go back to reference Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks (2014). arXiv:1412.4729 Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks (2014). arXiv:​1412.​4729
21.
go back to reference Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
22.
go back to reference Wei, S., Zhao, Y., Zhu, Z., Liu, N.: Multimodal fusion for video searchreranking. IEEE Trans. Knowl. Data Eng. 22(8), 1191–1199 (2010)CrossRef Wei, S., Zhao, Y., Zhu, Z., Liu, N.: Multimodal fusion for video searchreranking. IEEE Trans. Knowl. Data Eng. 22(8), 1191–1199 (2010)CrossRef
23.
go back to reference Welch, G., Bishop, G.: An introduction to the Kalman filter. In: Annual Conference Computer Graphics Interactions Technology, pp. 12–17. ACM (2001) Welch, G., Bishop, G.: An introduction to the Kalman filter. In: Annual Conference Computer Graphics Interactions Technology, pp. 12–17. ACM (2001)
24.
go back to reference Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015) Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
25.
go back to reference Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 2, pp. 28–31. IEEE (2004) Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 2, pp. 28–31. IEEE (2004)
Metadata
Title
Natural Language Description of Surveillance Events
Authors
Sk. Arif Ahmed
Debi Prosad Dogra
Samarjit Kar
Partha Pratim Roy
Copyright Year
2019
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-7590-2_10

Premium Partner