Skip to main content
Top
Published in: International Journal of Computer Vision 2/2020

22-10-2019

Semantic Image Networks for Human Action Recognition

Authors: Sunder Ali Khowaja, Seok-Lyong Lee

Published in: International Journal of Computer Vision | Issue 2/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we propose the use of a semantic image, an improved representation for video analysis, principally in combination with Inception networks. The semantic image is obtained by applying localized sparse segmentation using global clustering prior to the approximate rank pooling, which summarizes the motion characteristics in single or multiple images. It incorporates the background information by overlaying a static background from the window onto the subsequent segmented frames. The idea is to improve the action–motion dynamics by focusing on the region, which is important for action recognition and encoding the temporal variances using the frame ranking method. We also propose the sequential combination of Inception-ResNetv2 and long–short-term memory network (LSTM) to leverage the temporal variances for improved recognition performance. Extensive analysis has been carried out on UCF101 and HMDB51 datasets, which are widely used in action recognition studies. We show that (1) the semantic image generates better activations and converges faster than its original variant, (2) using segmentation prior to approximate rank pooling yields better recognition performance, (3) the use of LSTM leverages the temporal variance information from approximate rank pooling to model the action behavior better than the base network, (4) the proposed representations are adaptive as they can be used with existing methods such as temporal segment and I3D ImageNet + Kinetics network to improve the recognition performance, and (5) the four-stream network architecture pre-trained on ImageNet + Kinetics and fine-tuned using the proposed representation achieves the state-of-the-art performance, 99.1% and 83.7% recognition accuracy on UCF101 and HMDB51, respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Allen, G., & Gray, R. M. (2012). Vector quantization and signal compression (Vol. 159). Berlin: Springer.MATH Allen, G., & Gray, R. M. (2012). Vector quantization and signal compression (Vol. 159). Berlin: Springer.MATH
go back to reference Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018c). A2-nets: Double attention networks. In 32nd Conference on neural information processing systems (NeurIPs) (pp. 352–361). Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018c). A2-nets: Double attention networks. In 32nd Conference on neural information processing systems (NeurIPs) (pp. 352–361).
go back to reference Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 539–546). IEEE. https://doi.org/10.1109/cvpr.2005.202. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 539–546). IEEE. https://​doi.​org/​10.​1109/​cvpr.​2005.​202.
go back to reference Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems (pp. 3468–3476). Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems (pp. 3468–3476).
go back to reference Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In IEEE international conference on computer vision and pattern recognition (pp. 6546–6555). Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In IEEE international conference on computer vision and pattern recognition (pp. 6546–6555).
go back to reference Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM international conference on multimedia—MM’14 (pp. 675–678). New York, NY: ACM Press. https://doi.org/10.1145/2647868.2654889. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM international conference on multimediaMM’14 (pp. 675–678). New York, NY: ACM Press. https://​doi.​org/​10.​1145/​2647868.​2654889.
go back to reference Kellokumpu, V., Zhao, G., & Pietikainen, M. (2008). Human activity recognition using a dynamic texture based method. In Proceedings of the British machine vision conference (pp. 88.1–88.10). British Machine Vision Association. https://doi.org/10.5244/c.22.88. Kellokumpu, V., Zhao, G., & Pietikainen, M. (2008). Human activity recognition using a dynamic texture based method. In Proceedings of the British machine vision conference (pp. 88.1–88.10). British Machine Vision Association. https://​doi.​org/​10.​5244/​c.​22.​88.
go back to reference Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1–9). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1–9).
go back to reference Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE International conference on computer vision and pattern recognition (pp. 3361–3368). IEEE. https://doi.org/10.1109/cvpr.2011.5995496. Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE International conference on computer vision and pattern recognition (pp. 3361–3368). IEEE. https://​doi.​org/​10.​1109/​cvpr.​2011.​5995496.
go back to reference Li, C., Chen, C., Carlson, D., & Carin, L. (2016). Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Proceedings of the thirtieth association for the advancement of artificial intelligence (AAAI) conference on artificial intelligence (pp. 1788–1794). Li, C., Chen, C., Carlson, D., & Carin, L. (2016). Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Proceedings of the thirtieth association for the advancement of artificial intelligence (AAAI) conference on artificial intelligence (pp. 1788–1794).
go back to reference Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th international conference on machine learning (pp. 1310–1318). Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th international conference on machine learning (pp. 1310–1318).
go back to reference Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2009). Bilinear classifiers for visual recognition. In Neural information processing systems (NIPS) (pp. 1482–1490). Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2009). Bilinear classifiers for visual recognition. In Neural information processing systems (NIPS) (pp. 1482–1490).
go back to reference Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 1–9). Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 1–9).
go back to reference Srivastava, N., Mansimov, E., & Slakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd international conference on machine learning (pp. 843–852). Srivastava, N., Mansimov, E., & Slakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd international conference on machine learning (pp. 843–852).
go back to reference Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Thirty-first association for the advancement of artificial intelligence (AAAI) (pp. 4278–4284). Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Thirty-first association for the advancement of artificial intelligence (AAAI) (pp. 4278–4284).
go back to reference Wang Y., Song J., Wang L., Gool L., & Hilliges O. (2016b). Two-stream SR-CNNs for action recognition in videos. In Proceedings of the British machine vision conference 2016 (pp. 108.1–108.12). British Machine Vision Association. https://doi.org/10.5244/c.30.108. Wang Y., Song J., Wang L., Gool L., & Hilliges O. (2016b). Two-stream SR-CNNs for action recognition in videos. In Proceedings of the British machine vision conference 2016 (pp. 108.1–108.12). British Machine Vision Association. https://​doi.​org/​10.​5244/​c.​30.​108.
go back to reference Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702). IEEE. https://doi.org/10.1109/cvpr.2015.7299101. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702). IEEE. https://​doi.​org/​10.​1109/​cvpr.​2015.​7299101.
Metadata
Title
Semantic Image Networks for Human Action Recognition
Authors
Sunder Ali Khowaja
Seok-Lyong Lee
Publication date
22-10-2019
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 2/2020
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-019-01248-3

Other articles of this Issue 2/2020

International Journal of Computer Vision 2/2020 Go to the issue

Premium Partner