Top

Multimedia Systems

Published in:

20-08-2020 | Regular Paper

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Authors: Ming Fang, Xiaoying Bai, Jianwei Zhao, Fengqin Yang, Chih-Cheng Hung, Shuhua Liu

Published in: Multimedia Systems | Issue 6/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Action recognition in video is one of the important applications in computer vision. In recent years, the two-stream architecture has made significant progress in action recognition, but it has not systematically explored spatial–temporal features. Therefore, this paper proposes an integrated approach using Gaussian mixture model (GMM) and dilated convolution residual network (GD-RN) for action recognition. This method uses ResNet-101 as spatial and temporal stream ConvNet. On the one hand, this paper first sends the action video into the GMM for background subtraction and then sends the video marking the action profile to ResNet-101 for identification and classification. Compared with the baseline, ConvNet takes the original RGB image as input, which not only reduces the complexity of the video background, but also reduces the amount of computation of the learning space information. On the other hand, using the stacked optical flow images as the input of the ResNet-101 added to the dilated convolution, the convolution receptive field is expanded without lowering the resolution of the optical flow image, thereby improving the classification accuracy. The two ConvNet-independent learning spatial and temporal features of the GD-RN network finally fine-tune and fuse the spatio-temporal features to obtain the final action recognition accuracy. The action recognition method proposed in this paper is tested on the challenging UCF101 and HMDB51 datasets, and accuracy rates of 91.3% and 62.4%, respectively, are obtained, which proves the proposed method with the competitive results.

previous article Color image quantization with peak-picking and color space

next article Multi-nonlinear multi-view locality-preserving projection with similarity learning for random cross-view gait recognition

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Chenyang, S.I., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 2019.

Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015).

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)

Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4694–4702 (2015)

Xiusheng, L.U., et al.: Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimedia Tools Appl 78(1), 507–523 (2019)CrossRef

Zhang, B., Wang, L., Wang, Z., et al.: Real-time action recognition with deeply-transferred motion vector CNNs. IEEE Trans. Image Process.vol. 27(5), 2326–2339 (2018)MathSciNetCrossRef

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778 (2016)

Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. https://arxiv.org/abs/1608.06993 (2016)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural information processing systems (NIPS), pp. 1097–1105 (2012)

10.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR) 2015.

11.

Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), pp. 299–318 (2008).CrossRef

12.

Wang, H., Klser, A., Schmid, C., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRef

13.

Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp. 3551–3558 (2014)

14.

Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1725–1732 (2014)

15.

Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International conference on machine learning, pp. 843–852 2015

16.

Bilen, H., Fernando, B., Gavves, E., et al.: Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2017)CrossRef

17.

Xie, S., Sun, C., Huang, J., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 305–321 (2018)

18.

Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814 (2010)

19.

Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004)

20.

Zivkovic, Z., Van Der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn. Lett. 27(7), 773–780 (2006)CrossRef

21.

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR 2016.

22.

Soomro, K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. https://arxiv.org/abs/1212.0402 (2012).

23.

Jhuang, H., Garrote, H., Poggio, E., et al.: A large video database for human motion recognition. In: Proceedings of of IEEE international conference on computer vision, pp. 2556–2563 (2011)

24.

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, CVPR 2009, pp. 248–255 2009.

25.

Zach, C., Pock, T., Bischof, H.: A duality based approach for real-time tv-l 1 optical flow. In: Joint pattern recognition symposium, vol. 5, Springer, pp. 214–223 2007

26.

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint. https://arxiv.org/abs/1604.06573 (2016)

27.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)

28.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint. https://arxiv.org/abs/1507.02159 (2015).

29.

Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR. https://arxiv.org/abs/1405.4506 (2014).

30.

Donahue, J., Hendricks, J., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2015)CrossRef

31.

Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702 (2015)

32.

Tran, D., Ray, J., Shou, Z., Changm, S.F, Paluri, M. ConvNet architecture search for spatiotemporal feature learning. arXiv. https://arxiv.org/abs/1708.05038 (2017).

33.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October, pp. 4489–4497 (2017)

34.

Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June, pp. 4694–4702 (2015)

35.

Li, Y., Li, W., Mahadevan, V., Vasconcelos, N.: VLAD3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 26 June–1 July, pp. 1951–1960 (2016)

Title: Integrating Gaussian mixture model and dilated residual network for action recognition in videos
Authors: Ming Fang
Xiaoying Bai
Jianwei Zhao
Fengqin Yang
Chih-Cheng Hung
Shuhua Liu
Publication date: 20-08-2020
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 6/2020
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-020-00683-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 6/2020

Hybrid optimal algorithm-based 2D discrete wavelet transform for image compression using fractional KCA

MQTT-SN, CoAP, and RTP in wireless IoT real-time communications

Anti-distractors: two-branch siamese tracker with both static and dynamic filters for object tracking

Emperor Penguin optimized event recognition and summarization for cricket highlight generation

Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

Color image quantization with peak-picking and color space