Top

Multimedia Systems

Published in:

25-07-2020 | Regular Paper

Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

Authors: Pratishtha Verma, Animesh Sah, Rajeev Srivastava

Published in: Multimedia Systems | Issue 6/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The deep learning techniques have achieved great success in the application of human activity recognition (HAR). In this paper, we propose a technique for HAR that utilizes the RGB and skeleton information with the help of a convolutional neural network (Convnet) and long short-term memory (LSTM) as a recurrent neural network (RNN). The proposed method has two parts: first, motion representation images like motion history image (MHI) and motion energy image (MEI) have been created from the RGB videos. The convnet has been trained, using these images with feature-level fusion. Second, the skeleton data have been utilized with a proposed algorithm that develops skeleton intensity images, for three views (top, front and side). Each view is first analyzed by a convnet, that generates the set of feature maps, which are fused for further analysis. On top of convnet sub-networks, LSTM has been used to exploit the temporal dependency. The softmax scores from these two independent parts are later combined at the decision level. Apart from the given approach for HAR, this paper also presents a strategy that utilizes the concept of cyclic learning rate to develop a multi-modal neural network by training the model only once to make the system more efficient. The suggested approach privileges for the perfect utilization of RGB and skeleton data available from an RGB-D sensor. The proposed approach has been tested on three famous and challenging multimodal datasets which are UTD-MHAD, CAD-60 and NTU-RGB + D120. Results have shown that the stated method gives a satisfactory result as compared to the other state-of-the-art systems.

previous article Online multi-object tracking using KCF-based single-object tracker with occlusion analysis

next article Hybrid optimal algorithm-based 2D discrete wavelet transform for image compression using fractional KCA

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)CrossRef

Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)

Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012)

Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012)

Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010)

Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015)

Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013)

Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)

Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)CrossRef

10.

Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)CrossRef

11.

e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)CrossRef

12.

Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017)

13.

Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018)

14.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

15.

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

16.

Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013)

17.

Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)CrossRef

18.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

19.

Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016)

20.

Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016)

21.

Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)CrossRef

22.

Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016)

23.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109 (2017)

24.

Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)CrossRef

25.

Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)MathSciNet

26.

Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)CrossRef

27.

Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019)

28.

Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:1611.00218

29.

Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)CrossRef

30.

Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016)

31.

Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)CrossRef

32.

Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)CrossRef

33.

Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)CrossRef

34.

Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:2003.07564

35.

Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

36.

Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)MathSciNetCrossRef

Title: Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition
Authors: Pratishtha Verma
Animesh Sah
Rajeev Srivastava
Publication date: 25-07-2020
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 6/2020
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-020-00677-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 6/2020

Emperor Penguin optimized event recognition and summarization for cricket highlight generation

Online multi-object tracking using KCF-based single-object tracker with occlusion analysis

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Color image quantization with peak-picking and color space

Anti-distractors: two-branch siamese tracker with both static and dynamic filters for object tracking

MQTT-SN, CoAP, and RTP in wireless IoT real-time communications