Skip to main content
Top
Published in: Multimedia Systems 6/2020

25-07-2020 | Regular Paper

Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

Authors: Pratishtha Verma, Animesh Sah, Rajeev Srivastava

Published in: Multimedia Systems | Issue 6/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The deep learning techniques have achieved great success in the application of human activity recognition (HAR). In this paper, we propose a technique for HAR that utilizes the RGB and skeleton information with the help of a convolutional neural network (Convnet) and long short-term memory (LSTM) as a recurrent neural network (RNN). The proposed method has two parts: first, motion representation images like motion history image (MHI) and motion energy image (MEI) have been created from the RGB videos. The convnet has been trained, using these images with feature-level fusion. Second, the skeleton data have been utilized with a proposed algorithm that develops skeleton intensity images, for three views (top, front and side). Each view is first analyzed by a convnet, that generates the set of feature maps, which are fused for further analysis. On top of convnet sub-networks, LSTM has been used to exploit the temporal dependency. The softmax scores from these two independent parts are later combined at the decision level. Apart from the given approach for HAR, this paper also presents a strategy that utilizes the concept of cyclic learning rate to develop a multi-modal neural network by training the model only once to make the system more efficient. The suggested approach privileges for the perfect utilization of RGB and skeleton data available from an RGB-D sensor. The proposed approach has been tested on three famous and challenging multimodal datasets which are UTD-MHAD, CAD-60 and NTU-RGB + D120. Results have shown that the stated method gives a satisfactory result as compared to the other state-of-the-art systems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)CrossRef Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)CrossRef
2.
go back to reference Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017) Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)
3.
go back to reference Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012) Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012)
4.
go back to reference Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012) Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012)
5.
go back to reference Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010) Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010)
6.
go back to reference Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015) Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015)
7.
go back to reference Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013) Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013)
8.
go back to reference Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019) Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
9.
go back to reference Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)CrossRef Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)CrossRef
10.
go back to reference Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)CrossRef Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)CrossRef
11.
go back to reference e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)CrossRef e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)CrossRef
12.
go back to reference Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017) Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017)
13.
go back to reference Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018) Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018)
14.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
15.
go back to reference Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
16.
go back to reference Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013) Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013)
17.
go back to reference Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)CrossRef Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)CrossRef
18.
go back to reference Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
19.
go back to reference Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016) Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016)
20.
go back to reference Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016) Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016)
21.
go back to reference Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)CrossRef Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)CrossRef
22.
go back to reference Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016) Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016)
23.
go back to reference Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109 (2017) Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:​1704.​00109 (2017)
24.
go back to reference Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)CrossRef Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)CrossRef
25.
go back to reference Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)MathSciNet Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)MathSciNet
26.
go back to reference Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)CrossRef Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)CrossRef
27.
go back to reference Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019) Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019)
28.
go back to reference Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:1611.00218 Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:​1611.​00218
29.
go back to reference Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)CrossRef Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)CrossRef
30.
go back to reference Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016) Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016)
31.
go back to reference Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)CrossRef Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)CrossRef
32.
go back to reference Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)CrossRef Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)CrossRef
33.
go back to reference Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)CrossRef Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)CrossRef
34.
go back to reference Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:2003.07564 Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:​2003.​07564
35.
go back to reference Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017) Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
36.
go back to reference Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)MathSciNetCrossRef Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)MathSciNetCrossRef
Metadata
Title
Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition
Authors
Pratishtha Verma
Animesh Sah
Rajeev Srivastava
Publication date
25-07-2020
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 6/2020
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-020-00677-2

Other articles of this Issue 6/2020

Multimedia Systems 6/2020 Go to the issue