Skip to main content
Erschienen in: Multimedia Systems 6/2020

25.07.2020 | Regular Paper

Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

verfasst von: Pratishtha Verma, Animesh Sah, Rajeev Srivastava

Erschienen in: Multimedia Systems | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The deep learning techniques have achieved great success in the application of human activity recognition (HAR). In this paper, we propose a technique for HAR that utilizes the RGB and skeleton information with the help of a convolutional neural network (Convnet) and long short-term memory (LSTM) as a recurrent neural network (RNN). The proposed method has two parts: first, motion representation images like motion history image (MHI) and motion energy image (MEI) have been created from the RGB videos. The convnet has been trained, using these images with feature-level fusion. Second, the skeleton data have been utilized with a proposed algorithm that develops skeleton intensity images, for three views (top, front and side). Each view is first analyzed by a convnet, that generates the set of feature maps, which are fused for further analysis. On top of convnet sub-networks, LSTM has been used to exploit the temporal dependency. The softmax scores from these two independent parts are later combined at the decision level. Apart from the given approach for HAR, this paper also presents a strategy that utilizes the concept of cyclic learning rate to develop a multi-modal neural network by training the model only once to make the system more efficient. The suggested approach privileges for the perfect utilization of RGB and skeleton data available from an RGB-D sensor. The proposed approach has been tested on three famous and challenging multimodal datasets which are UTD-MHAD, CAD-60 and NTU-RGB + D120. Results have shown that the stated method gives a satisfactory result as compared to the other state-of-the-art systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)CrossRef Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)CrossRef
2.
Zurück zum Zitat Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017) Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)
3.
Zurück zum Zitat Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012) Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012)
4.
Zurück zum Zitat Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012) Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012)
5.
Zurück zum Zitat Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010) Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010)
6.
Zurück zum Zitat Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015) Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015)
7.
Zurück zum Zitat Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013) Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013)
8.
Zurück zum Zitat Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019) Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
9.
Zurück zum Zitat Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)CrossRef Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)CrossRef
10.
Zurück zum Zitat Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)CrossRef Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)CrossRef
11.
Zurück zum Zitat e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)CrossRef e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)CrossRef
12.
Zurück zum Zitat Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017) Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017)
13.
Zurück zum Zitat Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018) Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018)
14.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
15.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
16.
Zurück zum Zitat Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013) Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013)
17.
Zurück zum Zitat Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)CrossRef Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)CrossRef
18.
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
19.
Zurück zum Zitat Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016) Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016)
20.
Zurück zum Zitat Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016) Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016)
21.
Zurück zum Zitat Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)CrossRef Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)CrossRef
22.
Zurück zum Zitat Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016) Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016)
23.
Zurück zum Zitat Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109 (2017) Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:​1704.​00109 (2017)
24.
Zurück zum Zitat Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)CrossRef Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)CrossRef
25.
Zurück zum Zitat Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)MathSciNet Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)MathSciNet
26.
Zurück zum Zitat Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)CrossRef Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)CrossRef
27.
Zurück zum Zitat Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019) Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019)
28.
Zurück zum Zitat Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:1611.00218 Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:​1611.​00218
29.
Zurück zum Zitat Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)CrossRef Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)CrossRef
30.
Zurück zum Zitat Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016) Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016)
31.
Zurück zum Zitat Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)CrossRef Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)CrossRef
32.
Zurück zum Zitat Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)CrossRef Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)CrossRef
33.
Zurück zum Zitat Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)CrossRef Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)CrossRef
34.
Zurück zum Zitat Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:2003.07564 Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:​2003.​07564
35.
Zurück zum Zitat Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017) Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
36.
Zurück zum Zitat Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)MathSciNetCrossRef Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)MathSciNetCrossRef
Metadaten
Titel
Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition
verfasst von
Pratishtha Verma
Animesh Sah
Rajeev Srivastava
Publikationsdatum
25.07.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Multimedia Systems / Ausgabe 6/2020
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-020-00677-2

Weitere Artikel der Ausgabe 6/2020

Multimedia Systems 6/2020 Zur Ausgabe

Neuer Inhalt