Skip to main content
Erschienen in: Neural Computing and Applications 9/2020

30.11.2019 | Emerging Trends of Applied Neural Computation - E_TRAINCO

Spatiotemporal neural networks for action recognition based on joint loss

verfasst von: Chao Jing, Ping Wei, Hongbin Sun, Nanning Zheng

Erschienen in: Neural Computing and Applications | Ausgabe 9/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Action recognition is a challenging and important problem in a myriad of significant fields, such as intelligent robots and video surveillance. In recent years, deep learning and neural network techniques have been widely applied to action recognition and attained remarkable results. However, it is still a difficult task to recognize actions in complicated scenes, such as various illumination conditions, similar motions, and background noise. In this paper, we present a spatiotemporal neural network model with a joint loss to recognize human actions from videos. This spatiotemporal neural network is comprised of two key connected substructures. The first one is a two-stream-based network extracting optical flow and appearance features from each frame of videos, which characterizes the human actions of videos in spatial dimension. The second substructure is a group of Long Short-Term Memory structures following the spatial network, which describes the temporal and transition information in videos. This research effort presents a joint loss function for training the spatiotemporal neural network model. By introducing the loss function, the action recognition performance is improved. The proposed method was tested with video samples from two challenging datasets. The experiments demonstrate that our approach outperforms the baseline comparison methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef
2.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576
3.
Zurück zum Zitat Wei P, Zhao Y, Zheng N, Zhu SC (2017) Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal Mach Intell 39(6):1165–1179CrossRef Wei P, Zhao Y, Zheng N, Zhu SC (2017) Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal Mach Intell 39(6):1165–1179CrossRef
4.
Zurück zum Zitat Shu T, Gao X, Ryoo M, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: International conference on robotics and automation (ICRA) Shu T, Gao X, Ryoo M, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: International conference on robotics and automation (ICRA)
5.
Zurück zum Zitat Rezazadegan F, Shirazi S, Upcrofit B, Milford M (2017) Action recognition: from static datasets to moving robots. In: IEEE international conference on robotics and automation (ICRA), pp 3186–3191 Rezazadegan F, Shirazi S, Upcrofit B, Milford M (2017) Action recognition: from static datasets to moving robots. In: IEEE international conference on robotics and automation (ICRA), pp 3186–3191
6.
Zurück zum Zitat Arunnehru J, Kalaiselvi Geetha M (2015) Vision-based human action recognition in surveillance videos using motion projection profile features. In: International conference on mining intelligence and knowledge exploration, pp 460–471 Arunnehru J, Kalaiselvi Geetha M (2015) Vision-based human action recognition in surveillance videos using motion projection profile features. In: International conference on mining intelligence and knowledge exploration, pp 460–471
7.
Zurück zum Zitat Luo S, Yang H, Wang C, Che X, Meinel C (2016) Action recognition in surveillance video using ConvNets and motion history image. In: International conference on artificial neural networks, pp 187–195 Luo S, Yang H, Wang C, Che X, Meinel C (2016) Action recognition in surveillance video using ConvNets and motion history image. In: International conference on artificial neural networks, pp 187–195
8.
Zurück zum Zitat Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRef Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRef
9.
Zurück zum Zitat Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Ana Mach Intell 40(3):667–681CrossRef Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Ana Mach Intell 40(3):667–681CrossRef
10.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
11.
Zurück zum Zitat Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision
12.
Zurück zum Zitat Wang J, Liu Z, Wu Y, Yuan J (2012) Learning actionlet ensemble for 3D human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1290–1297 Wang J, Liu Z, Wu Y, Yuan J (2012) Learning actionlet ensemble for 3D human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1290–1297
14.
Zurück zum Zitat Jing C, Wei P, Sun H, Zheng N (2018) Spatial-temporal neural networks for action recognition. In: International conference on artificial intelligence applications and innovations, pp 619–627 Jing C, Wei P, Sun H, Zheng N (2018) Spatial-temporal neural networks for action recognition. In: International conference on artificial intelligence applications and innovations, pp 619–627
15.
Zurück zum Zitat Wei P, Zheng N, Zhao Y, Zhu SC (2013) Concurrent action detection with structural prediction. In: International conference on computer vision, pp 3136–3143 Wei P, Zheng N, Zhao Y, Zhu SC (2013) Concurrent action detection with structural prediction. In: International conference on computer vision, pp 3136–3143
16.
Zurück zum Zitat Fujiyoshi H, Lipton AJ (1998) Real-time human motion analysis by image skeletonization. In: IEEE workshop on applications of computer vision, pp 15–21 Fujiyoshi H, Lipton AJ (1998) Real-time human motion analysis by image skeletonization. In: IEEE workshop on applications of computer vision, pp 15–21
17.
Zurück zum Zitat Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimed 21(9):2195–2208CrossRef Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimed 21(9):2195–2208CrossRef
18.
Zurück zum Zitat Yan S, Xiong Y, Lin D (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence, pp 7444–7452 Yan S, Xiong Y, Lin D (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence, pp 7444–7452
19.
Zurück zum Zitat Zhang S, Xiao J, Liu X, Yi Y, Di X, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks. IEEE Trans Multimed 20(9):2330–2343CrossRef Zhang S, Xiao J, Liu X, Yi Y, Di X, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks. IEEE Trans Multimed 20(9):2330–2343CrossRef
20.
Zurück zum Zitat Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. Int J Comput Vis 67(2):141–158CrossRef Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. Int J Comput Vis 67(2):141–158CrossRef
21.
Zurück zum Zitat Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939 Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939
22.
Zurück zum Zitat Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer vision Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer vision
23.
Zurück zum Zitat Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893
24.
Zurück zum Zitat Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: The British machine vision conference Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: The British machine vision conference
25.
Zurück zum Zitat Sch C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, pp 32–36 Sch C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, pp 32–36
26.
Zurück zum Zitat Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176 Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176
27.
Zurück zum Zitat Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
28.
Zurück zum Zitat Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH
29.
Zurück zum Zitat Xing H, Zhang G, Shang M (2016) Deeplearning. Int J Semant Comput 10(3):417–439CrossRef Xing H, Zhang G, Shang M (2016) Deeplearning. Int J Semant Comput 10(3):417–439CrossRef
30.
Zurück zum Zitat Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRef Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRef
31.
Zurück zum Zitat Mora SV, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: Computer vision and pattern recognition workshops, pp 170–178 Mora SV, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: Computer vision and pattern recognition workshops, pp 170–178
32.
Zurück zum Zitat Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deepconvolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314 Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deepconvolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314
33.
Zurück zum Zitat Husain F, Dellen B, Torras C (2016) Action recognition based on efficient deep feature learning in the spatio-temporal domain. IEEE Robot Autom Lett 1(2):984–991CrossRef Husain F, Dellen B, Torras C (2016) Action recognition based on efficient deep feature learning in the spatio-temporal domain. IEEE Robot Autom Lett 1(2):984–991CrossRef
34.
Zurück zum Zitat Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612 Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612
35.
Zurück zum Zitat Karpathy A, Toderici G, Shetty S (2014) Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition, pp 1725–1732 Karpathy A, Toderici G, Shetty S (2014) Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition, pp 1725–1732
36.
Zurück zum Zitat Li C, Sun S, Min X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612 Li C, Sun S, Min X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612
37.
Zurück zum Zitat Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef
38.
Zurück zum Zitat Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE international conference on computer vision and pattern recognition, pp 4694–4702 Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE international conference on computer vision and pattern recognition, pp 4694–4702
39.
Zurück zum Zitat Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 677–691 Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 677–691
40.
Zurück zum Zitat Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39 Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39
41.
Zurück zum Zitat Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRef Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRef
42.
Zurück zum Zitat Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Luc VG (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Luc VG (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision
43.
Zurück zum Zitat Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: The conference on uncertainty in artificial intelligence, pp 102–112 Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: The conference on uncertainty in artificial intelligence, pp 102–112
44.
Zurück zum Zitat Yuan ZW, Zhang J (2016) Feature extraction and image retrieval based on AlexNet. In: Eighth international conference on digital image processing Yuan ZW, Zhang J (2016) Feature extraction and image retrieval based on AlexNet. In: Eighth international conference on digital image processing
45.
Zurück zum Zitat Fischer P, Dosovitskiy A, Ilg E, Häusser P, Hazırbaş C, Golkov V (2015) Flownet: learning optical flow with convolutional networks. In: IEEE international conference on computer vision Fischer P, Dosovitskiy A, Ilg E, Häusser P, Hazırbaş C, Golkov V (2015) Flownet: learning optical flow with convolutional networks. In: IEEE international conference on computer vision
46.
Zurück zum Zitat Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, pp 675–678 Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, pp 675–678
47.
Zurück zum Zitat Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
48.
Zurück zum Zitat Müller M, Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH/EUROGRAPHICS symposium on computer animation SCA 2006 Vienna Austria September, pp 137–146 Müller M, Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH/EUROGRAPHICS symposium on computer animation SCA 2006 Vienna Austria September, pp 137–146
49.
Zurück zum Zitat Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRef Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRef
50.
Zurück zum Zitat Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Deep convolutional neural networks for action recognition using depth map sequences. Comput Sci. arXiv:1501.04686v1 Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Deep convolutional neural networks for action recognition using depth map sequences. Comput Sci. arXiv:​1501.​04686v1
Metadaten
Titel
Spatiotemporal neural networks for action recognition based on joint loss
verfasst von
Chao Jing
Ping Wei
Hongbin Sun
Nanning Zheng
Publikationsdatum
30.11.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 9/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04615-w

Weitere Artikel der Ausgabe 9/2020

Neural Computing and Applications 9/2020 Zur Ausgabe