Skip to main content
Top
Published in: Neural Computing and Applications 9/2020

30-11-2019 | Emerging Trends of Applied Neural Computation - E_TRAINCO

Spatiotemporal neural networks for action recognition based on joint loss

Authors: Chao Jing, Ping Wei, Hongbin Sun, Nanning Zheng

Published in: Neural Computing and Applications | Issue 9/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Action recognition is a challenging and important problem in a myriad of significant fields, such as intelligent robots and video surveillance. In recent years, deep learning and neural network techniques have been widely applied to action recognition and attained remarkable results. However, it is still a difficult task to recognize actions in complicated scenes, such as various illumination conditions, similar motions, and background noise. In this paper, we present a spatiotemporal neural network model with a joint loss to recognize human actions from videos. This spatiotemporal neural network is comprised of two key connected substructures. The first one is a two-stream-based network extracting optical flow and appearance features from each frame of videos, which characterizes the human actions of videos in spatial dimension. The second substructure is a group of Long Short-Term Memory structures following the spatial network, which describes the temporal and transition information in videos. This research effort presents a joint loss function for training the spatiotemporal neural network model. By introducing the loss function, the action recognition performance is improved. The proposed method was tested with video samples from two challenging datasets. The experiments demonstrate that our approach outperforms the baseline comparison methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef
2.
go back to reference Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576
3.
go back to reference Wei P, Zhao Y, Zheng N, Zhu SC (2017) Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal Mach Intell 39(6):1165–1179CrossRef Wei P, Zhao Y, Zheng N, Zhu SC (2017) Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal Mach Intell 39(6):1165–1179CrossRef
4.
go back to reference Shu T, Gao X, Ryoo M, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: International conference on robotics and automation (ICRA) Shu T, Gao X, Ryoo M, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: International conference on robotics and automation (ICRA)
5.
go back to reference Rezazadegan F, Shirazi S, Upcrofit B, Milford M (2017) Action recognition: from static datasets to moving robots. In: IEEE international conference on robotics and automation (ICRA), pp 3186–3191 Rezazadegan F, Shirazi S, Upcrofit B, Milford M (2017) Action recognition: from static datasets to moving robots. In: IEEE international conference on robotics and automation (ICRA), pp 3186–3191
6.
go back to reference Arunnehru J, Kalaiselvi Geetha M (2015) Vision-based human action recognition in surveillance videos using motion projection profile features. In: International conference on mining intelligence and knowledge exploration, pp 460–471 Arunnehru J, Kalaiselvi Geetha M (2015) Vision-based human action recognition in surveillance videos using motion projection profile features. In: International conference on mining intelligence and knowledge exploration, pp 460–471
7.
go back to reference Luo S, Yang H, Wang C, Che X, Meinel C (2016) Action recognition in surveillance video using ConvNets and motion history image. In: International conference on artificial neural networks, pp 187–195 Luo S, Yang H, Wang C, Che X, Meinel C (2016) Action recognition in surveillance video using ConvNets and motion history image. In: International conference on artificial neural networks, pp 187–195
8.
go back to reference Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRef Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRef
9.
go back to reference Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Ana Mach Intell 40(3):667–681CrossRef Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Ana Mach Intell 40(3):667–681CrossRef
10.
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
11.
go back to reference Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision
12.
go back to reference Wang J, Liu Z, Wu Y, Yuan J (2012) Learning actionlet ensemble for 3D human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1290–1297 Wang J, Liu Z, Wu Y, Yuan J (2012) Learning actionlet ensemble for 3D human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1290–1297
14.
go back to reference Jing C, Wei P, Sun H, Zheng N (2018) Spatial-temporal neural networks for action recognition. In: International conference on artificial intelligence applications and innovations, pp 619–627 Jing C, Wei P, Sun H, Zheng N (2018) Spatial-temporal neural networks for action recognition. In: International conference on artificial intelligence applications and innovations, pp 619–627
15.
go back to reference Wei P, Zheng N, Zhao Y, Zhu SC (2013) Concurrent action detection with structural prediction. In: International conference on computer vision, pp 3136–3143 Wei P, Zheng N, Zhao Y, Zhu SC (2013) Concurrent action detection with structural prediction. In: International conference on computer vision, pp 3136–3143
16.
go back to reference Fujiyoshi H, Lipton AJ (1998) Real-time human motion analysis by image skeletonization. In: IEEE workshop on applications of computer vision, pp 15–21 Fujiyoshi H, Lipton AJ (1998) Real-time human motion analysis by image skeletonization. In: IEEE workshop on applications of computer vision, pp 15–21
17.
go back to reference Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimed 21(9):2195–2208CrossRef Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimed 21(9):2195–2208CrossRef
18.
go back to reference Yan S, Xiong Y, Lin D (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence, pp 7444–7452 Yan S, Xiong Y, Lin D (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence, pp 7444–7452
19.
go back to reference Zhang S, Xiao J, Liu X, Yi Y, Di X, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks. IEEE Trans Multimed 20(9):2330–2343CrossRef Zhang S, Xiao J, Liu X, Yi Y, Di X, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks. IEEE Trans Multimed 20(9):2330–2343CrossRef
20.
go back to reference Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. Int J Comput Vis 67(2):141–158CrossRef Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. Int J Comput Vis 67(2):141–158CrossRef
21.
go back to reference Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939 Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939
22.
go back to reference Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer vision Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer vision
23.
go back to reference Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893
24.
go back to reference Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: The British machine vision conference Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: The British machine vision conference
25.
go back to reference Sch C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, pp 32–36 Sch C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, pp 32–36
26.
go back to reference Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176 Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176
27.
go back to reference Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
28.
go back to reference Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH
29.
go back to reference Xing H, Zhang G, Shang M (2016) Deeplearning. Int J Semant Comput 10(3):417–439CrossRef Xing H, Zhang G, Shang M (2016) Deeplearning. Int J Semant Comput 10(3):417–439CrossRef
30.
31.
go back to reference Mora SV, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: Computer vision and pattern recognition workshops, pp 170–178 Mora SV, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: Computer vision and pattern recognition workshops, pp 170–178
32.
go back to reference Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deepconvolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314 Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deepconvolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314
33.
go back to reference Husain F, Dellen B, Torras C (2016) Action recognition based on efficient deep feature learning in the spatio-temporal domain. IEEE Robot Autom Lett 1(2):984–991CrossRef Husain F, Dellen B, Torras C (2016) Action recognition based on efficient deep feature learning in the spatio-temporal domain. IEEE Robot Autom Lett 1(2):984–991CrossRef
34.
go back to reference Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612 Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612
35.
go back to reference Karpathy A, Toderici G, Shetty S (2014) Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition, pp 1725–1732 Karpathy A, Toderici G, Shetty S (2014) Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition, pp 1725–1732
36.
go back to reference Li C, Sun S, Min X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612 Li C, Sun S, Min X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612
37.
go back to reference Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef
38.
go back to reference Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE international conference on computer vision and pattern recognition, pp 4694–4702 Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE international conference on computer vision and pattern recognition, pp 4694–4702
39.
go back to reference Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 677–691 Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 677–691
40.
go back to reference Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39 Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39
41.
go back to reference Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRef Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRef
42.
go back to reference Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Luc VG (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Luc VG (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision
43.
go back to reference Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: The conference on uncertainty in artificial intelligence, pp 102–112 Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: The conference on uncertainty in artificial intelligence, pp 102–112
44.
go back to reference Yuan ZW, Zhang J (2016) Feature extraction and image retrieval based on AlexNet. In: Eighth international conference on digital image processing Yuan ZW, Zhang J (2016) Feature extraction and image retrieval based on AlexNet. In: Eighth international conference on digital image processing
45.
go back to reference Fischer P, Dosovitskiy A, Ilg E, Häusser P, Hazırbaş C, Golkov V (2015) Flownet: learning optical flow with convolutional networks. In: IEEE international conference on computer vision Fischer P, Dosovitskiy A, Ilg E, Häusser P, Hazırbaş C, Golkov V (2015) Flownet: learning optical flow with convolutional networks. In: IEEE international conference on computer vision
46.
go back to reference Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, pp 675–678 Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, pp 675–678
47.
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
48.
go back to reference Müller M, Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH/EUROGRAPHICS symposium on computer animation SCA 2006 Vienna Austria September, pp 137–146 Müller M, Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH/EUROGRAPHICS symposium on computer animation SCA 2006 Vienna Austria September, pp 137–146
49.
go back to reference Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRef Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRef
50.
go back to reference Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Deep convolutional neural networks for action recognition using depth map sequences. Comput Sci. arXiv:1501.04686v1 Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Deep convolutional neural networks for action recognition using depth map sequences. Comput Sci. arXiv:​1501.​04686v1
Metadata
Title
Spatiotemporal neural networks for action recognition based on joint loss
Authors
Chao Jing
Ping Wei
Hongbin Sun
Nanning Zheng
Publication date
30-11-2019
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 9/2020
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04615-w

Other articles of this Issue 9/2020

Neural Computing and Applications 9/2020 Go to the issue

Cognitive Computing for Intelligent Application and Service

Toward cognitive support for automated defect detection

Premium Partner