nach oben

Pattern Analysis and Applications

Erschienen in:

24.07.2018 | Original Article

Human action recognition in videos with articulated pose information by deep networks

verfasst von: M. Farrajota, João M. F. Rodrigues, J. M. H. du Buf

Erschienen in: Pattern Analysis and Applications | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Action recognition is of great importance in understanding human motion from video. It is an important topic in computer vision due to its many applications such as video surveillance, human–machine interaction and video retrieval. One key problem is to automatically recognize low-level actions and high-level activities of interest. This paper proposes a way to cope with low-level actions by combining information of human body joints to aid action recognition. This is achieved by using high-level features computed by a convolutional neural network which was pre-trained on Imagenet, with articulated body joints as low-level features. These features are then used to feed a Long Short-Term Memory network to learn the temporal dependencies of an action. For pose prediction, we focus on articulated relations between body joints. We employ a series of residual auto-encoders to produce multiple predictions which are then combined to provide a likelihood map of body joints. In the network topology, features are processed across all scales which capture the various spatial relationships associated with the body. Repeated bottom-up and top-down processing with intermediate supervision of each auto-encoder network is applied. We demonstrate state-of-the-art results on the popular FLIC, LSP and UCF Sports datasets.

Vorheriger Artikel Discriminative ridge regression algorithm for adaptation in statistical machine translation

Nächster Artikel User-aware dialogue management policies over attributed bi-automata

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR. IEEE, pp 3686–3693

Belagiannis V, Zisserman A (2016) Recurrent human pose estimation. arXiv preprint arXiv:1605.02914

Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: ECCV. Springer, pp 717–732

Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750

Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS, pp 1736–1744

Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7: a matlab-like environment for machine learning. Tech. rep

Dantone M, Gall J, Leistner C, Van Gool L (2013) Human pose estimation using body parts dependent joint regressors. In: CVPR, pp 3041–3048

Derpanis KG, Sizintsev M, Cannons K, Wildes RP (2010) Efficient action spotting based on a spacetime oriented structure representation. In: CVPR. IEEE, pp 1990–1997

10.

Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

11.

Ess A, Leibe B, Schindler K, Van Gool L (2008) A mobile vision system for robust multi-person tracking. In: CVPR, pp 1–8

12.

Fang H, Xie S, Lu C (2016) Rmpe: regional multi-person pose estimation. arXiv preprint arXiv:1612.00137

13.

Farrajota M, Rodrigues JM, du Buf J (2017) Human pose estimation by a series of residual auto-encoders. In: IBPRIA. Springer, pp 131–139

14.

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941

15.

Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. In: PAMI, vol. 32. IEEE, pp 1627–1645

16.

Fernando B, Gavves E, Oramas JM, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: CVPR, pp 5378–5387

17.

Gaidon A, Harchaoui Z, Schmid C (2013) Temporal localization of actions with actoms. In: PAMI, vol. 35. IEEE, pp 2782–2795

18.

Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG (2015) Devnet: a deep event network for multimedia event detection and evidence recounting. In: CVPR, pp 2568–2577

19.

He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385

20.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

21.

Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Andres B, Schiele B (2016) Articulated multi-person tracking in the wild. arXiv preprint arXiv:1612.01465

22.

Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. arXiv preprint arXiv:1605.03170

23.

Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

24.

Jain M, Van Gemert J, Jégou H, Bouthemy P, Snoek C (2014) Action localization with tubelets from motion. In: CVPR

25.

Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: ICCV, pp 3192–3199

26.

Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE, pp 221–231

27.

Johnson S, Everingham M (2010) Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, vol 2, p 5

28.

Johnson S, Everingham M (2011) Learning effective human pose estimation from inaccurate annotation. In: CVPR, pp 1465–1472

29.

Jones S, Shao L, Zhang J, Liu Y (2012) Relevance feedback for real-world human action retrieval. Pattern Recogn Lett 33(4):446–452CrossRef

30.

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR, pp 1725–1732

31.

Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

32.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

33.

Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: ICCV. IEEE, pp 2003–2010

34.

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef

35.

Li X, Chuah M (2018) Rehar: robust and efficient human activity recognition. arXiv preprint arXiv:1802.09745

36.

Lifshitz I, Fetaya E, Ullman S (2016) Human pose estimation using deep consensus voting. arXiv preprint arXiv:1603.08212

37.

Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR, pp 3431–3440

38.

Newell A, Deng J (2016) Associative embedding: end-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424

39.

Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937

40.

Niebles JC, Chen CW, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV, pp 392–405. Springer

41.

Packer B, Saenko K, Koller D (2012) A combined pose, object, and feature model for action understanding. In: CVPR, pp 1378–1385

42.

Pirsiavash H, Ramanan D (2014) Parsing videos of actions with segmental grammars. In: CVPR, pp 612–619

43.

Pishchulin L, Andriluka M, Gehler P, Schiele B (2013) Poselet conditioned pictorial structures. In: CVPR, pp 588–595

44.

Pishchulin L, Andriluka M, Gehler P, Schiele B (2013) Strong appearance and expressive spatial models for human pose estimation. In: ICCV, pp 3487–3494

45.

Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P, Schiele B (2015) Deepcut: joint subset partition and labeling for multi person pose estimation. arXiv preprint arXiv:1511.06645

46.

Ramakrishna V, Munoz D, Hebert M, Bagnell JA, Sheikh Y (2014) Pose machines: articulated pose estimation via inference machines. In: ECCV. Springer, pp 33–47

47.

Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: CVPR, pp 2650–2657

48.

Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR. IEEE, pp 1–8

49.

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2014) ImageNet large scale visual recognition challenge, p 37. arXiv preprint arXiv:1409.0575

50.

Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: CVPR. IEEE, pp 1234–1241

51.

Sapp B, Taskar B (2013) Modec: multimodal decomposable models for human pose estimation. In: CVPR, vol 13, p 3

52.

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

53.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

54.

Singh VK, Nevatia R (2011) Action recognition in cluttered dynamic scenes using pose-specific part models. In: ICCV. IEEE, pp 113–120

55.

Souly N, Shah M (2016) Visual saliency detection using group lasso regularization in videos of natural scenes. IJCV 117:93–110MathSciNetCrossRef

56.

Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp 4597–4605

57.

Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: ICCV. IEEE, pp 723–730

58.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9

59.

Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp 1799–1807

60.

Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: CVPR, pp 648–656

61.

Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: CVPR, pp 1653–1660

62.

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497

63.

Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE

64.

Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. Trans Image Process 23(2):810–822MathSciNetCrossRefMATH

65.

Wang L, Qiao Y, Tang X (2013) Motionlets: mid-level 3d parts for human motion recognition. In: CVPR, pp 2674–2681

66.

Wang L, Qiao Y, Tang X (2014) Video action detection with relational dynamic-poselets. In: ECCV. Springer, pp 565–580

67.

Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp 4305–4314

68.

Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, pp 3551–3558

69.

Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: CVPR, pp 915–922

70.

Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36

71.

Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: ICCV, pp 3164–3172

72.

Wei S, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. arXiv preprint arXiv:1602.00134

73.

Xiong Y, Zhu K, Lin D, Tang X (2015) Recognize complex events from static images by fusing deep channels. In: CVPR, pp 1600–1609

74.

Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853

75.

Yang Y, Saleemi I, Shah M (2013) Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. In: PAMI, vol 35. IEEE, pp 1635–1648

76.

Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR, pp 4694–4702

77.

Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: CVPR, pp 2718–2726

Titel: Human action recognition in videos with articulated pose information by deep networks
verfasst von: M. Farrajota
João M. F. Rodrigues
J. M. H. du Buf
Publikationsdatum: 24.07.2018
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 4/2019
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-018-0727-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2019

A novel method for binarization of scene text images and its application in text identification

Micro-expression recognition based on 3D flow convolutional neural network

L1-norm orthogonal neighbourhood preserving projection and its applications

Gait-based person re-identification under covariate factors

Contour recognition of roadheader cutting head based on shape matching

Pedestrian gender classification using combined global and local parts-based convolutional neural networks

Premium Partner