Skip to main content
Erschienen in: International Journal of Computer Vision 2-4/2018

04.10.2016

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

verfasst von: Lionel Pigou, Aäron van den Oord, Sander Dieleman, Mieke Van Herreweghe, Joni Dambre

Erschienen in: International Journal of Computer Vision | Ausgabe 2-4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In A. Salah & B. Lepri (Eds.), Human behavior understanding (pp. 29–39). Berlin Heidelberg: Springer. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In A. Salah & B. Lepri (Eds.), Human behavior understanding (pp. 29–39). Berlin Heidelberg: Springer.
Zurück zum Zitat Chang, J. Y. (2014). Nonparametric gesture labeling from multi-modal data. Computer vision-ECCV 2014 workshops (pp. 503–517). Springer. Chang, J. Y. (2014). Nonparametric gesture labeling from multi-modal data. Computer vision-ECCV 2014 workshops (pp. 503–517). Springer.
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634.
Zurück zum Zitat Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In: ECCV workshop. Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In: ECCV workshop.
Zurück zum Zitat Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In J. Bigun & T. Gustavsson (Eds.), Scandinavian conference on image analysis (pp. 363–370). Berlin Heidelberg: Springer.CrossRef Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In J. Bigun & T. Gustavsson (Eds.), Scandinavian conference on image analysis (pp. 363–370). Berlin Heidelberg: Springer.CrossRef
Zurück zum Zitat Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2003). Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3, 115–143.MathSciNetMATH Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2003). Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3, 115–143.MathSciNetMATH
Zurück zum Zitat Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, IEEE, pp. 6645–6649. Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, IEEE, pp. 6645–6649.
Zurück zum Zitat Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., & Coates, A., et al. (2014). Deepspeech: Scaling up end-to-end speech recognition. arXiv:1412.5567 (preprint). Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., & Coates, A., et al. (2014). Deepspeech: Scaling up end-to-end speech recognition. arXiv:​1412.​5567 (preprint).
Zurück zum Zitat Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef
Zurück zum Zitat Jain, A., Tompson, J., LeCun, Y., & Bregler, C. (2014). MoDeep: A deep learning framework using motion features for human pose estimation. Computer Vision ACCV, 2014, 302–315. Jain, A., Tompson, J., LeCun, Y., & Bregler, C. (2014). MoDeep: A deep learning framework using motion features for human pose estimation. Computer Vision ACCV, 2014, 302–315.
Zurück zum Zitat Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.CrossRef Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.CrossRef
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1725–1732. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1725–1732.
Zurück zum Zitat Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015. Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.
Zurück zum Zitat Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, pp. 2556–2563. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, pp. 2556–2563.
Zurück zum Zitat LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef
Zurück zum Zitat Maas, A.L., Hannun, A.Y., & Ng, A.Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML, vol. 30. Maas, A.L., Hannun, A.Y., & Ng, A.Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML, vol. 30.
Zurück zum Zitat Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In: Computer vision-ECCV 2014 workshops (pp. 491–502). Springer Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In: Computer vision-ECCV 2014 workshops (pp. 491–502). Springer
Zurück zum Zitat Neverova, N., Wolf, C., Taylor, G.W., & Nebout, F. (2014). ModDrop: Adaptive multi-modal gesture recognition. arXiv:1501.00102 (preprint). Neverova, N., Wolf, C., Taylor, G.W., & Nebout, F. (2014). ModDrop: Adaptive multi-modal gesture recognition. arXiv:​1501.​00102 (preprint).
Zurück zum Zitat Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: Computer vision and pattern recognition (CVPR), 2015 IEEE conference on, IEEE, pp. 4694–4702. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: Computer vision and pattern recognition (CVPR), 2015 IEEE conference on, IEEE, pp. 4694–4702.
Zurück zum Zitat Saxe, A.M., McClelland, J.L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (preprint). Saxe, A.M., McClelland, J.L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:​1312.​6120 (preprint).
Zurück zum Zitat Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 (preprint). Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:​1312.​6229 (preprint).
Zurück zum Zitat Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (preprint). Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:​1212.​0402 (preprint).
Zurück zum Zitat Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer vision-ECCV 2010 (pp. 140–153). Berlin Heidelberg: Springer.CrossRef Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer vision-ECCV 2010 (pp. 140–153). Berlin Heidelberg: Springer.CrossRef
Zurück zum Zitat Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1653–1660. Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1653–1660.
Zurück zum Zitat Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence–video to text. arXiv:1505.00487 (preprint). Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence–video to text. arXiv:​1505.​00487 (preprint).
Zurück zum Zitat Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.
Zurück zum Zitat Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. In: ICML deep learning workshop. Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. In: ICML deep learning workshop.
Metadaten
Titel
Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
verfasst von
Lionel Pigou
Aäron van den Oord
Sander Dieleman
Mieke Van Herreweghe
Joni Dambre
Publikationsdatum
04.10.2016
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 2-4/2018
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-016-0957-7

Weitere Artikel der Ausgabe 2-4/2018

International Journal of Computer Vision 2-4/2018 Zur Ausgabe

Premium Partner