Skip to main content
Erschienen in: International Journal of Computer Vision 5/2021

02.03.2021

Mimetics: Towards Understanding Human Actions Out of Context

verfasst von: Philippe Weinzaepfel, Grégory Rogez

Erschienen in: International Journal of Computer Vision | Ausgabe 5/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Angelini, F., Fu, Z., Long, Y., Shao, L., & Naqvi, S. M. (2018). ActionXPose: A novel 2D multi-view pose-based algorithm for real-time human action recognition. arXiv preprint arXiv:1810.12126. Angelini, F., Fu, Z., Long, Y., Shao, L., & Naqvi, S. M. (2018). ActionXPose: A novel 2D multi-view pose-based algorithm for real-time human action recognition. arXiv preprint arXiv:​1810.​12126.
Zurück zum Zitat Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2019). Learning de-biased representations with biased representations. arXiv. Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2019). Learning de-biased representations with biased representations. arXiv.
Zurück zum Zitat Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action recognition with joints-pooled 3D deep convolutional descriptors. In IJCAI. Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action recognition with joints-pooled 3D deep convolutional descriptors. In IJCAI.
Zurück zum Zitat Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2018). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008. Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2018). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:​1812.​08008.
Zurück zum Zitat Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.
Zurück zum Zitat Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV. Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.
Zurück zum Zitat Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). Potion: Pose motion representation for action recognition. CVPR. Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). Potion: Pose motion representation for action recognition. CVPR.
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Zurück zum Zitat Du, W., Wang, Y., Qiao, & Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV. Du, W., Wang, Y., Qiao, & Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.
Zurück zum Zitat Du, Y., Fu, Y., & Wang, L. (2015a). Skeleton based action recognition with convolutional neural network. In ACPR. Du, Y., Fu, Y., & Wang, L. (2015a). Skeleton based action recognition with convolutional neural network. In ACPR.
Zurück zum Zitat Du, Y., Wang, W., & Wang, L. (2015b). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR. Du, Y., Wang, W., & Wang, L. (2015b). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.
Zurück zum Zitat Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV.
Zurück zum Zitat Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.
Zurück zum Zitat Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR. Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR.
Zurück zum Zitat Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NIPS. Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NIPS.
Zurück zum Zitat Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR. Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.
Zurück zum Zitat Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In CVPR. Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In CVPR.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In International conference on automatic face & gesture recognition (FG). Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In International conference on automatic face & gesture recognition (FG).
Zurück zum Zitat Jacquot, V., Ying, Z., & Kreiman, G. (2020). Can deep learning recognize subtle human activities. In CVPR. Jacquot, V., Ying, Z., & Kreiman, G. (2020). Can deep learning recognize subtle human activities. In CVPR.
Zurück zum Zitat Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Zurück zum Zitat Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In ICCV. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In ICCV.
Zurück zum Zitat Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P, et al. (2017). The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P, et al. (2017). The Kinetics human action video dataset. arXiv preprint arXiv:​1705.​06950.
Zurück zum Zitat Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In ECCV. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In ECCV.
Zurück zum Zitat Kuehne, H., Jhuang, H., Garrote, E., Poggio, & T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV. Kuehne, H., Jhuang, H., Garrote, E., Poggio, & T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Zurück zum Zitat Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV. Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV.
Zurück zum Zitat Li, Y., & Vasconcelos, N. (2019). Repair: Removing representation bias by dataset resampling. In CVPR. Li, Y., & Vasconcelos, N. (2019). Repair: Removing representation bias by dataset resampling. In CVPR.
Zurück zum Zitat Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV. Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV.
Zurück zum Zitat Liu, M., & Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR. Liu, M., & Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.
Zurück zum Zitat Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In CVPR. Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In CVPR.
Zurück zum Zitat McNally, W., Wong, A., & McPhee, J. (2019). STAR-Net: Action recognition using spatio-temporal activation reprojection. In CRV. McNally, W., Wong, A., & McPhee, J. (2019). STAR-Net: Action recognition using spatio-temporal activation reprojection. In CRV.
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.
Zurück zum Zitat Rogez, G., Weinzaepfel, P., & Schmid, C. (2020). LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1146–1161. Rogez, G., Weinzaepfel, P., & Schmid, C. (2020). LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1146–1161.
Zurück zum Zitat Saha, S., Singh, G., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2016). Deep learning for detecting multiple space-time action tubes in videos. In BMVC. Saha, S., Singh, G., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2016). Deep learning for detecting multiple space-time action tubes in videos. In BMVC.
Zurück zum Zitat Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., & Black, M. J. (2018). On the integration of optical flow and action recognition. In GCPR. Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., & Black, M. J. (2018). On the integration of optical flow and action recognition. In GCPR.
Zurück zum Zitat Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR
Zurück zum Zitat Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV. Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Zurück zum Zitat Singh, G., Saha, S., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In ICCV. Singh, G., Saha, S., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In ICCV.
Zurück zum Zitat Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.
Zurück zum Zitat Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
Zurück zum Zitat Torralba, A., & Efros, A. A., et al. (2011). Unbiased look at dataset bias. In CVPR. Torralba, A., & Efros, A. A., et al. (2011). Unbiased look at dataset bias. In CVPR.
Zurück zum Zitat Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Zurück zum Zitat Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Zurück zum Zitat Wang, W., Zhang, J., Si, C., & Wang, L. (2018). Pose-based two-stream relational networks for action recognition in videos. arXiv preprint arXiv:1805.08484. Wang, W., Zhang, J., Si, C., & Wang, L. (2018). Pose-based two-stream relational networks for action recognition in videos. arXiv preprint arXiv:​1805.​08484.
Zurück zum Zitat Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV. Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.
Zurück zum Zitat Weng, J., Liu, M., Jiang, X., & Yuan, J. (2018). Deformable pose traversal convolution for 3d action and gesture recognition. In ECCV. Weng, J., Liu, M., Jiang, X., & Yuan, J. (2018). Deformable pose traversal convolution for 3d action and gesture recognition. In ECCV.
Zurück zum Zitat Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR. Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.
Zurück zum Zitat Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV. Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.
Zurück zum Zitat Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
Zurück zum Zitat Yao, A., Gall, J., & van Gool, L. (2012). Coupled action recognition and pose estimation from multiple views. In IJCV. Yao, A., Gall, J., & van Gool, L. (2012). Coupled action recognition and pose estimation from multiple views. In IJCV.
Zurück zum Zitat Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In: Joint pattern recognition symposium. Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In: Joint pattern recognition symposium.
Zurück zum Zitat Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV. Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.
Zurück zum Zitat Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.
Zurück zum Zitat Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., & Du, D. (2018). Action machine: Rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770. Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., & Du, D. (2018). Action machine: Rethinking action recognition in trimmed videos. arXiv preprint arXiv:​1812.​05770.
Zurück zum Zitat Zhu, W., Lan, C., Xing, J., Li, Y., Shen, L., Zeng, W., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI. Zhu, W., Lan, C., Xing, J., Li, Y., Shen, L., Zeng, W., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI.
Zurück zum Zitat Zolfaghari, M., Oliveira, G. L., Sedaghat, N., Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV. Zolfaghari, M., Oliveira, G. L., Sedaghat, N., Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV.
Metadaten
Titel
Mimetics: Towards Understanding Human Actions Out of Context
verfasst von
Philippe Weinzaepfel
Grégory Rogez
Publikationsdatum
02.03.2021
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 5/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-021-01446-y

Weitere Artikel der Ausgabe 5/2021

International Journal of Computer Vision 5/2021 Zur Ausgabe

Premium Partner