nach oben

International Journal of Computer Vision

Erschienen in:

02.03.2021

Mimetics: Towards Understanding Human Actions Out of Context

verfasst von: Philippe Weinzaepfel, Grégory Rogez

Erschienen in: International Journal of Computer Vision | Ausgabe 5/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods.

Vorheriger Artikel Context-Enhanced Representation Learning for Single Image Deraining

Nächster Artikel Successive Graph Convolutional Network for Image De-raining

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://europe.naverlabs.com/research/computer-vision/mimetics/.

http://thoth.inrialpes.fr/src/LCR-Net/.

https://github.com/yysijie/st-gcn.

Angelini, F., Fu, Z., Long, Y., Shao, L., & Naqvi, S. M. (2018). ActionXPose: A novel 2D multi-view pose-based algorithm for real-time human action recognition. arXiv preprint arXiv:1810.12126.

Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2019). Learning de-biased representations with biased representations. arXiv.

Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action recognition with joints-pooled 3D deep convolutional descriptors. In IJCAI.

Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2018). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008.

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.

Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.

Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). Potion: Pose motion representation for action recognition. CVPR.

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.

Du, W., Wang, Y., Qiao, & Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.

Du, Y., Fu, Y., & Wang, L. (2015a). Skeleton based action recognition with convolutional neural network. In ACPR.

Du, Y., Wang, W., & Wang, L. (2015b). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.

Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV.

Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.

Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR.

Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NIPS.

Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.

Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In CVPR.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In International conference on automatic face & gesture recognition (FG).

Jacquot, V., Ying, Z., & Kreiman, G. (2020). Can deep learning recognize subtle human activities. In CVPR.

Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.

Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In ICCV.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P, et al. (2017). The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In ECCV.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, & T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.

Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV.

Li, Y., & Vasconcelos, N. (2019). Repair: Removing representation bias by dataset resampling. In CVPR.

Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV.

Liu, M., & Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.

Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In CVPR.

McNally, W., Wong, A., & McPhee, J. (2019). STAR-Net: Action recognition using spatio-temporal activation reprojection. In CRV.

Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H. P., et al. (2017). VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics. https://doi.org/10.1145/3072959.3073596.

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.

Rogez, G., Weinzaepfel, P., & Schmid, C. (2020). LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1146–1161.

Saha, S., Singh, G., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2016). Deep learning for detecting multiple space-time action tubes in videos. In BMVC.

Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., & Black, M. J. (2018). On the integration of optical flow and action recognition. In GCPR.

Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR

Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

Singh, G., Saha, S., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In ICCV.

Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

Torralba, A., & Efros, A. A., et al. (2011). Unbiased look at dataset bias. In CVPR.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.

Wang, W., Zhang, J., Si, C., & Wang, L. (2018). Pose-based two-stream relational networks for action recognition in videos. arXiv preprint arXiv:1805.08484.

Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.

Weng, J., Liu, M., Jiang, X., & Yuan, J. (2018). Deformable pose traversal convolution for 3d action and gesture recognition. In ECCV.

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.

Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.

Yao, A., Gall, J., & van Gool, L. (2012). Coupled action recognition and pose estimation from multiple views. In IJCV.

Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In: Joint pattern recognition symposium.

Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.

Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., & Du, D. (2018). Action machine: Rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770.

Zhu, W., Lan, C., Xing, J., Li, Y., Shen, L., Zeng, W., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI.

Zolfaghari, M., Oliveira, G. L., Sedaghat, N., Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV.

Titel: Mimetics: Towards Understanding Human Actions Out of Context
verfasst von: Philippe Weinzaepfel
Grégory Rogez
Publikationsdatum: 02.03.2021
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 5/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-021-01446-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 5/2021

Manhattan Room Layout Reconstruction from a Single Image: A Comparative Study of State-of-the-Art Methods

An Exploration of Embodied Visual Exploration

Deformable Image Registration Based on Functions of Bounded Generalized Deformation

Letter-Level Online Writer Identification

Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition

Intra-Camera Supervised Person Re-Identification

Premium Partner