nach oben

International Journal of Computer Vision

Erschienen in:

10.07.2023

HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events

verfasst von: Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Hongkai Xiong, Guojun Qi, Nicu Sebe

Erschienen in: International Journal of Computer Vision | Ausgabe 11/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named human-in-events or human-centric video analysis in complex events (HiEve), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd and complex events. It contains a record number of poses (> 1 M), the largest number of action instances (> 56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of > 480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at http://humaninevents.org.

Vorheriger Artikel A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Nächster Artikel Camouflaged Object Segmentation with Omni Perception

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/yysijie/st-gcn

https://github.com/SwinTransformer/Video-Swin-Transformer

https://github.com/facebookresearch/TimeSformer

Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR (pp. 5167–5176).

Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.

Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, p. 4).

Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE.

Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE.

Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In European conference on computer vision (pp. 455–472). Springer.

Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).

Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME.

Chen, Y., Zhao, P., Qi, M., Zhao, Y., Jia, W., & Wang, R. (2022). Audio matters in video super-resolution by implicit semantic guidance. IEEE Transactions on Multimedia.

Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR

Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2021). The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11), 4125–4141. https://doi.org/10.1109/TPAMI.2020.2991965CrossRef

Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003.

Du, Y., Fu, Y., & Wang, L. (2016). Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7), 3010–3022.MathSciNetCrossRefMATH

Eichner, M., &Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.

Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In IEEE international conference on computer vision.

Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In IEEE international conference on computer vision.

Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6. IEEE).

Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 CVPR. IEEE

Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686).

Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). A better baseline for AVA. arXiv:1807.10066.

Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In CVPR (pp. 244–253).

Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850).

Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., & Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.

Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., & Sukthankar, R. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.

He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In CVPR (pp. 770–778).

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.

Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017) (pp. 438–445). IEEE.

Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In: bmvc.

Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision (pp. 4405–4413).

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision. IEEE.

Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR (pp. 10863–10872).

Li, Y., Zhang, B., Li, J., Wang, Y., Lin, W., Wang, C., Li, J., & Huang, F. (2021). Lstc: Boosting atomic action detection with long-short-term context. In Proceedings of the 29th ACM international conference on multimedia (pp. 2158–2166).

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV.

Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2017). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.MathSciNetCrossRefMATH

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).

Lu, C., Shi, J., & Jia, J. (2013). Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision (pp. 2720–2727)

Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.CrossRef

Luvizon, D.C., Picard, D., &Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146).

Mei, T., Tang, L. X., Tang, J., & Hua, X. S. (2013). Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(3), 1–23.CrossRef

Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.

Ning, G., Huang, & H. (2019). Lighttrack: A generic framework for online top-down human pose tracking. arXiv:1905.02822.

Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., & Ding, E. (2020). Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition, 107, 107480.CrossRef

Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR (pp. 4929–4937).

Ren, L., Lu, J., Wang, Z., Tian, Q., & Zhou, J. (2018). Collaborative deep reinforcement learning for multi-object tracking. In ECCV (pp. 586–602).

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (pp. 91–99).

Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35).

Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In: CVPR (pp. 3674–3681).

Shahroudy, A., Liu, J., Ng, T.T., & Wang, G.: Ntu rgb+ d . (2016). A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010–1019).

Shu, X., Tang, J., Qi, G., Liu, W., & Yang, J. (2019). Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Singh, G., Saha, S., Sapienza, M., Torr, P.H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3637–3646).

Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.

Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR (pp. 6479–6488).

Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (pp. 5998–6008).

Veeriah, V., Zhuang, N., & Qi, G.J. (2015). Differential recurrent neural networks for action recognition. In IEEE international conference on computer vision (pp. 4041–4049).

Wang, H., & Wang, L. (2018). Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing (pp. 4382–4394).

Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.

Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2019). Towards real-time multi-object tracking. arXiv:1909.12605.

Wojke, N., Bewley, A., &Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (pp. 3645–3649). IEEE.

Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293).

Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV.

Xiaohan Nie, B., Xiong, C., & Zhu, S.C. (2015). Joint action recognition and pose estimation from video. In CVPR (pp. 1293–1301).

Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose flow: Efficient online pose tracking. arXiv:1802.00977.

Xu, M., Liu, Y., Hu, R., & He, F. (2018). Find who to look at: Turning from action to saliency. IEEE Transactions on Image Processing.

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).

Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems.

Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv:2004.01888.

Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490).

Titel: HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events
verfasst von: Weiyao Lin
Huabin Liu
Shizhan Liu
Yuxi Li
Hongkai Xiong
Guojun Qi
Nicu Sebe
Publikationsdatum: 10.07.2023
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 11/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01842-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 11/2023

Universal Prototype Transport for Zero-Shot Action Recognition and Localization

Attribute-Image Person Re-identification via Modal-Consistent Metric Learning

RAFT-MSF: Self-Supervised Monocular Scene Flow Using Recurrent Optimizer

Deep Corner

DCP–NAS: Discrepant Child–Parent Neural Architecture Search for 1-bit CNNs

Hierarchical Curriculum Learning for No-Reference Image Quality Assessment

Premium Partner