Skip to main content
Erschienen in: International Journal of Computer Vision 11/2023

10.07.2023

HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events

verfasst von: Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Hongkai Xiong, Guojun Qi, Nicu Sebe

Erschienen in: International Journal of Computer Vision | Ausgabe 11/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named human-in-events or human-centric video analysis in complex events (HiEve), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd and complex events. It contains a record number of poses (>  1 M), the largest number of action instances (> 56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of > 480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at http://​humaninevents.​org.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR (pp. 5167–5176). Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR (pp. 5167–5176).
Zurück zum Zitat Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
Zurück zum Zitat Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, p. 4). Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, p. 4).
Zurück zum Zitat Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE. Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464–3468). IEEE.
Zurück zum Zitat Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE. Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE.
Zurück zum Zitat Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In European conference on computer vision (pp. 455–472). Springer. Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In European conference on computer vision (pp. 455–472). Springer.
Zurück zum Zitat Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308). Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 6299–6308).
Zurück zum Zitat Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME. Chen, L., Ai, H., Zhuang, Z., & Shang, C. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME.
Zurück zum Zitat Chen, Y., Zhao, P., Qi, M., Zhao, Y., Jia, W., & Wang, R. (2022). Audio matters in video super-resolution by implicit semantic guidance. IEEE Transactions on Multimedia. Chen, Y., Zhao, P., Qi, M., Zhao, Y., Jia, W., & Wang, R. (2022). Audio matters in video super-resolution by implicit semantic guidance. IEEE Transactions on Multimedia.
Zurück zum Zitat Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., & Zhang, L. (2020). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR
Zurück zum Zitat Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2021). The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11), 4125–4141. https://doi.org/10.1109/TPAMI.2020.2991965CrossRef Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2021). The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11), 4125–4141. https://​doi.​org/​10.​1109/​TPAMI.​2020.​2991965CrossRef
Zurück zum Zitat Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:​2003.​09003.
Zurück zum Zitat Du, Y., Fu, Y., & Wang, L. (2016). Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7), 3010–3022.MathSciNetCrossRefMATH Du, Y., Fu, Y., & Wang, L. (2016). Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing, 25(7), 3010–3022.MathSciNetCrossRefMATH
Zurück zum Zitat Eichner, M., &Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV. Eichner, M., &Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.
Zurück zum Zitat Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In IEEE international conference on computer vision. Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In IEEE international conference on computer vision.
Zurück zum Zitat Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In IEEE international conference on computer vision. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In IEEE international conference on computer vision.
Zurück zum Zitat Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6. IEEE). Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6. IEEE).
Zurück zum Zitat Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 CVPR. IEEE Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 CVPR. IEEE
Zurück zum Zitat Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686). Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686).
Zurück zum Zitat Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In CVPR (pp. 244–253). Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In CVPR (pp. 244–253).
Zurück zum Zitat Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850). Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., & Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850).
Zurück zum Zitat Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., & Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., & Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
Zurück zum Zitat Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., & Sukthankar, R. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR. Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., & Sukthankar, R. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In CVPR (pp. 770–778). He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In CVPR (pp. 770–778).
Zurück zum Zitat Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.
Zurück zum Zitat Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017) (pp. 438–445). IEEE. Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017) (pp. 438–445). IEEE.
Zurück zum Zitat Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In: bmvc. Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In: bmvc.
Zurück zum Zitat Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision (pp. 4405–4413). Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision (pp. 4405–4413).
Zurück zum Zitat Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision. IEEE. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision. IEEE.
Zurück zum Zitat Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR (pp. 10863–10872). Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR (pp. 10863–10872).
Zurück zum Zitat Li, Y., Zhang, B., Li, J., Wang, Y., Lin, W., Wang, C., Li, J., & Huang, F. (2021). Lstc: Boosting atomic action detection with long-short-term context. In Proceedings of the 29th ACM international conference on multimedia (pp. 2158–2166). Li, Y., Zhang, B., Li, J., Wang, Y., Lin, W., Wang, C., Li, J., & Huang, F. (2021). Lstc: Boosting atomic action detection with long-short-term context. In Proceedings of the 29th ACM international conference on multimedia (pp. 2158–2166).
Zurück zum Zitat Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV.
Zurück zum Zitat Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2017). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.MathSciNetCrossRefMATH Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2017). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.MathSciNetCrossRefMATH
Zurück zum Zitat Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211). Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).
Zurück zum Zitat Lu, C., Shi, J., & Jia, J. (2013). Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision (pp. 2720–2727) Lu, C., Shi, J., & Jia, J. (2013). Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision (pp. 2720–2727)
Zurück zum Zitat Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.CrossRef Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129, 548–578.CrossRef
Zurück zum Zitat Luvizon, D.C., Picard, D., &Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146). Luvizon, D.C., Picard, D., &Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5137–5146).
Zurück zum Zitat Mei, T., Tang, L. X., Tang, J., & Hua, X. S. (2013). Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(3), 1–23.CrossRef Mei, T., Tang, L. X., Tang, J., & Hua, X. S. (2013). Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(3), 1–23.CrossRef
Zurück zum Zitat Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831. Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:​1603.​00831.
Zurück zum Zitat Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., & Ding, E. (2020). Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition, 107, 107480.CrossRef Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., & Ding, E. (2020). Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition, 107, 107480.CrossRef
Zurück zum Zitat Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR (pp. 4929–4937). Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR (pp. 4929–4937).
Zurück zum Zitat Ren, L., Lu, J., Wang, Z., Tian, Q., & Zhou, J. (2018). Collaborative deep reinforcement learning for multi-object tracking. In ECCV (pp. 586–602). Ren, L., Lu, J., Wang, Z., Tian, Q., & Zhou, J. (2018). Collaborative deep reinforcement learning for multi-object tracking. In ECCV (pp. 586–602).
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (pp. 91–99). Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (pp. 91–99).
Zurück zum Zitat Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35). Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV (pp. 17–35).
Zurück zum Zitat Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In: CVPR (pp. 3674–3681). Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In: CVPR (pp. 3674–3681).
Zurück zum Zitat Shahroudy, A., Liu, J., Ng, T.T., & Wang, G.: Ntu rgb+ d . (2016). A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010–1019). Shahroudy, A., Liu, J., Ng, T.T., & Wang, G.: Ntu rgb+ d . (2016). A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010–1019).
Zurück zum Zitat Shu, X., Tang, J., Qi, G., Liu, W., & Yang, J. (2019). Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Shu, X., Tang, J., Qi, G., Liu, W., & Yang, J. (2019). Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zurück zum Zitat Singh, G., Saha, S., Sapienza, M., Torr, P.H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3637–3646). Singh, G., Saha, S., Sapienza, M., Torr, P.H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3637–3646).
Zurück zum Zitat Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:​1212.​0402.
Zurück zum Zitat Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR (pp. 6479–6488). Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR (pp. 6479–6488).
Zurück zum Zitat Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR. Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (pp. 5998–6008). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (pp. 5998–6008).
Zurück zum Zitat Veeriah, V., Zhuang, N., & Qi, G.J. (2015). Differential recurrent neural networks for action recognition. In IEEE international conference on computer vision (pp. 4041–4049). Veeriah, V., Zhuang, N., & Qi, G.J. (2015). Differential recurrent neural networks for action recognition. In IEEE international conference on computer vision (pp. 4041–4049).
Zurück zum Zitat Wang, H., & Wang, L. (2018). Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing (pp. 4382–4394). Wang, H., & Wang, L. (2018). Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing (pp. 4382–4394).
Zurück zum Zitat Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR. Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.
Zurück zum Zitat Wojke, N., Bewley, A., &Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (pp. 3645–3649). IEEE. Wojke, N., Bewley, A., &Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (pp. 3645–3649). IEEE.
Zurück zum Zitat Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293). Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293).
Zurück zum Zitat Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV. Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV.
Zurück zum Zitat Xiaohan Nie, B., Xiong, C., & Zhu, S.C. (2015). Joint action recognition and pose estimation from video. In CVPR (pp. 1293–1301). Xiaohan Nie, B., Xiong, C., & Zhu, S.C. (2015). Joint action recognition and pose estimation from video. In CVPR (pp. 1293–1301).
Zurück zum Zitat Xu, M., Liu, Y., Hu, R., & He, F. (2018). Find who to look at: Turning from action to saliency. IEEE Transactions on Image Processing. Xu, M., Liu, Y., Hu, R., & He, F. (2018). Find who to look at: Turning from action to saliency. IEEE Transactions on Image Processing.
Zurück zum Zitat Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.
Zurück zum Zitat Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (vol. 32). Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).
Zurück zum Zitat Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems. Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems.
Zurück zum Zitat Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv:2004.01888. Zhang, Y., Wang, C., Wang, X., Zeng, W., & Liu, W. (2020). Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv:​2004.​01888.
Zurück zum Zitat Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490). Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. In European conference on computer vision (pp. 474–490).
Metadaten
Titel
HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events
verfasst von
Weiyao Lin
Huabin Liu
Shizhan Liu
Yuxi Li
Hongkai Xiong
Guojun Qi
Nicu Sebe
Publikationsdatum
10.07.2023
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 11/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-023-01842-6

Weitere Artikel der Ausgabe 11/2023

International Journal of Computer Vision 11/2023 Zur Ausgabe

OriginalPaper

Deep Corner

Premium Partner