Skip to main content
Erschienen in: International Journal of Computer Vision 4/2024

01.11.2023

Local Compressed Video Stream Learning for Generic Event Boundary Detection

verfasst von: Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan

Erschienen in: International Journal of Computer Vision | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://​github.​com/​GX77/​LCVSL.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
https://www.meltycone.com/blog/video-marketing-statistics-for-2023.
 
Literatur
Zurück zum Zitat Alwassel, H., Heilbron, F. C., & Ghanem, B. (2018). Action search: Spotting actions in videos and its application to temporal action localization. In: ECCV. Alwassel, H., Heilbron, F. C., & Ghanem, B. (2018). Action search: Spotting actions in videos and its application to temporal action localization. In: ECCV.
Zurück zum Zitat Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In: ICCV. Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In: ICCV.
Zurück zum Zitat Caba Heilbron, F., Barrios, W., Escorcia, V., et al. (2017). Scc: Semantic context cascade for efficient action detection. In: CVPR. Caba Heilbron, F., Barrios, W., Escorcia, V., et al. (2017). Scc: Semantic context cascade for efficient action detection. In: CVPR.
Zurück zum Zitat Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR. Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR.
Zurück zum Zitat Chao, Y. W., Vijayanarasimhan, S., Seybold, B., et al. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR. Chao, Y. W., Vijayanarasimhan, S., Seybold, B., et al. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR.
Zurück zum Zitat Chen, Y., Kalantidis, Y., Li, J., et al. (2018). Multi-fiber networks for video recognition. In: ECCV. Chen, Y., Kalantidis, Y., Li, J., et al. (2018). Multi-fiber networks for video recognition. In: ECCV.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR. Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR.
Zurück zum Zitat Ding, L., & Xu, C. (2018). Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR. Ding, L., & Xu, C. (2018). Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR.
Zurück zum Zitat Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021a). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021a). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
Zurück zum Zitat Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021b). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021b). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
Zurück zum Zitat Fan, H., Xiong, B., Mangalam, K., et al. (2021). Multiscale vision transformers. Fan, H., Xiong, B., Mangalam, K., et al. (2021). Multiscale vision transformers.
Zurück zum Zitat Fan, L., Huang, W., Gan, C., et al. (2018). End-to-end learning of motion representation for video understanding. In: CVPR. Fan, L., Huang, W., Gan, C., et al. (2018). End-to-end learning of motion representation for video understanding. In: CVPR.
Zurück zum Zitat Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In: CVPR. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In: CVPR.
Zurück zum Zitat Feichtenhofer, C., Fan, H, Malik J, et al (2019) Slowfast networks for video recognition. In: ICCV. Feichtenhofer, C., Fan, H, Malik J, et al (2019) Slowfast networks for video recognition. In: ICCV.
Zurück zum Zitat Gall, D. L. (1991). MPEG: A video compression standard for multimedia applications. Communications of the ACM, 34(4), 46–58.CrossRef Gall, D. L. (1991). MPEG: A video compression standard for multimedia applications. Communications of the ACM, 34(4), 46–58.CrossRef
Zurück zum Zitat Geirhos, R., Jacobsen, J., Michaelis, C., et al. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.CrossRef Geirhos, R., Jacobsen, J., Michaelis, C., et al. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.CrossRef
Zurück zum Zitat He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR. He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR.
Zurück zum Zitat Hong, D., Li, C., Wen, L., et al. (2021). Generic event boundary detection challenge at CVPR 2021 technical report: Cascaded temporal attention network (CASTANET). arXiv. Hong, D., Li, C., Wen, L., et al. (2021). Generic event boundary detection challenge at CVPR 2021 technical report: Cascaded temporal attention network (CASTANET). arXiv.
Zurück zum Zitat Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: CVPR. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: CVPR.
Zurück zum Zitat Huang, D., Fei-Fei, L., Niebles, J. C. (2016). Connectionist temporal modeling for weakly supervised action labeling. In: ECCV. Huang, D., Fei-Fei, L., Niebles, J. C. (2016). Connectionist temporal modeling for weakly supervised action labeling. In: ECCV.
Zurück zum Zitat Huang, L., Liu, Y., Wang, B., et al. (2021). Self-supervised video representation learning by context and motion decoupling. In: CVPR. Huang, L., Liu, Y., Wang, B., et al. (2021). Self-supervised video representation learning by context and motion decoupling. In: CVPR.
Zurück zum Zitat Ji, S., Xu, W., Yang, M., et al. (2013). 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221–231.CrossRef Ji, S., Xu, W., Yang, M., et al. (2013). 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221–231.CrossRef
Zurück zum Zitat Kang, H., Kim, J., Kim, K., et al. (2021). Winning the CVPR’2021 kinetics-GEBD challenge: Contrastive learning approach. arXiv. Kang, H., Kim, J., Kim, K., et al. (2021). Winning the CVPR’2021 kinetics-GEBD challenge: Contrastive learning approach. arXiv.
Zurück zum Zitat Kuehne, H., Jhuang, H., Garrote, E., et al. (2011). HMDB: A large video database for human motion recognition. In: ICCV. Kuehne, H., Jhuang, H., Garrote, E., et al. (2011). HMDB: A large video database for human motion recognition. In: ICCV.
Zurück zum Zitat Lea, C., Reiter, A., Vidal, R., et al. (2016). Segmental spatiotemporal CNNS for fine-grained action segmentation. In: ECCV. Lea, C., Reiter, A., Vidal, R., et al. (2016). Segmental spatiotemporal CNNS for fine-grained action segmentation. In: ECCV.
Zurück zum Zitat Lea, C., Flynn, M. D., Vidal, R., et al. (2017). Temporal convolutional networks for action segmentation and detection. In: CVPR. Lea, C., Flynn, M. D., Vidal, R., et al. (2017). Temporal convolutional networks for action segmentation and detection. In: CVPR.
Zurück zum Zitat Li, C., Wang, X., Wen, L., et al. (2022). End-to-end compressed video representation learning for generic event boundary detection. In: CVPR. Li, C., Wang, X., Wen, L., et al. (2022). End-to-end compressed video representation learning for generic event boundary detection. In: CVPR.
Zurück zum Zitat Li, J., Wei, P., Zhang, Y., et al. (2020). A slow-i-fast-p architecture for compressed video action recognition. In: ACM MM. Li, J., Wei, P., Zhang, Y., et al. (2020). A slow-i-fast-p architecture for compressed video action recognition. In: ACM MM.
Zurück zum Zitat Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In: ACM MM. Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In: ACM MM.
Zurück zum Zitat Lin, T., Liu, X., Li, X., et al. (2019a). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV. Lin, T., Liu, X., Li, X., et al. (2019a). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV.
Zurück zum Zitat Lin, T., Liu, X., Li, X., et al. (2019b). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV. Lin, T., Liu, X., Li, X., et al. (2019b). BMN: Boundary-matching network for temporal action proposal generation. In: ICCV.
Zurück zum Zitat Liu, Z., Lin, Y., Cao, Y., et al, (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022. Liu, Z., Lin, Y., Cao, Y., et al, (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022.
Zurück zum Zitat Liu, Z., Ning, J., Cao, Y., et al. (2022). Video Swin transformer. Liu, Z., Ning, J., Cao, Y., et al. (2022). Video Swin transformer.
Zurück zum Zitat Long, F., Yao, T., Qiu, Z., et al. (2019). Gaussian temporal awareness networks for action localization. In: CVPR. Long, F., Yao, T., Qiu, Z., et al. (2019). Gaussian temporal awareness networks for action localization. In: CVPR.
Zurück zum Zitat Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: CVPR. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: CVPR.
Zurück zum Zitat Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In: CVPR. Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In: CVPR.
Zurück zum Zitat Ng, J. Y., Choi, J., Neumann, J., et al. (2018). Actionflownet: Learning motion representation for action recognition. In: WACV. Ng, J. Y., Choi, J., Neumann, J., et al. (2018). Actionflownet: Learning motion representation for action recognition. In: WACV.
Zurück zum Zitat Ni, B., Yang, X., & Gao, S. (2016). Progressively parsing interactional objects for fine grained action detection. In: CVPR. Ni, B., Yang, X., & Gao, S. (2016). Progressively parsing interactional objects for fine grained action detection. In: CVPR.
Zurück zum Zitat Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS. Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS.
Zurück zum Zitat Rai, A. K., Krishna, T., Dietlmeier, J, et al. (2021). Discerning generic event boundaries in long-form wild videos. arXiv. Rai, A. K., Krishna, T., Dietlmeier, J, et al. (2021). Discerning generic event boundaries in long-form wild videos. arXiv.
Zurück zum Zitat Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In: CVPR. Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In: CVPR.
Zurück zum Zitat Shao, D., Zhao, Y., Dai, B., et al. (2020). Intra- and inter-action understanding via temporal action parsing. In: CVPR. Shao, D., Zhao, Y., Dai, B., et al. (2020). Intra- and inter-action understanding via temporal action parsing. In: CVPR.
Zurück zum Zitat Shou, M. Z., Lei, S. W., Wang, W, et al. (2021). Generic event boundary detection: A benchmark for event segmentation. In: ICCV. Shou, M. Z., Lei, S. W., Wang, W, et al. (2021). Generic event boundary detection: A benchmark for event segmentation. In: ICCV.
Zurück zum Zitat Shou, Z., Lin, X., Kalantidis, Y., et al. (2019). Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: CVPR. Shou, Z., Lin, X., Kalantidis, Y., et al. (2019). Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: CVPR.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: NIPS. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: NIPS.
Zurück zum Zitat Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv.
Zurück zum Zitat Sun, D., Yang, X., Liu, M., et al. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR. Sun, D., Yang, X., Liu, M., et al. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR.
Zurück zum Zitat Tang, J., Liu, Z., Qian, C., et al. (2022). Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR. Tang, J., Liu, Z., Qian, C., et al. (2022). Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR.
Zurück zum Zitat Taylor, G. W., Fergus, R., LeCun, Y., et al. (2010). Convolutional learning of spatio-temporal features. In: ECCV. Taylor, G. W., Fergus, R., LeCun, Y., et al. (2010). Convolutional learning of spatio-temporal features. In: ECCV.
Zurück zum Zitat Tran, D., Bourdev, L. D., Fergus, R, et al. (2015). Learning spatiotemporal features with 3d convolutional networks. In: ICCV. Tran, D., Bourdev, L. D., Fergus, R, et al. (2015). Learning spatiotemporal features with 3d convolutional networks. In: ICCV.
Zurück zum Zitat Tran, D., Ray, J., Shou, Z, et al. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv. Tran, D., Ray, J., Shou, Z, et al. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv.
Zurück zum Zitat Tran, D., Wang, H., Feiszli, M., et al. (2019). Video classification with channel-separated convolutional networks. In: ICCV. Tran, D., Wang, H., Feiszli, M., et al. (2019). Video classification with channel-separated convolutional networks. In: ICCV.
Zurück zum Zitat Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517. Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., et al. (2017a). Attention is all you need. In: NIPS. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017a). Attention is all you need. In: NIPS.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., et al. (2017b). Attention is all you need. In: NIPS. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017b). Attention is all you need. In: NIPS.
Zurück zum Zitat Wang, L., Li, W., Li, W., et al. (2018a). Appearance-and-relation networks for video classification. In: CVPR. Wang, L., Li, W., Li, W., et al. (2018a). Appearance-and-relation networks for video classification. In: CVPR.
Zurück zum Zitat Wang, S., Lu, H., & Deng, Z. (2019). Fast object detection in compressed video. In: ICCV. Wang, S., Lu, H., & Deng, Z. (2019). Fast object detection in compressed video. In: ICCV.
Zurück zum Zitat Wang, X., Girshick, R. B., Gupta, A., et al. (2018b). Non-local neural networks. In: CVPR. Wang, X., Girshick, R. B., Gupta, A., et al. (2018b). Non-local neural networks. In: CVPR.
Zurück zum Zitat Woo, S., Park, J., Lee, J., et al. (2018). CBAM: Convolutional block attention module. In: ECCV. Woo, S., Park, J., Lee, J., et al. (2018). CBAM: Convolutional block attention module. In: ECCV.
Zurück zum Zitat Wu, C., Zaheer, M., Hu, H., et al. (2018). Compressed video action recognition. In: CVPR. Wu, C., Zaheer, M., Hu, H., et al. (2018). Compressed video action recognition. In: CVPR.
Zurück zum Zitat Xie, S., Sun, C., Huang, J., et al. (2017). Rethinking spatiotemporal feature learning for video understanding. arXiv. Xie, S., Sun, C., Huang, J., et al. (2017). Rethinking spatiotemporal feature learning for video understanding. arXiv.
Zurück zum Zitat Yu, Y., Lee, S., Kim, G., et al. (2021). Self-supervised learning of compressed video representations. In: ICLR. Yu, Y., Lee, S., Kim, G., et al. (2021). Self-supervised learning of compressed video representations. In: ICLR.
Zurück zum Zitat Yuan, Z., Stroud, J. C., Lu, T., et al. (2017). Temporal action localization by structured maximal sums. In: CVPR. Yuan, Z., Stroud, J. C., Lu, T., et al. (2017). Temporal action localization by structured maximal sums. In: CVPR.
Zurück zum Zitat Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: ECCV. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: ECCV.
Zurück zum Zitat Zhang, B., Wang, L., Wang, Z., et al. (2016). Real-time action recognition with enhanced motion vector CNNs. In: CVPR. Zhang, B., Wang, L., Wang, Z., et al. (2016). Real-time action recognition with enhanced motion vector CNNs. In: CVPR.
Zurück zum Zitat Zhang, B., Wang, L., Wang, Z., et al. (2018). Real-time action recognition with deeply transferred motion vector CNNs. IEEE Transactions on Image Processing, 27(5), 2326–2339.MathSciNetCrossRef Zhang, B., Wang, L., Wang, Z., et al. (2018). Real-time action recognition with deeply transferred motion vector CNNs. IEEE Transactions on Image Processing, 27(5), 2326–2339.MathSciNetCrossRef
Zurück zum Zitat Zhang, H., Hao, Y., & Ngo, C. (2021). Token shift transformer for video classification. In: ACM MM. Zhang, H., Hao, Y., & Ngo, C. (2021). Token shift transformer for video classification. In: ACM MM.
Zurück zum Zitat Zhao, P., Xie, L., Ju, C., et al. (2020). Bottom-up temporal action localization with mutual regularization. In: ECCV. Zhao, P., Xie, L., Ju, C., et al. (2020). Bottom-up temporal action localization with mutual regularization. In: ECCV.
Zurück zum Zitat Zhao, Y., Xiong, Y., Wang, L., et al. (2017). Temporal action detection with structured segment networks. In: ICCV. Zhao, Y., Xiong, Y., Wang, L., et al. (2017). Temporal action detection with structured segment networks. In: ICCV.
Metadaten
Titel
Local Compressed Video Stream Learning for Generic Event Boundary Detection
verfasst von
Libo Zhang
Xin Gu
Congcong Li
Tiejian Luo
Heng Fan
Publikationsdatum
01.11.2023
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 4/2024
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-023-01921-8

Weitere Artikel der Ausgabe 4/2024

International Journal of Computer Vision 4/2024 Zur Ausgabe

Premium Partner