Skip to main content

2021 | OriginalPaper | Buchkapitel

From Coarse to Fine: Hierarchical Structure-Aware Video Summarization

verfasst von : Wenxu Li, Gang Pan, Chen Wang, Zhen Xing, Xiaozhou Zhou, Xiaoxuan Dong, Jiawan Zhang

Erschienen in: Pattern Recognition. ICPR International Workshops and Challenges

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Hierarchical structure is a common characteristic of some kinds of videos (e.g., sports videos, game videos): the videos are composed of several actions hierarchically and there exists temporal dependencies among segments of different scales, where action labels can be enumerated. Our ideas are based on two intuition: First, the actions are the fundamental units for people to understand these videos. Second, the process of summarization is naturally one of observation and refinement, i.e., observing segments in video and hierarchically refining the boundaries of an important action according to video hierarchical structure. Based on above insights, we generate action proposals to exploit the structure and formulate the summarization process as a hierarchical refining process. We also train a hierarchical summarization network with deep Q-learning (HQSN) to achieve the refining process and explore temporal dependency. Besides, we collect a new dataset that consists of structured game videos with fine-grain actions and importance annotations. The experimental results demonstrate the effectiveness of our framework.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bettadapura, V., Pantofaru, C., Essa, I.: Leveraging contextual cues for generating basketball highlights. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 908–917. ACM (2016) Bettadapura, V., Pantofaru, C., Essa, I.: Leveraging contextual cues for generating basketball highlights. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 908–917. ACM (2016)
2.
Zurück zum Zitat Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1914–1923 (2016) Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1914–1923 (2016)
3.
Zurück zum Zitat Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
4.
Zurück zum Zitat Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the Empirical Methods in Natural Language Processing, pp. 1724–1734 (2014) Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the Empirical Methods in Natural Language Processing, pp. 1724–1734 (2014)
5.
Zurück zum Zitat Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3584–3592 (2015) Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3584–3592 (2015)
7.
Zurück zum Zitat Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3628–3636 (2017) Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3628–3636 (2017)
8.
Zurück zum Zitat Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2069–2077 (2014) Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2069–2077 (2014)
10.
Zurück zum Zitat Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3090–3098 (2015) Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3090–3098 (2015)
11.
Zurück zum Zitat Jiang, Y., Cui, K., Peng, B., Xu, C.: Comprehensive video understanding: video summarization with content-based video recommender design. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–8 (2019) Jiang, Y., Cui, K., Peng, B., Xu, C.: Comprehensive video understanding: video summarization with content-based video recommender design. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–8 (2019)
12.
Zurück zum Zitat Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, pp. 1–15 (2014) Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, pp. 1–15 (2014)
13.
Zurück zum Zitat Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends in Mach. Learn. 5(2–3), 123–286 (2012)CrossRef Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends in Mach. Learn. 5(2–3), 123–286 (2012)CrossRef
14.
Zurück zum Zitat Kwon, H., Shim, W., Cho, M.: Temporal u-nets for video summarization with scene and action recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–4 (2019) Kwon, H., Shim, W., Cho, M.: Temporal u-nets for video summarization with scene and action recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–4 (2019)
15.
Zurück zum Zitat Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018) Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018)
16.
Zurück zum Zitat Mathe, S., Pirinen, A., Sminchisescu, C.: Reinforcement learning for visual object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2894–2902 (2016) Mathe, S., Pirinen, A., Sminchisescu, C.: Reinforcement learning for visual object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2894–2902 (2016)
17.
Zurück zum Zitat Merler, M., et al.: Automatic curation of sports highlights using multimodal excitement features. IEEE Trans. Multimedia 21(5), 1147–1160 (2018)CrossRef Merler, M., et al.: Automatic curation of sports highlights using multimodal excitement features. IEEE Trans. Multimedia 21(5), 1147–1160 (2018)CrossRef
18.
Zurück zum Zitat Mnih, V., et al.: Playing atari with deep reinforcement learning. In: Neural Information Processing Systems Deep Learning Workshop, pp. 1–9 (2013) Mnih, V., et al.: Playing atari with deep reinforcement learning. In: Neural Information Processing Systems Deep Learning Workshop, pp. 1–9 (2013)
19.
Zurück zum Zitat Otani, M., Nakashima, Y., Rahtu, E., Heikkila, J.: Rethinking the evaluation of video summaries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7596–7604 (2019) Otani, M., Nakashima, Y., Rahtu, E., Heikkila, J.: Rethinking the evaluation of video summaries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7596–7604 (2019)
20.
Zurück zum Zitat Park, J., Lee, J., Jeon, S., Sohn, K.: Video summarization by learning relationships between action and scene. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–8 (2019) Park, J., Lee, J., Jeon, S., Sohn, K.: Video summarization by learning relationships between action and scene. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–8 (2019)
21.
Zurück zum Zitat Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp. 8026–8037 (2019) Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp. 8026–8037 (2019)
23.
Zurück zum Zitat Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 290–298 (2017) Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 290–298 (2017)
24.
Zurück zum Zitat Ringer, C., Nicolaou, M.A.: Deep unsupervised multi-view detection of video game stream highlights. In: Proceedings of the 13th International Conference on the Foundations of Digital Games, pp. 1–6. ACM (2018) Ringer, C., Nicolaou, M.A.: Deep unsupervised multi-view detection of video game stream highlights. In: Proceedings of the 13th International Conference on the Foundations of Digital Games, pp. 1–6. ACM (2018)
25.
Zurück zum Zitat Rochan, M., Wang, Y.: Video summarization by learning from unpaired data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7902–7911 (2019) Rochan, M., Wang, Y.: Video summarization by learning from unpaired data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7902–7911 (2019)
26.
Zurück zum Zitat Seong, H., Hyun, J., Kim, E.: Video multitask transformer network. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–9 (2019) Seong, H., Hyun, J., Kim, E.: Video multitask transformer network. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–9 (2019)
27.
Zurück zum Zitat Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016) Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)
28.
Zurück zum Zitat Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179–5187 (2015) Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179–5187 (2015)
29.
Zurück zum Zitat Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn. Challenge 1(2), 2 (2014) Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn. Challenge 1(2), 2 (2014)
30.
Zurück zum Zitat Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3093–3102 (2016) Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3093–3102 (2016)
31.
Zurück zum Zitat Yun, S., Choi, J., Yoo, Y., Yun, K., Young Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2711–2720 (2017) Yun, S., Choi, J., Yoo, Y., Yun, K., Young Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2711–2720 (2017)
32.
Zurück zum Zitat Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1059–1067 (2016) Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1059–1067 (2016)
34.
Zurück zum Zitat Zhang, S., Zhu, Y., Roy-Chowdhury, A.K.: Context-aware surveillance video summarization. IEEE Trans. Image Proces. 25(11), 5469–5478 (2016)MathSciNetCrossRef Zhang, S., Zhu, Y., Roy-Chowdhury, A.K.: Context-aware surveillance video summarization. IEEE Trans. Image Proces. 25(11), 5469–5478 (2016)MathSciNetCrossRef
35.
Zurück zum Zitat Zhao, B., Li, X., Lu, X.: Hsa-rnn: hierarchical structure-adaptive rnn for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7405–7414 (2018) Zhao, B., Li, X., Lu, X.: Hsa-rnn: hierarchical structure-adaptive rnn for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7405–7414 (2018)
36.
Zurück zum Zitat Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014) Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014)
37.
Zurück zum Zitat Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017) Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)
38.
Zurück zum Zitat Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-Second AAAI Conference on Artificial Intelligence, pp. 7582–7589 (2018) Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-Second AAAI Conference on Artificial Intelligence, pp. 7582–7589 (2018)
39.
Zurück zum Zitat Zhou, K., Xiang, T., Cavallaro, A.: Video summarisation by classification with deep reinforcement learning. In: Proceedings of the British Machine Vision Conference, pp. 1–13 (2018) Zhou, K., Xiang, T., Cavallaro, A.: Video summarisation by classification with deep reinforcement learning. In: Proceedings of the British Machine Vision Conference, pp. 1–13 (2018)
Metadaten
Titel
From Coarse to Fine: Hierarchical Structure-Aware Video Summarization
verfasst von
Wenxu Li
Gang Pan
Chen Wang
Zhen Xing
Xiaozhou Zhou
Xiaoxuan Dong
Jiawan Zhang
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-68799-1_6

Premium Partner