Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 9/2023

19-03-2023 | Original Article

HMNet: a hierarchical multi-modal network for educational video concept prediction

Authors: Wei Huang, Tong Xiao, Qi Liu, Zhenya Huang, Jianhui Ma, Enhong Chen

Published in: International Journal of Machine Learning and Cybernetics | Issue 9/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Educational video concept prediction is a challenging task in the online education system that aims to assign appropriate hierarchical concepts to the video. The key to this problem is to model and fuse the multimodal information of the video. However, most prior studies tend to ignore the incremental characteristics of the educational video, and most of the video segmentation strategies do not apply well to the educational video. Moreover, most existing methods overlook the class hierarchy and do not consider the class dependencies when predicting the hierarchical concepts of a video. To that end, in this paper, we propose a Hierarchical Multi-modal Network (HMNet) framework for predicting the hierarchical concepts of educational videos via fusing the multimodal information and modeling the class dependencies. Specifically, we first apply a video divider for extracting keyframes from the video, which considers the incremental characteristics of the educational video. The video is divided into a series of video sections with subtitles. Then, we utilize a multi-modal encoder to obtain the unified representation for multi-modality. Finally, we design a hierarchical predictor capable of fusing the multi-modality representation, modeling the class dependencies and predicting the hierarchical concepts of video in a top-down manner. Extensive experimental results on two real-world datasets demonstrate the effectiveness and explanatory power of HMNet.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European Conference on Computer Vision, pp 214–229 (2020). Springer Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European Conference on Computer Vision, pp 214–229 (2020). Springer
2.
go back to reference Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5079–5088 Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5079–5088
3.
go back to reference Liu S, Fan H, Qian S Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11915–11925 Liu S, Fan H, Qian S Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11915–11925
4.
go back to reference Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris RS, Harwath D, Glass J, Kuehne H (2022) Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20020–20029 Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris RS, Harwath D, Glass J, Kuehne H (2022) Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20020–20029
5.
go back to reference Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154CrossRef Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154CrossRef
6.
go back to reference Cooper M, Zhao J, Bhatt C, Shamma DA (2018) Moocex: exploring educational video via recommendation. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 521–524 Cooper M, Zhao J, Bhatt C, Shamma DA (2018) Moocex: exploring educational video via recommendation. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 521–524
7.
go back to reference Du X, Yin H, Chen L, Wang Y, Yang Y, Zhou X (2018) Personalized video recommendation using rich contents from videos. IEEE Trans Knowl Data Eng 32(3):492–505CrossRef Du X, Yin H, Chen L, Wang Y, Yang Y, Zhou X (2018) Personalized video recommendation using rich contents from videos. IEEE Trans Knowl Data Eng 32(3):492–505CrossRef
8.
go back to reference Furini M (2018) On introducing timed tag-clouds in video lectures indexing. Multimed Tools Appl 77(1):967–984CrossRef Furini M (2018) On introducing timed tag-clouds in video lectures indexing. Multimed Tools Appl 77(1):967–984CrossRef
9.
go back to reference Husain M, Meena S (2019) Multimodal fusion of speech and text using semi-supervised lda for indexing lecture videos. In: 2019 National Conference on Communications (NCC), pp 1–6. IEEE Husain M, Meena S (2019) Multimodal fusion of speech and text using semi-supervised lda for indexing lecture videos. In: 2019 National Conference on Communications (NCC), pp 1–6. IEEE
10.
go back to reference Cagliero L, Canale L, Farinetti L (2019) Visa: a supervised approach to indexing video lectures with semantic annotations. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp 226–235. IEEE Cagliero L, Canale L, Farinetti L (2019) Visa: a supervised approach to indexing video lectures with semantic annotations. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp 226–235. IEEE
11.
go back to reference Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence
12.
go back to reference Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Advances in neural information processing systems, 26 Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Advances in neural information processing systems, 26
13.
go back to reference Wu C-Y, Feichtenhofer C, Fan H, He K, Krahenbuhl P, Girshick R (2019) Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293 Wu C-Y, Feichtenhofer C, Fan H, He K, Krahenbuhl P, Girshick R (2019) Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293
14.
go back to reference Guo PJ, Kim J, Rubin R (2014) How video production affects student engagement: an empirical study of mooc videos. In: Proceedings of the First ACM Conference on Learning@ Scale Conference, pp 41–50 Guo PJ, Kim J, Rubin R (2014) How video production affects student engagement: an empirical study of mooc videos. In: Proceedings of the First ACM Conference on Learning@ Scale Conference, pp 41–50
15.
go back to reference Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 331–339 Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 331–339
16.
go back to reference Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. 2013 IEEE International Conference on Computer Vision, 1777–1784 Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. 2013 IEEE International Conference on Computer Vision, 1777–1784
17.
go back to reference Yu C-P, Le HM, Zelinsky GJ, Samaras D (2015) Efficient video segmentation using parametric graph partitioning. 2015 IEEE International Conference on Computer Vision (ICCV), 3155–3163 Yu C-P, Le HM, Zelinsky GJ, Samaras D (2015) Efficient video segmentation using parametric graph partitioning. 2015 IEEE International Conference on Computer Vision (ICCV), 3155–3163
18.
go back to reference Wattanarachothai W, Patanukhom K (2015) Key frame extraction for text based video retrieval using maximally stable extremal regions. In: 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom), pp 29–37. IEEE Wattanarachothai W, Patanukhom K (2015) Key frame extraction for text based video retrieval using maximally stable extremal regions. In: 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom), pp 29–37. IEEE
19.
go back to reference Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8858–8867 Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8858–8867
20.
go back to reference Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335CrossRef Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335CrossRef
21.
go back to reference Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. IJCAI 3:8 Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. IJCAI 3:8
22.
go back to reference Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32
23.
go back to reference Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185CrossRefMATH Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185CrossRefMATH
24.
go back to reference Cerri R, Barros RC, de Carvalho AC, Jin Y (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform 17(1):373CrossRef Cerri R, Barros RC, de Carvalho AC, Jin Y (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform 17(1):373CrossRef
25.
go back to reference Wehrmann J, Cerri R, Barros R (2018) Hierarchical multi-label classification networks. In: International Conference on Machine Learning, pp 5225–5234 Wehrmann J, Cerri R, Barros R (2018) Hierarchical multi-label classification networks. In: International Conference on Machine Learning, pp 5225–5234
26.
go back to reference Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
27.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30
28.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
30.
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
31.
go back to reference Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. Proceedings of the 28th ACM International Conference on Multimedia Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. Proceedings of the 28th ACM International Conference on Multimedia
32.
go back to reference Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256. JMLR Workshop and Conference Proceedings Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256. JMLR Workshop and Conference Proceedings
33.
go back to reference Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH
34.
go back to reference Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:​1912.​01703
35.
go back to reference Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Mining (IJDWM) 3(3):1–13CrossRef Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Mining (IJDWM) 3(3):1–13CrossRef
36.
go back to reference Giunchiglia E, Lukasiewicz T (2020) Coherent hierarchical multi-label classification networks. In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada Giunchiglia E, Lukasiewicz T (2020) Coherent hierarchical multi-label classification networks. In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada
Metadata
Title
HMNet: a hierarchical multi-modal network for educational video concept prediction
Authors
Wei Huang
Tong Xiao
Qi Liu
Zhenya Huang
Jianhui Ma
Enhong Chen
Publication date
19-03-2023
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 9/2023
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-023-01809-6

Other articles of this Issue 9/2023

International Journal of Machine Learning and Cybernetics 9/2023 Go to the issue