Skip to main content

06.01.2024 | Research

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

verfasst von: Massimo Guarascio, Marco Minici, Francesco Sergio Pisani, Erika De Francesco, Pasquale Lambardi

Erschienen in: Journal of Intelligent Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Providing rich and accurate metadata for indexing media content is a crucial problem for all the companies offering streaming entertainment services. These metadata are commonly employed to enhance search engine results and feed recommendation algorithms to improve the matching with user interests. However, the problem of labeling multimedia content with informative tags is challenging as the labeling procedure, manually performed by domain experts, is time-consuming and prone to error. Recently, the adoption of AI-based methods has been demonstrated to be an effective approach for automating this complex process. However, developing an effective solution requires coping with different challenging issues, such as data noise and the scarcity of labeled examples during the training phase. In this work, we address these challenges by introducing a Transformer-based framework for multi-modal multi-label classification enriched with model prediction explanation capabilities. These explanations can help the domain expert to understand the system’s predictions. Experimentation conducted on two real test cases demonstrates its effectiveness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
Literatur
Zurück zum Zitat Audebert, N., Herold, C., Slimani, K., et al. (2020). Multimodal deep networks for text and image-based document classification. In I. Part (Ed.), Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings (pp. 427–443). Springer.CrossRef Audebert, N., Herold, C., Slimani, K., et al. (2020). Multimodal deep networks for text and image-based document classification. In I. Part (Ed.), Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings (pp. 427–443). Springer.CrossRef
Zurück zum Zitat Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: Intl conf on learning representations Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: Intl conf on learning representations
Zurück zum Zitat Fish, E., Weinbren, J., Gilbert, A. (2020). Rethinking movie genre classification with fine-grained semantic clustering. arXiv:2012.02639 Fish, E., Weinbren, J., Gilbert, A. (2020). Rethinking movie genre classification with fine-grained semantic clustering. arXiv:​2012.​02639
Zurück zum Zitat Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, pp 448–456 Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, pp 448–456
Zurück zum Zitat Kar, S., Maharjan, S., López-Monroy, A. P., et al. (2018a). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). In chair) NCC, K. Choukri, C. Cieri, et al. (Eds.), MPST: A corpus of movie plot synopses with tags. Paris, France: European Language Resources Association (ELRA). Kar, S., Maharjan, S., López-Monroy, A. P., et al. (2018a). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). In chair) NCC, K. Choukri, C. Cieri, et al. (Eds.), MPST: A corpus of movie plot synopses with tags. Paris, France: European Language Resources Association (ELRA).
Zurück zum Zitat Kar, S., Maharjan, S., Solorio, T. (2018b). Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In: Proc of the 27th Intl Conf on computational linguistics, pp 2879–2891 Kar, S., Maharjan, S., Solorio, T. (2018b). Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In: Proc of the 27th Intl Conf on computational linguistics, pp 2879–2891
Zurück zum Zitat Nair, V., Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th int. conf. on machine learning, ICML’10, pp 807–814 Nair, V., Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th int. conf. on machine learning, ICML’10, pp 807–814
Zurück zum Zitat Rahman, M.M., Malik, S., Islam, M.S., et al. (2022). An efficient approach to automatic tag prediction from movie plot synopses using transformer-based language model. In: 2022 25th International conference on computer and information technology (ICCIT), pp 501–505, https://doi.org/10.1109/ICCIT57492.2022.10055349 Rahman, M.M., Malik, S., Islam, M.S., et al. (2022). An efficient approach to automatic tag prediction from movie plot synopses using transformer-based language model. In: 2022 25th International conference on computer and information technology (ICCIT), pp 501–505, https://​doi.​org/​10.​1109/​ICCIT57492.​2022.​10055349
Zurück zum Zitat Schroff, F., Kalenichenko, D., Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: 2015 IEEE conf on computer vision and pattern recognition (CVPR), pp 815–823, 10.1109/CVPR.2015.7298682 Schroff, F., Kalenichenko, D., Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: 2015 IEEE conf on computer vision and pattern recognition (CVPR), pp 815–823, 10.1109/CVPR.2015.7298682
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is All You Need. In: Proc of the 31st intl conf on neural information processing systems, pp 6000–6010 Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is All You Need. In: Proc of the 31st intl conf on neural information processing systems, pp 6000–6010
Zurück zum Zitat Wang, W., Tran, D., Feiszli, M. (2020). What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,695–12,705 Wang, W., Tran, D., Feiszli, M. (2020). What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,695–12,705
Zurück zum Zitat Zhang, H., Patel, V.M., Chellappa, R. (2017). Hierarchical multimodal metric learning for multimodal classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3057–3065 Zhang, H., Patel, V.M., Chellappa, R. (2017). Hierarchical multimodal metric learning for multimodal classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3057–3065
Zurück zum Zitat Zhang, Z., Gu, Y., Plummer B.A., et al. (2024). Movie genre classification by language augmentation and shot sampling. In: IEEE Winter conference on applications of computer vision (WACV) Zhang, Z., Gu, Y., Plummer B.A., et al. (2024). Movie genre classification by language augmentation and shot sampling. In: IEEE Winter conference on applications of computer vision (WACV)
Metadaten
Titel
Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation
verfasst von
Massimo Guarascio
Marco Minici
Francesco Sergio Pisani
Erika De Francesco
Pasquale Lambardi
Publikationsdatum
06.01.2024
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-023-00836-7