nach oben

Journal of Intelligent Information Systems

06.01.2024 | Research

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

verfasst von: Massimo Guarascio, Marco Minici, Francesco Sergio Pisani, Erika De Francesco, Pasquale Lambardi

Erschienen in: Journal of Intelligent Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Providing rich and accurate metadata for indexing media content is a crucial problem for all the companies offering streaming entertainment services. These metadata are commonly employed to enhance search engine results and feed recommendation algorithms to improve the matching with user interests. However, the problem of labeling multimedia content with informative tags is challenging as the labeling procedure, manually performed by domain experts, is time-consuming and prone to error. Recently, the adoption of AI-based methods has been demonstrated to be an effective approach for automating this complex process. However, developing an effective solution requires coping with different challenging issues, such as data noise and the scarcity of labeled examples during the training phase. In this work, we address these challenges by introducing a Transformer-based framework for multi-modal multi-label classification enriched with model prediction explanation capabilities. These explanations can help the domain expert to understand the system’s predictions. Experimentation conducted on two real test cases demonstrates its effectiveness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://www.trade.gov/media-entertainment

Source of data: Scopus. Research query: "Movie Genre Classification"

https://grouplens.org/datasets/movielens/25m/

Abavisani, M., Wu, L., Hu, S., et al. (2020). Multimodal categorization of crisis events in social media. In: 2020 IEEE/CVF conf. on computer vision and pattern recognition, CVPR 2020. Computer Vision Foundation/IEEE, pp 14,667–14,677, https://doi.org/10.1109/CVPR42600.2020.01469

Arevalo, J., Solorio, T., Montes-y Gómez, M., et al. (2017). Gated multimodal units for information fusion. arXiv:1702.01992

Audebert, N., Herold, C., Slimani, K., et al. (2020). Multimodal deep networks for text and image-based document classification. In I. Part (Ed.), Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings (pp. 427–443). Springer.CrossRef

Choi, J. H., & Lee, J. S. (2019). Embracenet: A robust deep learning architecture for multimodal classification. Information Fusion, 51, 259–270. https://doi.org/10.1016/j.inffus.2019.02.010CrossRef

Cui, Y., Jia, M., Lin, T.Y., et al. (2019). Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9268–9277, https://doi.org/10.1109/CVPR.2019.00949

Devlin, J., Chang, M.W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. Association for computational linguistics, pp 4171–4186, https://doi.org/10.18653/v1/N19-1423

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: Intl conf on learning representations

Fish, E., Weinbren, J., Gilbert, A. (2020). Rethinking movie genre classification with fine-grained semantic clustering. arXiv:2012.02639

Gao, Y., Gu, S., Jiang, J., et al. (2022). Going beyond xai: A systematic survey for explanation-guided learning. https://doi.org/10.48550/ARXIV.2212.03954arXiv:2212.03954

Guarascio, M., Manco, G., & Ritacco, E. (2018). Deep learning. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 1–3, 634–647. https://doi.org/10.1016/B978-0-12-809633-8.20352-XCrossRef

Hermans, A., Beyer, L., Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv:1703.07737

Hinton, G. E., Srivastava, N., Krizhevsky, A., et al. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://doi.org/10.5555/2627435.2670313MathSciNetCrossRef

Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, pp 448–456

Kar, S., Maharjan, S., López-Monroy, A. P., et al. (2018a). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). In chair) NCC, K. Choukri, C. Cieri, et al. (Eds.), MPST: A corpus of movie plot synopses with tags. Paris, France: European Language Resources Association (ELRA).

Kar, S., Maharjan, S., Solorio, T. (2018b). Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In: Proc of the 27th Intl Conf on computational linguistics, pp 2879–2891

Kaya, M., & Bilge, H. S. (2019). Deep metric learning: A survey. Symmetry, 11(9). https://doi.org/10.3390/sym11091066

Khan, U. A., Martínez-del-Amor, M. A., Altowaijri, S. M., et al. (2020). Movie tags prediction and segmentation using deep learning. IEEE Access, 8, 6071–6086. https://doi.org/10.1109/ACCESS.2019.2963535CrossRef

Le Cun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539CrossRef

Luggen, M., Audiffren, J., Difallah, D., et al. (2021). Wiki2prop: A multimodal approach for predicting wikidata properties from wikipedia. Proceedings of the Web Conference, 2021, 2357–2366. https://doi.org/10.1145/3442381.3450082CrossRef

Luo, Z., Tang, G., Wang, C., et al. (2021). Generating high-quality movie tags from social reviews: A learning-driven approach. In: 2021 IEEE international conferences on internet of things (iThings) and IEEE green computing & communications (GreenCom) and IEEE cyber, physical & social computing (CPSCom) and IEEE smart data (SmartData) and IEEE congress on cybermatics (Cybermatics), pp 182–189,https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics53846.2021.00040

Minici, M., Pisani, F.S., Guarascio, M., et al. (2022). Learning and explanation of extreme multi-label deep classification models for media content. In: Foundations of intelligent systems. Springer International Publishing, Cham, pp 138–148, https://doi.org/10.1007/978-3-031-16564-1_14

Nair, V., Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th int. conf. on machine learning, ICML’10, pp 807–814

Pandeya, Y. R., & Lee, J. (2021). Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimedia Tools and Applications, 80, 2887–2905. https://doi.org/10.1007/S11042-020-08836-3CrossRef

Rahman, M.M., Malik, S., Islam, M.S., et al. (2022). An efficient approach to automatic tag prediction from movie plot synopses using transformer-based language model. In: 2022 25th International conference on computer and information technology (ICCIT), pp 501–505, https://doi.org/10.1109/ICCIT57492.2022.10055349

Ren, P., Xiao, Y., Chang, X., et al. (2021). A survey of deep active learning. ACM Comput Surv, 54(9). https://doi.org/10.1145/3472291

Ribeiro, M.T., Singh, S., Guestrin, C. (2016). "why should i trust you?" explaining the predictions of any classifier. In: Proc of the 22nd ACM SIGKDD intl conf on knowledge discovery and data mining, pp 1135–1144, https://doi.org/10.1145/2939672.2939778

Schroff, F., Kalenichenko, D., Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: 2015 IEEE conf on computer vision and pattern recognition (CVPR), pp 815–823, 10.1109/CVPR.2015.7298682

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is All You Need. In: Proc of the 31st intl conf on neural information processing systems, pp 6000–6010

Wang, W., Tran, D., Feiszli, M. (2020). What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,695–12,705

Wehrmann, J., & Barros, R. C. (2017). Movie genre classification: A multi-label approach based on convolutions through time. Applied Soft Computing, 61, 973–982. https://doi.org/10.1016/j.asoc.2017.08.029CrossRef

Wu, C., Wang, C., Zhou, Y., et al. (2020). Exploiting user reviews for automatic movie tagging. Multimedia Tools and Applications, 79(17), 11399–11419. https://doi.org/10.1007/s11042-019-08513-0CrossRef

Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2023.3275156CrossRef

Zhang, H., Patel, V.M., Chellappa, R. (2017). Hierarchical multimodal metric learning for multimodal classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3057–3065

Zhang, Z., Gu, Y., Plummer B.A., et al. (2024). Movie genre classification by language augmentation and shot sampling. In: IEEE Winter conference on applications of computer vision (WACV)

Titel: Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation
verfasst von: Massimo Guarascio
Marco Minici
Francesco Sergio Pisani
Erika De Francesco
Pasquale Lambardi
Publikationsdatum: 06.01.2024
Verlag: Springer US
Erschienen in: Journal of Intelligent Information Systems
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI: https://doi.org/10.1007/s10844-023-00836-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"