Skip to main content
Top

2023 | OriginalPaper | Chapter

Data Homogeneity Dependent Topic Modeling for Information Retrieval

Authors : Keerthana Sureshbabu Kashi, Abigail A. Antenor, Gabriel Isaac L. Ramolete, Adrienne Heinrich

Published in: Intelligent Systems and Machine Learning

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Jelodar, H., et al.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)CrossRef Jelodar, H., et al.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)CrossRef
3.
go back to reference Joo, W., Lee, W., Park, S., Moon, I.-C.: Dirichlet variational autoencoder. Pattern Recogn. 107, 107514 (2020)CrossRef Joo, W., Lee, W., Park, S., Moon, I.-C.: Dirichlet variational autoencoder. Pattern Recogn. 107, 107514 (2020)CrossRef
4.
go back to reference Jabbar, A., Li, X., Omar, B.: A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv. (CSUR) 54(8), 1–49 (2021)CrossRef Jabbar, A., Li, X., Omar, B.: A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv. (CSUR) 54(8), 1–49 (2021)CrossRef
6.
go back to reference Wang, R., Zhou, D., He, Y.: ATM: adversarial-neural topic model. Inf. Process. Manag. 56(6), 102098 (2019)CrossRef Wang, R., Zhou, D., He, Y.: ATM: adversarial-neural topic model. Inf. Process. Manag. 56(6), 102098 (2019)CrossRef
7.
go back to reference Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey. arXiv preprint arXiv:2103.00498 (2021) Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey. arXiv preprint arXiv:​2103.​00498 (2021)
8.
go back to reference Doan, T.-N., Hoang, T.-A.: Benchmarking neural topic models: an empirical study. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4363–4368 (2021) Doan, T.-N., Hoang, T.-A.: Benchmarking neural topic models: an empirical study. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4363–4368 (2021)
9.
go back to reference Nguyen, H.-H., Thanh, H.: Analyzing customer experience in hotel services using topic modeling. J. Inf. Process. Syst. 17, 586–598 (2021) Nguyen, H.-H., Thanh, H.: Analyzing customer experience in hotel services using topic modeling. J. Inf. Process. Syst. 17, 586–598 (2021)
10.
go back to reference Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers Sociol. 7 (2022) Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers Sociol. 7 (2022)
11.
go back to reference Tan, S., et al.: Interpreting the public sentiment variations on Twitter. IEEE Trans. Knowl. Data Eng. 26(5), 1158–1170 (2013) Tan, S., et al.: Interpreting the public sentiment variations on Twitter. IEEE Trans. Knowl. Data Eng. 26(5), 1158–1170 (2013)
12.
go back to reference Xu, Z., Liu, Y., Xuan, J., Chen, H., Mei, L.: Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools Appl. 76(9), 11567–11584 (2017)CrossRef Xu, Z., Liu, Y., Xuan, J., Chen, H., Mei, L.: Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools Appl. 76(9), 11567–11584 (2017)CrossRef
13.
go back to reference Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)CrossRef Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)CrossRef
14.
go back to reference Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 54(4), 1095–1108 (2020)CrossRef Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 54(4), 1095–1108 (2020)CrossRef
15.
16.
go back to reference Suri, P., Roy, N.R.: Comparison between LDA & NMF for event-detection from large text stream data. In: 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), pp. 1–5 (2017) Suri, P., Roy, N.R.: Comparison between LDA & NMF for event-detection from large text stream data. In: 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), pp. 1–5 (2017)
17.
go back to reference Anantharaman, A., Jadiya, A., Siri, C.T.S., Adikar, B.N., Mohan, B.: Performance evaluation of topic modeling algorithms for text classification. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 704–708 (2019) Anantharaman, A., Jadiya, A., Siri, C.T.S., Adikar, B.N., Mohan, B.: Performance evaluation of topic modeling algorithms for text classification. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 704–708 (2019)
18.
go back to reference Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)CrossRef Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)CrossRef
19.
go back to reference Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)CrossRef Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)CrossRef
20.
go back to reference DiMaggio, P., Nag, M., Blei, D.: Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of us government arts funding. Poetics 41(6), 570–606 (2013)CrossRef DiMaggio, P., Nag, M., Blei, D.: Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of us government arts funding. Poetics 41(6), 570–606 (2013)CrossRef
21.
go back to reference Grimmer, J.: A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)MathSciNetCrossRef Grimmer, J.: A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)MathSciNetCrossRef
22.
go back to reference Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)CrossRef Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)CrossRef
23.
go back to reference Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)CrossRef Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)CrossRef
24.
go back to reference Baum, D.: Recognising speakers from the topics they talk about. Speech Commun. 54(10), 1132–1142 (2012)CrossRef Baum, D.: Recognising speakers from the topics they talk about. Speech Commun. 54(10), 1132–1142 (2012)CrossRef
25.
go back to reference Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the Snowden affair: combining automated and manual methods in the analysis of news blogging. Comput. Support. Coop. Work (CSCW) 25(2), 167–191 (2016)CrossRef Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the Snowden affair: combining automated and manual methods in the analysis of news blogging. Comput. Support. Coop. Work (CSCW) 25(2), 167–191 (2016)CrossRef
26.
go back to reference Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 1, pp. 95–104. IEEE (2010) Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 1, pp. 95–104. IEEE (2010)
27.
go back to reference Gethers, M., Poshyvanyk, D.: Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010) Gethers, M., Poshyvanyk, D.: Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010)
28.
go back to reference Thomas, S.W.: Mining software repositories using topic models. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1138–1139 (2011) Thomas, S.W.: Mining software repositories using topic models. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1138–1139 (2011)
29.
go back to reference Tian, K., Revelle, M., Poshyvanyk, D.: Using latent Dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 163–166. IEEE (2009) Tian, K., Revelle, M., Poshyvanyk, D.: Using latent Dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 163–166. IEEE (2009)
30.
go back to reference Özdağoğlu, G., Kapucugil-Ikiz, A., Celik, A.F.: Topic modelling-based decision framework for analysing digital voice of the customer. Total Qual. Manag. Bus. Excellence 29(13–14), 1545–1562 (2018)CrossRef Özdağoğlu, G., Kapucugil-Ikiz, A., Celik, A.F.: Topic modelling-based decision framework for analysing digital voice of the customer. Total Qual. Manag. Bus. Excellence 29(13–14), 1545–1562 (2018)CrossRef
31.
go back to reference Barravecchia, F., Mastrogiacomo, L., Franceschini, F.: Digital voice-of-customer processing by topic modelling algorithms: insights to validate empirical results. Int. J. Qual. Reliab. Manag. (2021) Barravecchia, F., Mastrogiacomo, L., Franceschini, F.: Digital voice-of-customer processing by topic modelling algorithms: insights to validate empirical results. Int. J. Qual. Reliab. Manag. (2021)
32.
go back to reference Ding, K., Choo, W.C., Ng, K.Y., Ng, S.I.: Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int. J. Hosp. Manag. 91, 102676 (2020)CrossRef Ding, K., Choo, W.C., Ng, K.Y., Ng, S.I.: Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int. J. Hosp. Manag. 91, 102676 (2020)CrossRef
33.
go back to reference Putranto, Y., Sartono, B., Djuraidah, A.: Topic modelling and hotel rating prediction based on customer review in Indonesia. Int. J. Manag. Decis. Mak. 20(3), 282–307 (2021) Putranto, Y., Sartono, B., Djuraidah, A.: Topic modelling and hotel rating prediction based on customer review in Indonesia. Int. J. Manag. Decis. Mak. 20(3), 282–307 (2021)
34.
go back to reference Gregoriades, A., Pampaka, M., Herodotou, H., Christodoulou, E.: Supporting digital content marketing and messaging through topic modelling and decision trees. Expert Syst. Appl. 184, 115546 (2021)CrossRef Gregoriades, A., Pampaka, M., Herodotou, H., Christodoulou, E.: Supporting digital content marketing and messaging through topic modelling and decision trees. Expert Syst. Appl. 184, 115546 (2021)CrossRef
35.
go back to reference Sánchez-Franco, M.J., Arenas-Márquez, F.J., Alonso-Dos-Santos, M.: Using structural topic modelling to predict users’ sentiment towards intelligent personal agents. An application for Amazon’s echo and Google home. J. Retail. Consum. Serv. 63, 102658 (2021)CrossRef Sánchez-Franco, M.J., Arenas-Márquez, F.J., Alonso-Dos-Santos, M.: Using structural topic modelling to predict users’ sentiment towards intelligent personal agents. An application for Amazon’s echo and Google home. J. Retail. Consum. Serv. 63, 102658 (2021)CrossRef
36.
go back to reference Li, X., Lei, L.: A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 47(2), 161–175 (2021)CrossRef Li, X., Lei, L.: A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 47(2), 161–175 (2021)CrossRef
37.
go back to reference Angel, M.M., Rey, J.-M.: On the role of Shannon’s entropy as a measure of heterogeneity. Geoderma 98(1–2), 1–3 (2000) Angel, M.M., Rey, J.-M.: On the role of Shannon’s entropy as a measure of heterogeneity. Geoderma 98(1–2), 1–3 (2000)
38.
go back to reference Torres-García, A.A., Mendoza-Montoya, O., Molinas, M., Antelis, J.M., Moctezuma, L.A., Hernández-Del-Toro, T.: Pre-processing and feature extraction. In: Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O. (eds.) BioSignal Processing and Classification Using Computational Learning and Intelligence, pp. 59–91. Academic Press (2022) Torres-García, A.A., Mendoza-Montoya, O., Molinas, M., Antelis, J.M., Moctezuma, L.A., Hernández-Del-Toro, T.: Pre-processing and feature extraction. In: Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O. (eds.) BioSignal Processing and Classification Using Computational Learning and Intelligence, pp. 59–91. Academic Press (2022)
39.
go back to reference Zhang, Y.: Modelling the lexical complexity of homogenous texts: a time series approach. Qual. Quant. (2022) Zhang, Y.: Modelling the lexical complexity of homogenous texts: a time series approach. Qual. Quant. (2022)
40.
go back to reference Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2019) Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2019)
41.
go back to reference Mitkov, R.: The Oxford Handbook of Computational Linguistics. Oxford University Press (2021) Mitkov, R.: The Oxford Handbook of Computational Linguistics. Oxford University Press (2021)
42.
go back to reference Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 9(1) (2019) Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 9(1) (2019)
43.
go back to reference Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2013)CrossRef Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2013)CrossRef
44.
go back to reference Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)CrossRef Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)CrossRef
45.
go back to reference Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefMATH Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefMATH
46.
go back to reference Dumais, S.T., et al.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)CrossRef Dumais, S.T., et al.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)CrossRef
47.
go back to reference Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef
48.
go back to reference Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:​1810.​04805 (2018)
49.
50.
go back to reference Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:​1804.​07461 (2018)
51.
go back to reference Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
52.
go back to reference Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011) Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
53.
go back to reference Ge, J., Lin, S., Fang, Y.: A text classification algorithm based on topic model and convolutional neural network. J. Phys: Conf. Ser. 1748(3), 032036 (2021) Ge, J., Lin, S., Fang, Y.: A text classification algorithm based on topic model and convolutional neural network. J. Phys: Conf. Ser. 1748(3), 032036 (2021)
54.
go back to reference Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)CrossRef
55.
go back to reference Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017) Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017)
56.
go back to reference Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems (2010) Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems (2010)
58.
go back to reference McHugh, M.L.: Interrater reliability: the Kappa statistic. Biochemia Medica, pp. 276–282 (2012) McHugh, M.L.: Interrater reliability: the Kappa statistic. Biochemia Medica, pp. 276–282 (2012)
Metadata
Title
Data Homogeneity Dependent Topic Modeling for Information Retrieval
Authors
Keerthana Sureshbabu Kashi
Abigail A. Antenor
Gabriel Isaac L. Ramolete
Adrienne Heinrich
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-35081-8_6

Premium Partner