Skip to main content
Top

2018 | OriginalPaper | Chapter

10. Topic Detection: A Statistical Model and a Quali-Quantitative Method

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter aims at comparing and contrasting two approaches for the automatic detection of topics in texts that show interesting similarities and differences. Among the advances that have given a new impetus and vitality to the discipline, the importance to identify topics or, in other words, the procedures that enable us to identify thematic groups within the texts seem to be relevant to meet the scholars’ needs and have been developed in different disciplines. Two approaches are compared and contrasted. The first, well-known as Latent Dirichlet Allocation, has been developed as a part of text mining statistical model mainly to classify automatically the texts of large corpora; the second, the Reinert’s methods, was developed mainly in the social sciences to managing reliably the content analysis process by bridging the gap between qualitative and quantitative text analysis. Both the procedures proved useful, but in different ways: Latent Dirichlet Allocation enabled us to classify the abstracts automatically under certain topics, while Reinert’s method was useful for identifying the internal structure of the abstracts and extracting the macro-topics that characterized them.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
I wish to thank Martin Ponweiser for his support in developing the necessary instructions in R for implementing LDA and obtaining these graphic results.
 
2
Special thanks go to Pierre Ratinaud for his support in developing the R instructions needed to recall the results produced by Iramuteq and construct the graphs from a chronological perspective.
 
Literature
go back to reference Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. New York: Springer. Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. New York: Springer.
go back to reference Amaturo, E. (1993). Messaggio, simbolo, comunicazione. Roma: NIS. Amaturo, E. (1993). Messaggio, simbolo, comunicazione. Roma: NIS.
go back to reference Arun, R., Suresh, V., Veni Madhavan, C. E., & Narasimha Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In M. J. Zaki, J. X. Yu, B. Ravindran, & V. Pudi (Eds.), Advances in knowledge discovery and data mining (pp. 391–402). Berlin: Springer.CrossRef Arun, R., Suresh, V., Veni Madhavan, C. E., & Narasimha Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In M. J. Zaki, J. X. Yu, B. Ravindran, & V. Pudi (Eds.), Advances in knowledge discovery and data mining (pp. 391–402). Berlin: Springer.CrossRef
go back to reference Beaudouin, V. (2016). Statistical analysis of textual data: Benzécri and the French School of Data Analysis. Glottometrics, 33, 56–72. Beaudouin, V. (2016). Statistical analysis of textual data: Benzécri and the French School of Data Analysis. Glottometrics, 33, 56–72.
go back to reference Benzécri, J.-P. (1973a). L’analyse des données. 1 La taxinomie. Paris: Bordas.MATH Benzécri, J.-P. (1973a). L’analyse des données. 1 La taxinomie. Paris: Bordas.MATH
go back to reference Benzécri, J.-P. (1973b). L’analyse des données. 2 L'analyse des correspondances. Paris: Bordas.MATH Benzécri, J.-P. (1973b). L’analyse des données. 2 L'analyse des correspondances. Paris: Bordas.MATH
go back to reference Benzécri, J.-P. (1982). Histoire et préhistoire de l’analyse des données. Paris: Dunod.MATH Benzécri, J.-P. (1982). Histoire et préhistoire de l’analyse des données. Paris: Dunod.MATH
go back to reference Benzécri, J.-P. (1992). Correspondence analysis handbook. New York: Marcel Dekker, Inc.CrossRef Benzécri, J.-P. (1992). Correspondence analysis handbook. New York: Marcel Dekker, Inc.CrossRef
go back to reference Berelson, B. (1952). Content analysis in communication research. Glencoe: The Free Press. Berelson, B. (1952). Content analysis in communication research. Glencoe: The Free Press.
go back to reference Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
go back to reference Blei, D. M. (2012a). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.CrossRef Blei, D. M. (2012a). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.CrossRef
go back to reference Blei, D. M, & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (pp. 113–120). Blei, D. M, & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (pp. 113–120).
go back to reference Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. Statistics, 1(1), 17–35.MathSciNetMATH Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. Statistics, 1(1), 17–35.MathSciNetMATH
go back to reference Blei, D. M., & Lafferty, J. D. (2009). Topic models. In A. Sahami & M. Srivastava (Eds.), Text mining: Theory and applications (pp. 71–93). New York: Taylor and Francis. Blei, D. M., & Lafferty, J. D. (2009). Topic models. In A. Sahami & M. Srivastava (Eds.), Text mining: Theory and applications (pp. 71–93). New York: Taylor and Francis.
go back to reference Blei, D. M., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993–1022.MATH Blei, D. M., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993–1022.MATH
go back to reference Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101.CrossRef Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101.CrossRef
go back to reference Bryman, A., & Burgess, R. G. (1994). Analyzing qualitative data. London: Routledge. Bryman, A., & Burgess, R. G. (1994). Analyzing qualitative data. London: Routledge.
go back to reference Busa, R. (1974-1980). Index Thomisticus Sancti Thomae Aquinatis Operum Omnium Indices ed concordantiae. Stoccarda: Frommann Holzboog. Busa, R. (1974-1980). Index Thomisticus Sancti Thomae Aquinatis Operum Omnium Indices ed concordantiae. Stoccarda: Frommann Holzboog.
go back to reference Creswell, J. W. (2007). Qualitative inquiry & research design: Choosing among five approaches. Thousand Oaks, CA: Sage. Creswell, J. W. (2007). Qualitative inquiry & research design: Choosing among five approaches. Thousand Oaks, CA: Sage.
go back to reference Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology, 41(6), 391–407. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology, 41(6), 391–407.
go back to reference Flick, U. (2014). An introduction to qualitative research (5th ed.). London: Sage. Flick, U. (2014). An introduction to qualitative research (5th ed.). London: Sage.
go back to reference Giuliano, L., & La Rocca, G. (2008). L'analisi automatica e semi-automatica dei dati testuali. Software e istruzioni per l'uso. Milano: Led edizioni. Giuliano, L., & La Rocca, G. (2008). L'analisi automatica e semi-automatica dei dati testuali. Software e istruzioni per l'uso. Milano: Led edizioni.
go back to reference Greenberg, B. G., & Sarhan, A. E. (1959). Matrix inversion, its interest and application in analysis of data. Journal of the American Statistical Association, 54(288), 755–766.MathSciNetMATH Greenberg, B. G., & Sarhan, A. E. (1959). Matrix inversion, its interest and application in analysis of data. Journal of the American Statistical Association, 54(288), 755–766.MathSciNetMATH
go back to reference Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1), 5228–5235.CrossRef Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1), 5228–5235.CrossRef
go back to reference Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic model. Journal of Statistical Software, 40(13), 1–30.CrossRef Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic model. Journal of Statistical Software, 40(13), 1–30.CrossRef
go back to reference Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363–371).CrossRef Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363–371).CrossRef
go back to reference Hansen, M. H., & Hurwitz, W. N. (1946). The Problem of Non-Response in Sample Surveys. Journal of the American Statistical Association, 41(236), 517–529.CrossRef Hansen, M. H., & Hurwitz, W. N. (1946). The Problem of Non-Response in Sample Surveys. Journal of the American Statistical Association, 41(236), 517–529.CrossRef
go back to reference Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.CrossRef Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.CrossRef
go back to reference Jardine, N., & Van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.CrossRef Jardine, N., & Van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.CrossRef
go back to reference Krippendorff, K. (1980). Content analysis. An introduction to its methodology. London: Sage.MATH Krippendorff, K. (1980). Content analysis. An introduction to its methodology. London: Sage.MATH
go back to reference Lasswell, H. D. (1927). Propaganda technique in the world war. New York: Alfred A. Knopf. Lasswell, H. D. (1927). Propaganda technique in the world war. New York: Alfred A. Knopf.
go back to reference Lasswell, H. D. (1949). The language of politics: Studies in quantitative semantics. New York: George Stewart. Lasswell, H. D. (1949). The language of politics: Studies in quantitative semantics. New York: George Stewart.
go back to reference Lebart, L., & Salem, A. (1988). Analyse statistique des données textuelles: Questions ouvertes et lexicometrie. Paris: Dunod. Lebart, L., & Salem, A. (1988). Analyse statistique des données textuelles: Questions ouvertes et lexicometrie. Paris: Dunod.
go back to reference Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Boston: Kluwer Academic Publication.CrossRef Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Boston: Kluwer Academic Publication.CrossRef
go back to reference Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning (pp. 577–584). Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning (pp. 577–584).
go back to reference Losito, G. (1993). L’analisi del contenuto nella ricerca sociale. Milano: Franco Angeli. Losito, G. (1993). L’analisi del contenuto nella ricerca sociale. Milano: Franco Angeli.
go back to reference Luhn, H. (1959). Auto-encoding of documents for information retrieval systems. In M. Boaz (Ed.), Modern trends in documentation (pp. 45–58). London: Pergamon Press. Luhn, H. (1959). Auto-encoding of documents for information retrieval systems. In M. Boaz (Ed.), Modern trends in documentation (pp. 45–58). London: Pergamon Press.
go back to reference Maron, M., & Kuhns, J. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7, 216–244.CrossRef Maron, M., & Kuhns, J. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7, 216–244.CrossRef
go back to reference Osgood, C. E. (1959). The representational model and relevant research methods. In I. de Sola Pool (Ed.), Trends in content analysis (pp. 33–88). Urbana, IL: University of Illinois Press. Osgood, C. E. (1959). The representational model and relevant research methods. In I. de Sola Pool (Ed.), Trends in content analysis (pp. 33–88). Urbana, IL: University of Illinois Press.
go back to reference Ponweiser, M. (2012). Latent Dirichlet Allocation in R. Vienna University of Business and Economics. Ponweiser, M. (2012). Latent Dirichlet Allocation in R. Vienna University of Business and Economics.
go back to reference Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRef
go back to reference R development core team (2016). R: A language and environment for statistical computing [software]. Vienna, Austria: R foundation for statistical computing. Retrieved from http://www.r-project.org R development core team (2016). R: A language and environment for statistical computing [software]. Vienna, Austria: R foundation for statistical computing. Retrieved from http://​www.​r-project.​org
go back to reference Ratinaud, P. (2014a). IRaMuTeQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires [software, Version 0.7 alpha 2]. Retrieved from http://www.iramuteq.org Ratinaud, P. (2014a). IRaMuTeQ: Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires [software, Version 0.7 alpha 2]. Retrieved from http://​www.​iramuteq.​org
go back to reference Ratinaud, P. (2014b). Visualisation chronologique des analyses ALCESTE: application à Twitter avec l'exemple du hashtag #mariagepourtous, In Actes des 12eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 553–565), JADT 2014, Paris. Ratinaud, P. (2014b). Visualisation chronologique des analyses ALCESTE: application à Twitter avec l'exemple du hashtag #mariagepourtous, In Actes des 12eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 553–565), JADT 2014, Paris.
go back to reference Ratinaud, P., & Marchand, P. (2012). Application de la méthode ALCESTE à de “gros” corpus et stabilité des “mondes lexicaux”: analyse du “CableGate” avec IRaMuTeQ. In Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 835–844), Liège, Belgique. Ratinaud, P., & Marchand, P. (2012). Application de la méthode ALCESTE à de “gros” corpus et stabilité des “mondes lexicaux”: analyse du “CableGate” avec IRaMuTeQ. In Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 835–844), Liège, Belgique.
go back to reference Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots Les Langages Du Politique, 108, 57–77. Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux représentations sociales. Une première approche des thématiques dans les débats à l’Assemblée nationale (1998-2014). Mots Les Langages Du Politique, 108, 57–77.
go back to reference Reinert, M. (1983). Une methode de classification descendante hierarchique: Application a l’analyse lexicale par contexte. Les Cahiers de l’Analyse des Données, 8(2), 187–198. Reinert, M. (1983). Une methode de classification descendante hierarchique: Application a l’analyse lexicale par contexte. Les Cahiers de l’Analyse des Données, 8(2), 187–198.
go back to reference Reinert, M. (1990). ALCESTE: Une méthodologie d'analyse des données textuelles et une application: Aurélia de Gérard de Nerval. Bulletin de Méthodologie Sociologique, 26, 24–54.CrossRef Reinert, M. (1990). ALCESTE: Une méthodologie d'analyse des données textuelles et une application: Aurélia de Gérard de Nerval. Bulletin de Méthodologie Sociologique, 26, 24–54.CrossRef
go back to reference Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» à travers l’analyse statistique d’un corpus de récits de cauchemars. Language et Société, 66, 5–39.CrossRef Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» à travers l’analyse statistique d’un corpus de récits de cauchemars. Language et Société, 66, 5–39.CrossRef
go back to reference Reinert, M. (1995). I mondi lessicali di un corpus di 304 racconti di incubi attraverso il metodo «Alceste». In R. Cipriani & S. Bolasco (Eds.), Ricerca qualitativa e computer (pp. 202–223). Milano: Franco Angeli. Reinert, M. (1995). I mondi lessicali di un corpus di 304 racconti di incubi attraverso il metodo «Alceste». In R. Cipriani & S. Bolasco (Eds.), Ricerca qualitativa e computer (pp. 202–223). Milano: Franco Angeli.
go back to reference Reinert, M. (1998). Mondes lexicaux et Topoi dans l’approche Alceste. In E. Mellet & M. Vuillaume (Eds.), Mots chiffrés et déchiffrés (pp. 289–303). Paris: Honoré Champion. Reinert, M. (1998). Mondes lexicaux et Topoi dans l’approche Alceste. In E. Mellet & M. Vuillaume (Eds.), Mots chiffrés et déchiffrés (pp. 289–303). Paris: Honoré Champion.
go back to reference Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494). Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494).
go back to reference Sanger, J., & Feldman, R. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press. Sanger, J., & Feldman, R. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.
go back to reference Savin-Baden, M., & Major, C. (2013). Qualitative research: The essential guide to theory and practice. London and New York: Routledge. Savin-Baden, M., & Major, C. (2013). Qualitative research: The essential guide to theory and practice. London and New York: Routledge.
go back to reference Sbalchiero, S., & Tuzzi, A. (2016). Scientists’ spirituality in Scientists’ words. Assessing and enriching the results of a qualitative analysis of in-depth interviews by means of quantitative approaches. Quality and Quantity, 50(3), 1333–1348.CrossRef Sbalchiero, S., & Tuzzi, A. (2016). Scientists’ spirituality in Scientists’ words. Assessing and enriching the results of a qualitative analysis of in-depth interviews by means of quantitative approaches. Quality and Quantity, 50(3), 1333–1348.CrossRef
go back to reference Schmidt, B. M. (2012). Words alone: Dismantling topic models in the humanities. Journal of Digital Humanities, 2(1), 49–65. Schmidt, B. M. (2012). Words alone: Dismantling topic models in the humanities. Journal of Digital Humanities, 2(1), 49–65.
go back to reference Smyrnaios, N., & Ratinaud, P. (2017). The Charlie Hebdo Attacks on Twitter: A comparative analysis of a political controversy in English and French. Social Media + Society, 3(1), 1–13.CrossRef Smyrnaios, N., & Ratinaud, P. (2017). The Charlie Hebdo Attacks on Twitter: A comparative analysis of a political controversy in English and French. Social Media + Society, 3(1), 1–13.CrossRef
go back to reference Sorokin, P. A. (1956). Fads and Foibles in Modern Sociology and Related Sciences. Chicago: Henry Regnery. Sorokin, P. A. (1956). Fads and Foibles in Modern Sociology and Related Sciences. Chicago: Henry Regnery.
go back to reference Strauss, A., & Corbin, J. (1990). Basics of qualitative research: Grounded theory procedures and techniques. London: Sage Inc. Strauss, A., & Corbin, J. (1990). Basics of qualitative research: Grounded theory procedures and techniques. London: Sage Inc.
go back to reference Thomas, W. I., & Znaniecki, F. (1958). The Polish Peasant in Europe and America Volumes I and II. New York: Dover Publications. Thomas, W. I., & Znaniecki, F. (1958). The Polish Peasant in Europe and America Volumes I and II. New York: Dover Publications.
go back to reference Tuzzi, A. (2003). L’analisi del contenuto: introduzione ai metodi e alle tecniche di ricerca. Roma: Carrocci. Tuzzi, A. (2003). L’analisi del contenuto: introduzione ai metodi e alle tecniche di ricerca. Roma: Carrocci.
Metadata
Title
Topic Detection: A Statistical Model and a Quali-Quantitative Method
Author
Stefano Sbalchiero
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-97064-6_10

Premium Partner