Skip to main content
Erschienen in: Advances in Data Analysis and Classification 4/2020

25.05.2020 | Regular Article

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

verfasst von: Laura Anderlucci, Cinzia Viroli

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ambroise C, Govaert G (2000) Em algorithm for partially known labels. In: Kiers HAL, Rasson J-P, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin, pp 161–166CrossRef Ambroise C, Govaert G (2000) Em algorithm for partially known labels. In: Kiers HAL, Rasson J-P, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin, pp 161–166CrossRef
Zurück zum Zitat Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, BelmontMATH Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, BelmontMATH
Zurück zum Zitat Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
Zurück zum Zitat Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27CrossRef Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27CrossRef
Zurück zum Zitat Feinerer I, Hornik K (2018) tm: text Mining Package. R package version 0.7-6 Feinerer I, Hornik K (2018) tm: text Mining Package. R package version 0.7-6
Zurück zum Zitat Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRef Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRef
Zurück zum Zitat Hand D, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69:385–398MATH Hand D, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69:385–398MATH
Zurück zum Zitat Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126CrossRef Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126CrossRef
Zurück zum Zitat John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th conference on uncertainty in artificial intelligence, pp. 338–345 John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th conference on uncertainty in artificial intelligence, pp. 338–345
Zurück zum Zitat Khan A, Baharudin B, Lee LH, Khan K, Tronoh UTP (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20 Khan A, Baharudin B, Lee LH, Khan K, Tronoh UTP (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20
Zurück zum Zitat Ko Y (2012) A study of term weighting schemes using class information for text classification. In: SIGIR’12—proceedings of the international ACM SIGIR conference on research and development in information retrieval Ko Y (2012) A study of term weighting schemes using class information for text classification. In: SIGIR’12—proceedings of the international ACM SIGIR conference on research and development in information retrieval
Zurück zum Zitat Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. Montreal, Canada, pp 1137–1145 Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. Montreal, Canada, pp 1137–1145
Zurück zum Zitat Kumbhar P, Mali M (2016) A survey on feature selection techniques and classification algorithms for efficient text classification. Int J Sci Res 5(5):9 Kumbhar P, Mali M (2016) A survey on feature selection techniques and classification algorithms for efficient text classification. Int J Sci Res 5(5):9
Zurück zum Zitat Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273 Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273
Zurück zum Zitat Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134CrossRef Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134CrossRef
Zurück zum Zitat Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280CrossRef Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280CrossRef
Zurück zum Zitat Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef
Zurück zum Zitat Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117MathSciNetCrossRef Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117MathSciNetCrossRef
Zurück zum Zitat Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on KDDM, KDD ’14, New York. ACM, pp 233–242 Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on KDDM, KDD ’14, New York. ACM, pp 233–242
Zurück zum Zitat Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San RafaelCrossRef Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San RafaelCrossRef
Metadaten
Titel
Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data
verfasst von
Laura Anderlucci
Cinzia Viroli
Publikationsdatum
25.05.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 4/2020
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-020-00399-3

Weitere Artikel der Ausgabe 4/2020

Advances in Data Analysis and Classification 4/2020 Zur Ausgabe