nach oben

Advances in Data Analysis and Classification

Erschienen in:

25.05.2020 | Regular Article

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

verfasst von: Laura Anderlucci, Cinzia Viroli

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.

Vorheriger Artikel Data generation for composite-based structural equation modeling methods

Nächster Artikel On the use of quantile regression to deal with heterogeneity: the case of multi-block data

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ambroise C, Govaert G (2000) Em algorithm for partially known labels. In: Kiers HAL, Rasson J-P, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin, pp 161–166CrossRef

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRef

Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, BelmontMATH

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH

Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27CrossRef

Feinerer I, Hornik K (2018) tm: text Mining Package. R package version 0.7-6

Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRef

Hand D, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69:385–398MATH

Harris ZS (1954) Distributional structure. Word 10(2–3):146–162CrossRef

Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126CrossRef

John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th conference on uncertainty in artificial intelligence, pp. 338–345

Khan A, Baharudin B, Lee LH, Khan K, Tronoh UTP (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20

Ko Y (2012) A study of term weighting schemes using class information for text classification. In: SIGIR’12—proceedings of the international ACM SIGIR conference on research and development in information retrieval

Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. Montreal, Canada, pp 1137–1145

Kumbhar P, Mali M (2016) A survey on feature selection techniques and classification algorithms for efficient text classification. Int J Sci Res 5(5):9

Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273

Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134CrossRef

Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280CrossRef

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRef

Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117MathSciNetCrossRef

Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on KDDM, KDD ’14, New York. ACM, pp 233–242

Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San RafaelCrossRef

Titel: Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data
verfasst von: Laura Anderlucci
Cinzia Viroli
Publikationsdatum: 25.05.2020
Verlag: Springer Berlin Heidelberg
Erschienen in: Advances in Data Analysis and Classification / Ausgabe 4/2020
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-020-00399-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2020

The GNG neural network in analyzing consumer behaviour patterns: empirical research on a purchasing behaviour processes realized by the elderly consumers

Editable machine learning models? A rule-based framework for user studies of explainability

SEM-Tree hybrid models in the preferences analysis of the members of Polish households

Chained correlations for feature selection

Data generation for composite-based structural equation modeling methods

On the use of quantile regression to deal with heterogeneity: the case of multi-block data