Skip to main content

2021 | OriginalPaper | Buchkapitel

Short Text Clustering Using Generalized Dirichlet Multinomial Mixture Model

verfasst von : Samar Hannachi, Fatma Najar, Nizar Bouguila

Erschienen in: Recent Challenges in Intelligent Information and Database Systems

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Artificial Intelligence field is under the spotlight as of its wide use and efficiency in solving real world problems. As of this decade, a notable rise in the amounts of data collected, which were made available to the public, is witnessed. This allowed the emergence of many research problems among which working with short texts and their different challenges. In this paper, we propose the collapsed Gibbs Sampling algorithm for the generalized Dirichlet Multinomial Mixture model for short text clustering (GSDMM). The proposed approach has been evaluated on the Google News dataset. Our approach proved to be more efficient than the related-works and succeeded into overcoming the common challenges that come with short texts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH
2.
Zurück zum Zitat Hu, D.J.: Latent Dirichlet allocation for text, images, and music. University of California, San Diego (2009). Accessed 26 Apr 2013 Hu, D.J.: Latent Dirichlet allocation for text, images, and music. University of California, San Diego (2009). Accessed 26 Apr 2013
4.
Zurück zum Zitat Frunza, O., Inkpen, D., Tran, T.: A machine learning approach for identifying disease-treatment relations in short texts. IEEE Trans. Knowl. Data Eng. 23(6), 801–814 (2010)CrossRef Frunza, O., Inkpen, D., Tran, T.: A machine learning approach for identifying disease-treatment relations in short texts. IEEE Trans. Knowl. Data Eng. 23(6), 801–814 (2010)CrossRef
5.
Zurück zum Zitat Alsmadi, I., Hoon, G.K.: Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput. Appl. 31(8), 3819–3831 (2019)CrossRef Alsmadi, I., Hoon, G.K.: Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput. Appl. 31(8), 3819–3831 (2019)CrossRef
6.
Zurück zum Zitat Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)CrossRef Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)CrossRef
7.
Zurück zum Zitat Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M.R., King, I.: Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664 (2018) Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M.R., King, I.: Topic memory networks for short text classification. arXiv preprint arXiv:​1809.​03664 (2018)
8.
Zurück zum Zitat Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784 (2011) Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784 (2011)
9.
Zurück zum Zitat Dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78 (2014) Dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78 (2014)
10.
Zurück zum Zitat Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks. arXiv preprint arXiv:1603.03827 (2016) Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks. arXiv preprint arXiv:​1603.​03827 (2016)
11.
Zurück zum Zitat Bouguila, N., ElGuebaly, W.: Discrete data clustering using finite mixture models. Pattern Recogn. 42(1), 33–42 (2009)CrossRef Bouguila, N., ElGuebaly, W.: Discrete data clustering using finite mixture models. Pattern Recogn. 42(1), 33–42 (2009)CrossRef
12.
Zurück zum Zitat Bouguila, N., Amayri, O.: A discrete mixture-based kernel for SVMs: application to spam and image categorization. Inf. Process. Manag. 45(6), 631–642 (2009)CrossRef Bouguila, N., Amayri, O.: A discrete mixture-based kernel for SVMs: application to spam and image categorization. Inf. Process. Manag. 45(6), 631–642 (2009)CrossRef
14.
Zurück zum Zitat Bouguila, N., ElGuebaly, W.: A generative model for spatial color image databases categorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, Caesars Palace, Las Vegas, Nevada, USA, 30 March–4 April 2008, pp. 821–824. IEEE (2008) Bouguila, N., ElGuebaly, W.: A generative model for spatial color image databases categorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, Caesars Palace, Las Vegas, Nevada, USA, 30 March–4 April 2008, pp. 821–824. IEEE (2008)
15.
Zurück zum Zitat Bakhtiari, A.S., Bouguila, N.: A variational Bayes model for count data learning and classification. Eng. Appl. Artif. Intell. 35, 176–186 (2014)CrossRef Bakhtiari, A.S., Bouguila, N.: A variational Bayes model for count data learning and classification. Eng. Appl. Artif. Intell. 35, 176–186 (2014)CrossRef
16.
Zurück zum Zitat Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)CrossRef Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)CrossRef
17.
Zurück zum Zitat Mehdi, M., Bouguila, N., Bentahar, J.: Trustworthy web service selection using probabilistic models. In: Goble, C.A., Chen, P.P., Zhang, J. (eds.) 2012 IEEE 19th International Conference on Web Services, Honolulu, HI, USA, 24–29 June 2012, pp. 17–24. IEEE Computer Society (2012) Mehdi, M., Bouguila, N., Bentahar, J.: Trustworthy web service selection using probabilistic models. In: Goble, C.A., Chen, P.P., Zhang, J. (eds.) 2012 IEEE 19th International Conference on Web Services, Honolulu, HI, USA, 24–29 June 2012, pp. 17–24. IEEE Computer Society (2012)
19.
Zurück zum Zitat Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014) Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014)
20.
Zurück zum Zitat Carlo, C.M.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB, 581 (2004) Carlo, C.M.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB, 581 (2004)
21.
Zurück zum Zitat Yildirim, I.: Bayesian inference: Gibbs sampling. Technical Note, University of Rochester (2012) Yildirim, I.: Bayesian inference: Gibbs sampling. Technical Note, University of Rochester (2012)
23.
Zurück zum Zitat Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)CrossRef Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)CrossRef
24.
Zurück zum Zitat Heinrich, G.: Parameter estimation for text analysis. Technical report (2005) Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)
25.
Zurück zum Zitat Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007) Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
26.
Zurück zum Zitat Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011) Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011)
27.
Zurück zum Zitat Becker, H.: Identification and characterization of events in social media. Ph.D. thesis, Columbia University (2011) Becker, H.: Identification and characterization of events in social media. Ph.D. thesis, Columbia University (2011)
28.
Zurück zum Zitat Zhang, S., Wong, H.-S.: ARImp: a generalized adjusted rand index for cluster ensembles. In: 2010 20th International Conference on Pattern Recognition, pp. 778–781. IEEE (2010) Zhang, S., Wong, H.-S.: ARImp: a generalized adjusted rand index for cluster ensembles. In: 2010 20th International Conference on Pattern Recognition, pp. 778–781. IEEE (2010)
Metadaten
Titel
Short Text Clustering Using Generalized Dirichlet Multinomial Mixture Model
verfasst von
Samar Hannachi
Fatma Najar
Nizar Bouguila
Copyright-Jahr
2021
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-16-1685-3_13