Skip to main content
Erschienen in: Neural Computing and Applications 14/2024

02.03.2024 | Original Article

A topic-enhanced dirichlet model for short text stream clustering

verfasst von: Kan Liu, Jiarui He, Yu Chen

Erschienen in: Neural Computing and Applications | Ausgabe 14/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Short text streams, such as social media comments, are continuously generated, making effective clustering methods essential for extracting valuable information. However, existing research fails to address the problem of topic concentration in clustering, which leads to multiple topics being confused in one cluster, making it challenging to summarize the center of clustering. To tackle this issue, this paper proposes a novel topic-enhanced clustering method called TEDM, based on the Dirichlet model. The method uses dynamic clustering, leveraging topic information to improve the sampling of documents and better cluster documents on the same topic. TEDM constructs a dynamic word relation graph to extract topic terms, which is updated with the stream of documents to cope with the dynamic changes in topics. Extensive experimental studies demonstrate that TEDM outperforms state-of-the-art works on multiple real datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal CC, Philip SY, Han J, et al (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92 Aggarwal CC, Philip SY, Han J, et al (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92
2.
Zurück zum Zitat Blackwell D, MacQueen JB (1973) Ferguson distributions via pólya urn schemes. Anna Statist 1(2):353–355 Blackwell D, MacQueen JB (1973) Ferguson distributions via pólya urn schemes. Anna Statist 1(2):353–355
3.
Zurück zum Zitat Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, pp 113–120 Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, pp 113–120
4.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022 Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
5.
Zurück zum Zitat Cao F, Estert M, Qian W, et al (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, SIAM, pp 328–339 Cao F, Estert M, Qian W, et al (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, SIAM, pp 328–339
6.
Zurück zum Zitat Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47MathSciNetCrossRef Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47MathSciNetCrossRef
7.
Zurück zum Zitat Chen J, Gong Z, Liu W (2020) A dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50(5):1609–1619CrossRef Chen J, Gong Z, Liu W (2020) A dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50(5):1609–1619CrossRef
8.
Zurück zum Zitat Chu D, Reyers M, Thomson J et al (2020) Route identification in the national football league: An application of model-based curve clustering using the em algorithm. J Quantit Anal Sports 16(2):121–132CrossRef Chu D, Reyers M, Thomson J et al (2020) Route identification in the national football league: An application of model-based curve clustering using the em algorithm. J Quantit Anal Sports 16(2):121–132CrossRef
9.
Zurück zum Zitat Duan T, Lou Q, Srihari SN, et al (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 68–80 Duan T, Lou Q, Srihari SN, et al (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 68–80
10.
Zurück zum Zitat Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Annal Statist pp 209–230 Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Annal Statist pp 209–230
11.
Zurück zum Zitat Geng F, Liu Q, Zhang P (2020) A time-aware query-focused summarization of an evolving microblogging stream via sentence extraction. Digit Commun Netw 6(3):389–397CrossRef Geng F, Liu Q, Zhang P (2020) A time-aware query-focused summarization of an evolving microblogging stream via sentence extraction. Digit Commun Netw 6(3):389–397CrossRef
12.
Zurück zum Zitat Iwata T, Watanabe S, Yamada T, et al (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-First international joint conference on artificial intelligence, Citeseer Iwata T, Watanabe S, Yamada T, et al (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-First international joint conference on artificial intelligence, Citeseer
13.
Zurück zum Zitat Kumar J, Shao J, Uddin S, et al (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 766–776 Kumar J, Shao J, Uddin S, et al (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 766–776
14.
Zurück zum Zitat Li Y, Li H, Wang Z et al (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng 34(2):617–630MathSciNetCrossRef Li Y, Li H, Wang Z et al (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng 34(2):617–630MathSciNetCrossRef
15.
Zurück zum Zitat Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 995–1004 Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 995–1004
16.
Zurück zum Zitat Lin Y, Jin X, Chen J et al (2019) An analytic computation-driven algorithm for decentralized multicore systems. Future Gener Comput Syst 96:101–110CrossRef Lin Y, Jin X, Chen J et al (2019) An analytic computation-driven algorithm for decentralized multicore systems. Future Gener Comput Syst 96:101–110CrossRef
17.
Zurück zum Zitat Miller E (2009) Rank hotness with newton’s law of cooling. Feb 15:3 Miller E (2009) Rank hotness with newton’s law of cooling. Feb 15:3
18.
Zurück zum Zitat Mills-Tettey GA, Stentz A, Dias MB (2007) The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech Rep CMU-RI-TR-07-27 Mills-Tettey GA, Stentz A, Dias MB (2007) The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech Rep CMU-RI-TR-07-27
19.
Zurück zum Zitat Nigam K, McCallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134CrossRef Nigam K, McCallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134CrossRef
20.
Zurück zum Zitat Niwattanakul S, Singthongchai J, Naenudorn E, et al (2013) Using of jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384 Niwattanakul S, Singthongchai J, Naenudorn E, et al (2013) Using of jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384
21.
Zurück zum Zitat Rakib MRH, Zeh N, Milios E (2021) Efficient clustering of short text streams using online-offline clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, pp 1–10 Rakib MRH, Zeh N, Milios E (2021) Efficient clustering of short text streams using online-offline clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, pp 1–10
22.
Zurück zum Zitat Rendón E, Abundez I, Arizmendi A et al (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34 Rendón E, Abundez I, Arizmendi A et al (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
23.
Zurück zum Zitat Rosenberg A, Hirschberg J (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410–420 Rosenberg A, Hirschberg J (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410–420
24.
Zurück zum Zitat Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer Science & Business Media Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer Science & Business Media
25.
Zurück zum Zitat Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: International conference on artificial neural networks, Springer, pp 175–184 Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: International conference on artificial neural networks, Springer, pp 175–184
26.
Zurück zum Zitat Shou L, Wang Z, Chen K, et al (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 533–542 Shou L, Wang Z, Chen K, et al (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 533–542
27.
Zurück zum Zitat Strehl A, Ghosh J (2002) Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNet Strehl A, Ghosh J (2002) Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNet
28.
Zurück zum Zitat Terenin A, Simpson D, Draper D (2020) Asynchronous gibbs sampling. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 144–154 Terenin A, Simpson D, Draper D (2020) Asynchronous gibbs sampling. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 144–154
29.
Zurück zum Zitat Vo T (2022) Gowseqstream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput Appl 34(6):4321–4341MathSciNetCrossRef Vo T (2022) Gowseqstream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput Appl 34(6):4321–4341MathSciNetCrossRef
30.
Zurück zum Zitat Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433 Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433
31.
Zurück zum Zitat Wang Y, Agichtein E, Benzi M (2012) Tm-lda: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 123–131 Wang Y, Agichtein E, Benzi M (2012) Tm-lda: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 123–131
32.
Zurück zum Zitat Yang S, Huang G, Cai B (2019) Discovering topic representative terms for short text clustering. IEEE Access 7:92037–92047CrossRef Yang S, Huang G, Cai B (2019) Discovering topic representative terms for short text clustering. IEEE Access 7:92037–92047CrossRef
33.
Zurück zum Zitat Yang S, Huang G, Zhou X, et al (2019b) Dynamic clustering of stream short documents using evolutionary word relation network. In: International Conference on Data Service, Springer, pp 418–428 Yang S, Huang G, Zhou X, et al (2019b) Dynamic clustering of stream short documents using evolutionary word relation network. In: International Conference on Data Service, Springer, pp 418–428
34.
Zurück zum Zitat Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242 Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242
35.
Zurück zum Zitat Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 625–636 Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 625–636
36.
Zurück zum Zitat Yin J, Chao D, Liu Z, et al (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2634–2642 Yin J, Chao D, Liu Z, et al (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2634–2642
37.
Zurück zum Zitat Yoo S, Huang H, Kasiviswanathan SP (2016) Streaming spectral clustering. In: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, pp 637–648 Yoo S, Huang H, Kasiviswanathan SP (2016) Streaming spectral clustering. In: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, pp 637–648
38.
Zurück zum Zitat Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772 Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772
39.
Zurück zum Zitat Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798CrossRef Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798CrossRef
40.
Zurück zum Zitat Zhou JY, Wang FY, Zeng DJ (2011) Hierarchical dirichlet processes and their applications: a survey. Zidonghua Xuebao/Acta Automatica Sinica 37(4):389–407MathSciNetCrossRef Zhou JY, Wang FY, Zeng DJ (2011) Hierarchical dirichlet processes and their applications: a survey. Zidonghua Xuebao/Acta Automatica Sinica 37(4):389–407MathSciNetCrossRef
Metadaten
Titel
A topic-enhanced dirichlet model for short text stream clustering
verfasst von
Kan Liu
Jiarui He
Yu Chen
Publikationsdatum
02.03.2024
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 14/2024
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-024-09480-w

Weitere Artikel der Ausgabe 14/2024

Neural Computing and Applications 14/2024 Zur Ausgabe

Premium Partner