Skip to main content
Erschienen in: Neural Computing and Applications 6/2022

28.10.2021 | Original Article

GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

verfasst von: Tham Vo

Erschienen in: Neural Computing and Applications | Ausgabe 6/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, the proposed non-parametric Bayesian based techniques which aim to model short-length textual documents through the multinomial distribution on the bag-of-words (BOW), aka mixture model-based approach. Although existing model can effectively deal with the topic/concept drift and textual sparsity problems, they are unable to exploit the semantic sequential representation of text as well as the co-occurrence relationships between words. To meet these challenges, we propose a novel approach called as GOWSeqStream. Our proposed model is a joint integration of graph-of-words (GOW) and deep sequential encoding within the Dirichlet Process Mixture Model (DPMM) framework to improve the performance of text clustering task. Extensive experiments in benchmark real-world datasets demonstrate the effectiveness of our proposed GOWSeqStream model in comparing with recent state-of-the-art baselines. Experimental outputs in terms of NMI standard metric demonstrate the outperformances of proposed GOWSeqStream model over the recent well-known text stream clustering baselines, such as MStream, NPMM and OSDM.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ganguli I, Sil J, Sengupta N (2021) Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput Appl 1–21 Ganguli I, Sil J, Sengupta N (2021) Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput Appl 1–21
2.
Zurück zum Zitat Hassani A, Iranmanesh A, Mansouri N (2021)Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput Appl 1–22 Hassani A, Iranmanesh A, Mansouri N (2021)Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput Appl 1–22
3.
Zurück zum Zitat Nakamura T, Shirakawa M, Hara T, Nishio S (2019) Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 18(2):1–25CrossRef Nakamura T, Shirakawa M, Hara T, Nishio S (2019) Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 18(2):1–25CrossRef
4.
Zurück zum Zitat Ruan YP, Ling ZH, Zhu X (2020) Condition-transforming variational autoencoder for generating diverse short text conversations. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(6):1–13CrossRef Ruan YP, Ling ZH, Zhu X (2020) Condition-transforming variational autoencoder for generating diverse short text conversations. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(6):1–13CrossRef
5.
Zurück zum Zitat Zhao S, Gao Y, Ding G, Chua TS (2017) Real-time multimedia social event detection in microblog. IEEE Trans Cybernet 48(11):3218–3231CrossRef Zhao S, Gao Y, Ding G, Chua TS (2017) Real-time multimedia social event detection in microblog. IEEE Trans Cybernet 48(11):3218–3231CrossRef
6.
Zurück zum Zitat Pham P, Nguyen LT, Vo B, & Yun U (2021) Bot2Vec: a general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inf Syst 101771 Pham P, Nguyen LT, Vo B, & Yun U (2021) Bot2Vec: a general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inf Syst 101771
7.
Zurück zum Zitat Blei DM, & Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning Blei DM, & Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning
8.
Zurück zum Zitat Amoualian H, Clausel M, Gaussier E, & Amini MR (2016) Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining Amoualian H, Clausel M, Gaussier E, & Amini MR (2016) Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
9.
Zurück zum Zitat Du N, Farajtabar M, Ahmed A, Smola AJ, & Song L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining Du N, Farajtabar M, Ahmed A, Smola AJ, & Song L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining
10.
Zurück zum Zitat Yin J and Wang J (2015) A text clustering algorithm using an online clustering scheme for initialization. In: ACM International Conference on Knowledge Discovery and Data Mining Yin J and Wang J (2015) A text clustering algorithm using an online clustering scheme for initialization. In: ACM International Conference on Knowledge Discovery and Data Mining
11.
Zurück zum Zitat Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, and de Rijke M (2016) Explainable user clustering in short text streams. In: International ACM conference on research and de- velopment in information retrieval Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, and de Rijke M (2016) Explainable user clustering in short text streams. In: International ACM conference on research and de- velopment in information retrieval
12.
Zurück zum Zitat Liang S, Yilmaz E, & Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining Liang S, Yilmaz E, & Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
13.
Zurück zum Zitat Livieris IE, Stavroyiannis S, Iliadis L, Pintelas P (2021) Smoothing and stationarity enforcement framework for deep learning time-series forecasting. Neural Comput Appl 1–15 Livieris IE, Stavroyiannis S, Iliadis L, Pintelas P (2021) Smoothing and stationarity enforcement framework for deep learning time-series forecasting. Neural Comput Appl 1–15
14.
Zurück zum Zitat Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: ACM international conference on knowledge discovery and data mining Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: ACM international conference on knowledge discovery and data mining
15.
Zurück zum Zitat Chen J, Gong Z, Liu W (2020) A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 1–11 Chen J, Gong Z, Liu W (2020) A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 1–11
16.
Zurück zum Zitat Ameur MSH, Belkebir R, Guessoum A (2020) Robust arabic text categorization by combining convolutional and recurrent neural networks. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(5):1–16CrossRef Ameur MSH, Belkebir R, Guessoum A (2020) Robust arabic text categorization by combining convolutional and recurrent neural networks. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(5):1–16CrossRef
17.
Zurück zum Zitat Kumar J, Shao J, Uddin S, Ali W (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics Kumar J, Shao J, Uddin S, Ali W (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics
18.
Zurück zum Zitat Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47MathSciNetCrossRef Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47MathSciNetCrossRef
19.
Zurück zum Zitat Liu Y, Che W, Wang Y, Zheng B, Qin B, Liu T (2019) Deep contextualized word embeddings for universal dependency parsing. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(1):1–17 Liu Y, Che W, Wang Y, Zheng B, Qin B, Liu T (2019) Deep contextualized word embeddings for universal dependency parsing. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(1):1–17
21.
Zurück zum Zitat Pirbhulal S, Pombo N, Felizardo V, Garcia N, Sodhro AH, Mukhopadhyay SC (2019) Towards machine learning enabled security framework for iot-based healthcare. In: 2019 13th international conference on sensing technology (ICST), IEEE Pirbhulal S, Pombo N, Felizardo V, Garcia N, Sodhro AH, Mukhopadhyay SC (2019) Towards machine learning enabled security framework for iot-based healthcare. In: 2019 13th international conference on sensing technology (ICST), IEEE
22.
Zurück zum Zitat AHMAD Ijaz et al (2020) Machine learning meets communication networks: current trends and future challenges. IEEE Access 8:223418–223460 AHMAD Ijaz et al (2020) Machine learning meets communication networks: current trends and future challenges. IEEE Access 8:223418–223460
23.
Zurück zum Zitat Lin Y, Jin X, Chen J, Sodhro AH, Pan Z (2019) An analytic computation-driven algorithm for decentralized multicore systems. Futur Gener Comput Syst 96:101–110CrossRef Lin Y, Jin X, Chen J, Sodhro AH, Pan Z (2019) An analytic computation-driven algorithm for decentralized multicore systems. Futur Gener Comput Syst 96:101–110CrossRef
24.
Zurück zum Zitat Talat R, Obaidat MS, Muzammal M, Sodhro AH, Luo Z, Pirbhulal S (2020) A decentralised approach to privacy preserving trajectory mining. Futur Gener Comput Syst 102:382–392CrossRef Talat R, Obaidat MS, Muzammal M, Sodhro AH, Luo Z, Pirbhulal S (2020) A decentralised approach to privacy preserving trajectory mining. Futur Gener Comput Syst 102:382–392CrossRef
25.
Zurück zum Zitat Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
26.
Zurück zum Zitat Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time-series. IJCAI 7:2909–2914 Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time-series. IJCAI 7:2909–2914
27.
Zurück zum Zitat Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-first international joint conference on artificial intelligence Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-first international joint conference on artificial intelligence
28.
Zurück zum Zitat Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. Society for industrial and applied mathematics Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. Society for industrial and applied mathematics
29.
Zurück zum Zitat Aggarwal CC, Philip SY, Han J, & Wang J (2003) in A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference Aggarwal CC, Philip SY, Han J, & Wang J (2003) in A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference
30.
Zurück zum Zitat Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798CrossRef Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798CrossRef
31.
Zurück zum Zitat Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining
32.
Zurück zum Zitat Shou L, Wang Z, Chen K, Chen G (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval Shou L, Wang Z, Chen K, Chen G (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval
33.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
34.
Zurück zum Zitat Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196CrossRef Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196CrossRef
35.
Zurück zum Zitat Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: Proceedings of IEEE international conference on data mining Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: Proceedings of IEEE international conference on data mining
36.
Zurück zum Zitat Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE international conference on data mining Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE international conference on data mining
37.
Zurück zum Zitat Duan T, Lou Q, Srihari SN, & Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining Duan T, Lou Q, Srihari SN, & Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining
38.
Zurück zum Zitat Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K & Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K & Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
39.
Zurück zum Zitat Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies
40.
Zurück zum Zitat Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning (PMLR) Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning (PMLR)
41.
Zurück zum Zitat Hoang VCD, Dinh D, Le Nguyen N, Ngo HQ (2007) A comparative study on vietnamese text classification methods. In: 2007 IEEE international conference on research, innovation and vision for the future Hoang VCD, Dinh D, Le Nguyen N, Ngo HQ (2007) A comparative study on vietnamese text classification methods. In: 2007 IEEE international conference on research, innovation and vision for the future
42.
Zurück zum Zitat Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) Vncorenlp: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) Vncorenlp: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations
Metadaten
Titel
GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering
verfasst von
Tham Vo
Publikationsdatum
28.10.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 6/2022
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-021-06563-w

Weitere Artikel der Ausgabe 6/2022

Neural Computing and Applications 6/2022 Zur Ausgabe

Premium Partner