Skip to main content
Erschienen in: Pattern Analysis and Applications 3/2023

10.05.2023 | Theoretical Advances

DEC-transformer: deep embedded clustering with transformer on Chinese long text

verfasst von: Ao Zou, Wenning Hao, Gang Chen, Dawei Jin

Erschienen in: Pattern Analysis and Applications | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Long text clustering is of great significance and practical value in data mining, such as information retrieval, text integration, and data compression. Compared with short text clustering, long text clustering involves more semantic information representation and processing, making it a challenging problem. Most recent techniques merely rely on dynamic word embeddings from pre-training as a transfer learning or only based on a simple combination of transformers and traditional clustering methods, which still need to be expanded to short text due to the quadratic computational complexity. In this paper, a novel model combining transfer learning and dynamic feedback called deep embedded clustering with transformer(DEC-transformer) is proposed. To better capture the semantic relationships between sentences in documents, a novel transfer learning technology is firstly applied to long text clustering tasks for pre-training. Unlike previous papers, a two-stage training task is designed by treating semantic representation and text clustering as a united process, and the parameter is dynamically optimized by adaptive feedback to further improve efficiency. Experimental results on the test set show that the proposed model has made great progress in accuracy compared with several benchmarks. Furthermore, it also has good robustness and can get good performance on noisy datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
4.
Zurück zum Zitat Seifzadeh S, Farahat AK, Kamel MS, Karray F Short-text clustering using statistical semantics. In: Proceedings of the 24th international conference on World Wide Web, New York Seifzadeh S, Farahat AK, Kamel MS, Karray F Short-text clustering using statistical semantics. In: Proceedings of the 24th international conference on World Wide Web, New York
8.
Zurück zum Zitat Revanasiddappa MB, Harish BS, Kumar SVA (2017) Clustering text documents using kernel possibilistic c-means. In: Proceedings of international conference on cognition and recognition. Springer, Berlin Revanasiddappa MB, Harish BS, Kumar SVA (2017) Clustering text documents using kernel possibilistic c-means. In: Proceedings of international conference on cognition and recognition. Springer, Berlin
10.
Zurück zum Zitat Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: 21st international conference on machine learning—ICML’04, New York Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: 21st international conference on machine learning—ICML’04, New York
12.
Zurück zum Zitat Aggarwal CC, Zhai CA (2012) Survey of text clustering algorithms. In: Mining text data. Springer, New York Aggarwal CC, Zhai CA (2012) Survey of text clustering algorithms. In: Mining text data. Springer, New York
13.
Zurück zum Zitat Wang B, Liu W, Lin Z, Hu X, Wei J, Liu C (2018) Text clustering algorithm based on deep representation learning. J Eng 2018(16):1407–1414CrossRef Wang B, Liu W, Lin Z, Hu X, Wei J, Liu C (2018) Text clustering algorithm based on deep representation learning. J Eng 2018(16):1407–1414CrossRef
14.
Zurück zum Zitat Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH
15.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119. Accessed 03 Jun 2021 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119. Accessed 03 Jun 2021
16.
Zurück zum Zitat Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of machine learning research, vol 32. PMLR, Bejing, pp 1188–1196. http://proceedings.mlr.press/v32/le14.html Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of machine learning research, vol 32. PMLR, Bejing, pp 1188–1196. http://​proceedings.​mlr.​press/​v32/​le14.​html
17.
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]. Accessed 03 Jun 2021 Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:​1810.​04805 [cs]. Accessed 03 Jun 2021
18.
Zurück zum Zitat Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs]. Accessed 03 Jun 2021 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:​1907.​11692 [cs]. Accessed 03 Jun 2021
23.
Zurück zum Zitat Yang B, Fu X, Sidiropoulos N, Hong M (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering, In: Proceedings of machine learning research, PMLR, Sydney, pp 3861–3870 Yang B, Fu X, Sidiropoulos N, Hong M (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering, In: Proceedings of machine learning research, PMLR, Sydney, pp 3861–3870
24.
Zurück zum Zitat Huang P, Huang Y, Wang W, Wang L (2014) Deep embedding network for clustering. In: 2014 22nd international conference on pattern recognition, Stockholm Huang P, Huang Y, Wang W, Wang L (2014) Deep embedding network for clustering. In: 2014 22nd international conference on pattern recognition, Stockholm
25.
Zurück zum Zitat Chen D, Lv J, Zhang Y (2017) Unsupervised multi-manifold clustering by learning deep representation. In: AAAI workshops Chen D, Lv J, Zhang Y (2017) Unsupervised multi-manifold clustering by learning deep representation. In: AAAI workshops
26.
Zurück zum Zitat Dizaji GK, Herandi A, Deng C, Cai W, Huang H (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE international conference on computer vision, pp 5736–5745 Dizaji GK, Herandi A, Deng C, Cai W, Huang H (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE international conference on computer vision, pp 5736–5745
29.
Zurück zum Zitat Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, pp 478–487. PMLR Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, pp 478–487. PMLR
32.
Zurück zum Zitat Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self augmented training. In: International conference on machine learning, pp 1558–1567. PMLR Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self augmented training. In: International conference on machine learning, pp 1558–1567. PMLR
33.
Zurück zum Zitat Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156 Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
34.
Zurück zum Zitat Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp 5879–5887 Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp 5879–5887
35.
Zurück zum Zitat Jiang Z, Zheng Y, Tan H, Tang B, Zhou H (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th international joint conference on artificial intelligence, pp 1965–1972 Jiang Z, Zheng Y, Tan H, Tang B, Zhou H (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th international joint conference on artificial intelligence, pp 1965–1972
36.
Zurück zum Zitat Dilokthanakul N, Mediano AMP, Garnelo M, Lee CHM, Salimbeni H, Arulkumaran K, Shanahan M (2017) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv: Learning Dilokthanakul N, Mediano AMP, Garnelo M, Lee CHM, Salimbeni H, Arulkumaran K, Shanahan M (2017) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv:​ Learning
37.
Zurück zum Zitat Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, pp 2172–2180 Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, pp 2172–2180
38.
Zurück zum Zitat Hadifar A, Sterckx L, Demeester T, Develder C (2019) A self-training approach for short text clustering. ACL 2019:194 Hadifar A, Sterckx L, Demeester T, Develder C (2019) A self-training approach for short text clustering. ACL 2019:194
41.
Zurück zum Zitat Rakib MRH, Zeh N, Jankowska M, Milios E Enhancement of short text clustering by iterative classification. In: Natural language processing and information systems. Springer, Berlin Rakib MRH, Zeh N, Jankowska M, Milios E Enhancement of short text clustering by iterative classification. In: Natural language processing and information systems. Springer, Berlin
43.
Zurück zum Zitat Aljalbout E, Golkov V, Siddiqui Y, Strobel M, Cremers D (2018) Clustering with deep learning: taxonomy and new methods. arXiv:1801.07648 [cs, stat]. Accessed 03 Jun 2021 Aljalbout E, Golkov V, Siddiqui Y, Strobel M, Cremers D (2018) Clustering with deep learning: taxonomy and new methods. arXiv:​1801.​07648 [cs, stat]. Accessed 03 Jun 2021
44.
Zurück zum Zitat Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11): 2579–2605MATH Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11): 2579–2605MATH
46.
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an Imperative style, high-performance deep learning library. arXiv:1912.01703 [cs, stat]. Accessed 03 Jun 2021 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an Imperative style, high-performance deep learning library. arXiv:​1912.​01703 [cs, stat]. Accessed 03 Jun 2021
47.
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH
48.
Zurück zum Zitat Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, Stroudsburg Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, Stroudsburg
49.
Zurück zum Zitat Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2020) XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs]. Accessed 03 Jun 2021 Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2020) XLNet: generalized autoregressive pretraining for language understanding. arXiv:​1906.​08237 [cs]. Accessed 03 Jun 2021
52.
Zurück zum Zitat Hu X, Sun N, Zhang C, Chua T-S (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management—CIKM’09, New York Hu X, Sun N, Zhang C, Chua T-S (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management—CIKM’09, New York
Metadaten
Titel
DEC-transformer: deep embedded clustering with transformer on Chinese long text
verfasst von
Ao Zou
Wenning Hao
Gang Chen
Dawei Jin
Publikationsdatum
10.05.2023
Verlag
Springer London
Erschienen in
Pattern Analysis and Applications / Ausgabe 3/2023
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-023-01161-z

Weitere Artikel der Ausgabe 3/2023

Pattern Analysis and Applications 3/2023 Zur Ausgabe

Premium Partner