Skip to main content
Erschienen in: Neural Computing and Applications 4/2015

01.05.2015 | Original Article

Text clustering using VSM with feature clusters

verfasst von: Cao Qimin, Guo Qiao, Wang Yongliang, Wu Xianghua

Erschienen in: Neural Computing and Applications | Ausgabe 4/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text clustering. In order to improve the quality of text clustering, this paper proposed a feature cluster-based vector space model (FC-VSM) which used the text feature clusters co-occurrence matrix to represent document and proposed to identify non-contiguous phrases in the text preprocessing stage. Our method can reduce dimension of features compared with the traditional VSM-based model. It identified non-contiguous phrases, used distributed representation of features, and implements feature clusters. Despite their simplicity, our methods are surprisingly effective and can improve the accuracy of clustering significantly which is shown in experimental results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Shi Z (2002) Knowledge discovery. Tsing University Press, BeiJing Shi Z (2002) Knowledge discovery. Tsing University Press, BeiJing
2.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef
3.
5.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space (2013). \hyperimage{http://arxiv.org/abs/1301–3781}{arXiv:1301–3781} Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space (2013). \hyperimage{http://​arxiv.​org/​abs/​1301–3781}{arXiv:1301–3781}
7.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
8.
Zurück zum Zitat Simard M, Cancedda N, Cavestro B, et al (2005) Translating with non-contiguous phrases. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, pp 755–762 Simard M, Cancedda N, Cavestro B, et al (2005) Translating with non-contiguous phrases. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, pp 755–762
9.
Zurück zum Zitat Doucet A, Ahonen-Myka H (2004) Non-contiguous word sequences for information retrieval. In: Proceedings of the workshop on multiword expressions: integrating processing. Association for computational linguistics, pp 88–95 Doucet A, Ahonen-Myka H (2004) Non-contiguous word sequences for information retrieval. In: Proceedings of the workshop on multiword expressions: integrating processing. Association for computational linguistics, pp 88–95
10.
Zurück zum Zitat Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(9):533–536CrossRef Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(9):533–536CrossRef
11.
Zurück zum Zitat Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048 Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
12.
Zurück zum Zitat Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res, 1137–1155 Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res, 1137–1155
13.
Zurück zum Zitat Mikolov T (2012) Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology Mikolov T (2012) Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology
14.
Zurück zum Zitat Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167 Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
15.
Zurück zum Zitat Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394 Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
16.
Zurück zum Zitat Socher R, Lin CC, Ng AY, Manning C (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136 Socher R, Lin CC, Ng AY, Manning C (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136
17.
Zurück zum Zitat Socher R, Bauer J, Manning CD and Ng AY (2013) Parsing with compositional vector grammars. In: Proceedings of the association for computational linguistics Socher R, Bauer J, Manning CD and Ng AY (2013) Parsing with compositional vector grammars. In: Proceedings of the association for computational linguistics
18.
Zurück zum Zitat Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATH Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATH
19.
Zurück zum Zitat Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1, pp 873–882 Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1, pp 873–882
20.
Zurück zum Zitat Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088 Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088
21.
Zurück zum Zitat Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048 Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
22.
Zurück zum Zitat Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of the international workshop on artificial intelligence and statistics, pp 246–252 Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of the international workshop on artificial intelligence and statistics, pp 246–252
23.
Zurück zum Zitat Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefMATH Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefMATH
24.
Zurück zum Zitat Landauer TK, Domais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240CrossRef Landauer TK, Domais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240CrossRef
25.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
26.
Zurück zum Zitat Lu Y, Mei Q, Zhai CX (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retrieval 14(2):178–203CrossRef Lu Y, Mei Q, Zhai CX (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retrieval 14(2):178–203CrossRef
Metadaten
Titel
Text clustering using VSM with feature clusters
verfasst von
Cao Qimin
Guo Qiao
Wang Yongliang
Wu Xianghua
Publikationsdatum
01.05.2015
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 4/2015
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-014-1792-9

Weitere Artikel der Ausgabe 4/2015

Neural Computing and Applications 4/2015 Zur Ausgabe