Skip to main content
Top
Published in: Advances in Data Analysis and Classification 3/2018

28-02-2017 | Regular Article

Cluster-based sparse topical coding for topic mining and document clustering

Authors: Parvin Ahmadi, Iman Gholampour, Mahmoud Tabandeh

Published in: Advances in Data Analysis and Classification | Issue 3/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022MATH Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022MATH
go back to reference Fritzke B (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632 Fritzke B (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632
go back to reference Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 289–296 Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 289–296
go back to reference Hyvarinen A (1999) Sparse code shrinkage: denoising of nongaussian data by maximum likelihood estimation. Neural Comput 10:1739–1768CrossRef Hyvarinen A (1999) Sparse code shrinkage: denoising of nongaussian data by maximum likelihood estimation. Neural Comput 10:1739–1768CrossRef
go back to reference Lamirel JC (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. J Scientometr 93(1):151–166CrossRef Lamirel JC (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. J Scientometr 93(1):151–166CrossRef
go back to reference Lamirel JC, Falk I, Gardent C (2015) Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with IGNGF neural clustering. Neurocomputing 147:136–146CrossRef Lamirel JC, Falk I, Gardent C (2015) Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with IGNGF neural clustering. Neurocomputing 147:136–146CrossRef
go back to reference Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808 Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808
go back to reference Li X, Ouyang J, Lu Y, Zhou X, Tian T (2014) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25 Li X, Ouyang J, Lu Y, Zhou X, Tian T (2014) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25
go back to reference Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Inf Retr 14(2):178–203CrossRef Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Inf Retr 14(2):178–203CrossRef
go back to reference Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th edn. McGraw-Hill, New York Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th edn. McGraw-Hill, New York
go back to reference Wallach HM (2008) Structured topic models for language. Doctoral dissertation, Univ. of Cambridge Wallach HM (2008) Structured topic models for language. Doctoral dissertation, Univ. of Cambridge
go back to reference Wang X, Ma X, Grimson WEL (2009) Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Trans Pattern Anal Mach Intell 31(3):539–555CrossRef Wang X, Ma X, Grimson WEL (2009) Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Trans Pattern Anal Mach Intell 31(3):539–555CrossRef
go back to reference Wang J, Fu W, Lu H, Ma S (2014) Bilayer sparse topic model for scene analysis in imbalanced surveillance videos. IEEE Trans Image Process 23(11):5198–5208MathSciNetCrossRef Wang J, Fu W, Lu H, Ma S (2014) Bilayer sparse topic model for scene analysis in imbalanced surveillance videos. IEEE Trans Image Process 23(11):5198–5208MathSciNetCrossRef
Metadata
Title
Cluster-based sparse topical coding for topic mining and document clustering
Authors
Parvin Ahmadi
Iman Gholampour
Mahmoud Tabandeh
Publication date
28-02-2017
Publisher
Springer Berlin Heidelberg
Published in
Advances in Data Analysis and Classification / Issue 3/2018
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-017-0280-3

Other articles of this Issue 3/2018

Advances in Data Analysis and Classification 3/2018 Go to the issue

Premium Partner