Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 12/2018

26-04-2017 | Original Article

Self-organizing weighted incremental probabilistic latent semantic analysis

Authors: Ning Li, Wenjuan Luo, Kun Yang, Fuzhen Zhuang, Qing He, Zhongzhi Shi

Published in: International Journal of Machine Learning and Cybernetics | Issue 12/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

PLSA (Probabilistic Latent Semantic Analysis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large dataset incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynamically discover topics and incrementally learn the topics from new documents. The experiments verify that the proposed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good performance in the application of document categorization.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
2.
go back to reference Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215:74–89MathSciNetCrossRef Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst. 215:74–89MathSciNetCrossRef
3.
go back to reference Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24 Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24
4.
go back to reference Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algorithms, INVITED SPEAKER, p 2001 Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algorithms, INVITED SPEAKER, p 2001
5.
go back to reference Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kannan R, Stolper CD, Inouye D et al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kannan R, Stolper CD, Inouye D et al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data
6.
go back to reference Mei Q, Zhai C (2001) A note on em algorithm for probabilistic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM Mei Q, Zhai C (2001) A note on em algorithm for probabilistic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM
7.
go back to reference Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73MathSciNetCrossRef Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73MathSciNetCrossRef
8.
go back to reference Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst
9.
go back to reference Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26 Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26
10.
go back to reference Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28 Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28
11.
go back to reference Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191CrossRef Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191CrossRef
12.
go back to reference Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440 Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440
13.
go back to reference Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120 Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120
14.
go back to reference Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 424–433 Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 424–433
16.
go back to reference Aggarwal CC, Zhai C (2012) Mining text data. Springer Aggarwal CC, Zhai C (2012) Mining text data. Springer
18.
go back to reference Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models can improve domain term extraction. In: Advances in Information Retrieval. Springer, pp 684–687 Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models can improve domain term extraction. In: Advances in Information Retrieval. Springer, pp 684–687
19.
go back to reference Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145CrossRef Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145CrossRef
20.
go back to reference Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407CrossRef Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407CrossRef
21.
go back to reference Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 50–57 Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 50–57
22.
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
23.
go back to reference Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM
24.
go back to reference Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888 Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888
25.
go back to reference Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilistic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49 Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilistic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49
26.
go back to reference Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207CrossRef Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207CrossRef
27.
go back to reference Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender systems. ACM, pp 99–106 Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender systems. ACM, pp 99–106
28.
go back to reference Tzu-Chuan Chou MCC (2008) Using incremental plsi for threshold-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299CrossRef Tzu-Chuan Chou MCC (2008) Using incremental plsi for threshold-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299CrossRef
29.
go back to reference Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196CrossRef Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196CrossRef
30.
go back to reference Surendran AC, Sra S (2006) Incremental aspect models for mining document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640 Surendran AC, Sra S (2006) Incremental aspect models for mining document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640
31.
go back to reference Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92CrossRef Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92CrossRef
32.
go back to reference Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24 Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24
Metadata
Title
Self-organizing weighted incremental probabilistic latent semantic analysis
Authors
Ning Li
Wenjuan Luo
Kun Yang
Fuzhen Zhuang
Qing He
Zhongzhi Shi
Publication date
26-04-2017
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 12/2018
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-017-0681-9

Other articles of this Issue 12/2018

International Journal of Machine Learning and Cybernetics 12/2018 Go to the issue