Top

Neural Computing and Applications

Published in:

21-04-2021 | Original Article

Text mining using nonnegative matrix factorization and latent semantic analysis

Authors: Ali Hassani, Amir Iranmanesh, Najme Mansouri

Published in: Neural Computing and Applications | Issue 20/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Text clustering is considered one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as latent semantic analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on nonnegative matrix factorization, which is employed to separate the terms into groups, and then each group’s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance or maintains the performance of other methods, while improving stability in results.

previous article Deep neural networks for quantum circuit mapping

next article Intelligent forecast engine for short-term wind speed prediction based on stacked long short-term memory

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Xie X, Fu Y, Jin H, Zhao Y, Cao W (2019) A novel text mining approach for scholar information extraction from web content in Chinese. Future Gener Comput Syst 111:859–872CrossRef

Krallinger M, Erhardt RAA, Valencia A (2005) Text-mining approaches in molecular biology and biomedicine. Drug Discover Today 10(6):439–445CrossRef

Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B (2013) Biomedical text mining and its applications in cancer research. J Biomed Informatics 46(2):200–211CrossRef

Davoodi E, Kianmehr K, Afsharchi M (2013) A semantic social network-based expert recommender system. Appl Intell 39(1):1–13CrossRef

Adeva JJG, Atxa JMP (2007) Intrusion detection in web applications using text mining. Eng Appl Artif Intell 20(4):555–566CrossRef

Lin H, Sun B, Wu J, Xiong H (2016) Topic detection from short text: a term-based consensus clustering method. In: 2016 13th international conference on service systems and service management (ICSSSM), IEEE, pp 1–6

Aljaber B, Stokes N, Bailey J, Pei J (2010) Document clustering of scientific texts using citation contexts. Inf Retrieval 13(2):101–131CrossRef

Modha DS, Spangler WS (2004) Clustering hypertext with applications to web searching. US Patent 6,684,205

Thakran Y, Toshniwal D (2014) A novel agglomerative hierarchical approach for clustering in medical databases. Springer, Berlin, pp 245–252

10.

Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. Springer, Berlin, pp 267–287

11.

Garg N, Gupta R (2018) Performance evaluation of new text mining method based on GA and K-means clustering algorithm. Springer, Berlin, pp 23–30

12.

Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200CrossRef

13.

Gulnashin F, Sharma I, Sharma H (2019) A new deterministic method of initializing spherical K-means for document clustering. Springer, Berlin, pp 149–155

14.

Kushwaha N, Pant M (2018) Link based bpso for feature selection in big data text clustering. Future Gener Comput Syst 82:190–199CrossRef

15.

Sankesara H (2018) Medium articles. (kaggle). https://www.kaggle.com/hsankesara/medium-articles

16.

Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21CrossRef

17.

Shi J, Malik J (2000) Normalized cuts and image segmentation. Departmental Papers (CIS) p 107

18.

Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230CrossRef

19.

Wang J, Ma L (2011) Text clustering based on lsa-hgsom. In: International conference on web information systems and mining. Springer, pp 1–10

20.

Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. Springer, Berlin, pp 383–390

21.

Yu B, Zb Xu, Li Ch (2008) Latent semantic analysis for text categorization using neural network. Knowl-Based Syst 21(8):900–904CrossRef

22.

Yu B, Zhu Dh (2009) Combining neural networks and semantic feature space for email classification. Knowl-Based Syst 22(5):376–381CrossRef

23.

Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, pp 163–172

24.

Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p 29

25.

Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905MathSciNetCrossRef

26.

Boutsidis C, Gallopoulos E (2008) Svd based initialization: a head start for nonnegative matrix factorization. Pattern Recognit 41(4):1350–1362CrossRef

27.

Casalino G, Del Buono N, Mencar C (2014) Subtractive clustering for seeding non-negative matrix factorizations. Inf Sci 257:369–387MathSciNetCrossRef

28.

Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25CrossRef

29.

Zeng K, Yu J, Li C, You J, Jin T (2014) Image clustering by hyper-graph regularized non-negative matrix factorization. Neurocomputing 138:209–217CrossRef

30.

Flenner J, Hunter B (2017) A deep non-negative matrix factorization neural network. https://www1.cmc.edu/pages/faculty/BHunter/papers/deep-negative-matrix.pdf

31.

Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293–2302CrossRef

32.

Lu M, Zhao XJ, Zhang L, Li FZ (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98MathSciNetCrossRef

33.

Song W, Park SC (2010) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22(3):347–369CrossRef

34.

Wang W, Yu B (2009) Text categorization based on combination of modified back propagation neural network and latent semantic analysis. Neural Comput Appl 18(8):875CrossRef

35.

Zheng W, Qian Y, Lu H (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3–4):447–456CrossRef

36.

Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185MathSciNet

37.

Toussaint G (2005) Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining. Int J Comput Geom Appl 15(2):101–150MathSciNetCrossRef

38.

Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 377–384

39.

Lang K (1995) Newsweeder: learning to filter netnews. Elsevier, Amsterdam, pp 331–339

40.

Mueller AC (2020) Word cloud. https://github.com/amueller/word_cloud

41.

Gulli A (2004) Ag’s corpus of news articles. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

42.

Sood G (2016) Parsed DMOZ data. https://doi.org/10.7910/DVN/OMV93V

43.

Almeida TA, Gómez Hidalgo JM (2011) The sms spam collection v.1. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

44.

Almeida T, Hidalgo JMG, Silva TP (2013) Towards sms spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18

45.

Group CTL (1997) The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

46.

Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on Autonomous agents. ACM, pp 408–415

47.

Van Der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22CrossRef

48.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830MathSciNetMATH

49.

Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563

50.

Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1(6):80–83MathSciNetCrossRef

Title: Text mining using nonnegative matrix factorization and latent semantic analysis
Authors: Ali Hassani
Amir Iranmanesh
Najme Mansouri
Publication date: 21-04-2021
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 20/2021
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-021-06014-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 20/2021

Accelerated proximal stochastic variance reduction for DC optimization

Extensive framework based on novel convolutional and variational autoencoder based on maximization of mutual information for anomaly detection

Estimates of greenhouse gas emission in Turkey with grey wolf optimizer algorithm-optimized artificial neural networks

Image retrieval based on texture using latent space representation of discrete Fourier transformed maps

Learnability and robustness of shallow neural networks learned by a performance-driven BP and a variant of PSO for edge decision-making

Intelligent forecast engine for short-term wind speed prediction based on stacked long short-term memory

Premium Partner