Skip to main content
Erschienen in: Neural Computing and Applications 20/2021

21.04.2021 | Original Article

Text mining using nonnegative matrix factorization and latent semantic analysis

verfasst von: Ali Hassani, Amir Iranmanesh, Najme Mansouri

Erschienen in: Neural Computing and Applications | Ausgabe 20/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text clustering is considered one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as latent semantic analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on nonnegative matrix factorization, which is employed to separate the terms into groups, and then each group’s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance or maintains the performance of other methods, while improving stability in results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Xie X, Fu Y, Jin H, Zhao Y, Cao W (2019) A novel text mining approach for scholar information extraction from web content in Chinese. Future Gener Comput Syst 111:859–872CrossRef Xie X, Fu Y, Jin H, Zhao Y, Cao W (2019) A novel text mining approach for scholar information extraction from web content in Chinese. Future Gener Comput Syst 111:859–872CrossRef
2.
Zurück zum Zitat Krallinger M, Erhardt RAA, Valencia A (2005) Text-mining approaches in molecular biology and biomedicine. Drug Discover Today 10(6):439–445CrossRef Krallinger M, Erhardt RAA, Valencia A (2005) Text-mining approaches in molecular biology and biomedicine. Drug Discover Today 10(6):439–445CrossRef
3.
Zurück zum Zitat Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B (2013) Biomedical text mining and its applications in cancer research. J Biomed Informatics 46(2):200–211CrossRef Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B (2013) Biomedical text mining and its applications in cancer research. J Biomed Informatics 46(2):200–211CrossRef
4.
Zurück zum Zitat Davoodi E, Kianmehr K, Afsharchi M (2013) A semantic social network-based expert recommender system. Appl Intell 39(1):1–13CrossRef Davoodi E, Kianmehr K, Afsharchi M (2013) A semantic social network-based expert recommender system. Appl Intell 39(1):1–13CrossRef
5.
Zurück zum Zitat Adeva JJG, Atxa JMP (2007) Intrusion detection in web applications using text mining. Eng Appl Artif Intell 20(4):555–566CrossRef Adeva JJG, Atxa JMP (2007) Intrusion detection in web applications using text mining. Eng Appl Artif Intell 20(4):555–566CrossRef
6.
Zurück zum Zitat Lin H, Sun B, Wu J, Xiong H (2016) Topic detection from short text: a term-based consensus clustering method. In: 2016 13th international conference on service systems and service management (ICSSSM), IEEE, pp 1–6 Lin H, Sun B, Wu J, Xiong H (2016) Topic detection from short text: a term-based consensus clustering method. In: 2016 13th international conference on service systems and service management (ICSSSM), IEEE, pp 1–6
7.
Zurück zum Zitat Aljaber B, Stokes N, Bailey J, Pei J (2010) Document clustering of scientific texts using citation contexts. Inf Retrieval 13(2):101–131CrossRef Aljaber B, Stokes N, Bailey J, Pei J (2010) Document clustering of scientific texts using citation contexts. Inf Retrieval 13(2):101–131CrossRef
8.
Zurück zum Zitat Modha DS, Spangler WS (2004) Clustering hypertext with applications to web searching. US Patent 6,684,205 Modha DS, Spangler WS (2004) Clustering hypertext with applications to web searching. US Patent 6,684,205
9.
Zurück zum Zitat Thakran Y, Toshniwal D (2014) A novel agglomerative hierarchical approach for clustering in medical databases. Springer, Berlin, pp 245–252 Thakran Y, Toshniwal D (2014) A novel agglomerative hierarchical approach for clustering in medical databases. Springer, Berlin, pp 245–252
10.
Zurück zum Zitat Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. Springer, Berlin, pp 267–287 Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. Springer, Berlin, pp 267–287
11.
Zurück zum Zitat Garg N, Gupta R (2018) Performance evaluation of new text mining method based on GA and K-means clustering algorithm. Springer, Berlin, pp 23–30 Garg N, Gupta R (2018) Performance evaluation of new text mining method based on GA and K-means clustering algorithm. Springer, Berlin, pp 23–30
12.
Zurück zum Zitat Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200CrossRef Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200CrossRef
13.
Zurück zum Zitat Gulnashin F, Sharma I, Sharma H (2019) A new deterministic method of initializing spherical K-means for document clustering. Springer, Berlin, pp 149–155 Gulnashin F, Sharma I, Sharma H (2019) A new deterministic method of initializing spherical K-means for document clustering. Springer, Berlin, pp 149–155
14.
Zurück zum Zitat Kushwaha N, Pant M (2018) Link based bpso for feature selection in big data text clustering. Future Gener Comput Syst 82:190–199CrossRef Kushwaha N, Pant M (2018) Link based bpso for feature selection in big data text clustering. Future Gener Comput Syst 82:190–199CrossRef
16.
Zurück zum Zitat Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21CrossRef Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21CrossRef
17.
Zurück zum Zitat Shi J, Malik J (2000) Normalized cuts and image segmentation. Departmental Papers (CIS) p 107 Shi J, Malik J (2000) Normalized cuts and image segmentation. Departmental Papers (CIS) p 107
18.
Zurück zum Zitat Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230CrossRef Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230CrossRef
19.
Zurück zum Zitat Wang J, Ma L (2011) Text clustering based on lsa-hgsom. In: International conference on web information systems and mining. Springer, pp 1–10 Wang J, Ma L (2011) Text clustering based on lsa-hgsom. In: International conference on web information systems and mining. Springer, pp 1–10
20.
Zurück zum Zitat Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. Springer, Berlin, pp 383–390 Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. Springer, Berlin, pp 383–390
21.
Zurück zum Zitat Yu B, Zb Xu, Li Ch (2008) Latent semantic analysis for text categorization using neural network. Knowl-Based Syst 21(8):900–904CrossRef Yu B, Zb Xu, Li Ch (2008) Latent semantic analysis for text categorization using neural network. Knowl-Based Syst 21(8):900–904CrossRef
22.
Zurück zum Zitat Yu B, Zhu Dh (2009) Combining neural networks and semantic feature space for email classification. Knowl-Based Syst 22(5):376–381CrossRef Yu B, Zhu Dh (2009) Combining neural networks and semantic feature space for email classification. Knowl-Based Syst 22(5):376–381CrossRef
23.
Zurück zum Zitat Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, pp 163–172 Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, pp 163–172
24.
Zurück zum Zitat Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p 29 Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p 29
25.
Zurück zum Zitat Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905MathSciNetCrossRef Korenius T, Laurikkala J, Juhola M (2007) On principal component analysis, cosine and euclidean measures in information retrieval. Inf Sci 177(22):4893–4905MathSciNetCrossRef
26.
Zurück zum Zitat Boutsidis C, Gallopoulos E (2008) Svd based initialization: a head start for nonnegative matrix factorization. Pattern Recognit 41(4):1350–1362CrossRef Boutsidis C, Gallopoulos E (2008) Svd based initialization: a head start for nonnegative matrix factorization. Pattern Recognit 41(4):1350–1362CrossRef
27.
Zurück zum Zitat Casalino G, Del Buono N, Mencar C (2014) Subtractive clustering for seeding non-negative matrix factorizations. Inf Sci 257:369–387MathSciNetCrossRef Casalino G, Del Buono N, Mencar C (2014) Subtractive clustering for seeding non-negative matrix factorizations. Inf Sci 257:369–387MathSciNetCrossRef
28.
Zurück zum Zitat Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25CrossRef Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25CrossRef
29.
Zurück zum Zitat Zeng K, Yu J, Li C, You J, Jin T (2014) Image clustering by hyper-graph regularized non-negative matrix factorization. Neurocomputing 138:209–217CrossRef Zeng K, Yu J, Li C, You J, Jin T (2014) Image clustering by hyper-graph regularized non-negative matrix factorization. Neurocomputing 138:209–217CrossRef
31.
Zurück zum Zitat Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293–2302CrossRef Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293–2302CrossRef
32.
Zurück zum Zitat Lu M, Zhao XJ, Zhang L, Li FZ (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98MathSciNetCrossRef Lu M, Zhao XJ, Zhang L, Li FZ (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98MathSciNetCrossRef
33.
Zurück zum Zitat Song W, Park SC (2010) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22(3):347–369CrossRef Song W, Park SC (2010) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22(3):347–369CrossRef
34.
Zurück zum Zitat Wang W, Yu B (2009) Text categorization based on combination of modified back propagation neural network and latent semantic analysis. Neural Comput Appl 18(8):875CrossRef Wang W, Yu B (2009) Text categorization based on combination of modified back propagation neural network and latent semantic analysis. Neural Comput Appl 18(8):875CrossRef
35.
Zurück zum Zitat Zheng W, Qian Y, Lu H (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3–4):447–456CrossRef Zheng W, Qian Y, Lu H (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3–4):447–456CrossRef
36.
Zurück zum Zitat Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185MathSciNet Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185MathSciNet
37.
Zurück zum Zitat Toussaint G (2005) Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining. Int J Comput Geom Appl 15(2):101–150MathSciNetCrossRef Toussaint G (2005) Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining. Int J Comput Geom Appl 15(2):101–150MathSciNetCrossRef
38.
Zurück zum Zitat Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 377–384 Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 377–384
39.
Zurück zum Zitat Lang K (1995) Newsweeder: learning to filter netnews. Elsevier, Amsterdam, pp 331–339 Lang K (1995) Newsweeder: learning to filter netnews. Elsevier, Amsterdam, pp 331–339
44.
Zurück zum Zitat Almeida T, Hidalgo JMG, Silva TP (2013) Towards sms spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18 Almeida T, Hidalgo JMG, Silva TP (2013) Towards sms spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18
46.
Zurück zum Zitat Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on Autonomous agents. ACM, pp 408–415 Han EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on Autonomous agents. ACM, pp 408–415
47.
Zurück zum Zitat Van Der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22CrossRef Van Der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22CrossRef
48.
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830MathSciNetMATH
49.
Zurück zum Zitat Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563 Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563
Metadaten
Titel
Text mining using nonnegative matrix factorization and latent semantic analysis
verfasst von
Ali Hassani
Amir Iranmanesh
Najme Mansouri
Publikationsdatum
21.04.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 20/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-021-06014-6

Weitere Artikel der Ausgabe 20/2021

Neural Computing and Applications 20/2021 Zur Ausgabe

Premium Partner