Skip to main content
Top
Published in: Knowledge and Information Systems 3/2022

17-02-2022 | Regular Paper

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Authors: Arnab Kumar Roy, Tanmay Basu

Published in: Knowledge and Information Systems | Issue 3/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern Part C 40(6):601–618CrossRef Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern Part C 40(6):601–618CrossRef
2.
go back to reference Xu Z, Ke Y (2016) Effective and efficient spectral clustering on text and link data. In: Proceedings of ACM international conference on information and knowledge management. pp 357–366 Xu Z, Ke Y (2016) Effective and efficient spectral clustering on text and link data. In: Proceedings of ACM international conference on information and knowledge management. pp 357–366
3.
go back to reference Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: Proceedings of international conference on learning representations Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: Proceedings of international conference on learning representations
4.
go back to reference Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200CrossRef Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200CrossRef
5.
go back to reference Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162CrossRef Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162CrossRef
6.
go back to reference Glasbey CA (1993) An analysis of histogram-based thresholding algorithms. CVGIP Graph Models Image Process 55(6):532–537CrossRef Glasbey CA (1993) An analysis of histogram-based thresholding algorithms. CVGIP Graph Models Image Process 55(6):532–537CrossRef
7.
go back to reference Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. pp 849–856 Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. pp 849–856
8.
go back to reference Kaya M, Bilge HŞ (2019) Deep metric learning: a survey. Symmetry 11(9):1066CrossRef Kaya M, Bilge HŞ (2019) Deep metric learning: a survey. Symmetry 11(9):1066CrossRef
9.
go back to reference Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning. pp 209–216 Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning. pp 209–216
10.
go back to reference Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(2):207MATH Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(2):207MATH
12.
go back to reference Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Proceedings of AAAI conference on artificial intelligence Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Proceedings of AAAI conference on artificial intelligence
13.
go back to reference Harwood B, Kumar BGV, Carneiro G, Reid I, Drummond T (2017) Smart mining for deep metric learning. In: Proceedings of the IEEE international conference on computer vision. pp 2821–2829 Harwood B, Kumar BGV, Carneiro G, Reid I, Drummond T (2017) Smart mining for deep metric learning. In: Proceedings of the IEEE international conference on computer vision. pp 2821–2829
14.
go back to reference Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. 58:64 Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. 58:64
15.
go back to reference Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference, Christchurch, New Zealand. pp 49–56 Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference, Christchurch, New Zealand. pp 49–56
16.
go back to reference Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84CrossRef Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84CrossRef
18.
go back to reference Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75CrossRef Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75CrossRef
19.
go back to reference Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279CrossRef Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279CrossRef
20.
go back to reference Jain Anil K, Narasimha Murty M, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain Anil K, Narasimha Murty M, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
21.
go back to reference Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992)Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. pp 318–329. ACM Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992)Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. pp 318–329. ACM
22.
go back to reference Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 267–273. ACM Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 267–273. ACM
23.
go back to reference Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353MathSciNetCrossRef Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353MathSciNetCrossRef
24.
go back to reference Ding CHQ, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1):45–55CrossRef Ding CHQ, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1):45–55CrossRef
25.
go back to reference Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW (2017) A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell 39(3):417–429CrossRef Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW (2017) A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell 39(3):417–429CrossRef
26.
go back to reference Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150CrossRef Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150CrossRef
27.
go back to reference Yang Y, Ma Z, Yang Y, Nie F, Heng TS (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094CrossRef Yang Y, Ma Z, Yang Y, Nie F, Heng TS (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094CrossRef
28.
go back to reference Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637CrossRef Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637CrossRef
29.
go back to reference Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. Proc Conf Comput Visi Pattern Recognit 2:1735–1742 Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. Proc Conf Comput Visi Pattern Recognit 2:1735–1742
30.
go back to reference Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New YorkMATH Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New YorkMATH
31.
go back to reference Sam HE-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: European conference on principles of data mining and knowledge discovery. pp 424–431. Springer Sam HE-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: European conference on principles of data mining and knowledge discovery. pp 424–431. Springer
32.
go back to reference Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATH Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATH
33.
go back to reference Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)
34.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH
35.
go back to reference Ruxton Graeme D (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behav Ecol 17(4):688–690CrossRef Ruxton Graeme D (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behav Ecol 17(4):688–690CrossRef
36.
go back to reference Friedman JH et al (1994) Flexible metric nearest neighbor classification. Technical report, Technical report, Department of Statistics, Stanford University Friedman JH et al (1994) Flexible metric nearest neighbor classification. Technical report, Technical report, Department of Statistics, Stanford University
37.
go back to reference De Amorim R, Cordeiro R, Boris M (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075CrossRef De Amorim R, Cordeiro R, Boris M (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075CrossRef
Metadata
Title
Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering
Authors
Arnab Kumar Roy
Tanmay Basu
Publication date
17-02-2022
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 3/2022
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-022-01658-9

Other articles of this Issue 3/2022

Knowledge and Information Systems 3/2022 Go to the issue

Premium Partner