Skip to main content
Top

2021 | OriginalPaper | Chapter

Clustering Research Papers: A Qualitative Study of Concatenated Power Means Sentence Embeddings over Centroid Sentence Embeddings

Authors : Devashish Gaikwad, Venkatesh Yelnoorkar, Atharva Jadhav, Yashodhara Haribhakta

Published in: Advances in Computing and Network Communications

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Mathematical average of word embeddings is a common baseline for sentence embedding techniques which typically fall short of the performance of more complex models such as BERT and InferSent. There has been significant improvement in the field of sentence embeddings and especially towards the development of universal sentence encoder that can be used for transfer learning in a wide variety of downstream tasks. Academic paper retrieval systems are widely used in academic institutions to store and categorise scientific papers and find connections between them using citation links, but these methods do not account for the content of the papers. For unsupervised clustering of these papers, a new approach of sentence embeddings is proposed using concatenated power means sentence embeddings and centroid sentence embeddings. The sentence embeddings so created are clustered using K-means clustering algorithm. The results show a clear increase of 47.94% in cosine distance of nearest papers using concatenated power means sentence embeddings with respect to baseline centroid embeddings for the highest performing GloVe models proving that the computationally inexpensive P-Means clustering sentence embeddings can be used for unsupervised clustering of scientific research papers using their abstracts.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings (2017) Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings (2017)
5.
go back to reference Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., Börner, K.: Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3), e18029 (2011). https://dx.plos.org/10.1371/journal.pone.0018029 Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., Börner, K.: Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3), e18029 (2011). https://​dx.​plos.​org/​10.​1371/​journal.​pone.​0018029
6.
go back to reference Efimov, K., Adamyan, L., Spokoiny, V.: Adaptive nonparametric clustering. IEEE Trans. Inf. Theory 65, 4875–4892 (2019)MathSciNetCrossRef Efimov, K., Adamyan, L., Spokoiny, V.: Adaptive nonparametric clustering. IEEE Trans. Inf. Theory 65, 4875–4892 (2019)MathSciNetCrossRef
9.
go back to reference Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs] (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:​1301.​3781 [cs] (2013)
10.
go back to reference Newman, M.E.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)CrossRef Newman, M.E.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)CrossRef
12.
go back to reference Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014). http://aclweb.org/anthology/D14-1162 Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014). http://​aclweb.​org/​anthology/​D14-1162
13.
go back to reference Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:1806.06259 [cs] (2018) Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:​1806.​06259 [cs] (2018)
14.
go back to reference Rücklé, A., Eger, S., Peyrard, M., Gurevych, I.: Concatenated power mean word embeddings as universal cross-lingual sentence representations. arxiv.org [cs] (2018) Rücklé, A., Eger, S., Peyrard, M., Gurevych, I.: Concatenated power mean word embeddings as universal cross-lingual sentence representations. arxiv.org [cs] (2018)
Metadata
Title
Clustering Research Papers: A Qualitative Study of Concatenated Power Means Sentence Embeddings over Centroid Sentence Embeddings
Authors
Devashish Gaikwad
Venkatesh Yelnoorkar
Atharva Jadhav
Yashodhara Haribhakta
Copyright Year
2021
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-33-6987-0_26