Classification of Scientific Texts Based on the Compression of Annotations to Publications

Selivanova, I. V.; Kosyakov, D. V.; Guskov, A. E.

doi:10.3103/S0005105519060062

Classification of Scientific Texts Based on the Compression of Annotations to Publications

AUTOMATION OF TEXT PROCESSING
Published: 26 February 2020

Volume 53, pages 329–342, (2019)
Cite this article

Automatic Documentation and Mathematical Linguistics Aims and scope

I. V. Selivanova¹,
D. V. Kosyakov¹ &
A. E. Guskov¹

199 Accesses
2 Citations
Explore all metrics

Abstract

This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

Article 01 July 2021

The contribution of the lexical component in hybrid clustering, the case of four decades of “Scientometrics”

Article 02 February 2018

Summarizing Citation Contexts of Scientific Publications

REFERENCES

Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Automated classification of Russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ.,Ser.: Lingvist. Mezhkul’t. Kommun., 2017, vol. 15, no. 3, pp. 13–23.
Google Scholar
Batura, T.V., Formal methods for determining authorship of texts, Vestn. Novosib. Gos. Univ.,Ser.: Inf. Tekhnol., 2012, vol. 10, no. 4, pp. 81–94.
Google Scholar
Dos Santos, C.N. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts, COLING 2014—25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, 2014, pp. 69–78.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M., Short text classification in twitter to improve information filtering, SIGIR 2010 Proceedings—33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842.
Kiritchenko, S., Zhu, X., and Mohammad, S.M., Sentiment analysis of short informal texts, J. Artif. Intell. Res., 2014, vol. 50, pp. 723–762.
Article Google Scholar
Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https:// link.springer.com/article/10.1134/S0032946017030115.
Article MathSciNet Google Scholar
Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126.
Article Google Scholar
Hall, G.M., How to Write a Paper, John Wiley & Sons, Ltd., 2013.
Google Scholar
Perianes-Rodriguez, A. and Ruiz-Castillo, J., A comparison of the Web of Science and publication-level classification systems of science, J. Inf., 2017, vol. 11, no. 1, pp. 32–45.
Google Scholar
Shu, F., Julien, C.A., Zhang, L., Qiu, J., Zhang, J., and Lariviere, V., Comparing journal and paper level classifications of science, J. Inf., 2019, vol.13, no. 1, pp. 202–209.
Google Scholar
Topic Prominence in Science is now available to SciVal users. http://elsevierscience.ru/news/428/topic-prominence-in-science-stali-dostupny-polzovatelyam-scival. Accessed October 14, 2019.
Waltman, L. and van Eck, N.J., A new methodology for constructing a publication-level classification system of science, J. Am. Soc. Inf. Sci. Technol., 2012, vol. 63, no. 12, pp. 2378–2392.
Article Google Scholar
UDC, LBC, ISBN as required elements of the publication’s output. https://www.ipu.ru/structure/information-services/polygraphy/20804. Accessed October 14, 2019.
1297.0—Australian and New Zealand Standard Research Classification (ANZSRC), 2008. https://www.abs. gov.au/Ausstats/abs.nsf/Latestproducts/1297.0Main% 20Features32008?opendocument&tabname=Summary& prodno=1297.0&issue=2008. Accessed October 14, 2019.
Passports of scientific specialties. http://arhvak.minobrnauki.gov.ru/316. Accessed October 14, 2019.
OKSO, All-Russian Classifier of Education Specialties. https://classifikators.ru/okso. Accessed October 14, 2019.
GRNTI, The State Register of Scientific and Technical Activities 2019. http://grnti.ru/. Accessed October 14, 2019.
Revised field of science and technology (FOS) classification in the Frascati Manual. http://www. oecd.org/science/inno/38235147.pdf. Accessed October 14, 2019.
Proposed international standard nomenclature for fields of science and technology. https://unesdoc.unesco.org/ ark:/48223/pf0000082946. Accessed October 14, 2019.
Parfenova, S.L., Dolgova, V.N., Bogatov, V.V., Khaltakshinova, N.V., and Korobatov, V.Ya., Methodological approach to the formation of rubricators-adapters for the analysis of Web of Science and Scopus directions in the context of the priorities of the Strategy for Scientific and Technological Development of the Russian Federation, Ekon. Nauki, 2018, vol. 4, no. 2, pp. 143–153.
Google Scholar
Scopus. Content Coverage Guide. http://elsevierscience. ru/files/Scopus_Content_Guide_Rus_2017.pdf. Accessed October 14, 2019.
Wang, Q. and Waltman, L., Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus, J. Inf., 2016, vol.10, no. 2, pp. 347–364.
Google Scholar
Mendes, A.C., Science classification, visibility of the different scientific domains and impact on scientific development Scopus, Rev. Enferm. Ref., 2016, vol. 10, no. 4, pp. 143–149.
Article Google Scholar
Martínez-Frías, J. and Hochberg, D., Classifying science and technology: Two problems with the UNESCO system, Interdiscip. Sci. Rev., 2007, vol. 32, no. 4, pp. 315–319.
Article Google Scholar
Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671.
Article Google Scholar
Jiang, L., Li, C., Wanga, S., and Zhanga, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39.
Article Google Scholar
Wang, S. and Manning, C.D., Baselines and bigrams: Simple, good sentiment and topic classification, 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012—Proceedings of the Conference, 2012, vol. 2, pp. 90–94.
Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273.
Li, S., Hu, J., Cui, Y., and Hu, J., DeepPatent: Patent classification with convolutional neural networks and word embedding, Scientometrics, 2018, vol. 117, no. 2, pp. 721–744.
Article Google Scholar
Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.
Article Google Scholar
Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.
Article Google Scholar
Šubelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: A systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, pp. 1–23.
Article Google Scholar
Liu, X., Yu, S., Moreau, Y., Janssens, F., Moor, B.D., and Glanzel, W., Hybrid clustering by integrating text and citation based graphs in journal database analysis, IEEE International Conference on Data Mining Workshops, Miami, 2009, pp. 521–526.
Waltman, L., Boyack, K.W., Colavizza, G., and van Eck, N.J., A principled methodology for comparing relatedness measures for clustering publications, arxiv:1901.06815. https://arxiv.org/ftp/arxiv/papers/ 1901/1901.06815.pdf. Accessed October 14, 2019.
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., and Börner, K., Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS ONE, 2011, vol. 6, no. 6, pp. 1–11.
Article Google Scholar
Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP fusion from multiple sources for text classification, Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, 2005.
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, pp. 95–98.
Article Google Scholar
Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, pp. 1–15.
Article Google Scholar
Sinclair, G. and Webber, B., Classification from full text: A comparison of canonical sections of scientific papers, Proc. of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, 2004, pp. 66–69.
Riloff, E., Little words can make a big difference for text classification, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 130–136.

Download references

Author information

Authors and Affiliations

State Public Library of Scientific and Technical Information, Siberian Branch, Russian Academy of Sciences, 630102, Novosibirsk, Russia
I. V. Selivanova, D. V. Kosyakov & A. E. Guskov

Authors

I. V. Selivanova
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Kosyakov
View author publications
You can also search for this author in PubMed Google Scholar
A. E. Guskov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to I. V. Selivanova, D. V. Kosyakov or A. E. Guskov.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by L. Solovyova

About this article

Cite this article

Selivanova, I.V., Kosyakov, D.V. & Guskov, A.E. Classification of Scientific Texts Based on the Compression of Annotations to Publications. Autom. Doc. Math. Linguist. 53, 329–342 (2019). https://doi.org/10.3103/S0005105519060062

Download citation

Received: 15 October 2019
Published: 26 February 2020
Issue Date: November 2019
DOI: https://doi.org/10.3103/S0005105519060062

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions