Abstract
This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is.
Similar content being viewed by others
REFERENCES
Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Automated classification of Russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ.,Ser.: Lingvist. Mezhkul’t. Kommun., 2017, vol. 15, no. 3, pp. 13–23.
Batura, T.V., Formal methods for determining authorship of texts, Vestn. Novosib. Gos. Univ.,Ser.: Inf. Tekhnol., 2012, vol. 10, no. 4, pp. 81–94.
Dos Santos, C.N. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts, COLING 2014—25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, 2014, pp. 69–78.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M., Short text classification in twitter to improve information filtering, SIGIR 2010 Proceedings—33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842.
Kiritchenko, S., Zhu, X., and Mohammad, S.M., Sentiment analysis of short informal texts, J. Artif. Intell. Res., 2014, vol. 50, pp. 723–762.
Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https:// link.springer.com/article/10.1134/S0032946017030115.
Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126.
Hall, G.M., How to Write a Paper, John Wiley & Sons, Ltd., 2013.
Perianes-Rodriguez, A. and Ruiz-Castillo, J., A comparison of the Web of Science and publication-level classification systems of science, J. Inf., 2017, vol. 11, no. 1, pp. 32–45.
Shu, F., Julien, C.A., Zhang, L., Qiu, J., Zhang, J., and Lariviere, V., Comparing journal and paper level classifications of science, J. Inf., 2019, vol.13, no. 1, pp. 202–209.
Topic Prominence in Science is now available to SciVal users. http://elsevierscience.ru/news/428/topic-prominence-in-science-stali-dostupny-polzovatelyam-scival. Accessed October 14, 2019.
Waltman, L. and van Eck, N.J., A new methodology for constructing a publication-level classification system of science, J. Am. Soc. Inf. Sci. Technol., 2012, vol. 63, no. 12, pp. 2378–2392.
UDC, LBC, ISBN as required elements of the publication’s output. https://www.ipu.ru/structure/information-services/polygraphy/20804. Accessed October 14, 2019.
1297.0—Australian and New Zealand Standard Research Classification (ANZSRC), 2008. https://www.abs. gov.au/Ausstats/abs.nsf/Latestproducts/1297.0Main% 20Features32008?opendocument&tabname=Summary& prodno=1297.0&issue=2008. Accessed October 14, 2019.
Passports of scientific specialties. http://arhvak.minobrnauki.gov.ru/316. Accessed October 14, 2019.
OKSO, All-Russian Classifier of Education Specialties. https://classifikators.ru/okso. Accessed October 14, 2019.
GRNTI, The State Register of Scientific and Technical Activities 2019. http://grnti.ru/. Accessed October 14, 2019.
Revised field of science and technology (FOS) classification in the Frascati Manual. http://www. oecd.org/science/inno/38235147.pdf. Accessed October 14, 2019.
Proposed international standard nomenclature for fields of science and technology. https://unesdoc.unesco.org/ ark:/48223/pf0000082946. Accessed October 14, 2019.
Parfenova, S.L., Dolgova, V.N., Bogatov, V.V., Khaltakshinova, N.V., and Korobatov, V.Ya., Methodological approach to the formation of rubricators-adapters for the analysis of Web of Science and Scopus directions in the context of the priorities of the Strategy for Scientific and Technological Development of the Russian Federation, Ekon. Nauki, 2018, vol. 4, no. 2, pp. 143–153.
Scopus. Content Coverage Guide. http://elsevierscience. ru/files/Scopus_Content_Guide_Rus_2017.pdf. Accessed October 14, 2019.
Wang, Q. and Waltman, L., Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus, J. Inf., 2016, vol.10, no. 2, pp. 347–364.
Mendes, A.C., Science classification, visibility of the different scientific domains and impact on scientific development Scopus, Rev. Enferm. Ref., 2016, vol. 10, no. 4, pp. 143–149.
Martínez-Frías, J. and Hochberg, D., Classifying science and technology: Two problems with the UNESCO system, Interdiscip. Sci. Rev., 2007, vol. 32, no. 4, pp. 315–319.
Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671.
Jiang, L., Li, C., Wanga, S., and Zhanga, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39.
Wang, S. and Manning, C.D., Baselines and bigrams: Simple, good sentiment and topic classification, 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012—Proceedings of the Conference, 2012, vol. 2, pp. 90–94.
Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273.
Li, S., Hu, J., Cui, Y., and Hu, J., DeepPatent: Patent classification with convolutional neural networks and word embedding, Scientometrics, 2018, vol. 117, no. 2, pp. 721–744.
Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.
Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.
Šubelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: A systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, pp. 1–23.
Liu, X., Yu, S., Moreau, Y., Janssens, F., Moor, B.D., and Glanzel, W., Hybrid clustering by integrating text and citation based graphs in journal database analysis, IEEE International Conference on Data Mining Workshops, Miami, 2009, pp. 521–526.
Waltman, L., Boyack, K.W., Colavizza, G., and van Eck, N.J., A principled methodology for comparing relatedness measures for clustering publications, arxiv:1901.06815. https://arxiv.org/ftp/arxiv/papers/ 1901/1901.06815.pdf. Accessed October 14, 2019.
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., and Börner, K., Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS ONE, 2011, vol. 6, no. 6, pp. 1–11.
Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP fusion from multiple sources for text classification, Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, 2005.
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, pp. 95–98.
Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, pp. 1–15.
Sinclair, G. and Webber, B., Classification from full text: A comparison of canonical sections of scientific papers, Proc. of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, 2004, pp. 66–69.
Riloff, E., Little words can make a big difference for text classification, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 130–136.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Translated by L. Solovyova
About this article
Cite this article
Selivanova, I.V., Kosyakov, D.V. & Guskov, A.E. Classification of Scientific Texts Based on the Compression of Annotations to Publications. Autom. Doc. Math. Linguist. 53, 329–342 (2019). https://doi.org/10.3103/S0005105519060062
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105519060062