Skip to main content
Log in

Classification of Scientific Texts Based on the Compression of Annotations to Publications

  • AUTOMATION OF TEXT PROCESSING
  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.

Similar content being viewed by others

REFERENCES

  1. Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Automated classification of Russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ.,Ser.: Lingvist. Mezhkul’t. Kommun., 2017, vol. 15, no. 3, pp. 13–23.

    Google Scholar 

  2. Batura, T.V., Formal methods for determining authorship of texts, Vestn. Novosib. Gos. Univ.,Ser.: Inf. Tekhnol., 2012, vol. 10, no. 4, pp. 81–94.

    Google Scholar 

  3. Dos Santos, C.N. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts, COLING 2014—25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, 2014, pp. 69–78.

  4. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M., Short text classification in twitter to improve information filtering, SIGIR 2010 Proceedings—33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842.

  5. Kiritchenko, S., Zhu, X., and Mohammad, S.M., Sentiment analysis of short informal texts, J. Artif. Intell. Res., 2014, vol. 50, pp. 723–762.

    Article  Google Scholar 

  6. Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https:// link.springer.com/article/10.1134/S0032946017030115.

    Article  MathSciNet  Google Scholar 

  7. Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126.

    Article  Google Scholar 

  8. Hall, G.M., How to Write a Paper, John Wiley & Sons, Ltd., 2013.

    Google Scholar 

  9. Perianes-Rodriguez, A. and Ruiz-Castillo, J., A comparison of the Web of Science and publication-level classification systems of science, J. Inf., 2017, vol. 11, no. 1, pp. 32–45.

    Google Scholar 

  10. Shu, F., Julien, C.A., Zhang, L., Qiu, J., Zhang, J., and Lariviere, V., Comparing journal and paper level classifications of science, J. Inf., 2019, vol.13, no. 1, pp. 202–209.

    Google Scholar 

  11. Topic Prominence in Science is now available to SciVal users. http://elsevierscience.ru/news/428/topic-prominence-in-science-stali-dostupny-polzovatelyam-scival. Accessed October 14, 2019.

  12. Waltman, L. and van Eck, N.J., A new methodology for constructing a publication-level classification system of science, J. Am. Soc. Inf. Sci. Technol., 2012, vol. 63, no. 12, pp. 2378–2392.

    Article  Google Scholar 

  13. UDC, LBC, ISBN as required elements of the publication’s output. https://www.ipu.ru/structure/information-services/polygraphy/20804. Accessed October 14, 2019.

  14. 1297.0—Australian and New Zealand Standard Research Classification (ANZSRC), 2008. https://www.abs. gov.au/Ausstats/abs.nsf/Latestproducts/1297.0Main% 20Features32008?opendocument&tabname=Summary& prodno=1297.0&issue=2008. Accessed October 14, 2019.

  15. Passports of scientific specialties. http://arhvak.minobrnauki.gov.ru/316. Accessed October 14, 2019.

  16. OKSO, All-Russian Classifier of Education Specialties. https://classifikators.ru/okso. Accessed October 14, 2019.

  17. GRNTI, The State Register of Scientific and Technical Activities 2019. http://grnti.ru/. Accessed October 14, 2019.

  18. Revised field of science and technology (FOS) classification in the Frascati Manual. http://www. oecd.org/science/inno/38235147.pdf. Accessed October 14, 2019.

  19. Proposed international standard nomenclature for fields of science and technology. https://unesdoc.unesco.org/ ark:/48223/pf0000082946. Accessed October 14, 2019.

  20. Parfenova, S.L., Dolgova, V.N., Bogatov, V.V., Khaltakshinova, N.V., and Korobatov, V.Ya., Methodological approach to the formation of rubricators-adapters for the analysis of Web of Science and Scopus directions in the context of the priorities of the Strategy for Scientific and Technological Development of the Russian Federation, Ekon. Nauki, 2018, vol. 4, no. 2, pp. 143–153.

    Google Scholar 

  21. Scopus. Content Coverage Guide. http://elsevierscience. ru/files/Scopus_Content_Guide_Rus_2017.pdf. Accessed October 14, 2019.

  22. Wang, Q. and Waltman, L., Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus, J. Inf., 2016, vol.10, no. 2, pp. 347–364.

    Google Scholar 

  23. Mendes, A.C., Science classification, visibility of the different scientific domains and impact on scientific development Scopus, Rev. Enferm. Ref., 2016, vol. 10, no. 4, pp. 143–149.

    Article  Google Scholar 

  24. Martínez-Frías, J. and Hochberg, D., Classifying science and technology: Two problems with the UNESCO system, Interdiscip. Sci. Rev., 2007, vol. 32, no. 4, pp. 315–319.

    Article  Google Scholar 

  25. Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671.

    Article  Google Scholar 

  26. Jiang, L., Li, C., Wanga, S., and Zhanga, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39.

    Article  Google Scholar 

  27. Wang, S. and Manning, C.D., Baselines and bigrams: Simple, good sentiment and topic classification, 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012—Proceedings of the Conference, 2012, vol. 2, pp. 90–94.

  28. Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273.

  29. Li, S., Hu, J., Cui, Y., and Hu, J., DeepPatent: Patent classification with convolutional neural networks and word embedding, Scientometrics, 2018, vol. 117, no. 2, pp. 721–744.

    Article  Google Scholar 

  30. Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.

    Article  Google Scholar 

  31. Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.

    Article  Google Scholar 

  32. Šubelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: A systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, pp. 1–23.

    Article  Google Scholar 

  33. Liu, X., Yu, S., Moreau, Y., Janssens, F., Moor, B.D., and Glanzel, W., Hybrid clustering by integrating text and citation based graphs in journal database analysis, IEEE International Conference on Data Mining Workshops, Miami, 2009, pp. 521–526.

  34. Waltman, L., Boyack, K.W., Colavizza, G., and van Eck, N.J., A principled methodology for comparing relatedness measures for clustering publications, arxiv:1901.06815. https://arxiv.org/ftp/arxiv/papers/ 1901/1901.06815.pdf. Accessed October 14, 2019.

  35. Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., and Börner, K., Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS ONE, 2011, vol. 6, no. 6, pp. 1–11.

    Article  Google Scholar 

  36. Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP fusion from multiple sources for text classification, Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, 2005.

  37. Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, pp. 95–98.

    Article  Google Scholar 

  38. Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, pp. 1–15.

    Article  Google Scholar 

  39. Sinclair, G. and Webber, B., Classification from full text: A comparison of canonical sections of scientific papers, Proc. of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, 2004, pp. 66–69.

  40. Riloff, E., Little words can make a big difference for text classification, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 130–136.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to I. V. Selivanova, D. V. Kosyakov or A. E. Guskov.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by L. Solovyova

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Selivanova, I.V., Kosyakov, D.V. & Guskov, A.E. Classification of Scientific Texts Based on the Compression of Annotations to Publications. Autom. Doc. Math. Linguist. 53, 329–342 (2019). https://doi.org/10.3103/S0005105519060062

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105519060062

Keywords:

Navigation