Skip to main content
Top
Published in: Automatic Documentation and Mathematical Linguistics 6/2019

01-11-2019 | AUTOMATION OF TEXT PROCESSING

Classification of Scientific Texts Based on the Compression of Annotations to Publications

Authors: I. V. Selivanova, D. V. Kosyakov, A. E. Guskov

Published in: Automatic Documentation and Mathematical Linguistics | Issue 6/2019

Login to get access

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is.
Literature
1.
go back to reference Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Automated classification of Russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ.,Ser.: Lingvist. Mezhkul’t. Kommun., 2017, vol. 15, no. 3, pp. 13–23. Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Automated classification of Russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ.,Ser.: Lingvist. Mezhkul’t. Kommun., 2017, vol. 15, no. 3, pp. 13–23.
2.
go back to reference Batura, T.V., Formal methods for determining authorship of texts, Vestn. Novosib. Gos. Univ.,Ser.: Inf. Tekhnol., 2012, vol. 10, no. 4, pp. 81–94. Batura, T.V., Formal methods for determining authorship of texts, Vestn. Novosib. Gos. Univ.,Ser.: Inf. Tekhnol., 2012, vol. 10, no. 4, pp. 81–94.
3.
go back to reference Dos Santos, C.N. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts, COLING 2014—25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, 2014, pp. 69–78. Dos Santos, C.N. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts, COLING 2014—25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, 2014, pp. 69–78.
4.
go back to reference Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M., Short text classification in twitter to improve information filtering, SIGIR 2010 Proceedings—33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M., Short text classification in twitter to improve information filtering, SIGIR 2010 Proceedings—33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842.
5.
go back to reference Kiritchenko, S., Zhu, X., and Mohammad, S.M., Sentiment analysis of short informal texts, J. Artif. Intell. Res., 2014, vol. 50, pp. 723–762.CrossRef Kiritchenko, S., Zhu, X., and Mohammad, S.M., Sentiment analysis of short informal texts, J. Artif. Intell. Res., 2014, vol. 50, pp. 723–762.CrossRef
6.
go back to reference Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https:// link.springer.com/article/10.1134/S0032946017030115.MathSciNetCrossRef Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https:// link.springer.com/article/10.1134/S0032946017030115.MathSciNetCrossRef
7.
go back to reference Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126.CrossRef Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126.CrossRef
8.
go back to reference Hall, G.M., How to Write a Paper, John Wiley & Sons, Ltd., 2013. Hall, G.M., How to Write a Paper, John Wiley & Sons, Ltd., 2013.
9.
go back to reference Perianes-Rodriguez, A. and Ruiz-Castillo, J., A comparison of the Web of Science and publication-level classification systems of science, J. Inf., 2017, vol. 11, no. 1, pp. 32–45. Perianes-Rodriguez, A. and Ruiz-Castillo, J., A comparison of the Web of Science and publication-level classification systems of science, J. Inf., 2017, vol. 11, no. 1, pp. 32–45.
10.
go back to reference Shu, F., Julien, C.A., Zhang, L., Qiu, J., Zhang, J., and Lariviere, V., Comparing journal and paper level classifications of science, J. Inf., 2019, vol.13, no. 1, pp. 202–209. Shu, F., Julien, C.A., Zhang, L., Qiu, J., Zhang, J., and Lariviere, V., Comparing journal and paper level classifications of science, J. Inf., 2019, vol.13, no. 1, pp. 202–209.
11.
go back to reference Topic Prominence in Science is now available to SciVal users. http://elsevierscience.ru/news/428/topic-prominence-in-science-stali-dostupny-polzovatelyam-scival. Accessed October 14, 2019. Topic Prominence in Science is now available to SciVal users. http://​elsevierscience.​ru/​news/​428/​topic-prominence-in-science-stali-dostupny-polzovatelyam-scival.​ Accessed October 14, 2019.
12.
go back to reference Waltman, L. and van Eck, N.J., A new methodology for constructing a publication-level classification system of science, J. Am. Soc. Inf. Sci. Technol., 2012, vol. 63, no. 12, pp. 2378–2392.CrossRef Waltman, L. and van Eck, N.J., A new methodology for constructing a publication-level classification system of science, J. Am. Soc. Inf. Sci. Technol., 2012, vol. 63, no. 12, pp. 2378–2392.CrossRef
13.
go back to reference UDC, LBC, ISBN as required elements of the publication’s output. https://www.ipu.ru/structure/information-services/polygraphy/20804. Accessed October 14, 2019. UDC, LBC, ISBN as required elements of the publication’s output. https://​www.​ipu.​ru/​structure/​information-services/​polygraphy/​20804.​ Accessed October 14, 2019.
14.
go back to reference 1297.0—Australian and New Zealand Standard Research Classification (ANZSRC), 2008. https://www.abs. gov.au/Ausstats/abs.nsf/Latestproducts/1297.0Main% 20Features32008?opendocument&tabname=Summary& prodno=1297.0&issue=2008. Accessed October 14, 2019. 1297.0—Australian and New Zealand Standard Research Classification (ANZSRC), 2008. https://​www.​abs.​ gov.au/Ausstats/abs.nsf/Latestproducts/1297.0Main% 20Features32008?opendocument&tabname=Summary& prodno=1297.0&issue=2008. Accessed October 14, 2019.
15.
go back to reference Passports of scientific specialties. http://arhvak.minobrnauki.gov.ru/316. Accessed October 14, 2019. Passports of scientific specialties. http://​arhvak.​minobrnauki.​gov.​ru/​316.​ Accessed October 14, 2019.
16.
go back to reference OKSO, All-Russian Classifier of Education Specialties. https://classifikators.ru/okso. Accessed October 14, 2019. OKSO, All-Russian Classifier of Education Specialties. https://​classifikators.​ru/​okso.​ Accessed October 14, 2019.
17.
go back to reference GRNTI, The State Register of Scientific and Technical Activities 2019. http://grnti.ru/. Accessed October 14, 2019. GRNTI, The State Register of Scientific and Technical Activities 2019. http://​grnti.​ru/​.​ Accessed October 14, 2019.
18.
go back to reference Revised field of science and technology (FOS) classification in the Frascati Manual. http://www. oecd.org/science/inno/38235147.pdf. Accessed October 14, 2019. Revised field of science and technology (FOS) classification in the Frascati Manual. http://​www.​ oecd.org/science/inno/38235147.pdf. Accessed October 14, 2019.
19.
go back to reference Proposed international standard nomenclature for fields of science and technology. https://unesdoc.unesco.org/ ark:/48223/pf0000082946. Accessed October 14, 2019. Proposed international standard nomenclature for fields of science and technology. https://​unesdoc.​unesco.​org/​ ark:/48223/pf0000082946. Accessed October 14, 2019.
20.
go back to reference Parfenova, S.L., Dolgova, V.N., Bogatov, V.V., Khaltakshinova, N.V., and Korobatov, V.Ya., Methodological approach to the formation of rubricators-adapters for the analysis of Web of Science and Scopus directions in the context of the priorities of the Strategy for Scientific and Technological Development of the Russian Federation, Ekon. Nauki, 2018, vol. 4, no. 2, pp. 143–153. Parfenova, S.L., Dolgova, V.N., Bogatov, V.V., Khaltakshinova, N.V., and Korobatov, V.Ya., Methodological approach to the formation of rubricators-adapters for the analysis of Web of Science and Scopus directions in the context of the priorities of the Strategy for Scientific and Technological Development of the Russian Federation, Ekon. Nauki, 2018, vol. 4, no. 2, pp. 143–153.
21.
go back to reference Scopus. Content Coverage Guide. http://elsevierscience. ru/files/Scopus_Content_Guide_Rus_2017.pdf. Accessed October 14, 2019. Scopus. Content Coverage Guide. http://​elsevierscience.​ ru/files/Scopus_Content_Guide_Rus_2017.pdf. Accessed October 14, 2019.
22.
go back to reference Wang, Q. and Waltman, L., Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus, J. Inf., 2016, vol.10, no. 2, pp. 347–364. Wang, Q. and Waltman, L., Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus, J. Inf., 2016, vol.10, no. 2, pp. 347–364.
23.
go back to reference Mendes, A.C., Science classification, visibility of the different scientific domains and impact on scientific development Scopus, Rev. Enferm. Ref., 2016, vol. 10, no. 4, pp. 143–149.CrossRef Mendes, A.C., Science classification, visibility of the different scientific domains and impact on scientific development Scopus, Rev. Enferm. Ref., 2016, vol. 10, no. 4, pp. 143–149.CrossRef
24.
go back to reference Martínez-Frías, J. and Hochberg, D., Classifying science and technology: Two problems with the UNESCO system, Interdiscip. Sci. Rev., 2007, vol. 32, no. 4, pp. 315–319.CrossRef Martínez-Frías, J. and Hochberg, D., Classifying science and technology: Two problems with the UNESCO system, Interdiscip. Sci. Rev., 2007, vol. 32, no. 4, pp. 315–319.CrossRef
25.
go back to reference Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671.CrossRef Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671.CrossRef
26.
go back to reference Jiang, L., Li, C., Wanga, S., and Zhanga, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39.CrossRef Jiang, L., Li, C., Wanga, S., and Zhanga, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39.CrossRef
27.
go back to reference Wang, S. and Manning, C.D., Baselines and bigrams: Simple, good sentiment and topic classification, 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012—Proceedings of the Conference, 2012, vol. 2, pp. 90–94. Wang, S. and Manning, C.D., Baselines and bigrams: Simple, good sentiment and topic classification, 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012—Proceedings of the Conference, 2012, vol. 2, pp. 90–94.
28.
go back to reference Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273. Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273.
29.
go back to reference Li, S., Hu, J., Cui, Y., and Hu, J., DeepPatent: Patent classification with convolutional neural networks and word embedding, Scientometrics, 2018, vol. 117, no. 2, pp. 721–744.CrossRef Li, S., Hu, J., Cui, Y., and Hu, J., DeepPatent: Patent classification with convolutional neural networks and word embedding, Scientometrics, 2018, vol. 117, no. 2, pp. 721–744.CrossRef
30.
go back to reference Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.CrossRef Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.CrossRef
31.
go back to reference Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.CrossRef Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.CrossRef
32.
go back to reference Šubelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: A systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, pp. 1–23.CrossRef Šubelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: A systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, pp. 1–23.CrossRef
33.
go back to reference Liu, X., Yu, S., Moreau, Y., Janssens, F., Moor, B.D., and Glanzel, W., Hybrid clustering by integrating text and citation based graphs in journal database analysis, IEEE International Conference on Data Mining Workshops, Miami, 2009, pp. 521–526. Liu, X., Yu, S., Moreau, Y., Janssens, F., Moor, B.D., and Glanzel, W., Hybrid clustering by integrating text and citation based graphs in journal database analysis, IEEE International Conference on Data Mining Workshops, Miami, 2009, pp. 521–526.
34.
go back to reference Waltman, L., Boyack, K.W., Colavizza, G., and van Eck, N.J., A principled methodology for comparing relatedness measures for clustering publications, arxiv:1901.06815. https://arxiv.org/ftp/arxiv/papers/ 1901/1901.06815.pdf. Accessed October 14, 2019. Waltman, L., Boyack, K.W., Colavizza, G., and van Eck, N.J., A principled methodology for comparing relatedness measures for clustering publications, arxiv:1901.06815. https://​arxiv.​org/​ftp/​arxiv/​papers/​ 1901/1901.06815.pdf. Accessed October 14, 2019.
35.
go back to reference Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., and Börner, K., Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS ONE, 2011, vol. 6, no. 6, pp. 1–11.CrossRef Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., Schijvenaars, B., Skupin, A., Ma, N., and Börner, K., Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS ONE, 2011, vol. 6, no. 6, pp. 1–11.CrossRef
36.
go back to reference Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP fusion from multiple sources for text classification, Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, 2005. Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP fusion from multiple sources for text classification, Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, 2005.
37.
go back to reference Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, pp. 95–98.CrossRef Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, pp. 95–98.CrossRef
38.
go back to reference Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, pp. 1–15.CrossRef Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, pp. 1–15.CrossRef
39.
go back to reference Sinclair, G. and Webber, B., Classification from full text: A comparison of canonical sections of scientific papers, Proc. of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, 2004, pp. 66–69. Sinclair, G. and Webber, B., Classification from full text: A comparison of canonical sections of scientific papers, Proc. of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, 2004, pp. 66–69.
40.
go back to reference Riloff, E., Little words can make a big difference for text classification, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 130–136. Riloff, E., Little words can make a big difference for text classification, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 130–136.
Metadata
Title
Classification of Scientific Texts Based on the Compression of Annotations to Publications
Authors
I. V. Selivanova
D. V. Kosyakov
A. E. Guskov
Publication date
01-11-2019
Publisher
Pleiades Publishing
Published in
Automatic Documentation and Mathematical Linguistics / Issue 6/2019
Print ISSN: 0005-1055
Electronic ISSN: 1934-8371
DOI
https://doi.org/10.3103/S0005105519060062

Other articles of this Issue 6/2019

Automatic Documentation and Mathematical Linguistics 6/2019 Go to the issue

Premium Partner