Skip to main content
Erschienen in: Automatic Documentation and Mathematical Linguistics 4/2021

01.07.2021 | TEXT PROCESSING AUTOMATION

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

verfasst von: I. V. Selivanova, D. V. Kosyakov, D. A. Dubovitskii, A. E. Guskov

Erschienen in: Automatic Documentation and Mathematical Linguistics | Ausgabe 4/2021

Einloggen, um Zugang zu erhalten

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this article we consider a fundamentally new information-theoretic approach to the classification of scientific texts based on compression algorithms. An analysis using the example of the comparative classification of full-text documents from arXiv.org and short annotations from Scopus showed that the accuracy of the proposed method is 87–92% and, in general, is not inferior to the existing ones. These conclusions were confirmed by an expert assessment.
Literatur
5.
Zurück zum Zitat Hasan, M., Rundensteiner, E., and Agu, E., Emotex: Detecting emotions in Twitter messages, in ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conf., Stanford, 2014, Stanford: Stanford Univ., 2014, pp. 27–31. Hasan, M., Rundensteiner, E., and Agu, E., Emotex: Detecting emotions in Twitter messages, in ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conf., Stanford, 2014, Stanford: Stanford Univ., 2014, pp. 27–31.
7.
Zurück zum Zitat Zantout, R., Osman, Z., and Hamandi, L., A universal method for author identification using statistical properties of text, in Proc. 2nd Int. Conf. on Vision, Image and Signal Processing, Las Vegas, 2018, New York: Association for Computing Machinery, 2018, p. 20. https://doi.org/10.1145/3271553.3271561 Zantout, R., Osman, Z., and Hamandi, L., A universal method for author identification using statistical properties of text, in Proc. 2nd Int. Conf. on Vision, Image and Signal Processing, Las Vegas, 2018, New York: Association for Computing Machinery, 2018, p. 20.  https://​doi.​org/​10.​1145/​3271553.​3271561
8.
Zurück zum Zitat Tang, X., Liang, S., and Liu, Z., Authorship attribution of the golden lotus based on text classification methods, in Proc. 3rd Int. Conf. on Innovation in Artificial Intelligence, Suzhou, China, 2019, New York: Association for Computing Machinery, 2019, pp. 69–72. https://doi.org/10.1145/3319921.3319958 Tang, X., Liang, S., and Liu, Z., Authorship attribution of the golden lotus based on text classification methods, in Proc. 3rd Int. Conf. on Innovation in Artificial Intelligence, Suzhou, China, 2019, New York: Association for Computing Machinery, 2019, pp. 69–72. https://​doi.​org/​10.​1145/​3319921.​3319958
9.
Zurück zum Zitat Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering, in Proc. 14th ACM Int. Conf. on Information and Knowledge Management, Bremen, 2005, New York: Association for Computing Machinery, 2005, pp. 357–358. https://doi.org/10.1145/1099554.1099665 Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering, in Proc. 14th ACM Int. Conf. on Information and Knowledge Management, Bremen, 2005, New York: Association for Computing Machinery, 2005, pp. 357–358.  https://​doi.​org/​10.​1145/​1099554.​1099665
10.
Zurück zum Zitat Volkova, L.L. and Stroganov, Yu.V., On associative binary measures of proximity of documents: Classification and application to clusterization, Novye Inf. Tekhnol. Avtom. Sist., 2014, no. 17, pp. 421–432. Volkova, L.L. and Stroganov, Yu.V., On associative binary measures of proximity of documents: Classification and application to clusterization, Novye Inf. Tekhnol. Avtom. Sist., 2014, no. 17, pp. 421–432.
11.
Zurück zum Zitat Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12. Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.
12.
Zurück zum Zitat Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, in Proc. Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, New York: Association for Computing Machinery, 2002, pp. 436–442. https://doi.org/10.1145/775047.775110 Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, in Proc. Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, New York: Association for Computing Machinery, 2002, pp. 436–442. https://​doi.​org/​10.​1145/​775047.​775110
13.
Zurück zum Zitat Deng, Z.H., Tang, S.W., Yang, D.Q., Li-Yu, M.Z., and Xie, K.-Q., A comparative study on feature weight in text categorization, in Advanced Web Technologies and Applications. APWeb 2004, Yu, J.X., Lin, X., Lu, H., and Zhang, Y., Eds., Lecture Notes in Computer Science, vol. 3007, Berlin: Springer, 2004, pp. 588–597. https://doi.org/10.1007/978-3-540-24655-8_64CrossRef Deng, Z.H., Tang, S.W., Yang, D.Q., Li-Yu, M.Z., and Xie, K.-Q., A comparative study on feature weight in text categorization, in Advanced Web Technologies and Applications. APWeb 2004, Yu, J.X., Lin, X., Lu, H., and Zhang, Y., Eds., Lecture Notes in Computer Science, vol. 3007, Berlin: Springer, 2004, pp. 588–597.  https://​doi.​org/​10.​1007/​978-3-540-24655-8_​64CrossRef
15.
Zurück zum Zitat Riloff, E., Little words can make a big difference for text classification, in Proc. 18th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 130–136. Riloff, E., Little words can make a big difference for text classification, in Proc. 18th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 130–136.
18.
Zurück zum Zitat Dhar, A., Dash, N., and Roy, K., Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents, in 3rd Int. Conf. on Advances in Computing, Communication & Automation (ICACCA) (Fall), Dehradun, India, 2017, IEEE, 2017, pp. 1–6. https://doi.org/10.1109/ICACCAF.2017.8344721 Dhar, A., Dash, N., and Roy, K., Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents, in 3rd Int. Conf. on Advances in Computing, Communication & Automation (ICACCA) (Fall), Dehradun, India, 2017, IEEE, 2017, pp. 1–6.  https://​doi.​org/​10.​1109/​ICACCAF.​2017.​8344721
19.
Zurück zum Zitat Walkowiak, T., Datko, S., and Maciejewski, H., Distance metrics in open-set classification of text documents by local outlier factor and doc2vec, in Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019, Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., and Ali, M., Eds., Lecture Notes in Computer Science, vol. 11606, Cham: Springer, 2019, pp. 102–109. https://doi.org/10.1007/978-3-030-22999-3_10CrossRef Walkowiak, T., Datko, S., and Maciejewski, H., Distance metrics in open-set classification of text documents by local outlier factor and doc2vec, in Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019, Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., and Ali, M., Eds., Lecture Notes in Computer Science, vol. 11606, Cham: Springer, 2019, pp. 102–109. https://​doi.​org/​10.​1007/​978-3-030-22999-3_​10CrossRef
21.
Zurück zum Zitat Forman, G., An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res., 2003, vol. 3, pp. 1289–1305.MATH Forman, G., An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res., 2003, vol. 3, pp. 1289–1305.MATH
22.
Zurück zum Zitat Nearest neighbor method. http://www.machinelearning. ru/wiki/index.php?title=Метод_ближайшего_соседа. Cited May 8, 2020. Nearest neighbor method. http://​www.​machinelearning.​ ru/wiki/index.php?title=Метод_ближайшего_соседа. Cited May 8, 2020.
23.
24.
Zurück zum Zitat Wang, C.-Y., Zhang K., Yan, Y.-G., Li, J.-G., A k-nearest neighbor algorithm based on cluster in text classification, in Int. Conf. on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 2010, IEEE, 2010, vol. 1, pp. 225–228. https://doi.org/10.1109/CMCE.2010.5610477 Wang, C.-Y., Zhang K., Yan, Y.-G., Li, J.-G., A k-nearest neighbor algorithm based on cluster in text classification, in Int. Conf. on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 2010, IEEE, 2010, vol. 1, pp. 225–228. https://​doi.​org/​10.​1109/​CMCE.​2010.​5610477
27.
Zurück zum Zitat Denœux, T., A k-nearest neighbor classification rule based on Dempster-Shafer theory, in Classic Works of the Dempster-Shafer Theory of Belief Functions, Yager, R.R. and Liu, L., Eds., Studies in Fuzziness and Soft Computing, vol. 219, Berlin: Springer, 2008, pp. 737–760. https://doi.org/10.1007/978-3-540-44792-4_29 Denœux, T., A k-nearest neighbor classification rule based on Dempster-Shafer theory, in Classic Works of the Dempster-Shafer Theory of Belief Functions, Yager, R.R. and Liu, L., Eds., Studies in Fuzziness and Soft Computing, vol. 219, Berlin: Springer, 2008, pp. 737–760. https://​doi.​org/​10.​1007/​978-3-540-44792-4_​29
30.
Zurück zum Zitat Howedi, F. and Mohd, M., Text classification for authorship attribution using naive Bayes classifier with limited training data, Comput. Eng., Intell. Syst., 2014, vol. 5, no. 4, pp. 48–56. Howedi, F. and Mohd, M., Text classification for authorship attribution using naive Bayes classifier with limited training data, Comput. Eng., Intell. Syst., 2014, vol. 5, no. 4, pp. 48–56.
31.
Zurück zum Zitat Xu, S., Li, Y., and Wang, Z., Bayesian multinomial naive bayes classifier to text classification, in Advanced Multimedia and Ubiquitous Engineering. FutureTech 2017, MUE 2017, Park, J., Chen., S.C., and Raymond Choo, K.K., Eds., Lecture Notes in Electrical Engineering, vol. 448, Singapore: Springer, 2017, pp. 347–352. https://doi.org/10.1007/978-981-10-5041-1_57 Xu, S., Li, Y., and Wang, Z., Bayesian multinomial naive bayes classifier to text classification, in Advanced Multimedia and Ubiquitous Engineering. FutureTech 2017, MUE 2017, Park, J., Chen., S.C., and Raymond Choo, K.K., Eds., Lecture Notes in Electrical Engineering, vol. 448, Singapore: Springer, 2017, pp. 347–352.  https://​doi.​org/​10.​1007/​978-981-10-5041-1_​57
32.
Zurück zum Zitat Narayanan, V., Arora, I., and Bhatia, A., Fast and accurate sentiment classification using an enhanced naive bayes model, in Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin H. , Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 194–201. https://doi.org/10.1007/978-3-642-41278-3_24CrossRef Narayanan, V., Arora, I., and Bhatia, A., Fast and accurate sentiment classification using an enhanced naive bayes model, in Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin H. , Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 194–201.  https://​doi.​org/​10.​1007/​978-3-642-41278-3_​24CrossRef
33.
Zurück zum Zitat Bi, Z., Han, Y., Huang, C., and Wang, M., Gaussian naive Bayesian data classification model based on clustering algorithm, in Proc. Int. Conf. on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019), Atlantis Press, 2019, pp. 396–400. https://doi.org/10.2991/masta-19.2019.67 Bi, Z., Han, Y., Huang, C., and Wang, M., Gaussian naive Bayesian data classification model based on clustering algorithm, in Proc. Int. Conf. on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019), Atlantis Press, 2019, pp. 396–400.  https://​doi.​org/​10.​2991/​masta-19.​2019.​67
37.
Zurück zum Zitat Ji, L., Cheng, X., Kang, L., Li, Daoliang, Li, Daiyi, Wang, K., and Chen, Y., A SVM-based text classification system for knowledge organization method of crop cultivation, in Computer and Computing Technologies in Agriculture V. CCTA 2011, Li, D. and Chen, Y., Eds., IFIP Advances in Information and Communication Technology, vol. 368, Berlin: Springer, 2012, pp. 318–324. https://doi.org/10.1007/978-3-642-27281-3_38 Ji, L., Cheng, X., Kang, L., Li, Daoliang, Li, Daiyi, Wang, K., and Chen, Y., A SVM-based text classification system for knowledge organization method of crop cultivation, in Computer and Computing Technologies in Agriculture V. CCTA 2011, Li, D. and Chen, Y., Eds., IFIP Advances in Information and Communication Technology, vol. 368, Berlin: Springer, 2012, pp. 318–324. https://​doi.​org/​10.​1007/​978-3-642-27281-3_​38
38.
Zurück zum Zitat Yang, Y., Zhang, J., and Kisiel, B., A scalability analysis of classifiers in text categorization, in Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003, New York: Association for Computing Machinery, 2003, pp. 96–103. https://doi.org/10.1145/860435.860455 Yang, Y., Zhang, J., and Kisiel, B., A scalability analysis of classifiers in text categorization, in Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003, New York: Association for Computing Machinery, 2003, pp. 96–103.  https://​doi.​org/​10.​1145/​860435.​860455
39.
Zurück zum Zitat Aborisade, O.M. and Anwar, M., Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers, in IEEE Int. Conf. on Information Reuse and Integration (IRI), Salt Lake City, Utah, 2018, IEEE, 2018, pp. 269–276. https://doi.org/10.1109/IRI.2018.00049 Aborisade, O.M. and Anwar, M., Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers, in IEEE Int. Conf. on Information Reuse and Integration (IRI), Salt Lake City, Utah, 2018, IEEE, 2018, pp. 269–276. https://​doi.​org/​10.​1109/​IRI.​2018.​00049
40.
Zurück zum Zitat Chistiakov, S.P., Random forests: An overview, Tr. Karel. Nauchn. Tsentra Ross. Akad. Nauk, 2013, no. 1, pp. 117–136. Chistiakov, S.P., Random forests: An overview, Tr. Karel. Nauchn. Tsentra Ross. Akad. Nauk, 2013, no. 1, pp. 117–136.
41.
Zurück zum Zitat Xu, B., Guo, X., Ye, Y., and Cheng, J., An improved random forest classifier for text categorization, J. Comput., 2012, vol. 7, no. 12, pp. 2913–2920. Xu, B., Guo, X., Ye, Y., and Cheng, J., An improved random forest classifier for text categorization, J. Comput., 2012, vol. 7, no. 12, pp. 2913–2920.
42.
Zurück zum Zitat Islam, M.Z., Liu, J., Li, J., Liu, L., and Kang, W., A semantics aware random forest for text classification, in Proc. 28th ACM Int. Conf. on Information and Knowledge Management, Beijing, 2019, New York, Association for Computing Machinery, 2019, pp. 1061–1070. https://doi.org/10.1145/3357384.3357891 Islam, M.Z., Liu, J., Li, J., Liu, L., and Kang, W., A semantics aware random forest for text classification, in Proc. 28th ACM Int. Conf. on Information and Knowledge Management, Beijing, 2019, New York, Association for Computing Machinery, 2019, pp. 1061–1070.  https://​doi.​org/​10.​1145/​3357384.​3357891
43.
Zurück zum Zitat Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., and Lloret, P., Short text classification using semantic random forest, in Data Warehousing and Knowledge Discovery. DaWaK 2014, Bellatreche, L. and Mohania, M.K., Eds., Lecture Notes in Computer Science, vol. 8646, Cham: Springer, 2014, pp. 288–299. https://doi.org/10.1007/978-3-319-10160-6_26CrossRef Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., and Lloret, P., Short text classification using semantic random forest, in Data Warehousing and Knowledge Discovery. DaWaK 2014, Bellatreche, L. and Mohania, M.K., Eds., Lecture Notes in Computer Science, vol. 8646, Cham: Springer, 2014, pp. 288–299.  https://​doi.​org/​10.​1007/​978-3-319-10160-6_​26CrossRef
44.
Zurück zum Zitat Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence, Austin, Tex., 2015, AAAI Press, 2015, pp. 2267–2273. Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence, Austin, Tex., 2015, AAAI Press, 2015, pp. 2267–2273.
45.
Zurück zum Zitat Alqaraleh, S., Classification of Turkish text using machine learning: a case study using disasters tweets, Int. J. Sci. Technol. Res., 2020, vol. 9, no. 3, pp. 4953–4956. Alqaraleh, S., Classification of Turkish text using machine learning: a case study using disasters tweets, Int. J. Sci. Technol. Res., 2020, vol. 9, no. 3, pp. 4953–4956.
50.
Zurück zum Zitat Liu, Z., Lv, X., Liu, K., and Shi, S., Study on SVM compared with the other text classification methods, in Second Int. Workshop on Education Technology and Computer Science, Wuhan, China, 2010, IEEE, 2010, vol. 1, pp. 219–222. https://doi.org/10.1109/ETCS.2010.248 Liu, Z., Lv, X., Liu, K., and Shi, S., Study on SVM compared with the other text classification methods, in Second Int. Workshop on Education Technology and Computer Science, Wuhan, China, 2010, IEEE, 2010, vol. 1, pp. 219–222. https://​doi.​org/​10.​1109/​ETCS.​2010.​248
54.
56.
Zurück zum Zitat Sinclair, G. and Webber, B., Classification from full text: a comparison of canonical sections of scientific papers, in Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, 2004, COLING, 2004, pp. 66–69. https://aclanthology.org/W04-1212. Sinclair, G. and Webber, B., Classification from full text: a comparison of canonical sections of scientific papers, in Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, 2004, COLING, 2004, pp. 66–69. https://​aclanthology.​org/​W04-1212.​
60.
Zurück zum Zitat Cilibrasi, R., Vitanyi, P., and de Wolf, R., Algorithmic clustering of music based on string compression, Comput. Music J., 2004, vol. 28, no. 4, pp. 49–67.CrossRef Cilibrasi, R., Vitanyi, P., and de Wolf, R., Algorithmic clustering of music based on string compression, Comput. Music J., 2004, vol. 28, no. 4, pp. 49–67.CrossRef
Metadaten
Titel
Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles
verfasst von
I. V. Selivanova
D. V. Kosyakov
D. A. Dubovitskii
A. E. Guskov
Publikationsdatum
01.07.2021
Verlag
Pleiades Publishing
Erschienen in
Automatic Documentation and Mathematical Linguistics / Ausgabe 4/2021
Print ISSN: 0005-1055
Elektronische ISSN: 1934-8371
DOI
https://doi.org/10.3103/S0005105521040075

Weitere Artikel der Ausgabe 4/2021

Automatic Documentation and Mathematical Linguistics 4/2021 Zur Ausgabe