Skip to main content
Top
Published in: Automatic Control and Computer Sciences 7/2019

01-12-2019

Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems

Authors: N. S. Lagutina, K. V. Lagutina, I. A. Shchitov, I. V. Paramonov

Published in: Automatic Control and Computer Sciences | Issue 7/2019

Login to get access

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a domain, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the domain. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose terms weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relations affect more or less. In particular, the most useful thesaurus relations are synonymous and hierarchical, as they provide a better quality of classification.
Literature
1.
go back to reference Masterman, M., Semantic message detection for machine translation, using an interlingua, Proc. 1961 International Conf. on Machine Translation, 1961, pp. 438–475. Masterman, M., Semantic message detection for machine translation, using an interlingua, Proc. 1961 International Conf. on Machine Translation, 1961, pp. 438–475.
2.
go back to reference Loukachevitch, N. and Dobrov, B., The Sociopolitical Thesaurus as a resource for automatic document processing in Russian, Terminology, 2015, vol. 21, no. 2, pp. 237–262.CrossRef Loukachevitch, N. and Dobrov, B., The Sociopolitical Thesaurus as a resource for automatic document processing in Russian, Terminology, 2015, vol. 21, no. 2, pp. 237–262.CrossRef
3.
go back to reference Aitchison, J. and Clarke, S.D., The thesaurus: A historical viewpoint, with a look to the future, Cataloging Classif. Q., 2004, vol. 37, nos. 3–4, pp. 5–21.CrossRef Aitchison, J. and Clarke, S.D., The thesaurus: A historical viewpoint, with a look to the future, Cataloging Classif. Q., 2004, vol. 37, nos. 3–4, pp. 5–21.CrossRef
4.
go back to reference Loukashevich, N. V., Tezaurusy v zadachah informatsionnogo poiska (Thesauri in Information Retrieval Problems), Moscow: Mosk. Gos. Univ., 2011. Loukashevich, N. V., Tezaurusy v zadachah informatsionnogo poiska (Thesauri in Information Retrieval Problems), Moscow: Mosk. Gos. Univ., 2011.
5.
go back to reference Willis, C. and Losee, R., A random walk on an ontology: Using thesaurus structure for automatic subject indexing, J. Am. Soc. Inf. Sci. Technol., 2013, vol. 64, no. 7, pp. 1330–1344.CrossRef Willis, C. and Losee, R., A random walk on an ontology: Using thesaurus structure for automatic subject indexing, J. Am. Soc. Inf. Sci. Technol., 2013, vol. 64, no. 7, pp. 1330–1344.CrossRef
6.
go back to reference Vállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., and Rovira, C., A semi-automatic indexing system based on embedded information in HTML documents, Libr. Hi Tech, 2015, vol. 33, no. 2, pp. 195–210.CrossRef Vállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., and Rovira, C., A semi-automatic indexing system based on embedded information in HTML documents, Libr. Hi Tech, 2015, vol. 33, no. 2, pp. 195–210.CrossRef
7.
go back to reference Loukachevitch, N., Nokel, M., and Ivanov, K., Combining Thesaurus Knowledge and Probabilistic Topic Models, 2017. https://arxiv.org/abs/1707.09816. Loukachevitch, N., Nokel, M., and Ivanov, K., Combining Thesaurus Knowledge and Probabilistic Topic Models, 2017. https://​arxiv.​org/​abs/​1707.​09816.​
8.
go back to reference Sanchez-Pi, N., Martí, L., and Garcia, A.C.B., Improving ontology-based text classification: An occupational health and security application, J. Appl. Logic, 2016, vol. 17, pp. 48–58.MathSciNetCrossRef Sanchez-Pi, N., Martí, L., and Garcia, A.C.B., Improving ontology-based text classification: An occupational health and security application, J. Appl. Logic, 2016, vol. 17, pp. 48–58.MathSciNetCrossRef
9.
go back to reference Bollegala, D., Weir, D., and Carroll, J., Cross-domain sentiment classification using a sentiment sensitive thesaurus, IEEE Trans. Knowl. Data Eng., 2013, vol. 25, no. 8, pp. 1719–1731.CrossRef Bollegala, D., Weir, D., and Carroll, J., Cross-domain sentiment classification using a sentiment sensitive thesaurus, IEEE Trans. Knowl. Data Eng., 2013, vol. 25, no. 8, pp. 1719–1731.CrossRef
10.
go back to reference Sparck Jones, K., Walker, S., and Robertson, S.E., A probabilistic model of information retrieval: Development and comparative experiments: Part 2, Inf. Process. Manage., 2000, vol. 36, no. 6, pp. 809–840.CrossRef Sparck Jones, K., Walker, S., and Robertson, S.E., A probabilistic model of information retrieval: Development and comparative experiments: Part 2, Inf. Process. Manage., 2000, vol. 36, no. 6, pp. 809–840.CrossRef
11.
go back to reference Lagutina, N.S., Lagutina, K.V., Mamedov, E.I., and Paramonov, I.V., Methodological aspects of semantic relation extraction for automatic thesaurus generation, Model. Anal. Inf. Sist., 2016, vol. 23, no. 6, pp. 826–840.MathSciNetCrossRef Lagutina, N.S., Lagutina, K.V., Mamedov, E.I., and Paramonov, I.V., Methodological aspects of semantic relation extraction for automatic thesaurus generation, Model. Anal. Inf. Sist., 2016, vol. 23, no. 6, pp. 826–840.MathSciNetCrossRef
12.
go back to reference Mihalcea, R. and Tarau, P., TextRank: Bringing order into texts, Proceedings of Empirical Methods in Natural Language Processing—EMNLP, Barcelona, 2004, pp. 404–411. Mihalcea, R. and Tarau, P., TextRank: Bringing order into texts, Proceedings of Empirical Methods in Natural Language Processing—EMNLP, Barcelona, 2004, pp. 404–411.
13.
go back to reference Trieschnigg, D., Pezik, P., Lee, V., De Jong, F., Kraaij, W., and Rebholz-Schuhmann, D., MeSH Up: Effective MeSH text classification for improved document retrieval, Bioinformatics, 2009, vol. 25, no. 11, pp. 1412–1418.CrossRef Trieschnigg, D., Pezik, P., Lee, V., De Jong, F., Kraaij, W., and Rebholz-Schuhmann, D., MeSH Up: Effective MeSH text classification for improved document retrieval, Bioinformatics, 2009, vol. 25, no. 11, pp. 1412–1418.CrossRef
14.
go back to reference Aggarwal, C. and Zhai, C., A survey of text classification algorithms, in Mining Text Data, New York: Springer-Verlag, 2012, pp. 163–222. Aggarwal, C. and Zhai, C., A survey of text classification algorithms, in Mining Text Data, New York: Springer-Verlag, 2012, pp. 163–222.
15.
go back to reference Grimmer, J. and Stewart, B., Text as data: The promise and pitfalls of automatic content analysis methods for political texts, Polit. Anal., 2013, vol. 21, no. 3, pp. 267–297.CrossRef Grimmer, J. and Stewart, B., Text as data: The promise and pitfalls of automatic content analysis methods for political texts, Polit. Anal., 2013, vol. 21, no. 3, pp. 267–297.CrossRef
16.
go back to reference Ravi, K. and Ravi, V., A survey on opinion mining and sentiment analysis: Tasks, approaches and applications, Knowl.-Based Syst., 2015, vol. 89, pp. 14–46.CrossRef Ravi, K. and Ravi, V., A survey on opinion mining and sentiment analysis: Tasks, approaches and applications, Knowl.-Based Syst., 2015, vol. 89, pp. 14–46.CrossRef
17.
go back to reference Junker, M., Hoch, R., and Dengel, A., On the evaluation of document analysis components by recall, precision, and accuracy, Proceedings of the Fifth International Conference on Document Analysis and Recognition, IEEE, 1999, pp. 713–716. Junker, M., Hoch, R., and Dengel, A., On the evaluation of document analysis components by recall, precision, and accuracy, Proceedings of the Fifth International Conference on Document Analysis and Recognition, IEEE, 1999, pp. 713–716.
Metadata
Title
Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
Authors
N. S. Lagutina
K. V. Lagutina
I. A. Shchitov
I. V. Paramonov
Publication date
01-12-2019
Publisher
Pleiades Publishing
Published in
Automatic Control and Computer Sciences / Issue 7/2019
Print ISSN: 0146-4116
Electronic ISSN: 1558-108X
DOI
https://doi.org/10.3103/S0146411619070277

Other articles of this Issue 7/2019

Automatic Control and Computer Sciences 7/2019 Go to the issue