Skip to main content
main-content
Top

Hint

Swipe to navigate through the articles of this issue

Published in: Automatic Documentation and Mathematical Linguistics 6/2021

01-11-2021 | TEXT PROCESSING AUTOMATION

The Problems and Methods of Automatic Text Document Classification

Author: V. A. Yatsko

Published in: Automatic Documentation and Mathematical Linguistics | Issue 6/2021

Login to get access
share
SHARE

Abstract

This paper gives a review of the main problems and methods of automatic text classification. It focuses on problems such as the choice of source linguistic material, neutralization of discrepancies in texts sizes, application of distance-based and dictionary-based approaches to the classification, reduction of texts dimensionality, creation of dictionaries, adequate term weighting, and the training and functioning of a classifier program. The author describes the procedures of texts undersampling and logarithmic alignment and algorithms for computing the cosine similarity measure and the Z-score in a comprehensible form. This paper describes the peculiarities of the use of Bayes theorem for the purposes of parts-of-speech classification and spam filtering.
Literature
2.
go back to reference Pogorelec, A. and Šauperl, A., The alternative model of classification of belles-lettres in libraries, Knowl. Organ., 2006, vol. 33, no. 4, pp. 204–214. htpps://www. nomos-elibrary.de/10.5771/0943-7444-2006-4-204.pdf. Pogorelec, A. and Šauperl, A., The alternative model of classification of belles-lettres in libraries, Knowl. Organ., 2006, vol. 33, no. 4, pp. 204–214. htpps://www. nomos-elibrary.de/10.5771/0943-7444-2006-4-204.pdf.
4.
go back to reference Mishra, N. and Jha, C.K., Classification of opinion mining techniques, Int. J. Comput. Appl., 2012, vol. 56, no. 13, pp. 1–6. http://​citeseerx.​ist.​psu.​edu/​viewdoc/​ download?doi=10.1.1.244.9953&rep=rep1&type=pdf. Mishra, N. and Jha, C.K., Classification of opinion mining techniques, Int. J. Comput. Appl., 2012, vol. 56, no. 13, pp. 1–6. http://​citeseerx.​ist.​psu.​edu/​viewdoc/​ download?doi=10.1.1.244.9953&rep=rep1&type=pdf.
5.
go back to reference Hart, M.S., The Project Gutenberg mission statement. htpps://www.gutenberg.org/about/background/mission_ statement.html. Hart, M.S., The Project Gutenberg mission statement. htpps://www.gutenberg.org/about/background/mission_ statement.html.
6.
go back to reference Davies, M., The Corpus of Contemporary American English, 2008–2021. htpps://www.english-corpora.org/coca. Davies, M., The Corpus of Contemporary American English, 2008–2021. htpps://www.english-corpora.org/coca.
7.
go back to reference Lewis, D.D., Yiming, Y., Russel-Rose, T., and Li, F., RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 2004, vol. 5, pp. 361–397. https://​www.​researchgate.​net/​publication/​ 220320442_RCV1_A_New_Benchmark_Collection_for_ Text_Categorization_Research. Lewis, D.D., Yiming, Y., Russel-Rose, T., and Li, F., RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 2004, vol. 5, pp. 361–397. https://​www.​researchgate.​net/​publication/​ 220320442_RCV1_A_New_Benchmark_Collection_for_ Text_Categorization_Research.
9.
go back to reference Li, B. and Han, L., Distance weighted cosine similarity measure for text classification, Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., and Yao, X., Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 611–618. https://​doi.​org/​10.​1007/​978-3-642-41278-3_​74 CrossRef Li, B. and Han, L., Distance weighted cosine similarity measure for text classification, Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., and Yao, X., Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 611–618. https://​doi.​org/​10.​1007/​978-3-642-41278-3_​74 CrossRef
11.
go back to reference Rawte, V., Gupta, A., and Zaki, M.J., A comparative analysis of temporal long text similarity: Application to financial documents, Mining Data for Financial Applications. MIDAS 2020, Bitetta, V., Bordino, I., Ferreti, A., Gullo, F., Ponti, G., and Severini, L., Eds., Lecture Notes in Computer Science, vol. 12591, Cham: Springer, 2021, pp. 77–91.   https://​doi.​org/​10.​1007/​978-3-030-66981-2_​7 CrossRef Rawte, V., Gupta, A., and Zaki, M.J., A comparative analysis of temporal long text similarity: Application to financial documents, Mining Data for Financial Applications. MIDAS 2020, Bitetta, V., Bordino, I., Ferreti, A., Gullo, F., Ponti, G., and Severini, L., Eds., Lecture Notes in Computer Science, vol. 12591, Cham: Springer, 2021, pp. 77–91.   https://​doi.​org/​10.​1007/​978-3-030-66981-2_​7 CrossRef
13.
go back to reference Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Korhonen, A., Traum, D., and Màrquez, L., Eds., Florence: Association for Computational Linguistics, 2019.   https://​doi.​org/​10.​18653/​v1/​P19-1036 Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Korhonen, A., Traum, D., and Màrquez, L., Eds., Florence: Association for Computational Linguistics, 2019.   https://​doi.​org/​10.​18653/​v1/​P19-1036
15.
go back to reference Francis, W.N., Kucera, H., and Mackie, A.W., Frequency Analysis of English Usage: Lexicon and Grammar, Boston: Houghton Mifflin, 1983. Francis, W.N., Kucera, H., and Mackie, A.W., Frequency Analysis of English Usage: Lexicon and Grammar, Boston: Houghton Mifflin, 1983.
17.
go back to reference Dalal, M.K. and Zaveri, M.A., Automatic text classification, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://​www.​researchgate.​net/​profile/​Mukesh_​Zaveri/​ publication/266296879_Automatic_Text_Classification_ A_Technical_Review/links/ 54e74a0a0cf2b199060ae1c5. pdf. Dalal, M.K. and Zaveri, M.A., Automatic text classification, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://​www.​researchgate.​net/​profile/​Mukesh_​Zaveri/​ publication/266296879_Automatic_Text_Classification_ A_Technical_Review/links/ 54e74a0a0cf2b199060ae1c5. pdf.
19.
go back to reference Yatsko, V.A., Iterative threshold level and classification of text documents, Nauka Granits, 2020, no. 8, pp. 50–54. https://​elibrary.​ru/​item.​asp?​id=​43862963.​ Yatsko, V.A., Iterative threshold level and classification of text documents, Nauka Granits, 2020, no. 8, pp. 50–54. https://​elibrary.​ru/​item.​asp?​id=​43862963.​
20.
go back to reference Yuan, Q., Cong, G., and Thalmann, N.M., Enhancing naive bayes with various smoothing methods for short text classification, WWW ’12 Companion: Proc. of the 21st Int. Conf. on World Wide Web, Lyon, 2012, New York: Association for Computing Machinery, 2012, pp. 645–646.   https://​doi.​org/​10.​1145/​2187980.​2188169 Yuan, Q., Cong, G., and Thalmann, N.M., Enhancing naive bayes with various smoothing methods for short text classification, WWW ’12 Companion: Proc. of the 21st Int. Conf. on World Wide Web, Lyon, 2012, New York: Association for Computing Machinery, 2012, pp. 645–646.   https://​doi.​org/​10.​1145/​2187980.​2188169
21.
go back to reference Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://​docs.​google.​com/​file/​d/​0B306nMx7wiLyZ0t​Felo4MzY5SWc/​edit.​ Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://​docs.​google.​com/​file/​d/​0B306nMx7wiLyZ0t​Felo4MzY5SWc/​edit.​
Metadata
Title
The Problems and Methods of Automatic Text Document Classification
Author
V. A. Yatsko
Publication date
01-11-2021
Publisher
Pleiades Publishing
Published in
Automatic Documentation and Mathematical Linguistics / Issue 6/2021
Print ISSN: 0005-1055
Electronic ISSN: 1934-8371
DOI
https://doi.org/10.3103/S0005105521060030

Other articles of this Issue 6/2021

Automatic Documentation and Mathematical Linguistics 6/2021 Go to the issue

Premium Partner