Skip to main content
Erschienen in: Automatic Documentation and Mathematical Linguistics 6/2021

01.11.2021 | TEXT PROCESSING AUTOMATION

The Problems and Methods of Automatic Text Document Classification

verfasst von: V. A. Yatsko

Erschienen in: Automatic Documentation and Mathematical Linguistics | Ausgabe 6/2021

Einloggen, um Zugang zu erhalten

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper gives a review of the main problems and methods of automatic text classification. It focuses on problems such as the choice of source linguistic material, neutralization of discrepancies in texts sizes, application of distance-based and dictionary-based approaches to the classification, reduction of texts dimensionality, creation of dictionaries, adequate term weighting, and the training and functioning of a classifier program. The author describes the procedures of texts undersampling and logarithmic alignment and algorithms for computing the cosine similarity measure and the Z-score in a comprehensible form. This paper describes the peculiarities of the use of Bayes theorem for the purposes of parts-of-speech classification and spam filtering.
Literatur
2.
Zurück zum Zitat Pogorelec, A. and Šauperl, A., The alternative model of classification of belles-lettres in libraries, Knowl. Organ., 2006, vol. 33, no. 4, pp. 204–214. htpps://www. nomos-elibrary.de/10.5771/0943-7444-2006-4-204.pdf. Pogorelec, A. and Šauperl, A., The alternative model of classification of belles-lettres in libraries, Knowl. Organ., 2006, vol. 33, no. 4, pp. 204–214. htpps://www. nomos-elibrary.de/10.5771/0943-7444-2006-4-204.pdf.
4.
Zurück zum Zitat Mishra, N. and Jha, C.K., Classification of opinion mining techniques, Int. J. Comput. Appl., 2012, vol. 56, no. 13, pp. 1–6. http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.244.9953&rep=rep1&type=pdf. Mishra, N. and Jha, C.K., Classification of opinion mining techniques, Int. J. Comput. Appl., 2012, vol. 56, no. 13, pp. 1–6. http://​citeseerx.​ist.​psu.​edu/​viewdoc/​ download?doi=10.1.1.244.9953&rep=rep1&type=pdf.
5.
Zurück zum Zitat Hart, M.S., The Project Gutenberg mission statement. htpps://www.gutenberg.org/about/background/mission_ statement.html. Hart, M.S., The Project Gutenberg mission statement. htpps://www.gutenberg.org/about/background/mission_ statement.html.
6.
Zurück zum Zitat Davies, M., The Corpus of Contemporary American English, 2008–2021. htpps://www.english-corpora.org/coca. Davies, M., The Corpus of Contemporary American English, 2008–2021. htpps://www.english-corpora.org/coca.
7.
Zurück zum Zitat Lewis, D.D., Yiming, Y., Russel-Rose, T., and Li, F., RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 2004, vol. 5, pp. 361–397. https://www.researchgate.net/publication/ 220320442_RCV1_A_New_Benchmark_Collection_for_ Text_Categorization_Research. Lewis, D.D., Yiming, Y., Russel-Rose, T., and Li, F., RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 2004, vol. 5, pp. 361–397. https://​www.​researchgate.​net/​publication/​ 220320442_RCV1_A_New_Benchmark_Collection_for_ Text_Categorization_Research.
9.
Zurück zum Zitat Li, B. and Han, L., Distance weighted cosine similarity measure for text classification, Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., and Yao, X., Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 611–618. https://doi.org/10.1007/978-3-642-41278-3_74CrossRef Li, B. and Han, L., Distance weighted cosine similarity measure for text classification, Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., and Yao, X., Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 611–618. https://​doi.​org/​10.​1007/​978-3-642-41278-3_​74CrossRef
11.
Zurück zum Zitat Rawte, V., Gupta, A., and Zaki, M.J., A comparative analysis of temporal long text similarity: Application to financial documents, Mining Data for Financial Applications. MIDAS 2020, Bitetta, V., Bordino, I., Ferreti, A., Gullo, F., Ponti, G., and Severini, L., Eds., Lecture Notes in Computer Science, vol. 12591, Cham: Springer, 2021, pp. 77–91. https://doi.org/10.1007/978-3-030-66981-2_7CrossRef Rawte, V., Gupta, A., and Zaki, M.J., A comparative analysis of temporal long text similarity: Application to financial documents, Mining Data for Financial Applications. MIDAS 2020, Bitetta, V., Bordino, I., Ferreti, A., Gullo, F., Ponti, G., and Severini, L., Eds., Lecture Notes in Computer Science, vol. 12591, Cham: Springer, 2021, pp. 77–91.  https://​doi.​org/​10.​1007/​978-3-030-66981-2_​7CrossRef
13.
Zurück zum Zitat Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Korhonen, A., Traum, D., and Màrquez, L., Eds., Florence: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/P19-1036 Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proc. 57th Ann. Meeting of the Association for Computational Linguistics, Korhonen, A., Traum, D., and Màrquez, L., Eds., Florence: Association for Computational Linguistics, 2019.  https://​doi.​org/​10.​18653/​v1/​P19-1036
15.
Zurück zum Zitat Francis, W.N., Kucera, H., and Mackie, A.W., Frequency Analysis of English Usage: Lexicon and Grammar, Boston: Houghton Mifflin, 1983. Francis, W.N., Kucera, H., and Mackie, A.W., Frequency Analysis of English Usage: Lexicon and Grammar, Boston: Houghton Mifflin, 1983.
17.
Zurück zum Zitat Dalal, M.K. and Zaveri, M.A., Automatic text classification, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://www.researchgate.net/profile/Mukesh_Zaveri/ publication/266296879_Automatic_Text_Classification_ A_Technical_Review/links/ 54e74a0a0cf2b199060ae1c5. pdf. Dalal, M.K. and Zaveri, M.A., Automatic text classification, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://​www.​researchgate.​net/​profile/​Mukesh_​Zaveri/​ publication/266296879_Automatic_Text_Classification_ A_Technical_Review/links/ 54e74a0a0cf2b199060ae1c5. pdf.
19.
Zurück zum Zitat Yatsko, V.A., Iterative threshold level and classification of text documents, Nauka Granits, 2020, no. 8, pp. 50–54. https://elibrary.ru/item.asp?id=43862963. Yatsko, V.A., Iterative threshold level and classification of text documents, Nauka Granits, 2020, no. 8, pp. 50–54. https://​elibrary.​ru/​item.​asp?​id=​43862963.​
20.
Zurück zum Zitat Yuan, Q., Cong, G., and Thalmann, N.M., Enhancing naive bayes with various smoothing methods for short text classification, WWW ’12 Companion: Proc. of the 21st Int. Conf. on World Wide Web, Lyon, 2012, New York: Association for Computing Machinery, 2012, pp. 645–646. https://doi.org/10.1145/2187980.2188169 Yuan, Q., Cong, G., and Thalmann, N.M., Enhancing naive bayes with various smoothing methods for short text classification, WWW ’12 Companion: Proc. of the 21st Int. Conf. on World Wide Web, Lyon, 2012, New York: Association for Computing Machinery, 2012, pp. 645–646.  https://​doi.​org/​10.​1145/​2187980.​2188169
21.
Zurück zum Zitat Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://docs.google.com/file/d/0B306nMx7wiLyZ0tFelo4MzY5SWc/edit. Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://​docs.​google.​com/​file/​d/​0B306nMx7wiLyZ0t​Felo4MzY5SWc/​edit.​
Metadaten
Titel
The Problems and Methods of Automatic Text Document Classification
verfasst von
V. A. Yatsko
Publikationsdatum
01.11.2021
Verlag
Pleiades Publishing
Erschienen in
Automatic Documentation and Mathematical Linguistics / Ausgabe 6/2021
Print ISSN: 0005-1055
Elektronische ISSN: 1934-8371
DOI
https://doi.org/10.3103/S0005105521060030

Weitere Artikel der Ausgabe 6/2021

Automatic Documentation and Mathematical Linguistics 6/2021 Zur Ausgabe