01-11-2021 | TEXT PROCESSING AUTOMATION
The Problems and Methods of Automatic Text Document Classification
Published in: Automatic Documentation and Mathematical Linguistics | Issue 6/2021
Login to get accessAbstract
This paper gives a review of the main problems and methods of automatic text classification. It focuses on problems such as the choice of source linguistic material, neutralization of discrepancies in texts sizes, application of distance-based and dictionary-based approaches to the classification, reduction of texts dimensionality, creation of dictionaries, adequate term weighting, and the training and functioning of a classifier program. The author describes the procedures of texts undersampling and logarithmic alignment and algorithms for computing the cosine similarity measure and the Z-score in a comprehensible form. This paper describes the peculiarities of the use of Bayes theorem for the purposes of parts-of-speech classification and spam filtering.
Advertisement