Abstract
k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.
- Allan, J. 2002. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Boston, MA.]] Google Scholar
- Cardoso-Cachopo, A., and Olivera, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (Manaus, Brazil, Oct.8--10, 2003). M.A. Nasciente et al. eds. Springer--Verlag, Heidelberg. 183--196.]]Google Scholar
- Dasarathy, B.V. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, CA.]]Google Scholar
- Han, E. H., Karypis, G., and Kumar, V. 2001. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Hong Kong, April 16--18, 2001). D. Cheung, et al. eds. Springer-Verlag, Heidelberg. 53--65.]] Google Scholar
- He, J., Tan, A.H., and Tan, C. L. 2000. Machine learning methods for Chinese web page categorization. In Proceedings of the ACL'2000 2nd Workshop on Chinese Language Processing (Hong Kong, Oct. 2000). 93--100.]] Google Scholar
- Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (Chemnitz, Germany, April 21--24, 1998). 137--142.]] Google Scholar
- Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (Tahoe City, CA, July 9--12, 1995). A. Prieditis et al. eds. Morgan Kaufmann. 331--339.]]Google Scholar
- Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]] Google Scholar
- Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memory based reasoning. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, June 21--24, 1992). N. J. Belkin et al. eds. ACM Press, New York. 59--64.]] Google Scholar
- Mitchell, T. 1997. Machine Learning. McGraw Hill, New York.]] Google Scholar
- Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.]]Google Scholar
- Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston, MA.]] Google Scholar
- Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]] Google Scholar
- Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR'94, Dublin, July 3--6, 1994). W.B. Croft et al. eds. ACM/Springer. 13--22.]] Google Scholar
- Yang, Y. 1999. An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 1/2 (1999), 67--88.]] Google Scholar
- Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000, Athens, July 24--28, 2000). N. J. Belkin et al. eds. ACM Press, New York, 65--72.]] Google Scholar
- Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. on Information Systems 12, 3 (1994), 252--277.]] Google Scholar
- Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 15--19, 1999). ACM Press, New York, 42--49.]] Google Scholar
- Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning (Nashville, TN, July 8--12, 1997). D. H. Fisher, ed. Morgan Kaufmann, 412--420.]] Google Scholar
Index Terms
- An adaptive k-nearest neighbor text categorization strategy
Recommendations
Machine learning in automated text categorization
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the ...
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Using kNN model for automatic text categorization
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new ...
Comments