skip to main content
article

An adaptive k-nearest neighbor text categorization strategy

Published:01 December 2004Publication History
Skip Abstract Section

Abstract

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

References

  1. Allan, J. 2002. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Boston, MA.]] Google ScholarGoogle Scholar
  2. Cardoso-Cachopo, A., and Olivera, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (Manaus, Brazil, Oct.8--10, 2003). M.A. Nasciente et al. eds. Springer--Verlag, Heidelberg. 183--196.]]Google ScholarGoogle Scholar
  3. Dasarathy, B.V. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, CA.]]Google ScholarGoogle Scholar
  4. Han, E. H., Karypis, G., and Kumar, V. 2001. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Hong Kong, April 16--18, 2001). D. Cheung, et al. eds. Springer-Verlag, Heidelberg. 53--65.]] Google ScholarGoogle Scholar
  5. He, J., Tan, A.H., and Tan, C. L. 2000. Machine learning methods for Chinese web page categorization. In Proceedings of the ACL'2000 2nd Workshop on Chinese Language Processing (Hong Kong, Oct. 2000). 93--100.]] Google ScholarGoogle Scholar
  6. Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (Chemnitz, Germany, April 21--24, 1998). 137--142.]] Google ScholarGoogle Scholar
  7. Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (Tahoe City, CA, July 9--12, 1995). A. Prieditis et al. eds. Morgan Kaufmann. 331--339.]]Google ScholarGoogle Scholar
  8. Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]] Google ScholarGoogle Scholar
  9. Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memory based reasoning. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, June 21--24, 1992). N. J. Belkin et al. eds. ACM Press, New York. 59--64.]] Google ScholarGoogle Scholar
  10. Mitchell, T. 1997. Machine Learning. McGraw Hill, New York.]] Google ScholarGoogle Scholar
  11. Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.]]Google ScholarGoogle Scholar
  12. Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston, MA.]] Google ScholarGoogle Scholar
  13. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]] Google ScholarGoogle Scholar
  14. Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR'94, Dublin, July 3--6, 1994). W.B. Croft et al. eds. ACM/Springer. 13--22.]] Google ScholarGoogle Scholar
  15. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 1/2 (1999), 67--88.]] Google ScholarGoogle Scholar
  16. Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000, Athens, July 24--28, 2000). N. J. Belkin et al. eds. ACM Press, New York, 65--72.]] Google ScholarGoogle Scholar
  17. Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. on Information Systems 12, 3 (1994), 252--277.]] Google ScholarGoogle Scholar
  18. Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 15--19, 1999). ACM Press, New York, 42--49.]] Google ScholarGoogle Scholar
  19. Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning (Nashville, TN, July 8--12, 1997). D. H. Fisher, ed. Morgan Kaufmann, 412--420.]] Google ScholarGoogle Scholar

Index Terms

  1. An adaptive k-nearest neighbor text categorization strategy

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader