article

An adaptive k-nearest neighbor text categorization strategy

Authors:
Li Baoli

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Lu Qin

The Hong Kong Polytechnic University, Kowloon, Hong Kong

The Hong Kong Polytechnic University, Kowloon, Hong Kong
View Profile

,
Yu Shiwen

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

ACM Transactions on Asian Language Information Processing Volume 3 Issue 4pp 215–226https://doi.org/10.1145/1039621.1039623

Published:01 December 2004Publication History

ACM Transactions on Asian Language Information Processing

Abstract

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

References

Allan, J. 2002. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Boston, MA.]] Google Scholar
Cardoso-Cachopo, A., and Olivera, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (Manaus, Brazil, Oct.8--10, 2003). M.A. Nasciente et al. eds. Springer--Verlag, Heidelberg. 183--196.]]Google Scholar
Dasarathy, B.V. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, CA.]]Google Scholar
Han, E. H., Karypis, G., and Kumar, V. 2001. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Hong Kong, April 16--18, 2001). D. Cheung, et al. eds. Springer-Verlag, Heidelberg. 53--65.]] Google Scholar
He, J., Tan, A.H., and Tan, C. L. 2000. Machine learning methods for Chinese web page categorization. In Proceedings of the ACL'2000 2nd Workshop on Chinese Language Processing (Hong Kong, Oct. 2000). 93--100.]] Google Scholar
Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (Chemnitz, Germany, April 21--24, 1998). 137--142.]] Google Scholar
Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (Tahoe City, CA, July 9--12, 1995). A. Prieditis et al. eds. Morgan Kaufmann. 331--339.]]Google Scholar
Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]] Google Scholar
Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memory based reasoning. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, June 21--24, 1992). N. J. Belkin et al. eds. ACM Press, New York. 59--64.]] Google Scholar
Mitchell, T. 1997. Machine Learning. McGraw Hill, New York.]] Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.]]Google Scholar
Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing, Boston, MA.]] Google Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]] Google Scholar
Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR'94, Dublin, July 3--6, 1994). W.B. Croft et al. eds. ACM/Springer. 13--22.]] Google Scholar
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 1/2 (1999), 67--88.]] Google Scholar
Yang, Y., Ault, T., Pierce, T., and Lattimer, C.W. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000, Athens, July 24--28, 2000). N. J. Belkin et al. eds. ACM Press, New York, 65--72.]] Google Scholar
Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. on Information Systems 12, 3 (1994), 252--277.]] Google Scholar
Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 15--19, 1999). ACM Press, New York, 42--49.]] Google Scholar
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning (Nashville, TN, July 8--12, 1997). D. H. Fisher, ed. Morgan Kaufmann, 412--420.]] Google Scholar

Index Terms

An adaptive k-nearest neighbor text categorization strategy

Recommendations

Machine learning in automated text categorization

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the ...
Read More
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Read More
Using kNN model for automatic text categorization

An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 3, Issue 4
December 2004
57 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1039621
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2004
Published in talip Volume 3, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
k-nearest neighbor algorithm
machine learning
text categorization
text classification
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 1,916
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Machine learning in automated text categorization

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Using kNN model for automatic text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Machine learning in automated text categorization

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Using kNN model for automatic text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media