Abstract
We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.
- APTLY, C., DAMERAU, F., AND WEISS, S. 1993. Knowledge discovery for document classification. In Working Notes of the AAAI 1993 Workshop on Knowledge Discovery zn Databases (KDD-93). AAAI, Menlo Park, Calif., 326-336.Google Scholar
- BIEBRICHER, P. FUHR, N., AND LUSTIG, G. 1988. The automatm indexing system (AIR/PHYS)--From research to application In ACM SIGIR' 88 ACM, New York, 333 342. Google Scholar
- BREIMAN, L., FRIEDMAN, J., OLSHEN, R., AND STONE, C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth, Monterey, CalifGoogle Scholar
- CHURCH, K. W. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Assocmtton for Cornputatwnal Lingu~stws. ACL, 76-83. Google Scholar
- CLARK, P. AND NIBLETT, T. 1989. The CN2 induction algorithm Mach Learn. 3, 261-283 Google Scholar
- FLOWER, M. AND JENNINGS, A. 1992. Domain classification of language using neural networks. In 3rd Australian Conference on Neural Networks.Google Scholar
- FUHR, N. AND PFEIFER, U. 1991 Combining model-oriented and description-oriented approaches for probabihstm reasoning. In ACM SIGIR' 91. ACM, New York, 46 56. Google Scholar
- FUNG, R., CRAWFORD, S., AND APPELBAUM, L. 1990. An architecture for probabilistic conceptbased information retrieval. In ACM SIGIR~ .90. ACM, New York, 455-467. Google Scholar
- HAYES, P AND WEINSTEIN, S 1991. Adding value to financial news by computer. In Proceedrags of the 1st International Conference on Artzf~c~al Intelligence Apphcatwns on Wall Street. 2 8.Google Scholar
- HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I B,, AND SCRMANDT, L. M 1990. TCS: A shell for content-based text categorization. In Procee&ngs of the 6th IEEE CAIA. IEEE, Piscataway, N.J,320 326. Google Scholar
- HIGHLEYMAN, W. 1962. The design and analysm of pattern recognition experiments. Bell Syst Tech. J. 41,723 744.Google Scholar
- LEWIS, D 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual Internatwnal ACM SIGIR Conference on Research and Development zn Informatzon Retrieval. ACM, New York, 37-50. Google Scholar
- LEWIS, D. 1992b. Feature selection and feature extraction for text categorization In Proceed- Lngs of the Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Washington, D.C., 212 217. Google Scholar
- LEWIS, D. AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Symposzum on Document Analyszs and Information Retrzeval. ISRI, Univ. of Nevada, Las Vegas. To be published.Google Scholar
- LIN, S. AND K~RNmHAN, B. 1973 An efficient heuristic for the traveling salesman problem Oper. Res. 21, 2, 498-516.Google Scholar
- MASAND, B., LINOFF, G., AND WALTZ, D. 1992, Classlfy/ng news stories using memory based reasoning. In Proceedtngs of the 15th Annual International ACM SIGIR Conference on Research and Development tn Information Retrieval. ACM, New York, 59 65. Google Scholar
- MICHALSKI, R., MOZETIC, I., HONG, J., AND LAVRAC, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains In Proceedings of the AAAI-86. AAAI, Menlo Park, Calif,, 1041 1045.Google Scholar
- QuINL~-% J.R. 1993 C4.5: Programs for Machine Learmng. Morgan Kaufmann, San Mateo, Calif. Google Scholar
- QUINLAN, J. 1987. Simplifying decision trees, Int. J. Man-Mach Stud. 27, 221 234 Google Scholar
- SARACEVIC, T. 1991. Individual differences in organizing, searching and retrieving information. In Proceedzngs of the 54th Annual Meeting of the Society for Informatzon Sczence, Jose-Marie Griffiths, Ed. Soc for Infornmtion Science, 82 86Google Scholar
- SHETH, B. AND MAES, P. 1993. Evolving agents for personalized information filtering. In Proceedings of the IEEE CAIA-93. IEEE, New York, 345-352.Google Scholar
- WEISS, S. AND INDURKHYA, N. 1993. Optimized rule induction. IEEE Exp. 8, 6, 61-69. Google Scholar
- WEISS, S. M. AND KULIKOWSKI, C.A. 1991. Computer Systems That Learn. Morgan Kaufmann, San Mateo, Calif.Google Scholar
- WEISS, S., GALEN, R., AND TADE?ALLI, P. 1990. Maximizing the predictive value of production rules. Art. Intell. 45, 47-71. Google Scholar
Index Terms
- Automated learning of decision rules for text categorization
Recommendations
Machine learning in automated text categorization
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the ...
Learning rules with negation for text categorization
SAC '07: Proceedings of the 2007 ACM symposium on Applied computingThis paper describes Olex, a novel method for the automatic construction of rule-based text classifiers. Olex relies on an optimization algorithm whereby a set of (both positive and negative) discriminating terms is generated for the category being ...
Text Categorization by MILO Tree Traversals
ICGEC '10: Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary ComputingThis paper presents a new method based on MILO for automatic text categorization. MILO classification technique is a new rule-based classification technique, which is different from traditional rule-based technique such as decision tree and association ...
Comments