skip to main content
article
Free Access

Automated learning of decision rules for text categorization

Published:01 July 1994Publication History
Skip Abstract Section

Abstract

We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.

References

  1. APTLY, C., DAMERAU, F., AND WEISS, S. 1993. Knowledge discovery for document classification. In Working Notes of the AAAI 1993 Workshop on Knowledge Discovery zn Databases (KDD-93). AAAI, Menlo Park, Calif., 326-336.Google ScholarGoogle Scholar
  2. BIEBRICHER, P. FUHR, N., AND LUSTIG, G. 1988. The automatm indexing system (AIR/PHYS)--From research to application In ACM SIGIR' 88 ACM, New York, 333 342. Google ScholarGoogle Scholar
  3. BREIMAN, L., FRIEDMAN, J., OLSHEN, R., AND STONE, C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth, Monterey, CalifGoogle ScholarGoogle Scholar
  4. CHURCH, K. W. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Assocmtton for Cornputatwnal Lingu~stws. ACL, 76-83. Google ScholarGoogle Scholar
  5. CLARK, P. AND NIBLETT, T. 1989. The CN2 induction algorithm Mach Learn. 3, 261-283 Google ScholarGoogle Scholar
  6. FLOWER, M. AND JENNINGS, A. 1992. Domain classification of language using neural networks. In 3rd Australian Conference on Neural Networks.Google ScholarGoogle Scholar
  7. FUHR, N. AND PFEIFER, U. 1991 Combining model-oriented and description-oriented approaches for probabihstm reasoning. In ACM SIGIR' 91. ACM, New York, 46 56. Google ScholarGoogle Scholar
  8. FUNG, R., CRAWFORD, S., AND APPELBAUM, L. 1990. An architecture for probabilistic conceptbased information retrieval. In ACM SIGIR~ .90. ACM, New York, 455-467. Google ScholarGoogle Scholar
  9. HAYES, P AND WEINSTEIN, S 1991. Adding value to financial news by computer. In Proceedrags of the 1st International Conference on Artzf~c~al Intelligence Apphcatwns on Wall Street. 2 8.Google ScholarGoogle Scholar
  10. HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I B,, AND SCRMANDT, L. M 1990. TCS: A shell for content-based text categorization. In Procee&ngs of the 6th IEEE CAIA. IEEE, Piscataway, N.J,320 326. Google ScholarGoogle Scholar
  11. HIGHLEYMAN, W. 1962. The design and analysm of pattern recognition experiments. Bell Syst Tech. J. 41,723 744.Google ScholarGoogle Scholar
  12. LEWIS, D 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual Internatwnal ACM SIGIR Conference on Research and Development zn Informatzon Retrieval. ACM, New York, 37-50. Google ScholarGoogle Scholar
  13. LEWIS, D. 1992b. Feature selection and feature extraction for text categorization In Proceed- Lngs of the Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Washington, D.C., 212 217. Google ScholarGoogle Scholar
  14. LEWIS, D. AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Symposzum on Document Analyszs and Information Retrzeval. ISRI, Univ. of Nevada, Las Vegas. To be published.Google ScholarGoogle Scholar
  15. LIN, S. AND K~RNmHAN, B. 1973 An efficient heuristic for the traveling salesman problem Oper. Res. 21, 2, 498-516.Google ScholarGoogle Scholar
  16. MASAND, B., LINOFF, G., AND WALTZ, D. 1992, Classlfy/ng news stories using memory based reasoning. In Proceedtngs of the 15th Annual International ACM SIGIR Conference on Research and Development tn Information Retrieval. ACM, New York, 59 65. Google ScholarGoogle Scholar
  17. MICHALSKI, R., MOZETIC, I., HONG, J., AND LAVRAC, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains In Proceedings of the AAAI-86. AAAI, Menlo Park, Calif,, 1041 1045.Google ScholarGoogle Scholar
  18. QuINL~-% J.R. 1993 C4.5: Programs for Machine Learmng. Morgan Kaufmann, San Mateo, Calif. Google ScholarGoogle Scholar
  19. QUINLAN, J. 1987. Simplifying decision trees, Int. J. Man-Mach Stud. 27, 221 234 Google ScholarGoogle Scholar
  20. SARACEVIC, T. 1991. Individual differences in organizing, searching and retrieving information. In Proceedzngs of the 54th Annual Meeting of the Society for Informatzon Sczence, Jose-Marie Griffiths, Ed. Soc for Infornmtion Science, 82 86Google ScholarGoogle Scholar
  21. SHETH, B. AND MAES, P. 1993. Evolving agents for personalized information filtering. In Proceedings of the IEEE CAIA-93. IEEE, New York, 345-352.Google ScholarGoogle Scholar
  22. WEISS, S. AND INDURKHYA, N. 1993. Optimized rule induction. IEEE Exp. 8, 6, 61-69. Google ScholarGoogle Scholar
  23. WEISS, S. M. AND KULIKOWSKI, C.A. 1991. Computer Systems That Learn. Morgan Kaufmann, San Mateo, Calif.Google ScholarGoogle Scholar
  24. WEISS, S., GALEN, R., AND TADE?ALLI, P. 1990. Maximizing the predictive value of production rules. Art. Intell. 45, 47-71. Google ScholarGoogle Scholar

Index Terms

  1. Automated learning of decision rules for text categorization

              Recommendations

              Reviews

              Ian Hugh Witten

              Can rules for document classification be induced from a training set of manually classified documents, enabling new documents to be classified automatically__?__ This question is important because of the time and skill required for manual classification and the huge volumes of textual information to be processed. The problem splits into two parts: creating a feature set to represent each document, and inferring classification rules based on these features. For the first part, the authors advocate the use of topic-specific dictionaries, one for each classification topic, prepared manually in advance. A document is represented by the most frequently occurring dictionary words it contains, for each dictionary. For the second part, a new rule-learning scheme called Swap-1 is described that uses a dynamic optimization technique to overcome the possible shortcomings of the usual greedy rule-selection procedure. The scheme is tested on a collection of 15,000 Reuters news stories and 90 topics. Three-quarters of the stories are used for training, and the resulting rules are evaluated on the remaining stories in terms of recall and precision. Significant improvement is claimed over previously reported results on the same data, although the experimental conditions are slightly different. I found this paper difficult to read and understand, principally because it introduces several new ideas in a sketchy manner and does not evaluate them properly. For <__?__Pub Fmt nolinebreak>example,<__?__Pub Fmt /nolinebreak> it would have been interesting to compare results using <__?__Pub Fmt nolinebreak>Swap-1<__?__Pub Fmt /nolinebreak><__?__Pub Caret> with those of standard rule-learning schemes such as C4.5 [1] and to evaluate what improvement the local-dictionary feature selection scheme gives over traditional methods. Focusing the experiments on a comparison with a single study, details of which are not included, seems to be of lesser value, particularly in a high-profile journal of fairly general coverage.

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Information Systems
                ACM Transactions on Information Systems  Volume 12, Issue 3
                July 1994
                101 pages
                ISSN:1046-8188
                EISSN:1558-2868
                DOI:10.1145/183422
                Issue’s Table of Contents

                Copyright © 1994 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 July 1994
                Published in tois Volume 12, Issue 3

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader