article

Free Access

Automated learning of decision rules for text categorization

Authors:
Chidanand Apté

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
Fred Damerau

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
Sholom M. Weiss

Rutgers Univ., New Brunswick, NJ

Rutgers Univ., New Brunswick, NJ
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 12 Issue 3pp 233–251https://doi.org/10.1145/183422.183423

Published:01 July 1994Publication History

ACM Transactions on Information Systems

Abstract

We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.

References

APTLY, C., DAMERAU, F., AND WEISS, S. 1993. Knowledge discovery for document classification. In Working Notes of the AAAI 1993 Workshop on Knowledge Discovery zn Databases (KDD-93). AAAI, Menlo Park, Calif., 326-336.Google Scholar
BIEBRICHER, P. FUHR, N., AND LUSTIG, G. 1988. The automatm indexing system (AIR/PHYS)--From research to application In ACM SIGIR' 88 ACM, New York, 333 342. Google Scholar
BREIMAN, L., FRIEDMAN, J., OLSHEN, R., AND STONE, C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth, Monterey, CalifGoogle Scholar
CHURCH, K. W. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Assocmtton for Cornputatwnal Lingu~stws. ACL, 76-83. Google Scholar
CLARK, P. AND NIBLETT, T. 1989. The CN2 induction algorithm Mach Learn. 3, 261-283 Google Scholar
FLOWER, M. AND JENNINGS, A. 1992. Domain classification of language using neural networks. In 3rd Australian Conference on Neural Networks.Google Scholar
FUHR, N. AND PFEIFER, U. 1991 Combining model-oriented and description-oriented approaches for probabihstm reasoning. In ACM SIGIR' 91. ACM, New York, 46 56. Google Scholar
FUNG, R., CRAWFORD, S., AND APPELBAUM, L. 1990. An architecture for probabilistic conceptbased information retrieval. In ACM SIGIR~ .90. ACM, New York, 455-467. Google Scholar
HAYES, P AND WEINSTEIN, S 1991. Adding value to financial news by computer. In Proceedrags of the 1st International Conference on Artzf~c~al Intelligence Apphcatwns on Wall Street. 2 8.Google Scholar
HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I B,, AND SCRMANDT, L. M 1990. TCS: A shell for content-based text categorization. In Procee&ngs of the 6th IEEE CAIA. IEEE, Piscataway, N.J,320 326. Google Scholar
HIGHLEYMAN, W. 1962. The design and analysm of pattern recognition experiments. Bell Syst Tech. J. 41,723 744.Google Scholar
LEWIS, D 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual Internatwnal ACM SIGIR Conference on Research and Development zn Informatzon Retrieval. ACM, New York, 37-50. Google Scholar
LEWIS, D. 1992b. Feature selection and feature extraction for text categorization In Proceed- Lngs of the Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Washington, D.C., 212 217. Google Scholar
LEWIS, D. AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Symposzum on Document Analyszs and Information Retrzeval. ISRI, Univ. of Nevada, Las Vegas. To be published.Google Scholar
LIN, S. AND K~RNmHAN, B. 1973 An efficient heuristic for the traveling salesman problem Oper. Res. 21, 2, 498-516.Google Scholar
MASAND, B., LINOFF, G., AND WALTZ, D. 1992, Classlfy/ng news stories using memory based reasoning. In Proceedtngs of the 15th Annual International ACM SIGIR Conference on Research and Development tn Information Retrieval. ACM, New York, 59 65. Google Scholar
MICHALSKI, R., MOZETIC, I., HONG, J., AND LAVRAC, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains In Proceedings of the AAAI-86. AAAI, Menlo Park, Calif,, 1041 1045.Google Scholar
QuINL~-% J.R. 1993 C4.5: Programs for Machine Learmng. Morgan Kaufmann, San Mateo, Calif. Google Scholar
QUINLAN, J. 1987. Simplifying decision trees, Int. J. Man-Mach Stud. 27, 221 234 Google Scholar
SARACEVIC, T. 1991. Individual differences in organizing, searching and retrieving information. In Proceedzngs of the 54th Annual Meeting of the Society for Informatzon Sczence, Jose-Marie Griffiths, Ed. Soc for Infornmtion Science, 82 86Google Scholar
SHETH, B. AND MAES, P. 1993. Evolving agents for personalized information filtering. In Proceedings of the IEEE CAIA-93. IEEE, New York, 345-352.Google Scholar
WEISS, S. AND INDURKHYA, N. 1993. Optimized rule induction. IEEE Exp. 8, 6, 61-69. Google Scholar
WEISS, S. M. AND KULIKOWSKI, C.A. 1991. Computer Systems That Learn. Morgan Kaufmann, San Mateo, Calif.Google Scholar
WEISS, S., GALEN, R., AND TADE?ALLI, P. 1990. Maximizing the predictive value of production rules. Art. Intell. 45, 47-71. Google Scholar

Index Terms

Automated learning of decision rules for text categorization
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
  2. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
      2. Rule learning
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
  2. Information systems applications
    1. Decision support systems
      1. Expert systems

Recommendations

Machine learning in automated text categorization

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the ...
Read More
Learning rules with negation for text categorization
SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

This paper describes Olex, a novel method for the automatic construction of rule-based text classifiers. Olex relies on an optimization algorithm whereby a set of (both positive and negative) discriminating terms is generated for the category being ...
Read More
Text Categorization by MILO Tree Traversals
ICGEC '10: Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing

This paper presents a new method based on MILO for automatic text categorization. MILO classification technique is a new rule-based classification technique, which is different from traditional rule-based technique such as decision tree and association ...
Read More

Reviews

Reviewer: Ian Hugh Witten

Can rules for document classification be induced from a training set of manually classified documents, enabling new documents to be classified automatically__?__ This question is important because of the time and skill required for manual classification and the huge volumes of textual information to be processed. The problem splits into two parts: creating a feature set to represent each document, and inferring classification rules based on these features. For the first part, the authors advocate the use of topic-specific dictionaries, one for each classification topic, prepared manually in advance. A document is represented by the most frequently occurring dictionary words it contains, for each dictionary. For the second part, a new rule-learning scheme called Swap-1 is described that uses a dynamic optimization technique to overcome the possible shortcomings of the usual greedy rule-selection procedure. The scheme is tested on a collection of 15,000 Reuters news stories and 90 topics. Three-quarters of the stories are used for training, and the resulting rules are evaluated on the remaining stories in terms of recall and precision. Significant improvement is claimed over previously reported results on the same data, although the experimental conditions are slightly different. I found this paper difficult to read and understand, principally because it introduces several new ideas in a sketchy manner and does not evaluate them properly. For <__?__Pub Fmt nolinebreak>example,<__?__Pub Fmt /nolinebreak> it would have been interesting to compare results using <__?__Pub Fmt nolinebreak>Swap-1<__?__Pub Fmt /nolinebreak><__?__Pub Caret> with those of standard rule-learning schemes such as C4.5 [1] and to evaluate what improvement the local-dictionary feature selection scheme gives over traditional methods. Focusing the experiments on a comparison with a single study, details of which are not included, seems to be of lesser value, particularly in a high-profile journal of fairly general coverage.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 12, Issue 3
July 1994
101 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/183422
Issue’s Table of Contents

Copyright © 1994 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 1994
Published in tois Volume 12, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 540
  Total Citations
  View Citations
- 3,324
  Total Downloads
- Downloads (Last 12 months)241
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Machine learning in automated text categorization

Learning rules with negation for text categorization

Text Categorization by MILO Tree Traversals

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Machine learning in automated text categorization

Learning rules with negation for text categorization

Text Categorization by MILO Tree Traversals

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media