Article

Free Access

Distributional clustering of words for text classification

Authors:
L. Douglas Baker

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA and Just Research 4616 Henry Street, Pittsburgh, PA

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA and Just Research 4616 Henry Street, Pittsburgh, PA
View Profile

,
Andrew Kachites McCallum

Just Research 4616 Henry Street, Pittsburgh, PA and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

Just Research 4616 Henry Street, Pittsburgh, PA and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
View Profile

SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrievalAugust 1998Pages 96–103https://doi.org/10.1145/290941.290970

Published:01 August 1998Publication History

SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 96–103

References

1.P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479, 1992. Google ScholarDigital Library
2.Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21-27, 1967.Google ScholarDigital Library
3.Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. john Wiley, 1991. Google ScholarDigital Library
4.Mark Craven, Daniel DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, and Sean Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998. Google ScholarDigital Library
5.Ido Dagan, Fernando Pereira, and Lillian Lee. Similarity-based estimation of word cooccurrence probabilities. In Proceedings of the 32rid Annual Meeting of the Association .for Computational Linguistics, 1994. Google ScholarDigital Library
6.S. C. Deerwester, S. T. Dumais, T. K. Landaner, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.Google ScholarCross Ref
7.P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. Machine Learnin9, 29:103-130, 1997. Google ScholarDigital Library
8.Susan T. Dumais. Using LSI for information filtering: TREC-3 experiments. Technical Report 500- 225, National Institute of Standards and Technology, 1995.Google Scholar
9.Jerome H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77, 1997. Google ScholarDigital Library
10.Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In International Conference on Machine Learning (ICML), 1997. Google ScholarDigital Library
11.j. D. Jobson. Applied Multivariate Data Analysis - Volume iI: Categorical and Multivariate Methods. Springer Verlag, 1992.Google Scholar
12.R. Kerber. Chimerge: Discretization of numeric attributes. In Proceedings of Tenth National Conference on Artificial Intelligence (AAAI-9e), 1992.Google Scholar
13.D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of Thirteenth International Conference on Machine Learning (ICML-96), 1996.Google Scholar
14.Ken Lang. Newsweeder: Learning to filter netnews. In International Conference on Machine Learning (ICML), pages 331-339, 1995.Google Scholar
15.Lillian Lee. Similarity-Based Approaches to Natural Language Processing. PhD thesis, Harvard University, 1997. (also Technical Report TR-11-97). Google ScholarDigital Library
16.David Lewis and Marc Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, 1994.Google Scholar
17.David D. Lewis and Kimberly A. Knowles. Threading electronic mail: A preliminary study. Information Processing and Management, 33(2):209-217, 1997. Google ScholarDigital Library
18.H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of 7th IEEE Int'l Conference on Tools with Artificial Intelligence, 1995. Google ScholarDigital Library
19.Andrew McCallum and Kamal Nigam. A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998. http://www, cs.cmu.edu/-#mccallum.Google Scholar
20.Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183-90, 1993. Google ScholarDigital Library
21.WiseWire. http://www.wisewire.com.Google Scholar
22.Yiming Yang. Noise reduction in a statistical approach to text categorization. In Proceedings of the 18th Annual International A CM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pages 256-263, 1995. Google ScholarDigital Library
23.Yiming Yang and Jan Pederson. Feature selection in statistical learning of text categorization. In ICML- 97, pages 412-420, 1997. Google ScholarDigital Library

Index Terms

Distributional clustering of words for text classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Detecting misspelled words in Turkish text using syllable n-gram frequencies
PReMI'07: Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are ...
Read More
Semantic classification of Chinese unknown words
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2

This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words (words not already in the CiLin thesaurus and the Chinese Electronic Dictionary, but in the Sinica Corpus). The focus of the paper differs in two ways ...
Read More
Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and Applications

Representing a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
August 1998
394 pages
ISBN:1581130155
DOI:10.1145/290941
Chairmen:
W. Bruce Croft
Univ. of Massachusetts
,
Alistair Moffat
Univ. of Melbourne, Victoria, Australia
,
C. J. van Rijsbergen
Univ. of Glasgow, Scotland, UK
,
Ross Wilkinson
RMIT Univ., Australia and CSIRO
,
Justin Zobel
RMIT Univ., Australia
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 393
  Total Citations
  View Citations
- 6,638
  Total Downloads
- Downloads (Last 12 months)288
- Downloads (Last 6 weeks)39
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributional clustering of words for text classification

SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

References

Cited By

Index Terms

Recommendations

Detecting misspelled words in Turkish text using syllable n-gram frequencies

Semantic classification of Chinese unknown words

Multi-prototype Morpheme Embedding for Text Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributional clustering of words for text classification

SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

References

Cited By

Index Terms

Recommendations

Detecting misspelled words in Turkish text using syllable n-gram frequencies

Semantic classification of Chinese unknown words

Multi-prototype Morpheme Embedding for Text Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media