Article

Text clustering with extended user feedback

Authors:
Yifen Huang

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Tom M. Mitchell

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2006Pages 413–420https://doi.org/10.1145/1148170.1148242

Published:06 August 2006Publication History

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 413–420

ABSTRACT

Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.

References

S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD-04, 2004. Google ScholarDigital Library
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarDigital Library
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. In Journal of the Royal Statistical Society, volume 39 of B, pages 1--38, 1977.Google Scholar
B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM, 2001.Google Scholar
S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In PKDD-04, 2004. Google ScholarDigital Library
A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.Google Scholar
Y. Huang, D. Govindaraju, T. Mitchell, V. R. Carvalho, and W. Cohen. Inferring ongoing activities of workstation users by clustering email. In First Conference on Email and Spam, 2004.Google Scholar
T. Joachims. Transductive inference for text classification using support vector machines. In ICML-99, 1999. Google ScholarDigital Library
R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for text learning tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999.Google Scholar
B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In AAAI-04, 2004. Google ScholarDigital Library
K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. In AAAI-98, 1998. Google ScholarDigital Library
H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In IJCAI-05, 2005. Google ScholarDigital Library
K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML-01, 2001. Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML-97, 1997. Google ScholarDigital Library

Index Terms

Text clustering with extended user feedback
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

A novel incremental conceptual hierarchical text clustering method using CFu-tree

This paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation.For summarizing a cluster, we use the term-based feature extraction in text clustering.A new measure criterion, ...
Read More
A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSDMM+. FGSDMM+ assumes that there are at most K_max clusters in the corpus, and regards these K_max potential clusters as one large ...
Read More
Weighted k-Means Algorithm Based Text Clustering
IEEC '09: Proceedings of the 2009 International Symposium on Information Engineering and Electronic Commerce

this paper proposes a weighted k-means clustering algorithm based on k-means (MacQueen, 1967; Anderberg, 1973) algorithm, and it can be used to cluster texts. Firstly, the weighted k-means algorithm changes the descriptive approach of text objects, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mixed-initiative learning
text clustering
user feedback
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 1,293
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text clustering with extended user feedback

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel incremental conceptual hierarchical text clustering method using CFu-tree

A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization

Weighted k-Means Algorithm Based Text Clustering