ABSTRACT
Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.
- S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD-04, 2004. Google ScholarDigital Library
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarDigital Library
- A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. In Journal of the Royal Statistical Society, volume 39 of B, pages 1--38, 1977.Google Scholar
- B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM, 2001.Google Scholar
- S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In PKDD-04, 2004. Google ScholarDigital Library
- A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.Google Scholar
- Y. Huang, D. Govindaraju, T. Mitchell, V. R. Carvalho, and W. Cohen. Inferring ongoing activities of workstation users by clustering email. In First Conference on Email and Spam, 2004.Google Scholar
- T. Joachims. Transductive inference for text classification using support vector machines. In ICML-99, 1999. Google ScholarDigital Library
- R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for text learning tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999.Google Scholar
- B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In AAAI-04, 2004. Google ScholarDigital Library
- K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. In AAAI-98, 1998. Google ScholarDigital Library
- H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In IJCAI-05, 2005. Google ScholarDigital Library
- K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML-01, 2001. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML-97, 1997. Google ScholarDigital Library
Index Terms
- Text clustering with extended user feedback
Recommendations
A novel incremental conceptual hierarchical text clustering method using CFu-tree
This paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation.For summarizing a cluster, we use the term-based feature extraction in text clustering.A new measure criterion, ...
A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSDMM+. FGSDMM+ assumes that there are at most Kmax clusters in the corpus, and regards these Kmax potential clusters as one large ...
Weighted k-Means Algorithm Based Text Clustering
IEEC '09: Proceedings of the 2009 International Symposium on Information Engineering and Electronic Commercethis paper proposes a weighted k-means clustering algorithm based on k-means (MacQueen, 1967; Anderberg, 1973) algorithm, and it can be used to cluster texts. Firstly, the weighted k-means algorithm changes the descriptive approach of text objects, and ...
Comments