skip to main content
10.1145/1148170.1148242acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Text clustering with extended user feedback

Published:06 August 2006Publication History

ABSTRACT

Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.

References

  1. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD-04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. In Journal of the Royal Statistical Society, volume 39 of B, pages 1--38, 1977.Google ScholarGoogle Scholar
  4. B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM, 2001.Google ScholarGoogle Scholar
  5. S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In PKDD-04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.Google ScholarGoogle Scholar
  7. Y. Huang, D. Govindaraju, T. Mitchell, V. R. Carvalho, and W. Cohen. Inferring ongoing activities of workstation users by clustering email. In First Conference on Email and Spam, 2004.Google ScholarGoogle Scholar
  8. T. Joachims. Transductive inference for text classification using support vector machines. In ICML-99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for text learning tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999.Google ScholarGoogle Scholar
  10. B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In AAAI-04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. In AAAI-98, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In IJCAI-05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML-01, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML-97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Text clustering with extended user feedback

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
        August 2006
        768 pages
        ISBN:1595933697
        DOI:10.1145/1148170

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 August 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader