skip to main content
10.1145/2245276.2245306acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Semi-supervised document clustering with dual supervision through seeding

Published:26 March 2012Publication History

ABSTRACT

Semi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by associating it with a document or a cluster. Besides labeled documents, this paper also explores labeled features to generate cluster seeds to seed the unsupervised clustering. In this paper, we present a unified framework in which one can use both labeled documents and features in terms of seeding clusters and refine this information using intermediate clusters. We introduce two methods of using labeled features to generate cluster seeds. Experimental results on several real-world data sets demonstrate that constraining the clustering by both documents and features seeding can significantly improve document clustering performance over random seeding and document only seeding.

References

  1. Josh Attenberg, Prem Melville, and Foster Provost. A Unified Approach to Active Dual Supervision for Labeling Features and Examples. In ECML PKDD 2010 Part I, LNAI 6321, pages 40--55. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In International Conference on Machine Learning, pages 19--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59--68. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Cheng, K. A. Hua, and K. Vu. Constrained locally weighted clustering. Proceedings of the PVLDB'08, 1 (1): 90--101, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 89--98. ACM, 2003. ISBN 1581137370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. E. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Division, 2001.Google ScholarGoogle Scholar
  7. G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 595--602. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Hu, E. Milios, and J. Blustein. Interactive feature selection for document clustering. In the 26th Symposium On Applied Computing, pages 1148--1155. ACM Special Interest Group on Applied Computing, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Huang and T. M. Mitchell. Text clustering with extended user feedback. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 420. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Ji and W. Xu. Document clustering with prior knowledge. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 412. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joe Lamantia. Text Clouds: A New Form of Tag Cloud? http://www.joelamantia.com/tag-clouds/text-clouds-a-new-form-of-tag-cloud, 2007.Google ScholarGoogle Scholar
  12. B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In Proceedings of the National Conference on Artificial Intelligence, pages 425--430, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Melville, W. Gryc, and R. D. Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1275--1284. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proceedings of IJCAI 05: The 19th International Joint Conference on Artificial Intelligence, pages 841--846, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Tang, H. Xiong, S. Zhong, and J. Wu. Enhancing semi-supervised clustering: a feature projection perspective. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 707--716. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577--584, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Wu and R. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 326--333. ACM, 2004. ISBN 1581138881. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semi-supervised document clustering with dual supervision through seeding

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing
          March 2012
          2179 pages
          ISBN:9781450308571
          DOI:10.1145/2245276
          • Conference Chairs:
          • Sascha Ossowski,
          • Paola Lecca

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 March 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SAC '12 Paper Acceptance Rate270of1,056submissions,26%Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader