ABSTRACT
Semi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by associating it with a document or a cluster. Besides labeled documents, this paper also explores labeled features to generate cluster seeds to seed the unsupervised clustering. In this paper, we present a unified framework in which one can use both labeled documents and features in terms of seeding clusters and refine this information using intermediate clusters. We introduce two methods of using labeled features to generate cluster seeds. Experimental results on several real-world data sets demonstrate that constraining the clustering by both documents and features seeding can significantly improve document clustering performance over random seeding and document only seeding.
- Josh Attenberg, Prem Melville, and Foster Provost. A Unified Approach to Active Dual Supervision for Labeling Features and Examples. In ECML PKDD 2010 Part I, LNAI 6321, pages 40--55. Springer, 2010. Google ScholarDigital Library
- S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In International Conference on Machine Learning, pages 19--26, 2002. Google ScholarDigital Library
- S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59--68. ACM, 2004. Google ScholarDigital Library
- H. Cheng, K. A. Hua, and K. Vu. Constrained locally weighted clustering. Proceedings of the PVLDB'08, 1 (1): 90--101, 2008. Google ScholarDigital Library
- I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 89--98. ACM, 2003. ISBN 1581137370. Google ScholarDigital Library
- B. E. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Division, 2001.Google Scholar
- G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 595--602. ACM, 2008. Google ScholarDigital Library
- Y. Hu, E. Milios, and J. Blustein. Interactive feature selection for document clustering. In the 26th Symposium On Applied Computing, pages 1148--1155. ACM Special Interest Group on Applied Computing, 2011. Google ScholarDigital Library
- Y. Huang and T. M. Mitchell. Text clustering with extended user feedback. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 420. ACM, 2006. Google ScholarDigital Library
- X. Ji and W. Xu. Document clustering with prior knowledge. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 412. ACM, 2006. Google ScholarDigital Library
- Joe Lamantia. Text Clouds: A New Form of Tag Cloud? http://www.joelamantia.com/tag-clouds/text-clouds-a-new-form-of-tag-cloud, 2007.Google Scholar
- B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In Proceedings of the National Conference on Artificial Intelligence, pages 425--430, 2004. Google ScholarDigital Library
- P. Melville, W. Gryc, and R. D. Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1275--1284. ACM, 2009. Google ScholarDigital Library
- H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proceedings of IJCAI 05: The 19th International Joint Conference on Artificial Intelligence, pages 841--846, 2005. Google ScholarDigital Library
- W. Tang, H. Xiong, S. Zhong, and J. Wu. Enhancing semi-supervised clustering: a feature projection perspective. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 707--716. ACM, 2007. Google ScholarDigital Library
- K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577--584, 2001. Google ScholarDigital Library
- X. Wu and R. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 326--333. ACM, 2004. ISBN 1581138881. Google ScholarDigital Library
Index Terms
- Semi-supervised document clustering with dual supervision through seeding
Recommendations
A unified framework for document clustering with dual supervision
Semi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...
Enhancing semi-supervised document clustering with feature supervision
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied ComputingTraditional semi-supervised clustering uses only limited user supervision in the form of labeled instances and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...
Document Clustering With Dual Supervision Through Feature Reweighting
Traditional semi-supervised clustering uses only limited user supervision in the form of instance seeds for clusters and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms ...
Comments