Article

Enhancing semi-supervised clustering: a feature projection perspective

Authors:
Wei Tang

Florida Atlantic University

Florida Atlantic University
View Profile

,
Hui Xiong

Rutgers University

Rutgers University
View Profile

,
Shi Zhong

Yahoo Inc.

Yahoo Inc.
View Profile

,
Jie Wu

Florida Atlantic University

Florida Atlantic University
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 707–716https://doi.org/10.1145/1281192.1281268

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 707–716

ABSTRACT

Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data. This paper thus fills this crucial void by developing a Semi-supervised Clustering method based on spheRical K-mEans via fEature projectioN (SCREEN). Specifically, we formulate the problem of constraint-guided feature projection, which can be nicely integrated with semi-supervised clustering algorithms and has the ability to effectively reduce data dimension. Indeed, our experimental results on several real-world data sets show that the SCREEN method can effectively deal with high-dimensional data and provides an appealing clustering performance.

References

C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD-99), pages 61--72, 1999. Google ScholarDigital Library
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD-98), pages 94--105, 1998. Google ScholarDigital Library
A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In Proc. of the 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.Google Scholar
S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proc. of the 19th International Conference on Machine Learning (ICML-02), pages 19--26, 2002. Google ScholarDigital Library
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04), pages 59--68, 2004. Google ScholarDigital Library
M. Berry, Z. Drmac, and E. Jessup. Matrics, vector spaces, and information retrieval. SIAM Review, 41(2):335--362, 1999. Google ScholarDigital Library
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful? In Proc. of International Conference on Database Theory (ICDT-99), 1999. Google ScholarDigital Library
T. Bie, M. Momma, and N. Cristianini. Efficiently learning the metric using side-information. In Proc. of the 14th International Conference on Algorithmic Learning Theory (ALT-03), pages 175--189, 2003.Google ScholarCross Ref
C. Burges. Geometric methods for feature extraction and dimensionality reduction: a guided tour. Technical Report MSR-TR-2004-55, Microsoft, 2004.Google Scholar
D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell Univ., 2003.Google Scholar
T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, 1991. Google ScholarDigital Library
I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proc. of the 5th SIAM International Conference on Data Mining (SDM-05), 2005.Google ScholarCross Ref
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 89--98, 2003. Google ScholarDigital Library
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001.Google ScholarDigital Library
R. O. Duda, P. E. Hart, and D. H. Stork. Pattern classification. Wiley Interscience, 2nd edition, 2000. Google ScholarDigital Library
K. Fukunaga. Statistical pattern recognition. Academic Press, San Diego, CA, 2nd edition, 1990. Google ScholarDigital Library
G. Karypis. Cluto - a clustering toolkit, 2002. http://glaros.dtc.umn.edu/gkhome/views/cluto/.Google Scholar
K. Lang. News weeder: learning to filter netnews. In Proc. of the 12th International Conference on Machine Learning (ICML-95), pages 331--339, 1995.Google Scholar
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/<mccallum/bow.Google Scholar
D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/<mlearn/MLRepository.html.Google Scholar
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations, 6(1):90--105, 2004. Google ScholarDigital Library
K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical magazine, 2(6):559--572, 1901.Google ScholarCross Ref
S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems 16 (NIPS-97), pages 626--632, 1998. Google ScholarDigital Library
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc. of the Workshop on Artificial Intelligence for Web Search, pages 58--64, 2000.Google Scholar
K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In Proc. of the 18th International Conference on Machine Learning (ICML-01), pages 577--584, 2001. Google ScholarDigital Library
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems 15 (NIPS-02), pages 505--512, 2003.Google Scholar
K. Y. Yip, D. W. Cheung, and M. K. Ng. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In Proc. of the 21st International Conference on Data Engineering (ICDE-05), pages 329--340, 2005. Google ScholarDigital Library

Index Terms

Enhancing semi-supervised clustering: a feature projection perspective
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining

Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...
Read More
Density-based semi-supervised clustering

Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Read More
Semi-supervised clustering with metric learning: An adaptive kernel method

Most existing representative works in semi-supervised clustering do not sufficiently solve the violation problem of pairwise constraints. On the other hand, traditional kernel methods for semi-supervised clustering not only face the problem of manually ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
feature projection
pairwise instance constraints
semi-supervised clustering
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 93
  Total Citations
  View Citations
- 1,129
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing semi-supervised clustering: a feature projection perspective

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semi-supervised Hierarchical Clustering

Density-based semi-supervised clustering

Semi-supervised clustering with metric learning: An adaptive kernel method