skip to main content
10.1145/1281192.1281268acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Enhancing semi-supervised clustering: a feature projection perspective

Published:12 August 2007Publication History

ABSTRACT

Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data. This paper thus fills this crucial void by developing a Semi-supervised Clustering method based on spheRical K-mEans via fEature projectioN (SCREEN). Specifically, we formulate the problem of constraint-guided feature projection, which can be nicely integrated with semi-supervised clustering algorithms and has the ability to effectively reduce data dimension. Indeed, our experimental results on several real-world data sets show that the SCREEN method can effectively deal with high-dimensional data and provides an appealing clustering performance.

References

  1. C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD-99), pages 61--72, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD-98), pages 94--105, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In Proc. of the 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.Google ScholarGoogle Scholar
  4. S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proc. of the 19th International Conference on Machine Learning (ICML-02), pages 19--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04), pages 59--68, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Berry, Z. Drmac, and E. Jessup. Matrics, vector spaces, and information retrieval. SIAM Review, 41(2):335--362, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful? In Proc. of International Conference on Database Theory (ICDT-99), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Bie, M. Momma, and N. Cristianini. Efficiently learning the metric using side-information. In Proc. of the 14th International Conference on Algorithmic Learning Theory (ALT-03), pages 175--189, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Burges. Geometric methods for feature extraction and dimensionality reduction: a guided tour. Technical Report MSR-TR-2004-55, Microsoft, 2004.Google ScholarGoogle Scholar
  10. D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell Univ., 2003.Google ScholarGoogle Scholar
  11. T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proc. of the 5th SIAM International Conference on Data Mining (SDM-05), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  13. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 89--98, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. O. Duda, P. E. Hart, and D. H. Stork. Pattern classification. Wiley Interscience, 2nd edition, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Fukunaga. Statistical pattern recognition. Academic Press, San Diego, CA, 2nd edition, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Karypis. Cluto - a clustering toolkit, 2002. http://glaros.dtc.umn.edu/gkhome/views/cluto/.Google ScholarGoogle Scholar
  18. K. Lang. News weeder: learning to filter netnews. In Proc. of the 12th International Conference on Machine Learning (ICML-95), pages 331--339, 1995.Google ScholarGoogle Scholar
  19. A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/<mccallum/bow.Google ScholarGoogle Scholar
  20. D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/<mlearn/MLRepository.html.Google ScholarGoogle Scholar
  21. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations, 6(1):90--105, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical magazine, 2(6):559--572, 1901.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems 16 (NIPS-97), pages 626--632, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc. of the Workshop on Artificial Intelligence for Web Search, pages 58--64, 2000.Google ScholarGoogle Scholar
  25. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In Proc. of the 18th International Conference on Machine Learning (ICML-01), pages 577--584, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems 15 (NIPS-02), pages 505--512, 2003.Google ScholarGoogle Scholar
  27. K. Y. Yip, D. W. Cheung, and M. K. Ng. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In Proc. of the 21st International Conference on Data Engineering (ICDE-05), pages 329--340, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing semi-supervised clustering: a feature projection perspective

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2007
      1080 pages
      ISBN:9781595936097
      DOI:10.1145/1281192

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 August 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader