skip to main content
10.1145/3221269.3223037acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

A unified framework of density-based clustering for semi-supervised classification

Published:09 July 2018Publication History

ABSTRACT

Semi-supervised classification is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we introduce a unified framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. Experimental results on a large collection of datasets show the advantages of the proposed framework.

References

  1. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: ordering points to identify the clustering structure. In Proc. ACM SIGMOD. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. J. L. Batista, R. J. G. B. Campello, and J. Sander. 2016. Active Semi-Supervised Classification Based on Multiple Clustering Hierarchies. In Proc. DSAA. 11--20.Google ScholarGoogle Scholar
  3. M. Belkin, P. Niyogi, and V. Sindhwani. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR 7 (2006), 2399--2434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Böhm and C. Plant. 2008. Hissclu: a hierarchical density-based method for semi-supervised clustering. In Proc. EDBT. 440--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013. Density-based clustering based on hierarchical density estimates. In Proc. PAKDD. 160--172.Google ScholarGoogle Scholar
  6. R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Discov. 27, 3 (2013), 344--371.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2015. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10, 1 (2015), 1--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Chapelle, B. Schölkopf, and A. Zien. 2006. Introduction to Semi-Supervised Learning. MIT Press, Cambridge, MA, Chapter 1, 1--12.Google ScholarGoogle Scholar
  9. C. A. R. de Sousa. 2015. An overview on the Gaussian Fields and Harmonic Functions method for semi-supervised learning. In Proc. IJCNN. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. A. R. de Sousa, S. O. Rezende, and G. E. A. P. A. Batista. 2013. Influence of Graph Construction on Semi-supervised Learning. In Proc. ECML PKDD. III. 160--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. C. P. de Souto, I. G. Costa, D. S. A. de Araujo, T. B. Ludermir, and A. Schliep. 2008. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9 (2008).Google ScholarGoogle Scholar
  12. J. Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR 7 (2006), 1--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD. 226--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Fontaine, M. Pastor, I. Zamora, and F. Sanz. 2005. Anchor-GRIND: Filling the Gap between Standard 3D QSAR and the GRid-INdependent Descriptors. J. Med.Chem. 48, 7 (2005), 2687--2694.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32, 200 (1937), 675--701.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez, P. Mutowo-Meullenet, F. Atkinson, L. J. Bellis, E. Cibrián-Uhalte, M. Davies, N. Dedman, A. Karlsson, M. P. Magariños, J. P. Overington, G. Papadatos, I. Smit, and A. R. Leach. 2017. The ChEMBL database in 2017. Nucleic Acids Res. 45 (2017), D945--D954.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. A. Hartigan. 1975. Clustering Algorithms. Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. WIREs: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240.Google ScholarGoogle ScholarCross RefCross Ref
  20. H.-P. Kriegel, E. Schubert, and A. Zimek. 2017. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 52, 2 (2017), 341--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Lelis and J. Sander. 2009. Semi-supervised Density-Based Clustering. In Proc. IEEE ICDM. 842--847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Li, J. Sander, R. J. G. B. Campello, and A. Zimek. 2014. Active learning strategies for semi-supervised DBSCAN. In Proc. Canadian AI. 179--190.Google ScholarGoogle Scholar
  23. M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  24. W. Liu and S.-F. Chang. 2009. Robust multi-class transductive learning with graphs. In Proc. IEEE CVPR. 381--388.Google ScholarGoogle ScholarCross RefCross Ref
  25. D. Moulavi. 2014. Finding, Evaluating and Exploring Clustering Alternatives Unsupervised and Semi-supervised. Ph.D. Dissertation. University of Alberta.Google ScholarGoogle Scholar
  26. P. Nemenyi. 1962. Distribution-free multiple comparisons. In Biometrics, Vol. 18. 263.Google ScholarGoogle Scholar
  27. O. M. Rivera-Borroto, Y. Marrero Ponce, J. M. Garcia-de-la Vega, and R. d. C. GrauÁlbalo. 2011. Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors. J. Chem. Inf. Model. 51 (2011), 3036--3049.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Sokolova and G. Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Inf Process Manag 45, 4 (2009), 427--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. J. Sutherland, L. A. O'Brien, and D. F. Weaver. 2004. A Comparison of Methods for Modeling Quantitative Structure-Activity Relationships. J. Med.Chem. 47, 22 (2004), 5541--5554.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Szummer and T. S. Jaakkola. 2002. Information Regularization with Partially Labeled Data. In Proc. NIPS. 1025--1032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. 2014. OpenML: networked science in machine learning. ACM SIGKDD Expl. Newsletter 15, 2 (2014), 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biol 4, 5 (2003), R34.Google ScholarGoogle ScholarCross RefCross Ref
  33. L. Zhao, S. Luo, M. Tian, C. Shao, and H. Ma. 2006. Combining Label Information and Neighborhood Graph for Semi-supervised Learning. In Proc. ISNN - Advances in Neural Networks. 482--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Zhu. 2005. Semi-Supervised Learning Literature Survey --- TR1530. Technical Report. University of Wisconsin, Madison.Google ScholarGoogle Scholar
  35. X. Zhu, Z. Ghahramani, and J. D. Lafferty. 2003. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proc. ICML. 912--919. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Zhu and A. B. Goldberg. 2009. Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A unified framework of density-based clustering for semi-supervised classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSDBM '18: Proceedings of the 30th International Conference on Scientific and Statistical Database Management
      July 2018
      314 pages
      ISBN:9781450365055
      DOI:10.1145/3221269

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 July 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SSDBM '18 Paper Acceptance Rate30of75submissions,40%Overall Acceptance Rate56of146submissions,38%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader