ABSTRACT
Semi-supervised classification is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we introduce a unified framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. Experimental results on a large collection of datasets show the advantages of the proposed framework.
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: ordering points to identify the clustering structure. In Proc. ACM SIGMOD. 49--60. Google ScholarDigital Library
- A. J. L. Batista, R. J. G. B. Campello, and J. Sander. 2016. Active Semi-Supervised Classification Based on Multiple Clustering Hierarchies. In Proc. DSAA. 11--20.Google Scholar
- M. Belkin, P. Niyogi, and V. Sindhwani. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR 7 (2006), 2399--2434. Google ScholarDigital Library
- C. Böhm and C. Plant. 2008. Hissclu: a hierarchical density-based method for semi-supervised clustering. In Proc. EDBT. 440--451. Google ScholarDigital Library
- R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013. Density-based clustering based on hierarchical density estimates. In Proc. PAKDD. 160--172.Google Scholar
- R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Discov. 27, 3 (2013), 344--371.Google ScholarCross Ref
- R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2015. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10, 1 (2015), 1--51. Google ScholarDigital Library
- O. Chapelle, B. Schölkopf, and A. Zien. 2006. Introduction to Semi-Supervised Learning. MIT Press, Cambridge, MA, Chapter 1, 1--12.Google Scholar
- C. A. R. de Sousa. 2015. An overview on the Gaussian Fields and Harmonic Functions method for semi-supervised learning. In Proc. IJCNN. 1--8.Google ScholarCross Ref
- C. A. R. de Sousa, S. O. Rezende, and G. E. A. P. A. Batista. 2013. Influence of Graph Construction on Semi-supervised Learning. In Proc. ECML PKDD. III. 160--175. Google ScholarDigital Library
- M. C. P. de Souto, I. G. Costa, D. S. A. de Araujo, T. B. Ludermir, and A. Schliep. 2008. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9 (2008).Google Scholar
- J. Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR 7 (2006), 1--30. Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD. 226--231. Google ScholarDigital Library
- F. Fontaine, M. Pastor, I. Zamora, and F. Sanz. 2005. Anchor-GRIND: Filling the Gap between Standard 3D QSAR and the GRid-INdependent Descriptors. J. Med.Chem. 48, 7 (2005), 2687--2694.Google ScholarCross Ref
- M. Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32, 200 (1937), 675--701.Google ScholarCross Ref
- A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez, P. Mutowo-Meullenet, F. Atkinson, L. J. Bellis, E. Cibrián-Uhalte, M. Davies, N. Dedman, A. Karlsson, M. P. Magariños, J. P. Overington, G. Papadatos, I. Smit, and A. R. Leach. 2017. The ChEMBL database in 2017. Nucleic Acids Res. 45 (2017), D945--D954.Google ScholarCross Ref
- J. A. Hartigan. 1975. Clustering Algorithms. Wiley. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall. Google ScholarDigital Library
- H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. WIREs: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240.Google ScholarCross Ref
- H.-P. Kriegel, E. Schubert, and A. Zimek. 2017. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst. 52, 2 (2017), 341--378. Google ScholarDigital Library
- L. Lelis and J. Sander. 2009. Semi-supervised Density-Based Clustering. In Proc. IEEE ICDM. 842--847. Google ScholarDigital Library
- J. Li, J. Sander, R. J. G. B. Campello, and A. Zimek. 2014. Active learning strategies for semi-supervised DBSCAN. In Proc. Canadian AI. 179--190.Google Scholar
- M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/mlGoogle Scholar
- W. Liu and S.-F. Chang. 2009. Robust multi-class transductive learning with graphs. In Proc. IEEE CVPR. 381--388.Google ScholarCross Ref
- D. Moulavi. 2014. Finding, Evaluating and Exploring Clustering Alternatives Unsupervised and Semi-supervised. Ph.D. Dissertation. University of Alberta.Google Scholar
- P. Nemenyi. 1962. Distribution-free multiple comparisons. In Biometrics, Vol. 18. 263.Google Scholar
- O. M. Rivera-Borroto, Y. Marrero Ponce, J. M. Garcia-de-la Vega, and R. d. C. GrauÁlbalo. 2011. Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors. J. Chem. Inf. Model. 51 (2011), 3036--3049.Google ScholarCross Ref
- M. Sokolova and G. Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Inf Process Manag 45, 4 (2009), 427--437. Google ScholarDigital Library
- J. J. Sutherland, L. A. O'Brien, and D. F. Weaver. 2004. A Comparison of Methods for Modeling Quantitative Structure-Activity Relationships. J. Med.Chem. 47, 22 (2004), 5541--5554.Google ScholarCross Ref
- M. Szummer and T. S. Jaakkola. 2002. Information Regularization with Partially Labeled Data. In Proc. NIPS. 1025--1032. Google ScholarDigital Library
- J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. 2014. OpenML: networked science in machine learning. ACM SIGKDD Expl. Newsletter 15, 2 (2014), 49--60. Google ScholarDigital Library
- K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biol 4, 5 (2003), R34.Google ScholarCross Ref
- L. Zhao, S. Luo, M. Tian, C. Shao, and H. Ma. 2006. Combining Label Information and Neighborhood Graph for Semi-supervised Learning. In Proc. ISNN - Advances in Neural Networks. 482--488. Google ScholarDigital Library
- X. Zhu. 2005. Semi-Supervised Learning Literature Survey --- TR1530. Technical Report. University of Wisconsin, Madison.Google Scholar
- X. Zhu, Z. Ghahramani, and J. D. Lafferty. 2003. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proc. ICML. 912--919. Google ScholarDigital Library
- X. Zhu and A. B. Goldberg. 2009. Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers. Google ScholarDigital Library
Index Terms
- A unified framework of density-based clustering for semi-supervised classification
Recommendations
A unified view of density-based methods for semi-supervised clustering and classification
AbstractSemi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is ...
Density-based semi-supervised clustering
Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Semi-supervised Density-Based Clustering
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data MiningMost of the effort in the semi-supervised clustering literature was devoted to variations of the K-means algorithm. In this paper we show how background knowledge can be used to bias a partitional density-based clustering algorithm. Our work describes ...
Comments