Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 6/2019

Open Access 23.08.2019

A unified view of density-based methods for semi-supervised clustering and classification

verfasst von: Jadson Castro Gertrudes, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Please refer to the “Appendix” for a graphical illustration of border objects as well as other fundamental definitions reviewed in this section, which will be subsequently used throughout the remainder of the manuscript.
 
2
Algorithmically, however, \(\mathbf {x}_{3}\) may in practice be labeled depending on how the tie is resolved in the backtracking step of the label expansion procedure, and the final label may in this case be order-dependend.
 
3
HDBSCAN* is equipped with an optional parameter, \(m_{\text {ClSize}}\), which allows the user to specify the minimum size for a component to be considered a cluster, in such a way that components with fewer than \(m_{\text {ClSize}}\) objects are disregarded as noise. This can significantly reduce the size of the resulting clustering hierarchy. By default, HDBSCAN* uses \(m_{\text {ClSize}} = m_{\text {pts}}\), so in practice only \(m_{\text {pts}}\) needs to be given as input to the algorithm (Campello et al. 2013a, 2015). The result in Table 1 corresponds to \(m_{\text {ClSize}} = m_{\text {pts}} = 3\).
 
4
There is no such thing as a “rag bag cluster” because noise objects should not be seen as clustered with each other, in spite of sharing a common label “0”.
 
5
Both properties could be seen as related to the “locality” property of clustering functions as introduced by Ackerman et al. (2010).
 
6
It also makes it possible to compute Stability for the root, \(\mathbf {C}_1\).
 
7
The root can be trivially removed from the set of candidate solutions in FOSC, if desired (Campello et al. 2013b).
 
8
It can be implemented in O(n) w.r.t. memory (Campello et al. 2015).
 
11
Please notice that this statement holds under the assumption that labels are available. In certain application scenarios users will only be able or willing to provide pairwise constraints, and we are by no means claiming that these constraints cannot be useful for clustering. The relevance of pairwise constraints in semi-supervised clustering has been thoroughly investigated in the literature—e.g. see (Lampert et al. 2018) for a recent study in the realm of time-series.
 
12
If a result equivalent to SSDBSCAN’s label assignments would be desired, the result of the label expansion could be post-processed to remove propagated labels from objects that lead to label inconsistencies, declaring them as Noise (see Sect. 3.2). However, since a label assignment for all \(\mathbf {x} \in \mathbf {X}_U\) is required in the context of semi-supervisedclassification, such a noise identification procedure will not be further discussed here.
 
Literatur
Zurück zum Zitat Ackerman M, Ben-David S, Loker D (2010) Characterization of linkage-based clustering. In: COLT 2010—the 23rd conference on learning theory, Haifa, Israel, June 27–29, 2010. Omnipress, pp 270–281 Ackerman M, Ben-David S, Loker D (2010) Characterization of linkage-based clustering. In: COLT 2010—the 23rd conference on learning theory, Haifa, Israel, June 27–29, 2010. Omnipress, pp 270–281
Zurück zum Zitat Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, Pennsylvania, USA. ACM Press, pp 49–60. https://doi.org/10.1145/304182.304187 Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, Pennsylvania, USA. ACM Press, pp 49–60. https://​doi.​org/​10.​1145/​304182.​304187
Zurück zum Zitat Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, COLING-ACL’98, August 10–14, 1998, Université de Montréal, Montréal, Quebec, Canada, Proceedings of the Conference. Morgan Kaufmann Publishers/ACL, pp 79–85 Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, COLING-ACL’98, August 10–14, 1998, Université de Montréal, Montréal, Quebec, Canada, Proceedings of the Conference. Morgan Kaufmann Publishers/ACL, pp 79–85
Zurück zum Zitat Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained Clustering: Advances in Algorithms. Applications and Theory. CRC Press, Boca Raton Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained Clustering: Advances in Algorithms. Applications and Theory. CRC Press, Boca Raton
Zurück zum Zitat Batista AJL, Campello RJGB, Sander J (2016) Active semi-supervised classification based on multiple clustering hierarchies. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 11–20. https://doi.org/10.1109/DSAA.2016.9 Batista AJL, Campello RJGB, Sander J (2016) Active semi-supervised classification based on multiple clustering hierarchies. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 11–20. https://​doi.​org/​10.​1109/​DSAA.​2016.​9
Zurück zum Zitat Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434MathSciNetMATH Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434MathSciNetMATH
Zurück zum Zitat Böhm C, Plant C (2008) HISSCLU: a hierarchical density-based method for semi-supervised clustering. In: EDBT 2008, 11th international conference on extending database technology, Nantes, France, March 25–29, 2008, Proceedings, ACM International Conference Proceeding Series, vol 261, pp 440–451. https://doi.org/10.1145/1353343.1353398 Böhm C, Plant C (2008) HISSCLU: a hierarchical density-based method for semi-supervised clustering. In: EDBT 2008, 11th international conference on extending database technology, Nantes, France, March 25–29, 2008, Proceedings, ACM International Conference Proceeding Series, vol 261, pp 440–451. https://​doi.​org/​10.​1145/​1353343.​1353398
Zurück zum Zitat Campello RJGB, Moulavi D, Sander J (2013a) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining, 17th Pacific-Asia conference, PAKDD 2013, Gold Coast, Australia, April 14–17, 2013, Proceedings, Part II, Lecture Notes in Computer Science, vol 7819. Springer, Berlin, pp 160–172. https://doi.org/10.1007/978-3-642-37456-2_14 Campello RJGB, Moulavi D, Sander J (2013a) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining, 17th Pacific-Asia conference, PAKDD 2013, Gold Coast, Australia, April 14–17, 2013, Proceedings, Part II, Lecture Notes in Computer Science, vol 7819. Springer, Berlin, pp 160–172. https://​doi.​org/​10.​1007/​978-3-642-37456-2_​14
Zurück zum Zitat Chapelle O, Schölkopf B, Zien A (2006) Introduction to semi-supervised learning, Chapter 1. MIT Press, Cambridge, pp 1–12CrossRef Chapelle O, Schölkopf B, Zien A (2006) Introduction to semi-supervised learning, Chapter 1. MIT Press, Cambridge, pp 1–12CrossRef
Zurück zum Zitat Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Knowledge discovery in databases: PKDD. Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, September 18–22, 2006, Lecture Notes in Computer Science, vol 4213. Springer, Berlin, pp 115–126. https://doi.org/10.1007/11871637_15 Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Knowledge discovery in databases: PKDD. Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, September 18–22, 2006, Lecture Notes in Computer Science, vol 4213. Springer, Berlin, pp 115–126. https://​doi.​org/​10.​1007/​11871637_​15
Zurück zum Zitat de Sousa CAR, Rezende SO, Batista GEAPA (2013) Influence of graph construction on semi-supervised learning. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2013, Prague, Czech Republic, September 23–27, 2013, Proceedings, Part III, Lecture Notes in Computer Science, vol 8190. Springer, Berlin, pp 160–175. https://doi.org/10.1007/978-3-642-40994-3_11 de Sousa CAR, Rezende SO, Batista GEAPA (2013) Influence of graph construction on semi-supervised learning. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2013, Prague, Czech Republic, September 23–27, 2013, Proceedings, Part III, Lecture Notes in Computer Science, vol 8190. Springer, Berlin, pp 160–175. https://​doi.​org/​10.​1007/​978-3-642-40994-3_​11
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
Zurück zum Zitat Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, USA. AAAI Press, pp 226–231 Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, USA. AAAI Press, pp 226–231
Zurück zum Zitat Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo-Meullenet P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR (2017) The chembl database in 2017. Nucleic Acids Res 45(Database–Issue):D945–D954. https://doi.org/10.1093/nar/gkw1074 CrossRef Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo-Meullenet P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR (2017) The chembl database in 2017. Nucleic Acids Res 45(Database–Issue):D945–D954. https://​doi.​org/​10.​1093/​nar/​gkw1074 CrossRef
Zurück zum Zitat Gertrudes JC, Zimek A, Sander J, Campello RJGB (2018) A unified framework of density-based clustering for semi-supervised classification. In: Proceedings of the 30th international conference on scientific and statistical database management, SSDBM 2018, Bozen-Bolzano, Italy, July 09–11, 2018. ACM, New York, pp 11:1–11:12. https://doi.org/10.1145/3221269.3223037 Gertrudes JC, Zimek A, Sander J, Campello RJGB (2018) A unified framework of density-based clustering for semi-supervised classification. In: Proceedings of the 30th international conference on scientific and statistical database management, SSDBM 2018, Bozen-Bolzano, Italy, July 09–11, 2018. ACM, New York, pp 11:1–11:12. https://​doi.​org/​10.​1145/​3221269.​3223037
Zurück zum Zitat Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH
Zurück zum Zitat Li J, Sander J, Campello RJGB, Zimek A (2014) Active learning strategies for semi-supervised DBSCAN. In: Advances in artificial intelligence—proceedings of the 27th Canadian conference on artificial intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6–9, 2014, Lecture Notes in Computer Science, vol 8436. Springer, Berlin, pp 179–190. https://doi.org/10.1007/978-3-319-06483-3_16 Li J, Sander J, Campello RJGB, Zimek A (2014) Active learning strategies for semi-supervised DBSCAN. In: Advances in artificial intelligence—proceedings of the 27th Canadian conference on artificial intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6–9, 2014, Lecture Notes in Computer Science, vol 8436. Springer, Berlin, pp 179–190. https://​doi.​org/​10.​1007/​978-3-319-06483-3_​16
Zurück zum Zitat Lichman M (2013) UCI machine learning repository. Accessed 17 June 2017 Lichman M (2013) UCI machine learning repository. Accessed 17 June 2017
Zurück zum Zitat Liu W, Chang S (2009) Robust multi-class transductive learning with graphs. In: 2009 IEEE Computer Society conference on computer vision and pattern recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA. IEEE Computer Society, pp 381–388. https://doi.org/10.1109/CVPRW.2009.5206871 Liu W, Chang S (2009) Robust multi-class transductive learning with graphs. In: 2009 IEEE Computer Society conference on computer vision and pattern recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA. IEEE Computer Society, pp 381–388. https://​doi.​org/​10.​1109/​CVPRW.​2009.​5206871
Zurück zum Zitat Moulavi D (2014) Finding, evaluating and exploring clustering alternatives unsupervised and semi-supervised. PhD Thesis, University of Alberta Moulavi D (2014) Finding, evaluating and exploring clustering alternatives unsupervised and semi-supervised. PhD Thesis, University of Alberta
Zurück zum Zitat Nemenyi P (1963) Distribution-free multiple comparisons. Princeton University Nemenyi P (1963) Distribution-free multiple comparisons. Princeton University
Zurück zum Zitat Pourrajabi M, Moulavi D, Campello RJGB, Zimek A, Sander J, Goebel R (2014) Model selection for semi-supervised clustering. In: Proceedings of the 17th international conference on extending database technology, EDBT 2014, Athens, Greece, March 24–28, 2014. OpenProceedings.org, pp 331–342. https://doi.org/10.5441/002/edbt.2014.31 Pourrajabi M, Moulavi D, Campello RJGB, Zimek A, Sander J, Goebel R (2014) Model selection for semi-supervised clustering. In: Proceedings of the 17th international conference on extending database technology, EDBT 2014, Athens, Greece, March 24–28, 2014. OpenProceedings.org, pp 331–342. https://​doi.​org/​10.​5441/​002/​edbt.​2014.​31
Zurück zum Zitat Rivera-Borroto OM, Marrero-Ponce Y, de la Vega JMG, del Corazón Grau-Ábalo R (2011) Comparison of combinatorial clustering methods on pharmacological data sets represented by machine learning-selected real molecular descriptors. J Chem Inf Model 51(12):3036–3049. https://doi.org/10.1021/ci2000083 CrossRef Rivera-Borroto OM, Marrero-Ponce Y, de la Vega JMG, del Corazón Grau-Ábalo R (2011) Comparison of combinatorial clustering methods on pharmacological data sets represented by machine learning-selected real molecular descriptors. J Chem Inf Model 51(12):3036–3049. https://​doi.​org/​10.​1021/​ci2000083 CrossRef
Zurück zum Zitat Ruiz C, Spiliopoulou M, Ruiz EM (2007) C-DBSCAN: density-based clustering with constraints. In: Rough sets, fuzzy sets, data mining and granular computing. Proceedings of the 11th International Conference, RSFDGrC 2007, Toronto, Canada, May 14–16, 2007, Lecture Notes in Computer Science, vol 4482. Springer, Berlin, pp 216–223. https://doi.org/10.1007/978-3-540-72530-5_25 Ruiz C, Spiliopoulou M, Ruiz EM (2007) C-DBSCAN: density-based clustering with constraints. In: Rough sets, fuzzy sets, data mining and granular computing. Proceedings of the 11th International Conference, RSFDGrC 2007, Toronto, Canada, May 14–16, 2007, Lecture Notes in Computer Science, vol 4482. Springer, Berlin, pp 216–223. https://​doi.​org/​10.​1007/​978-3-540-72530-5_​25
Zurück zum Zitat Szummer M, Jaakkola TS (2002) Information regularization with partially labeled data. In: Advances in neural information processing systems 15 [Neural Information Processing Systems, NIPS 2002, December 9–14, 2002, Vancouver, British Columbia, Canada]. MIT Press, Cambridge, pp 1025–1032 Szummer M, Jaakkola TS (2002) Information regularization with partially labeled data. In: Advances in neural information processing systems 15 [Neural Information Processing Systems, NIPS 2002, December 9–14, 2002, Vancouver, British Columbia, Canada]. MIT Press, Cambridge, pp 1025–1032
Zurück zum Zitat Zhao L, Luo S, Tian M, Shao C, Ma H (2006) Combining label information and neighborhood graph for semi-supervised learning. In: Advances in neural networks—ISNN 2006. Proceedings of the third international symposium on neural networks, Chengdu, China, May 28–June 1, 2006, Part I, Lecture Notes in Computer Science, vol 3971. Springer, Berlin, pp 482–488. https://doi.org/10.1007/11759966_72 Zhao L, Luo S, Tian M, Shao C, Ma H (2006) Combining label information and neighborhood graph for semi-supervised learning. In: Advances in neural networks—ISNN 2006. Proceedings of the third international symposium on neural networks, Chengdu, China, May 28–June 1, 2006, Part I, Lecture Notes in Computer Science, vol 3971. Springer, Berlin, pp 482–488. https://​doi.​org/​10.​1007/​11759966_​72
Zurück zum Zitat Zhu X (2005) Semi-supervised learning literature survey—TR1530. Technical report, University of Wisconsin, Madison Zhu X (2005) Semi-supervised learning literature survey—TR1530. Technical report, University of Wisconsin, Madison
Zurück zum Zitat Zhu X, Ghahramani Z, Lafferty JD (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning (ICML 2003), August 21–24, 2003, Washington, DC, USA. AAAI Press, pp 912–919 Zhu X, Ghahramani Z, Lafferty JD (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning (ICML 2003), August 21–24, 2003, Washington, DC, USA. AAAI Press, pp 912–919
Metadaten
Titel
A unified view of density-based methods for semi-supervised clustering and classification
verfasst von
Jadson Castro Gertrudes
Arthur Zimek
Jörg Sander
Ricardo J. G. B. Campello
Publikationsdatum
23.08.2019
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 6/2019
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-019-00651-1

Weitere Artikel der Ausgabe 6/2019

Data Mining and Knowledge Discovery 6/2019 Zur Ausgabe

Premium Partner