Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 6/2021

16.09.2021

A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

verfasst von: Rodrigo Randel, Daniel Aloise, Simon J. Blanchard, Alain Hertz

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We focus on discrete labels (e.g., classes) for simplicity of exposition, although there are numerous unsupervised (e.g., latent trait models) and supervised models (e.g., regression) which focus on continuous outcomes.
 
Literatur
Zurück zum Zitat Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef
Zurück zum Zitat Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding, In: Proceedings of the nineteenth international conference on machine learning, vol 656012. Morgan Kaufmann Publishers Inc., pp 27–34 Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding, In: Proceedings of the nineteenth international conference on machine learning, vol 656012. Morgan Kaufmann Publishers Inc., pp 27–34
Zurück zum Zitat Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints. In: Semi-supervised learning. pp 71–98 Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints. In: Semi-supervised learning. pp 71–98
Zurück zum Zitat Bertsimas D, Tsitsiklis J (1997) Introduction to linear optimization, 1st edn. Athena Scientific, Belmont Bertsimas D, Tsitsiklis J (1997) Introduction to linear optimization, 1st edn. Athena Scientific, Belmont
Zurück zum Zitat Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkCrossRef Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkCrossRef
Zurück zum Zitat Campello RJ, Moulavi D, Zimek A, Sander J (2013) A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Discov 27(3):344–371MathSciNetCrossRef Campello RJ, Moulavi D, Zimek A, Sander J (2013) A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Discov 27(3):344–371MathSciNetCrossRef
Zurück zum Zitat Costa LR, Aloise D, Mladenović N (2017) Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering. Inf Sci 415:247–253CrossRef Costa LR, Aloise D, Mladenović N (2017) Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering. Inf Sci 415:247–253CrossRef
Zurück zum Zitat Davidson I (2012) Two approaches to understanding when constraints help clustering. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD ’12. ACM, 2339734, pp 1312–1320. https://doi.org/10.1145/2339530.2339734 Davidson I (2012) Two approaches to understanding when constraints help clustering. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD ’12. ACM, 2339734, pp 1312–1320. https://​doi.​org/​10.​1145/​2339530.​2339734
Zurück zum Zitat Davidson I, Ravi SS (2006)Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1, vol 1597593. AAAI Press, pp 336–341 Davidson I, Ravi SS (2006)Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1, vol 1597593. AAAI Press, pp 336–341
Zurück zum Zitat Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 115–126CrossRef Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 115–126CrossRef
Zurück zum Zitat Edwards AWF, Cavalli-Sforza LL (1965) A method for cluster analysis. Biometrics 21(2):362–375CrossRef Edwards AWF, Cavalli-Sforza LL (1965) A method for cluster analysis. Biometrics 21(2):362–375CrossRef
Zurück zum Zitat Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188CrossRef Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188CrossRef
Zurück zum Zitat Grossi V, Romei A, Turini F (2017b) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464MathSciNetCrossRef Grossi V, Romei A, Turini F (2017b) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464MathSciNetCrossRef
Zurück zum Zitat Hansen P, Brimberg J, Urošević D, Mladenović N (2009) Solving large p-median clustering problems by primal-dual variable neighborhood search. Data Min Knowl Discov 19(3):351–375MathSciNetCrossRef Hansen P, Brimberg J, Urošević D, Mladenović N (2009) Solving large p-median clustering problems by primal-dual variable neighborhood search. Data Min Knowl Discov 19(3):351–375MathSciNetCrossRef
Zurück zum Zitat Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, HobokenMATH Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, HobokenMATH
Zurück zum Zitat Kim S, Blanchard SJ, DeSarbo WS, Fong DK (2013) Implementing managerial constraints in model-based segmentation: extensions of Kim, Fong, and DeSarbo (2012) with an application to heterogeneous perceptions of service quality. J Mark Res 50(5):664–673CrossRef Kim S, Blanchard SJ, DeSarbo WS, Fong DK (2013) Implementing managerial constraints in model-based segmentation: extensions of Kim, Fong, and DeSarbo (2012) with an application to heterogeneous perceptions of service quality. J Mark Res 50(5):664–673CrossRef
Zurück zum Zitat Kochetov Y, Ivanenko D (2005) Computationally difficult instances for the uncapacitated facility location problem. Operations research/computer science interfaces series. Springer, Boston, pp 351–367 Kochetov Y, Ivanenko D (2005) Computationally difficult instances for the uncapacitated facility location problem. Operations research/computer science interfaces series. Springer, Boston, pp 351–367
Zurück zum Zitat Randel R, Aloise D, Mladenović N, Hansen P (2019) On the k-medoids model for semi-supervised clustering. In: Sifaleras A, Salhi S, Brimberg J (eds) Variable neighborhood search. Springer, Cham, pp 13–27CrossRef Randel R, Aloise D, Mladenović N, Hansen P (2019) On the k-medoids model for semi-supervised clustering. In: Sifaleras A, Salhi S, Brimberg J (eds) Variable neighborhood search. Springer, Cham, pp 13–27CrossRef
Zurück zum Zitat Santi É, Aloise D, Blanchard SJ (2016) A model for clustering data from heterogeneous dissimilarities. Eur J Oper Res 253(3):659–672MathSciNetCrossRef Santi É, Aloise D, Blanchard SJ (2016) A model for clustering data from heterogeneous dissimilarities. Eur J Oper Res 253(3):659–672MathSciNetCrossRef
Zurück zum Zitat Shor NZ, Kiwiel KC, Ruszcaynski A (1985) Minimization methods for nondifferentiable functions. Springer, BerlinCrossRef Shor NZ, Kiwiel KC, Ruszcaynski A (1985) Minimization methods for nondifferentiable functions. Springer, BerlinCrossRef
Zurück zum Zitat Wagstaff KL (2007) Value, cost, and sharing: open issues in constrained clustering. In: Džeroski S, Struyf J (eds) Knowledge discovery in inductive databases. Springer, Berlin, pp 1–10 Wagstaff KL (2007) Value, cost, and sharing: open issues in constrained clustering. In: Džeroski S, Struyf J (eds) Knowledge discovery in inductive databases. Springer, Berlin, pp 1–10
Zurück zum Zitat Wagstaff K, Cardie C, Rogers S, Schrödl S (2001). Constrained k-means clustering with background knowledge, vol ICML ’01. In: Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 577–584 Wagstaff K, Cardie C, Rogers S, Schrödl S (2001). Constrained k-means clustering with background knowledge, vol ICML ’01. In: Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 577–584
Zurück zum Zitat Zhu X, Goldberg AB, Brachman R, Dietterich T (2009) Introduction to semi-supervised learning. Morgan and Claypool Publishers, San RafaelCrossRef Zhu X, Goldberg AB, Brachman R, Dietterich T (2009) Introduction to semi-supervised learning. Morgan and Claypool Publishers, San RafaelCrossRef
Metadaten
Titel
A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering
verfasst von
Rodrigo Randel
Daniel Aloise
Simon J. Blanchard
Alain Hertz
Publikationsdatum
16.09.2021
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 6/2021
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-021-00794-0

Weitere Artikel der Ausgabe 6/2021

Data Mining and Knowledge Discovery 6/2021 Zur Ausgabe