nach oben

Data Mining and Knowledge Discovery

Erschienen in:

16.09.2021

A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

verfasst von: Rodrigo Randel, Daniel Aloise, Simon J. Blanchard, Alain Hertz

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised.

Vorheriger Artikel Implicit consensus clustering from multiple graphs

Nächster Artikel Time series clustering in linear time complexity

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

We focus on discrete labels (e.g., classes) for simplicity of exposition, although there are numerous unsupervised (e.g., latent trait models) and supervised models (e.g., regression) which focus on continuous outcomes.

Aggarwal CC (2015) Data mining. Springer, Berlin. https://doi.org/10.1007/978-3-319-14142-8CrossRefMATH

Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef

Aloise D, Hansen P, Liberti L (2010) An improved column generation algorithm for minimum sum-of-squares clustering. Math Program 131(1–2):195–220. https://doi.org/10.1007/s10107-010-0349-7MathSciNetCrossRefMATH

Anil J, Rong J, Radha C (2015) Semi-supervised clustering. Book section semi-supervised clustering. CRC Press, Boca Raton. https://doi.org/10.1201/b19706-26CrossRefMATH

Ares ME, Parapar J, Barreiro A (2012) An experimental study of constrained clustering effectiveness in presence of erroneous constraints. Inf Process Manag 48(3):537–551. https://doi.org/10.1016/j.ipm.2011.08.006CrossRef

Avella P, Sassano A, Vasil’ev I (2007) Computational study of large-scale p-median problems. Math Program 109(1):89–114. https://doi.org/10.1007/s10107-005-0700-6

Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding, In: Proceedings of the nineteenth international conference on machine learning, vol 656012. Morgan Kaufmann Publishers Inc., pp 27–34

Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. Soc Ind Appl Math. https://doi.org/10.1137/1.9781611972740.31CrossRef

Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints. In: Semi-supervised learning. pp 71–98

Bertsimas D, Tsitsiklis J (1997) Introduction to linear optimization, 1st edn. Athena Scientific, Belmont

Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkCrossRef

Blanchard SJ, Aloise D, DeSarbo WS (2012) The heterogeneous p-median problem for categorization based clustering. Psychometrika 77(4):741–762. https://doi.org/10.1007/s11336-012-9283-3MathSciNetCrossRefMATH

Brucker P (1978) On the complexity of clustering problems. In: Henn R, Korte B, Oettli W (eds) Optimization and operations research. Springer, Berlin, pp 45–54. https://doi.org/10.1007/978-3-642-95322-4_5CrossRef

Campello RJ, Moulavi D, Zimek A, Sander J (2013) A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Discov 27(3):344–371MathSciNetCrossRef

Christou IT (2011) Coordination of cluster ensembles via exact methods. IEEE Trans Pattern Anal Mach Intell 33(2):279–93. https://doi.org/10.1109/TPAMI.2010.85CrossRef

Costa LR, Aloise D, Mladenović N (2017) Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering. Inf Sci 415:247–253CrossRef

Davidson I (2012) Two approaches to understanding when constraints help clustering. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD ’12. ACM, 2339734, pp 1312–1320. https://doi.org/10.1145/2339530.2339734

Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. https://doi.org/10.1137/1.9781611972757.13

Davidson I, Ravi SS (2006)Identifying and generating easy sets of constraints for clustering. In: Proceedings of the 21st national conference on artificial intelligence—Volume 1, vol 1597593. AAAI Press, pp 336–341

Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 115–126CrossRef

Delattre M, Hansen P (1980) Bicriterion cluster analysis. IEEE Trans Pattern Anal Mach Intell PAMI 2(4):277–291. https://doi.org/10.1109/TPAMI.1980.4767027CrossRefMATH

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

Edwards AWF, Cavalli-Sforza LL (1965) A method for cluster analysis. Biometrics 21(2):362–375CrossRef

Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188CrossRef

Fisher ML (1981) The Lagrangian relaxation method for solving integer programming problems. Manag Sci 27(1):1–18. https://doi.org/10.1287/mnsc.27.1.1MathSciNetCrossRefMATH

García S, Labbé M, Marín A (2011) Solving large p-median problems with a radius formulation. INFORMS J Comput 23(4):546–556. https://doi.org/10.1287/ijoc.1100.0418MathSciNetCrossRefMATH

Grossi V, Romei A, Turini F (2017a) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464. https://doi.org/10.1007/s10618-016-0480-zMathSciNetCrossRef

Grossi V, Romei A, Turini F (2017b) Survey on using constraints in data mining. Data Min Knowl Discov 31(2):424–464MathSciNetCrossRef

Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79(1–3):191–215. https://doi.org/10.1007/bf02614317MathSciNetCrossRefMATH

Hansen P, Brimberg J, Urošević D, Mladenović N (2009) Solving large p-median clustering problems by primal-dual variable neighborhood search. Data Min Knowl Discov 19(3):351–375MathSciNetCrossRef

Held M, Wolfe P, Crowder HP (1974) Validation of subgradient optimization. Math Program 6(1):62–88. https://doi.org/10.1007/bf01580223MathSciNetCrossRefMATH

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075CrossRefMATH

Kariv O, Hakimi SL (1979) An algorithmic approach to network location problems. II: the p-medians. SIAM J Appl Math 37:539–560. https://doi.org/10.1137/0137041MathSciNetCrossRefMATH

Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, HobokenMATH

Kim S, Blanchard SJ, DeSarbo WS, Fong DK (2013) Implementing managerial constraints in model-based segmentation: extensions of Kim, Fong, and DeSarbo (2012) with an application to heterogeneous perceptions of service quality. J Mark Res 50(5):664–673CrossRef

Kochetov Y, Ivanenko D (2005) Computationally difficult instances for the uncapacitated facility location problem. Operations research/computer science interfaces series. Springer, Boston, pp 351–367

Mallapragada PK, Jin R, Jain AK (2008) Active query selection for semi-supervised clustering. In: 2008 19th international conference on pattern recognition. IEEE, pp 1–4. https://doi.org/10.1109/ICPR.2008.4761792

Pinheiro DN, Aloise D, Blanchard SJ (2020) Convex fuzzy k-medoids clustering. Fuzzy Sets Syst 389:66–92MathSciNetCrossRef

Randel R, Aloise D, Mladenović N, Hansen P (2019) On the k-medoids model for semi-supervised clustering. In: Sifaleras A, Salhi S, Brimberg J (eds) Variable neighborhood search. Springer, Cham, pp 13–27CrossRef

Resende MGC, Werneck RF (2007) A fast swap-based local search procedure for location problems. Ann Oper Res 150(1):205–230. https://doi.org/10.1007/s10479-006-0154-0MathSciNetCrossRefMATH

Santi É, Aloise D, Blanchard SJ (2016) A model for clustering data from heterogeneous dissimilarities. Eur J Oper Res 253(3):659–672MathSciNetCrossRef

Shor NZ, Kiwiel KC, Ruszcaynski A (1985) Minimization methods for nondifferentiable functions. Springer, BerlinCrossRef

Wagstaff KL (2007) Value, cost, and sharing: open issues in constrained clustering. In: Džeroski S, Struyf J (eds) Knowledge discovery in inductive databases. Springer, Berlin, pp 1–10

Wagstaff K, Cardie C, Rogers S, Schrödl S (2001). Constrained k-means clustering with background knowledge, vol ICML ’01. In: Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 577–584

Xiong S, Azimi J, Fern XZ (2014) Active learning of constraints for semi-supervised clustering. IEEE Trans Knowl Data Eng 26(1):43–54. https://doi.org/10.1109/tkde.2013.22CrossRef

Xiong C, Johnson DM, Corso JJ (2017) Active clustering with model-based uncertainty reduction. IEEE Trans Pattern Anal Mach Intell 39(1):5–17. https://doi.org/10.1109/TPAMI.2016.2539965CrossRef

Zhu X, Goldberg AB, Brachman R, Dietterich T (2009) Introduction to semi-supervised learning. Morgan and Claypool Publishers, San RafaelCrossRef

Titel: A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering
verfasst von: Rodrigo Randel
Daniel Aloise
Simon J. Blanchard
Alain Hertz
Publikationsdatum: 16.09.2021
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 6/2021
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-021-00794-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2021

BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

Continuous treatment effect estimation via generative adversarial de-confounding

Chebyshev approaches for imbalanced data streams regression models

CURIE: a cellular automaton for concept drift detection

Characterizing attitudinal network graphs through frustration cloud

Introduction to the special issue of the ECML PKDD 2021 journal track