Skip to main content
Top
Published in: Knowledge and Information Systems 1/2017

18-03-2016 | Regular Paper

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Authors: Aalaa Mojahed, Beatriz de la Iglesia

Published in: Knowledge and Information Systems | Issue 1/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper introduces Hk-medoids, a modified version of the standard k-medoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk-medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm, and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k-medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic Press Professional Inc, San DiegoMATH Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic Press Professional Inc, San DiegoMATH
2.
go back to reference Acar E, Rasmussen MA, Savorani F, Naes T, Bro R (2013) Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemom Intell Lab Syst 129:53–63 Multiway and Multiset MethodsCrossRef Acar E, Rasmussen MA, Savorani F, Naes T, Bro R (2013) Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemom Intell Lab Syst 129:53–63 Multiway and Multiset MethodsCrossRef
3.
go back to reference Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multimedia data mining in information technology environment. Int J Comput Sci Inf Secur (IJCSIS) 10(5):69–77 Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multimedia data mining in information technology environment. Int J Comput Sci Inf Secur (IJCSIS) 10(5):69–77
4.
go back to reference Baeza-Yates RA, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston Baeza-Yates RA, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston
5.
go back to reference Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 229–248 Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 229–248
6.
go back to reference Bettencourt-Silva J, Iglesia B, Donell S, and Rayward-Smith V (2011) On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods Inf Med, 51(3):6730–6737 Bettencourt-Silva J, Iglesia B, Donell S, and Rayward-Smith V (2011) On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods Inf Med, 51(3):6730–6737
7.
go back to reference Bie TD, Tranchevent L-C, van Oeffelen LMM, Moreau Y (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (Supplement of Bioinformatics), pp 125–132 Bie TD, Tranchevent L-C, van Oeffelen LMM, Moreau Y (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (Supplement of Bioinformatics), pp 125–132
8.
go back to reference Boström H, Andler SF, Brohede M, Johansson R, Karlsson A, van Laere J, Niklasson L, Nilsson M, Persson A, Ziemke T (2007) On the definition of information fusion as a field of research. Technical report, Institutionen för kommunikation och information Boström H, Andler SF, Brohede M, Johansson R, Karlsson A, van Laere J, Niklasson L, Nilsson M, Persson A, Ziemke T (2007) On the definition of information fusion as a field of research. Technical report, Institutionen för kommunikation och information
9.
go back to reference Chan TY, Partin AW, Walsh PC, Epstein JI (2000) Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy. Urology 56(5):823–827CrossRef Chan TY, Partin AW, Walsh PC, Epstein JI (2000) Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy. Urology 56(5):823–827CrossRef
10.
go back to reference Dasarathy BV (2003) Information fusion, data mining, and knowledge discovery. Inf Fusion 4(1):1–2CrossRef Dasarathy BV (2003) Information fusion, data mining, and knowledge discovery. Inf Fusion 4(1):1–2CrossRef
11.
go back to reference Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 269–274, New York, NY, USA. ACM Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, pp 269–274, New York, NY, USA. ACM
12.
go back to reference Dhillon IS, Mallela S, Modha D (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98 Dhillon IS, Mallela S, Modha D (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98
13.
go back to reference Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302CrossRef Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302CrossRef
14.
go back to reference Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal N, Sugeno M (eds) Advances in soft computing (AFSS 2002), vol 2275., Lecture notes in computer science, Berlin Heidelberg, Springer, pp 332–338 Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal N, Sugeno M (eds) Advances in soft computing (AFSS 2002), vol 2275., Lecture notes in computer science, Berlin Heidelberg, Springer, pp 332–338
15.
go back to reference Faouzi N-EE, Leung H, Kurian A (2011) Data fusion in intelligent transportation systems: progress and challenges a survey. Inf Fusion 12(1):4–10 Special Issue on Intelligent Transportation SystemsCrossRef Faouzi N-EE, Leung H, Kurian A (2011) Data fusion in intelligent transportation systems: progress and challenges a survey. Inf Fusion 12(1):4–10 Special Issue on Intelligent Transportation SystemsCrossRef
16.
go back to reference Gao B, Liu T, Zheng X, Cheng Q, Ma W (2006) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1–31 Gao B, Liu T, Zheng X, Cheng Q, Ma W (2006) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1–31
18.
go back to reference Greene P, Cunningham P (2009) A matrix factorization approach for integrating multiple data views. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I, pp 423–438 Greene P, Cunningham P (2009) A matrix factorization approach for integrating multiple data views. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I, pp 423–438
19.
go back to reference Hall D, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23CrossRef Hall D, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23CrossRef
20.
go back to reference Hays J, Efros AA (2007) Scene completion using millions of photographs. In: ACM SIGGRAPH, (2007) papers, SIGGRAPH ’07, New York, NY, USA. ACM Hays J, Efros AA (2007) Scene completion using millions of photographs. In: ACM SIGGRAPH, (2007) papers, SIGGRAPH ’07, New York, NY, USA. ACM
21.
go back to reference Huang A (2008) Similarity measures for text document clustering. In: Holland J, Nicholas A, Brignoli D (eds) New Zealand computer science research student conference, pp 49–56 Huang A (2008) Similarity measures for text document clustering. In: Holland J, Nicholas A, Brignoli D (eds) New Zealand computer science research student conference, pp 49–56
23.
go back to reference Jaccard S (1908) Nouvelles researches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–270 Jaccard S (1908) Nouvelles researches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–270
24.
go back to reference Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Springer, Berlin Heidelberg, pp 405–416 Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Springer, Berlin Heidelberg, pp 405–416
25.
go back to reference Kaufman L, Rousseeuw PJ (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New YorkMATH Kaufman L, Rousseeuw PJ (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New YorkMATH
26.
go back to reference Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inf Fusion 14(1):28–44CrossRef Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inf Fusion 14(1):28–44CrossRef
27.
go back to reference Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS (2004a) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635CrossRef Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS (2004a) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635CrossRef
28.
go back to reference Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004b) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72MathSciNetMATH Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004b) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72MathSciNetMATH
29.
go back to reference Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical report, META Group Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical report, META Group
30.
go back to reference Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99, pp 16–22, New York, NY, USA. ACM Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99, pp 16–22, New York, NY, USA. ACM
31.
go back to reference Li X, Wu C, Zach C, Lazebnik S, Frahm J-M (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: Proceedings of the 10th European conference on computer vision: part I, ECCV ’08, pp 427–440, Springer-Verlag, Berlin, Heidelberg Li X, Wu C, Zach C, Lazebnik S, Frahm J-M (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: Proceedings of the 10th European conference on computer vision: part I, ECCV ’08, pp 427–440, Springer-Verlag, Berlin, Heidelberg
32.
go back to reference Liang P, Klein D (2009) Online EM for unsupervised models. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, NAACL ’09, pp 611–619, Stroudsburg, PA, USA Liang P, Klein D (2009) Online EM for unsupervised models. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, NAACL ’09, pp 611–619, Stroudsburg, PA, USA
33.
go back to reference Long B, Zhang Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML, pp 585–592 Long B, Zhang Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML, pp 585–592
34.
go back to reference Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, pp 931–940, New York, NY, USA. ACM Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, pp 931–940, New York, NY, USA. ACM
35.
go back to reference Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Netw Secur (IJCSNS) 10(11):165–170 Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Netw Secur (IJCSNS) 10(11):165–170
36.
go back to reference Maragos P, Gros P, Katsamanis A, Papandreou G (2008) Cross-modal integration for performance improving in multimedia: a review. In: Maragos P, Potamianos A, Gros P (eds) Multimodal processing and interaction, vol 33., multimedia systems and applications, US, Springer, pp 1–46 Maragos P, Gros P, Katsamanis A, Papandreou G (2008) Cross-modal integration for performance improving in multimedia: a review. In: Maragos P, Potamianos A, Gros P (eds) Multimodal processing and interaction, vol 33., multimedia systems and applications, US, Springer, pp 1–46
38.
go back to reference Mojahed A, Bettencourt-Silva J, Wang W, de la Iglesia B (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: Perner P (ed) Machine learning and data mining in pattern recognition, vol 9166 of lecture notes in computer science, pp 251–265. Springer International Publishing Mojahed A, Bettencourt-Silva J, Wang W, de la Iglesia B (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: Perner P (ed) Machine learning and data mining in pattern recognition, vol 9166 of lecture notes in computer science, pp 251–265. Springer International Publishing
39.
go back to reference Mojahed A, De La Iglesia B (2014) A fusion approach to computing distance for heterogeneous data. In: Proceedings of the sixth international conference on knowledge discover and information retrieval (KDIR 2014), pp 269–276, Rome, Italy. SCITEPRESS Mojahed A, De La Iglesia B (2014) A fusion approach to computing distance for heterogeneous data. In: Proceedings of the sixth international conference on knowledge discover and information retrieval (KDIR 2014), pp 269–276, Rome, Italy. SCITEPRESS
40.
go back to reference Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, pp 144–155 Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, pp 144–155
41.
go back to reference NICE (2014) Prostate cancer: diagnosis and treatment. NICE Clin Guidel 175:1–48 NICE (2014) Prostate cancer: diagnosis and treatment. NICE Clin Guidel 175:1–48
42.
go back to reference Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175CrossRefMATH Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175CrossRefMATH
43.
go back to reference Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341CrossRef Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341CrossRef
44.
go back to reference Pavlidis P, Cai J, Weston J, Noble WS (2002) Learning gene functional classifications from multiple data types. J Comput Biol 9(2):401–411CrossRef Pavlidis P, Cai J, Weston J, Noble WS (2002) Learning gene functional classifications from multiple data types. J Comput Biol 9(2):401–411CrossRef
45.
go back to reference Rand WM (1958) Objective criteria foe the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef Rand WM (1958) Objective criteria foe the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef
46.
go back to reference Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings of SIAM international conference on data mining (SDM05), pp 506–510 Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings of SIAM international conference on data mining (SDM05), pp 506–510
49.
go back to reference Salton G, McGill MJ (1987) Introduction to modern information retrieval. McGraw-Hill, New YorkMATH Salton G, McGill MJ (1987) Introduction to modern information retrieval. McGraw-Hill, New YorkMATH
50.
go back to reference Shi Y, Falck T, Daemen A, Tranchevent L-C, Suykens JAK, De Moor B, Moreau Y (2010) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11:309–332CrossRef Shi Y, Falck T, Daemen A, Tranchevent L-C, Suykens JAK, De Moor B, Moreau Y (2010) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11:309–332CrossRef
52.
go back to reference Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH
53.
go back to reference Žitnik M, Zupan B (2014) Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold. Syst Biomed 2:1–7CrossRef Žitnik M, Zupan B (2014) Matrix factorization-based data fusion for gene function prediction in baker’s yeast and slime mold. Syst Biomed 2:1–7CrossRef
54.
go back to reference van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA (2012) Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 7(7):e40358CrossRef van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA (2012) Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 7(7):e40358CrossRef
55.
go back to reference Wang J, Zeng H, Chen Z, Lu H, Tao L, Ma W (2003) Recom: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 274–281 Wang J, Zeng H, Chen Z, Lu H, Tao L, Ma W (2003) Recom: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 274–281
57.
go back to reference Yu S, Moor B, Moreau Y (2009) Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop Yu S, Moor B, Moreau Y (2009) Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop
58.
go back to reference Zeng H, Chen Z, Ma W (2002) a unified framework for clustering heterogeneous web objects. In: Proceedings of the 3rd international conference on web information systems engineering (WISE), pp 161–172 Zeng H, Chen Z, Ma W (2002) a unified framework for clustering heterogeneous web objects. In: Proceedings of the 3rd international conference on web information systems engineering (WISE), pp 161–172
59.
go back to reference Zha H, Ding C, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32 Zha H, Ding C, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32
Metadata
Title
An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach
Authors
Aalaa Mojahed
Beatriz de la Iglesia
Publication date
18-03-2016
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 1/2017
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0930-3

Other articles of this Issue 1/2017

Knowledge and Information Systems 1/2017 Go to the issue

Premium Partner