Skip to main content

2015 | OriginalPaper | Buchkapitel

Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

verfasst von : Aalaa Mojahed, Joao H. Bettencourt-Silva, Wenjia Wang, Beatriz de la Iglesia

Erschienen in: Machine Learning and Data Mining in Pattern Recognition

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the k-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Contact the authors for more information about the data dictionary.
 
Literatur
1.
Zurück zum Zitat Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRef Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRef
2.
Zurück zum Zitat Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group. February 2001 Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group. February 2001
3.
Zurück zum Zitat Mojahed, A., De La Iglesia, B.: A fusion approach to computing distance for heterogeneous data. In: Proceedings of the Sixth International Conference on Knowledge Discover and Information Retrieval (KDIR 2014), pp. 269–276. SCITEPRESS, Rome, Italy (2014) Mojahed, A., De La Iglesia, B.: A fusion approach to computing distance for heterogeneous data. In: Proceedings of the Sixth International Conference on Knowledge Discover and Information Retrieval (KDIR 2014), pp. 269–276. SCITEPRESS, Rome, Italy (2014)
4.
Zurück zum Zitat Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004)CrossRef Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004)CrossRef
5.
Zurück zum Zitat Johnson, R.A., Wichern, D.W. (eds.): Applied Multivariate Statistical Analysis. Prentice-Hall Inc, NJ (1988)MATH Johnson, R.A., Wichern, D.W. (eds.): Applied Multivariate Statistical Analysis. Prentice-Hall Inc, NJ (1988)MATH
6.
Zurück zum Zitat Skillicorn, D.B.: Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group, Boca Raton (2007)CrossRef Skillicorn, D.B.: Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group, Boca Raton (2007)CrossRef
7.
Zurück zum Zitat Hall, D., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE 85(1), 6–23 (1997)CrossRef Hall, D., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE 85(1), 6–23 (1997)CrossRef
8.
Zurück zum Zitat Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)CrossRef Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)CrossRef
9.
Zurück zum Zitat Abidi, M.A., Gonzalez, R.C.: Data Fusion in Robotics and Machine Intelligence. Academic Press Professional Inc, San Diego (1992)MATH Abidi, M.A., Gonzalez, R.C.: Data Fusion in Robotics and Machine Intelligence. Academic Press Professional Inc, San Diego (1992)MATH
10.
Zurück zum Zitat Faouzi, N.E.E., Leung, H., Kurian, A.: Data fusion in intelligent transportation systems: progress and challenges a survey. Inf. Fusion 12(1), 4–10 (2011). Special Issue on Intelligent Transportation SystemsCrossRef Faouzi, N.E.E., Leung, H., Kurian, A.: Data fusion in intelligent transportation systems: progress and challenges a survey. Inf. Fusion 12(1), 4–10 (2011). Special Issue on Intelligent Transportation SystemsCrossRef
11.
Zurück zum Zitat Dasarathy, B.V.: Information fusion, data mining, and knowledge discovery. Inf. Fusion 4(1), 1 (2003)CrossRef Dasarathy, B.V.: Information fusion, data mining, and knowledge discovery. Inf. Fusion 4(1), 1 (2003)CrossRef
12.
Zurück zum Zitat Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. Multimedia Systems and Applications, vol. 33, pp. 1–46. Springer, New York (2008)CrossRef Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. Multimedia Systems and Applications, vol. 33, pp. 1–46. Springer, New York (2008)CrossRef
13.
Zurück zum Zitat Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)MATH Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)MATH
14.
Zurück zum Zitat Bie, T.D., Tranchevent, L.C., van Oeffelen, L.M.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. ISMB/ECCB (Suppl. Bioinform.) 23(13), 125–132 (2007)CrossRef Bie, T.D., Tranchevent, L.C., van Oeffelen, L.M.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. ISMB/ECCB (Suppl. Bioinform.) 23(13), 125–132 (2007)CrossRef
15.
Zurück zum Zitat Shi, Y., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11, 309–332 (2010)CrossRef Shi, Y., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11, 309–332 (2010)CrossRef
16.
Zurück zum Zitat Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop (2009) Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop (2009)
17.
Zurück zum Zitat MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297: Statistics, Berkeley. University of California Press, California (1967) MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297: Statistics, Berkeley. University of California Press, California (1967)
18.
Zurück zum Zitat Bettencourt-Silva, J.H., Iglesia, B.D.L., Donell, S., Rayward-Smith, V.: On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, 6730–6737 (2012) Bettencourt-Silva, J.H., Iglesia, B.D.L., Donell, S., Rayward-Smith, V.: On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, 6730–6737 (2012)
19.
Zurück zum Zitat NICE: Prostate cancer: diagnosis and treatment. NICE clinical guideline, vol. 175, pp. 1–48 (2014) NICE: Prostate cancer: diagnosis and treatment. NICE clinical guideline, vol. 175, pp. 1–48 (2014)
20.
Zurück zum Zitat Chan, T.Y., Partin, A.W., Walsh, P.C., Epstein, J.I.: Prognostic significance of gleason score 3+4 versus gleason score 4+3 tumor at radical prostatectomy. Urology 56(5), 823–827 (2000)CrossRef Chan, T.Y., Partin, A.W., Walsh, P.C., Epstein, J.I.: Prognostic significance of gleason score 3+4 versus gleason score 4+3 tumor at radical prostatectomy. Urology 56(5), 823–827 (2000)CrossRef
21.
Zurück zum Zitat Jaccard, S.: Nouvelles researches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat 44, 223–270 (1908) Jaccard, S.: Nouvelles researches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat 44, 223–270 (1908)
22.
Zurück zum Zitat Rand, W.M.: Objective criteria foe the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1958)CrossRef Rand, W.M.: Objective criteria foe the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1958)CrossRef
23.
Zurück zum Zitat Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)CrossRef Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)CrossRef
24.
Zurück zum Zitat Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(1), 32–57 (1973)MATHMathSciNetCrossRef Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(1), 32–57 (1973)MATHMathSciNetCrossRef
Metadaten
Titel
Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)
verfasst von
Aalaa Mojahed
Joao H. Bettencourt-Silva
Wenjia Wang
Beatriz de la Iglesia
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-21024-7_17