Skip to main content
Erschienen in: Neural Computing and Applications 15/2020

30.11.2019 | Original Article

A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data

verfasst von: N. Nidheesh, K. A. Abdul Nazeer, P. M. Ameer

Erschienen in: Neural Computing and Applications | Ausgabe 15/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Identifying potential novel subtypes of cancers from genomic data requires techniques to estimate the number of natural clusters in the data. Determining the number of natural clusters in a dataset has been a challenging problem in Machine Learning. Employing an internal cluster validity index such as Silhouette Index together with a clustering algorithm has been a widely used technique for estimating the number of natural clusters, which has limitations. We propose a Hierarchical Agglomerative Clustering algorithm which automatically estimates the numbers of natural clusters and gives the associated clustering solutions along with dendrograms for visualizing the clustering structure. The algorithm has a Silhouette Index-based criterion for selecting the pair of clusters to merge, in the process of building the clustering hierarchy. The proposed algorithm could find decent estimates for the number of natural clusters, and the associated clustering solutions when applied to a collection of cancer gene expression datasets and general datasets. The proposed method showed better overall performance when compared to that of a set of prominent methods for estimating the number of natural clusters, which are used for cancer subtype discovery from genomic data. The proposed method is deterministic. It can be a better alternative to contemporary approaches for identifying potential novel subtypes of cancers from genomic data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Azzalini A, Menardi G (2014) Clustering via nonparametric density estimation: the R package pdfCluster. J Stat Softw 57(11):1–26CrossRef Azzalini A, Menardi G (2014) Clustering via nonparametric density estimation: the R package pdfCluster. J Stat Softw 57(11):1–26CrossRef
3.
Zurück zum Zitat Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci 98(24):13790–13795. https://doi.org/10.1073/pnas.191502998 CrossRef Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci 98(24):13790–13795. https://​doi.​org/​10.​1073/​pnas.​191502998 CrossRef
9.
Zurück zum Zitat Ciriello G, Gatza M, Beck A, Wilkerson M, Rhie S, Pastore A, Zhang H, McLellan M, Yau C, Kandoth C, Bowlby R, Shen H, Hayat S, Fieldhouse R, Lester S, Tse G, Factor R, Collins L, Allison K et al (2015) Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163(2):506–519. https://doi.org/10.1016/j.cell.2015.09.033 CrossRef Ciriello G, Gatza M, Beck A, Wilkerson M, Rhie S, Pastore A, Zhang H, McLellan M, Yau C, Kandoth C, Bowlby R, Shen H, Hayat S, Fieldhouse R, Lester S, Tse G, Factor R, Collins L, Allison K et al (2015) Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163(2):506–519. https://​doi.​org/​10.​1016/​j.​cell.​2015.​09.​033 CrossRef
17.
Zurück zum Zitat Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20CrossRef Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20CrossRef
18.
19.
Zurück zum Zitat Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Elsevier Science Pub. Co., Amsterdam, pp 405–416 Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Elsevier Science Pub. Co., Amsterdam, pp 405–416
20.
Zurück zum Zitat Laiho P, Kokko A, Vanharanta S, Salovaara R, Sammalkorpi H, Järvinen H, Mecklin J, Karttunen T, Tuppurainen K, Davalos V, Arango D, Aaltonen LA (2007) Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene 26(2):312CrossRef Laiho P, Kokko A, Vanharanta S, Salovaara R, Sammalkorpi H, Järvinen H, Mecklin J, Karttunen T, Tuppurainen K, Davalos V, Arango D, Aaltonen LA (2007) Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene 26(2):312CrossRef
24.
Zurück zum Zitat Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horwitz HR, Golub TR (2005) MicroRNA expression profiles classify human cancers. Nature 435(7043):834CrossRef Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horwitz HR, Golub TR (2005) MicroRNA expression profiles classify human cancers. Nature 435(7043):834CrossRef
25.
Zurück zum Zitat Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2017) Cluster: cluster analysis basics and extensions. R package version 2.0.6 Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2017) Cluster: cluster analysis basics and extensions. R package version 2.0.6
27.
Zurück zum Zitat Mertins P, Mani DR, Ruggles KV, Gillette MA, Clauser KR, Wang P, Wang X, Qiao JW, Cao S, Petralia F, Kawaler E, Mundt F, Krug K, Tu Z, Lei JT, Gatza ML, Wilkerson M, Perou CM, Yellapantula V, Kl Huang et al (2016) Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534:55. https://doi.org/10.1038/nature18003 CrossRef Mertins P, Mani DR, Ruggles KV, Gillette MA, Clauser KR, Wang P, Wang X, Qiao JW, Cao S, Petralia F, Kawaler E, Mundt F, Krug K, Tu Z, Lei JT, Gatza ML, Wilkerson M, Perou CM, Yellapantula V, Kl Huang et al (2016) Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534:55. https://​doi.​org/​10.​1038/​nature18003 CrossRef
30.
Zurück zum Zitat Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856 Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856
33.
Zurück zum Zitat Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications, Chap 4. Chapman & Hall/CRC, Boca Raton, pp 87–110 Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications, Chap 4. Chapman & Hall/CRC, Boca Raton, pp 87–110
34.
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef
39.
Zurück zum Zitat Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston
40.
Zurück zum Zitat Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS et al (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17(1):98–110. https://doi.org/10.1016/j.ccr.2009.12.020 CrossRef Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS et al (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17(1):98–110. https://​doi.​org/​10.​1016/​j.​ccr.​2009.​12.​020 CrossRef
42.
43.
Zurück zum Zitat Wilkerson MD, Hayes DN (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26(12):1572–1573CrossRef Wilkerson MD, Hayes DN (2010) ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26(12):1572–1573CrossRef
46.
Zurück zum Zitat Yu Z, Wongb HS, You J, Yang Q, Liao H (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans Nanobiosci 10(2):76–85CrossRef Yu Z, Wongb HS, You J, Yang Q, Liao H (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans Nanobiosci 10(2):76–85CrossRef
47.
Zurück zum Zitat Yu Z, Li L, You J, Wong HS, Han G (2012) Sc$^3$: triple spectral clustering-based consensus clustering framework for class discovery from cancer gene expression profiles. IEEE/ACM Trans Comput Biol Bioinform 9(6):1751–1765CrossRef Yu Z, Li L, You J, Wong HS, Han G (2012) Sc$^3$: triple spectral clustering-based consensus clustering framework for class discovery from cancer gene expression profiles. IEEE/ACM Trans Comput Biol Bioinform 9(6):1751–1765CrossRef
48.
Zurück zum Zitat Yu Z, You J, Li L, Wong HS, Han G (2012) Representative distance: a new similarity measure for class discovery from gene expression data. IEEE Trans Nanobiosci 11(4):341–351CrossRef Yu Z, You J, Li L, Wong HS, Han G (2012) Representative distance: a new similarity measure for class discovery from gene expression data. IEEE Trans Nanobiosci 11(4):341–351CrossRef
49.
Zurück zum Zitat Zheng S, Cherniack A, Dewal N, Moffitt R, Danilova L, Murray B, Lerario A, Else T, Knijnenburg T, Ciriello G, Kim S, Assie G, Morozova O, Akbani R et al (2016) Comprehensive pan-genomic characterization of adrenocortical carcinoma. Cancer Cell 29(5):723–736CrossRef Zheng S, Cherniack A, Dewal N, Moffitt R, Danilova L, Murray B, Lerario A, Else T, Knijnenburg T, Ciriello G, Kim S, Assie G, Morozova O, Akbani R et al (2016) Comprehensive pan-genomic characterization of adrenocortical carcinoma. Cancer Cell 29(5):723–736CrossRef
Metadaten
Titel
A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data
verfasst von
N. Nidheesh
K. A. Abdul Nazeer
P. M. Ameer
Publikationsdatum
30.11.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 15/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04636-5

Weitere Artikel der Ausgabe 15/2020

Neural Computing and Applications 15/2020 Zur Ausgabe

Premium Partner