Skip to main content
Erschienen in: Knowledge and Information Systems 1/2013

01.04.2013 | Regular Paper

Finding best algorithmic components for clustering microarray data

verfasst von: Milan Vukićević, Kathrin Kirchner, Boris Delibašić, Miloš Jovanović, Johannes Ruhland, Milija Suknović

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The analysis of microarray data is fundamental to microbiology. Although clustering has long been realized as central to the discovery of gene functions and disease diagnostic, researchers have found the construction of good algorithms a surprisingly difficult task. In this paper, we address this problem by using a component-based approach for clustering algorithm design, for class retrieval from microarray data. The idea is to break up existing algorithms into independent building blocks for typical sub-problems, which are in turn reassembled in new ways to generate yet unexplored methods. As a test, 432 algorithms were generated and evaluated on published microarray data sets. We found their top performers to be better than the original, component-providing ancestors and also competitive with a set of new algorithms recently proposed. Finally, we identified components that showed consistently good performance for clustering microarray data and that should be considered in further development of clustering algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314CrossRef Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314CrossRef
4.
Zurück zum Zitat Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60 Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60
5.
Zurück zum Zitat Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035 Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035
7.
9.
Zurück zum Zitat Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18CrossRef Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18CrossRef
10.
Zurück zum Zitat Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New YorkMATHCrossRef Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New YorkMATHCrossRef
12.
Zurück zum Zitat Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.1109/ICDM.2011.114 Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.​1109/​ICDM.​2011.​114
13.
Zurück zum Zitat Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592 Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592
14.
Zurück zum Zitat Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.datak.2010.08.003 Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.​datak.​2010.​08.​003
17.
Zurück zum Zitat Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582 Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582
18.
Zurück zum Zitat De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572 De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572
19.
Zurück zum Zitat de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.1109/IJCNN.2008.4634333 de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.​1109/​IJCNN.​2008.​4634333
21.
Zurück zum Zitat Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980CrossRef Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980CrossRef
22.
Zurück zum Zitat Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163 Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163
23.
Zurück zum Zitat Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502 Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502
24.
Zurück zum Zitat Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689 Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689
25.
Zurück zum Zitat Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231 Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231
28.
Zurück zum Zitat Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.1186/1748-7188-6-1 Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.​1186/​1748-7188-6-1
29.
Zurück zum Zitat Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138 Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138
30.
Zurück zum Zitat Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10 Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10
31.
Zurück zum Zitat Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17 Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17
32.
Zurück zum Zitat Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York
33.
34.
Zurück zum Zitat Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519CrossRef Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519CrossRef
35.
Zurück zum Zitat Jovanović M, Delibašić B, Vukićević M, et al (2011) Optimizing performance of decision tree component-based algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin Jovanović M, Delibašić B, Vukićević M, et al (2011) Optimizing performance of decision tree component-based algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin
36.
Zurück zum Zitat Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
37.
Zurück zum Zitat Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318 Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318
42.
Zurück zum Zitat Minović M, Milovanović M, Kovačević I, Minović J, Starčević D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508 Minović M, Milovanović M, Kovačević I, Minović J, Starčević D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508
44.
45.
Zurück zum Zitat Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin
47.
Zurück zum Zitat Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734 Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734
49.
Zurück zum Zitat Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810CrossRef Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810CrossRef
50.
Zurück zum Zitat Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96 Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96
51.
Zurück zum Zitat Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427CrossRef Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427CrossRef
52.
Zurück zum Zitat Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286 Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286
53.
Zurück zum Zitat Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618CrossRef Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618CrossRef
55.
Zurück zum Zitat Savoiu G, Jaško O, Čudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13 Savoiu G, Jaško O, Čudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13
56.
Zurück zum Zitat Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194CrossRef Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194CrossRef
57.
58.
Zurück zum Zitat Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.1109/ICDM.2011.50 Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.​1109/​ICDM.​2011.​50
60.
Zurück zum Zitat Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718 Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718
61.
Zurück zum Zitat Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124 Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124
62.
Zurück zum Zitat Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466 Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
63.
Zurück zum Zitat Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412CrossRef Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412CrossRef
64.
Zurück zum Zitat Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATH Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATH
65.
Zurück zum Zitat Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov
66.
Zurück zum Zitat Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.1007/s10115-011-0453-x Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.​1007/​s10115-011-0453-x
67.
Zurück zum Zitat Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop
68.
Zurück zum Zitat Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265CrossRef Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265CrossRef
70.
Zurück zum Zitat Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847CrossRef Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847CrossRef
73.
Zurück zum Zitat Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896CrossRef Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896CrossRef
Metadaten
Titel
Finding best algorithmic components for clustering microarray data
verfasst von
Milan Vukićević
Kathrin Kirchner
Boris Delibašić
Miloš Jovanović
Johannes Ruhland
Milija Suknović
Publikationsdatum
01.04.2013
Verlag
Springer-Verlag
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2013
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-012-0542-5

Weitere Artikel der Ausgabe 1/2013

Knowledge and Information Systems 1/2013 Zur Ausgabe