nach oben

Knowledge and Information Systems

Erschienen in:

01.04.2013 | Regular Paper

Finding best algorithmic components for clustering microarray data

verfasst von: Milan Vukićević, Kathrin Kirchner, Boris Delibašić, Miloš Jovanović, Johannes Ruhland, Milija Suknović

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The analysis of microarray data is fundamental to microbiology. Although clustering has long been realized as central to the discovery of gene functions and disease diagnostic, researchers have found the construction of good algorithms a surprisingly difficult task. In this paper, we address this problem by using a component-based approach for clustering algorithm design, for class retrieval from microarray data. The idea is to break up existing algorithms into independent building blocks for typical sub-problems, which are in turn reassembled in new ways to generate yet unexplored methods. As a test, 432 algorithms were generated and evaluated on published microarray data sets. We found their top performers to be better than the original, component-providing ancestors and also competitive with a set of new algorithms recently proposed. Finally, we identified components that showed consistently good performance for clustering microarray data and that should be considered in further development of clustering algorithms.

Vorheriger Artikel Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Nächster Artikel On measuring the performance of binary classifiers

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng. doi:10.1016/j.datak.2007.03.016

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control. doi:10.1109/TAC.1974.1100705

Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314CrossRef

Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60

Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035

Ayadi W, Elloumi M, Hao JK (2012) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358. doi:10.1007/s10115-011-0383-7 CrossRef

Balachandran V, Khemani D (2011) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst. doi:10.1007/s10115-011-0446-9

Baralis E, Bruno G, Flori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst 29:81–101. doi:10.1007/s10115-010-0374-0 CrossRef

Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18CrossRef

10.

Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New YorkMATHCrossRef

11.

Belacel N, Wang Q, Cuperlovic-Culf M (2006) Clustering methods for microarray gene expression data. OMICS J Integr Biol 10(4):507–531. doi:10.1089/omi.2006.10.507 CrossRef

12.

Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.1109/ICDM.2011.114

13.

Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592

14.

Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.datak.2010.08.003

15.

Cheung Y (2003) k*-means: a new generalized k-means clustering algorithm. Pattern Recognit Lett 24(15):2883–2893. doi:10.1016/S0167-8655(03)00146-6 MATHCrossRef

16.

Da Silva A, Chiky R, Hébrail G (2011) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0448-7

17.

Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582

18.

De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572

19.

de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.1109/IJCNN.2008.4634333

20.

Delibašić B, Kirchner K, Ruhland J et al (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32:59–75. doi:10.1007/s10462-009-9133-6 CrossRef

21.

Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980CrossRef

22.

Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163

23.

Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502

24.

Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689

25.

Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231

26.

Forestier G, Gançarski P, Wemmert C (2010) Collaborative clustering with background knowledge. Data Knowl Eng 69(2):211–228. doi:10.1016/j.datak.2009.10.004 CrossRef

27.

Geraci F, Leoncini M, Montangero M et al (2009) K-boost: a scalable algorithm for high-quality clustering of microarray gene expression data. J Comput Biol J Comput Mol Cell Biol 16(6):859–873. doi:10.1089/cmb.2008.0201 MathSciNetCrossRef

28.

Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.1186/1748-7188-6-1

29.

Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138

30.

Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10

31.

Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17

32.

Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York

33.

Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108MATHCrossRef

34.

Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519CrossRef

35.

Jovanović M, Delibašić B, Vukićević M, et al (2011) Optimizing performance of decision tree component-based algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin

36.

Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

37.

Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318

38.

Kalogeratos A, Likas A (2011) Document clustering using synthetic cluster prototypes. Data Knowl Eng 70(3):284–306. doi:j.datak.2010.12.002 CrossRef

39.

Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. doi:10.1109/TIT.1982.1056489 MathSciNetMATHCrossRef

40.

Milligan GW, Cooper MC (1987) Methodology review: clustering methods. Appl Psychol Meas 11(4):329–354. doi:10.1177/014662168701100401 CrossRef

41.

Milovanović M, Minović M, Štavljanin V et al (2012) Wiki as a corporate learning tool: case study for software development company. Behav Inf Technol. doi:10.1080/0144929X.2011.642894

42.

Minović M, Milovanović M, Kovačević I, Minović J, Starčević D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508

43.

Moise G, Zimek A, Kröger P et al (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326. doi:10.1007/s10115-009-0226-y CrossRef

44.

Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. doi:10.1023/A:1023949509487 MATHCrossRef

45.

Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin

46.

Nascimento MCV, Toledo FMB, Carvalho A (2010) Investigation of a new GRASP-based clustering algorithm applied to biological data. Comput Oper Res 37(8):1381–1388. doi:10.1016/j.cor.2009.02.014 MATHCrossRef

47.

Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734

48.

Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5. doi:10.1145/980972.980974 CrossRef

49.

Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810CrossRef

50.

Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96

51.

Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427CrossRef

52.

Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286

53.

Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618CrossRef

54.

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. doi:10.1016/0377-0427(87)90125-7 MATHCrossRef

55.

Savoiu G, Jaško O, Čudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13

56.

Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194CrossRef

57.

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MATHCrossRef

58.

Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.1109/ICDM.2011.50

59.

Shaham E, Sarne D, Ben-Moshe B (2011) Sleeved co-clustering of lagged data. Knowl Inf Syst. doi:10.1007/s10115-011-0420-6

60.

Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718

61.

Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124

62.

Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466

63.

Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412CrossRef

64.

Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATH

65.

Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov

66.

Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.1007/s10115-011-0453-x

67.

Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop

68.

Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265CrossRef

69.

Wu X, Kumar V, Quinlan JR et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2 CrossRef

70.

Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847CrossRef

71.

Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. doi:10.1109/RBME.2010.2083647 CrossRef

72.

Yan Y, Chen L, Tjhi W-C (2011) Semi-supervised fuzzy co-clustering algorithm for document classification. Knowl Inf Syst. doi:10.1007/s10115-011-0454-9

73.

Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896CrossRef

Titel: Finding best algorithmic components for clustering microarray data
verfasst von: Milan Vukićević
Kathrin Kirchner
Boris Delibašić
Miloš Jovanović
Johannes Ruhland
Milija Suknović
Publikationsdatum: 01.04.2013
Verlag: Springer-Verlag
Erschienen in: Knowledge and Information Systems / Ausgabe 1/2013
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-012-0542-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2013

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Overlapping correlation clustering

Dynamic evaluation of the development process of knowledge-based information systems

Detection of cross-channel anomalies

On measuring the performance of binary classifiers

Supervised term weighting centroid-based classifiers for text categorization