Top

Published in:

2006 | OriginalPaper | Chapter

33. Statistical Methodologies for Analyzing Genomic Data

Authors : Fenghai Duan, Heping Zhang

Published in: Springer Handbook of Engineering Statistics

Publisher: Springer London

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The purpose of this chapter is to describe and review a variety of statistical issues and methods related to the analysis of microarray data. In the first section, after a brief introduction of the DNA microarray technology in biochemical and genetic research, we provide an overview of four levels of statistical analyses. The subsequent sections present the methods and algorithms in detail.

In the second section, we describe the methods for identifying significantly differentially expressed genes in different groups. The methods include fold change, different t-statistics, empirical Bayesian approach and significance analysis of microarrays (SAM). We further illustrate SAM using a publicly available colon-cancer dataset as an example. We also discuss multiple comparison issues and the use of false discovery rate.

In the third section, we present various algorithms and approaches for studying the relationship among genes, particularly clustering and classification. In clustering analysis, we discuss hierarchical clustering, k-means and probabilistic model-based clustering in detail with examples. We also describe the adjusted Rand index as a measure of agreement between different clustering methods. In classification analysis, we first define some basic concepts related to classification. Then we describe four commonly used classification methods including linear discriminant analysis (LDA), support vector machines (SVM), neural network and tree-and-forest-based classification. Examples are included to illustrate SVM and tree-and-forest-based classification.

The fourth section is a brief description of the meta-analysis of microarray data in three different settings: meta-analysis of the same biomolecule and same platform microarray data, meta-analysis of the same biomolecule but different platform microarray data, and meta-analysis of different biomolecule microarray data.

We end this chapter with final remarks on future prospects of microarray data analysis.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Statistical Genetics for Genomic Data Analysis

next chapter Statistical Methods in Proteomics

33.1.

M. Schena, M. Shalon, R. W. Davis, P. O. Brown: Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray, Science 270, 467–470 (1995)CrossRef

33.2.

R. A. Heller, M. Schena, A. Chai, D. Shalon, T. Bedilion, J. Gilmore, D. E. Woolley, R. W. Davis: Discovery and analysis of inflammatory disease-related genes using cDNA microarrays, Proc. Natl. Acad. Sci. USA 94(6), 2150–2155 (1997)CrossRef

33.3.

E. Segal, M. Shapira, A. Regev, D. Peʼer, D. Botstein, D. Koller, N. Friedman: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data, Nature Genetics 34, 166–176 (2003)CrossRef

33.4.

J. C. Hacia, B. Sun, N. Hunt, K. Edgemon, D. Mosbrook, C. Robbins, S. P. A. Fodor, D. A. Tagle, F. S. Collins: Strategies for mutational analysis of the large multiexon ATM gene using high-density oligonucleotide arrays, Genome Res. 8, 1245–1258 (1998)

33.5.

J. B. Fan, X. Q. Chen, M. K. Halushka, A. Berno, X. H. Huang, T. Ryder, R. J. Lipshutz, D. J. Lockhart, A. Chakravarti: Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays, Gen. Res. 10, 853–860 (2000)CrossRef

33.6.

S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, T. R. Golub: Multiclass cancer diagnosis using tumor gene expression signatures, Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001)CrossRef

33.7.

E. R. Marcotte, L. K. Srivastava, R. Quirion: DNA microarrays in neuropsychopharmacology, Trends Pharmacol. Sci. 22, 426–436 (2001)CrossRef

33.8.

C. Li, W. H. Wong: Model-based analysis of oligonucleotide arrays: expression index computation, outlier detection, Proc. Natl. Acad. Sci. USA 98, 31–36 (2001)CrossRefMATH

33.9.

B. Efron, R. Tibshirani, J. D. Storey, V. Tusher: J. Amer. Stat. Assoc 96, 1151–1160 (2001)CrossRefMathSciNetMATH

33.10.

V. G. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001)CrossRefMATH

33.11.

R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, T. P. Speed: Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostat. 4, 249–264 (2003)CrossRefMATH

33.12.

M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein: Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998)CrossRef

33.13.

A. Soukas, P. Cohen, N. D. Socci, J. M. Friedman: Leptin-specific patterns of gene expression in white adipose tissue, Genes Dev. 14(8), 963–980 (2000)

33.14.

P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, T. R. Golub: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. USA 96(6), 2907–2912 (1999)CrossRef

33.15.

K. Y. Yeung, W. L. Ruzzo: Principal component analysis for clustering gene expression data, Bioinformatics 17, 763–774 (2001)CrossRef

33.16.

K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-based clustering and data transformations for gene expression data, Bioinformatics 17, 977–987 (2001)CrossRef

33.17.

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman: Missing value estimation methods for DNA microarrays, Bioinformatics 17(6), 520–525 (2001)CrossRef

33.18.

H. P. Zhang, C. Yu, B. Singer: Cell and tumor classification using gene expression data: construction of forests, Proc. Natl. Acad. Sci. USA 100, 4168–4172 (2003)CrossRef

33.19.

T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, D. Haussler: Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16(10), 906–914 (2000)CrossRef

33.20.

K. Mehrotra, C. K. Mohan, S. Ranka: Elements of Artificial Neural Networks (MIT, Massachusetts 1997)

33.21.

H. P. Zhang, C. Yu, B. Singer, M. Xiong: Recursive partitioning for tumor classification with gene expression microarray data, Proc. Natl. Acad. Sci. USA 98, 6730–6735 (2001)CrossRef

33.22.

A. J. Butte, P. Tamayo, D. Slonim, T. R. Golub, I. S. Kohane: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proc. Natl. Acad. Sci. USA 97, 12182–12186 (2000)CrossRef

33.23.

P. Dʼhaeseleer, S. Liang, R. Somogyi: Gene expression data analysis and modeling (Pacific Symposium on Biocomputing, 1999)

33.24.

I. Shmulevich, E. R. Dougherty, S. Kim, W. Zhang: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks, Bioinformatics 18(2), 261–274 (2002)CrossRef

33.25.

N. Friedman, M. Linial, I. Nachman, D. Peʼer: Using Bayesian networks to analyze expression data, J. Comp. Biol. 7, 601–620 (2000)CrossRef

33.26.

E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller: Rich probabilistic models for gene expression, Bioinformatics 1, 1–10 (2001)

33.27.

D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, E. L. Brown: Expression monitoring by hybridization to high-density oligonucleotide arrays, Nat. Biotechnol. 14, 1675–1680 (1996)CrossRef

33.28.

G. Smyth: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, Stat. Appl. Genet. Mol. Biol, 3(1), 3 (2004)MathSciNet

33.29.

Z. Šidák: Rectangular confidence regions for the means of multivariate normal distributions, J. Am. Stat. Assoc. 62, 626–633 (1967)CrossRefMATH

33.30.

S. Draghici: Data analysis tools for DNA microarrays (Chapman, Hall/CRC, New York 2003)CrossRef

33.31.

Y. Benjamin, Y. Hochberg: Controlling the false discovery rate – a practical and powerful approach to multiple testing, J. Roy. Soc. B Met. 57(1), 289–300 (1995)

33.32.

J. D. Storey: A direct approach to false discovery rates, J. R. Stat. Ser. B Stat. Methodol. 64, 479–498 Part 3 (2002)CrossRefMathSciNetMATH

33.33.

J. D. Storey: A Bayesian interpretation, the q-value, Ann. Stat, 31(6), 2013–2035 (2003)CrossRefMathSciNetMATH

33.34.

J. F. Troendle: Stepwise normal theory multiple test procedures controlling the false discovery rate, J. Stat. Plan. Inference 84(1-2), 139–158 (2000)CrossRefMathSciNetMATH

33.35.

B. Efron, R. Tibshirani: Empirical bayes methods and false discovery rates for microarrays, Genet. Epidemiol. 23(1), 70–86 (2002)CrossRef

33.36.

I. Lonnstedt, T. Speed: Replicated microarray data, Stat. Sinica 12(1), 31–46 (2001)MathSciNet

33.37.

U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, A. J. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor, normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96, 6745–6750 (1999)CrossRef

33.38.

J. Quackenbush: Computational analysis of microarray analysis, Nature Rev. Genetics 2, 418–427 (2001)CrossRef

33.39.

N. Kaminski, N. Friedman: Practical approaches to analyzing results of microarray experiments, Am. J. Respir. Cell. Mol. Biol. 27(2), 125–132 (2002)

33.40.

R. Jansen, D. Greenbaum, M. Gerstein: Relating whole-genome expression data with protein-protein interactions, Genome Res. 12(¹), 37–46 (2002)CrossRef

33.41.

J. C. Boldrick, A. A. Alizadeh, M. Diehn, S. Dudoit, C. L. Liu, C. E. Belcher, D. Botstein, L. M. Staudt, P. O. Brown, D. A. Relman: Stereotyped and specific gene expression programs in human innate immune responses to bacteria, Proc. Natl. Acad. Sci. USA 99, 972–977 (2002)CrossRef

33.42.

G. Sherlock: Analysis of large-scale gene expression data, Curr. Opin. Immunol. 12(2), 201–205 (2000)CrossRef

33.43.

F. H. Duan, H. P. Zhang: Correcting the loss of cell-cycle synchrony in clustering analysis of microarray data using weights, Bioinformatics 20(11), 1766–1771 (2004)CrossRef

33.44.

T. Kohonen: Self-Organizing Maps (Springer, Brelin Heidelberg New York 1997)MATH

33.45.

W. N. Venables, B. D. Ripley: Modern Applied Statistics with S (Springer, Berlin Heidelberg New York 2002)MATH

33.46.

E. Wit, J. McClure: Statistics for Microarrays (Wiley, New York 2004)CrossRefMATH

33.47.

L. Hubert, P. Arabie: Comparing partitions, J. Classification 2, 193–218 (1985)CrossRef

33.48.

G. W. Milligan, M. C. Cooper: A study of the comparability of external criteria for hierarchical cluster-analysis, Multivairate Behavioral Research 21(4), 441–458 (1986)CrossRef

33.49.

B. E. Boser, I. M. Guyon, V. N. Vapnik: A training algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory, ed. by D. Haussle (ACM, New York 1992) pp. 144–152CrossRef

33.50.

C. Cortes, V. Vapnik: Support-vector networks, Mach. Learn. 20(3), 273–297 (1995)MATH

33.51.

V. Vapnik: Statistical Learning Theory (Wiley, New York 1998)MATH

33.52.

L. Breiman, J. Friedman, C. Stone, R. Olshen: Classification, Regression Trees (Wadsworth, Belmont 1984)

33.53.

H. P. Zhang, B. Singer: Recursive Partitioning in the Health Sciences (Springer, Berlin Heidelberg New York 1999)MATH

33.54.

H. Zhang, C.-Y. Yu: Tree-based analysis of microarray data for classifying breast cancer, Front. in Biosci. 7, c63–67 (2002)CrossRef

33.55.

I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger, N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O. P. Kallioniemi, A. Borg, J. Trent: Gene-expression profiles in hereditary breast cancer, N. Engl. J. Med 344, 539–48 (2001)CrossRef

33.56.

H. P. Zhang, C. Y. Yu, H. T. Zhu, J. Shi: Identification of linear directions in multivariate adaptive spline models, J. Am. Stat. Assoc. 98, 369–376 (2003)CrossRefMathSciNetMATH

33.57.

B. L. Random: Random forests, Mach. Learn. 45, 5–32 (2001)CrossRef

33.58.

T. Kroll, L. Odyvanova, H. Clement, C. Platzer, A. Naumann, N. Marr, K. Hoffken, S. Wolfl: Molecular characterization of breast cancer cell lines by expression profiling, J. Cancer Res. Clin. Oncol. 128, 125–34 (2002)CrossRef

33.59.

Y. Moreau, S. Aerts, B. D. Moor, B. D. Strooper, M. Dabrowski: Comparison and meta-analysis of microarray data: from the bench to the computer desk, Trends Genetics 9(10), 570–577 (2003)CrossRef

33.60.

D. Ghosh, T. Barette, D. Rhodes, A. Chinnaiyan: Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer, Funct. Integrat. Gen. 3(4), 180–188 (2003)CrossRef

33.61.

B. H. Mecham, G. T. Klus, J. Strover, M. Augustus, D. Byrne, P. Bozso, D. Z. Wetmore, T. J. Mariani, I. S. Kohane, Z. Szallasi: Sequence-matched robes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements, Nucleotide Acids Res. 32(9), e74 (2004)CrossRef

33.62.

C. L. Yauk, M. L. Berndt, A. Williams, G. R. Douglas: Comprehensive comparison of six microarray technologies, Nucleic Acids Res. 32(15), e124 (2004)CrossRef

33.63.

D. R. Rhodes, T. R. Barrette, M. A. Rubin, D. Ghosh, A. M. Chinnaiyan: Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer, Cancer Res. 62(15), 4427–4433 (2002)

33.64.

J. Wang, K. R. Coombes, W. E. Highsmith, M. J. Keating, L. V. Abruzzo: Differences in gene expression between B-cell chronic lymphocytic leukemia and normal B cells, Bioinformatics 20(17), 3166–3178 (2004)CrossRef

33.65.

J. B. Welsh, L. M. Sapinoso, S. G. Kern, D. A. Brown, T. Liu, A. R. Bauskin, R. L. Ward, N. J. Hawkins, D. I. Quinn, P. J. Russell, R. L. Sutherland, S. N. Breit, C. A. Moskaluk, H. F. Frierson Jr., G. M. Hampton: Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum, Proc. Natl. Acad. Sci 100(6), 3410–3415 (2003)CrossRef

33.66.

L. V. Hedges, I. Olkin: Statistical Methods For Meta-Analysis (Academic, New York 1985)MATH

33.67.

A. K. Järvinena, S. Hautaniemib, H. Edgrena, P. Auvinend, J. Saarelaa, O. P. Kallioniemic, O. Monni: Are data from different gene expression microarray platforms comparable?, Genomics 83(6), 1164–1168 (2004)CrossRef

Title: Statistical Methodologies for Analyzing Genomic Data
Authors: Fenghai Duan
Heping Zhang
Publisher: Springer London
Book: Springer Handbook of Engineering Statistics
Print ISBN: 978-1-85233-806-0

Electronic ISBN: 978-1-84628-288-1

Copyright Year: 2006
DOI: https://doi.org/10.1007/978-1-84628-288-1_33

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"