Abstract
The purpose of this chapter is to describe and review a variety of statistical issues and methods related to the analysis of microarray data. In the first section, after a brief introduction of the DNA microarray technology in biochemical and genetic research, we provide an overview of four levels of statistical analyses. The subsequent sections present the methods and algorithms in detail.
In the second section, we describe the methods for identifying significantly differentially expressed genes in different groups. The methods include fold change, different t-statistics, empirical Bayesian approach and significance analysis of microarrays (SAM). We further illustrate SAM using a publicly available colon-cancer dataset as an example. We also discuss multiple comparison issues and the use of false discovery rate.
In the third section, we present various algorithms and approaches for studying the relationship among genes, particularly clustering and classification. In clustering analysis, we discuss hierarchical clustering, k-means and probabilistic model-based clustering in detail with examples. We also describe the adjusted Rand index as a measure of agreement between different clustering methods. In classification analysis, we first define some basic concepts related to classification. Then we describe four commonly used classification methods including linear discriminant analysis (LDA), support vector machines (SVM), neural network and tree-and-forest-based classification. Examples are included to illustrate SVM and tree-and-forest-based classification.
The fourth section is a brief description of the meta-analysis of microarray data in three different settings: meta-analysis of the same biomolecule and same platform microarray data, meta-analysis of the same biomolecule but different platform microarray data, and meta-analysis of different biomolecule microarray data.
We end this chapter with final remarks on future prospects of microarray data analysis.