Deoxyribonucleic acid (DNA) microarrays are part of a promising class of biotechnologies that allow the simultaneous monitoring of expression levels in cells for thousands of genes. One of important issues in microarray experiments is the classification of biological samples and predicting clinical or other outcomes using gene expression data. A closely related issue is the identification of marker genes that have good predictive power for an outcome of interest. Although classification is not a new subject in the statistical literature, the large number of genes with relatively small sample size generated by microarray experiments raises new computational challenges. In this study, the gene expressions of breast cancer tumors are investigated and the performance of several popular classification methods, including decision tree, logistic regression, linear discriminant analysis, and k-nearest neighbor are compared. The results show that certain genes are significantly differentially expressed across groups of patients, and k-nearest neighbor method achieves better performance in class prediction than the other classification methods.
In addition to reviewing and illustrating the implementation of standard statistical tests and classification methods in modeling genome data, we will also address some important issues in the study, such as the role of experimental design (e.g., split-plot experimental design and analysis), the impact of correlation (within plate, between plates, between probe, etc.), the sampling issue in cross validation and training-testing splitting. While these issues have been discussed in simple statistical problems, they have not been well understood by bioinformatics researchers in modeling complex microarray data. In this talk, we will address these issues and their impact on various standard testing and classification methods and illustrate the potential problems through the cancer tumor microarray experiments.