Skip to main content
Top

2018 | Book

New Frontiers of Biostatistics and Bioinformatics

insite
SEARCH

About this book

This book is comprised of presentations delivered at the 5th Workshop on Biostatistics and Bioinformatics held in Atlanta on May 5-7, 2017. Featuring twenty-two selected papers from the workshop, this book showcases the most current advances in the field, presenting new methods, theories, and case applications at the frontiers of biostatistics, bioinformatics, and interdisciplinary areas.

Biostatistics and bioinformatics have been playing a key role in statistics and other scientific research fields in recent years. The goal of the 5th Workshop on Biostatistics and Bioinformatics was to stimulate research, foster interaction among researchers in field, and offer opportunities for learning and facilitating research collaborations in the era of big data. The resulting volume offers timely insights for researchers, students, and industry practitioners.

Table of Contents

Frontmatter

Review of Theoretical Framework in Biostatistics

Frontmatter
Chapter 1. Optimal Weighted Wilcoxon–Mann–Whitney Test for Prioritized Outcomes
Abstract
We consider a two-group randomized clinical trial of prioritized endpoints, where mortality affects the assessment of a follow-up continuous outcome. With the continuous outcome as the principal outcome, we combine it with mortality via the worst-rank paradigm into a single composite endpoint. Then, we develop a weighted Wilcoxon–Mann–Whitney test statistic to analyze the data. We determine the optimal weights for the Wilcoxon–Mann–Whitney test statistic that maximize its power. We provide the rationale for the weights and their implications in the application of the method. In addition, we derive a formula for its power and demonstrate its accuracy in simulations. Finally, we apply the method to data from an acute ischemic stroke clinical trial of normobaric oxygen therapy.
Roland A. Matsouaka, Aneesh B. Singhal, Rebecca A. Betensky
Chapter 2. A Selective Overview of Semiparametric Mixture of Regression Models
Abstract
Finite mixture of regression models have been popularly used in many applications. In this article, we did a systematic review of newly developed semiparametric mixture of regression models. Recent developments and some open questions are also discussed.
Sijia Xiang, Weixin Yao
Chapter 3. Rank-Based Empirical Likelihood for Regression Models with Responses Missing at Random
Abstract
In this paper, a general regression model with responses missing at random is considered. From an imputed rank-based objective function, a rank-based estimator is derived and its asymptotic distribution is established under mild conditions. Inference based on the normal approximation approach results in under coverage or over coverage issues. In order to address these issues, we propose an empirical likelihood approach based on the rank-based objective function, from which its asymptotic distribution is established. Extensive Monte Carlo simulation experiments under different settings of error distributions with different response probabilities are considered. The simulation results show that the proposed approach has better performance for the regression parameters compared to the normal approximation approach and its least-squares counterpart. Finally, a data example is provided to illustrate our method.
Huybrechts F. Bindele, Yichuan Zhao
Chapter 4. Bayesian Nonparametric Spatially Smoothed Density Estimation
Abstract
A Bayesian nonparametric density estimator that changes smoothly in space is developed. The estimator is built using the predictive rule from a marginalized Polya tree, modified so that observations are spatially weighted by their distance from the location of interest. A simple refinement is proposed to accommodate arbitrarily censored data and a test for whether the density is spatially varying is also developed. The method is illustrated on two real datasets, and an R function SpatDensReg is provided for general use.
Timothy Hanson, Haiming Zhou, Vanda Inácio de Carvalho

Wavelet-Based Approach for Complex Data

Frontmatter
Chapter 5. Mammogram Diagnostics Using Robust Wavelet-Based Estimator of Hurst Exponent
Abstract
Breast cancer is one of the leading causes of death in women. Mammography is an effective method for early detection of breast cancer. Like other medical images, mammograms demonstrate a certain degree of self-similarity over a range of scales, which can be used in classifying individuals as cancerous or non-cancerous. In this paper, we study the robust estimation of Hurst exponent (self-similarity measure) in two-dimensional images based on non-decimated wavelet transforms (NDWT). The robustness is achieved by applying a general trimean estimator on non-decimated wavelet detail coefficients of the transformed data, and the general trimean estimator is derived as a weighted average of the distribution’s median and quantiles, combining the median’s emphasis on central values with the quantiles’ attention to the extremes. The properties of the proposed estimators are studied both theoretically and numerically. Compared with other standard wavelet-based methods (Veitch and Abry (VA) method, Soltani, Simard, and Boichu (SSB) method, median based estimators MEDL and MEDLA, and Theil-type (TT) weighted regression method), our methods reduce the variance of the estimators and increase the prediction precision in most cases. We apply proposed methods to digitized mammogram images, estimate Hurst exponent, and then use it as a discriminatory descriptor to classify mammograms to benign and malignant. Our methods yield the highest classification accuracy around 65%.
Chen Feng, Yajun Mei, Brani Vidakovic
Chapter 6. Wavelet-Based Profile Monitoring Using Order-Thresholding Recursive CUSUM Schemes
Abstract
With the rapid development of advanced sensing technologies, rich and complex real-time profile or curve data are available in many processes in biomedical sciences and manufacturing. These profile data provide valuable intrinsic information about the performance or properties of the process, subject, or product, and it is often desirable to utilize them to develop efficient methodologies for process monitoring and fault diagnosing. In this article, we propose a novel wavelet-based profile monitoring procedure that is based on the order-thresholding transformation of recursive CUSUM statistics of multiple wavelet coefficients. Extensive simulation studies and a case study of tonnage profile data demonstrate that our proposed procedure is efficient for detecting the unknown local changes on the profile.
Ruizhi Zhang, Yajun Mei, Jianjun Shi
Chapter 7. Estimating the Confidence Interval of Evolutionary Stochastic Process Mean from Wavelet Based Bootstrapping
Abstract
A time series is a realization of a stochastic process, where each observation is considered in general as the mean of a Gaussian distribution for each time point t. The classical theory is built based on this supposition. However, this assumption may be frequently broken, mainly for non-stationary or evolutionary stochastic process. Thus, in this work we proposed to estimate the uncertainty for the evolutionary mean, μ t, of a stochastic process based on bootstrapping of wavelet coefficients. The wavelet multiscale decomposition provides wavelet coefficients that have less autocorrelation than the observations in time domain, allowing to apply bootstrap methodologies. Several bootstrap methodologies based on discrete wavelet transform (DWT), also called wavestrapping, have been proposed in the literature to estimate the confidence interval of some statistics for a time series, such as the autocorrelation. In this paper we implemented these methods with few modifications and compared them to newly proposed methods based on non-decimated wavelet transform (NDWT), which is a translation invariant transform and more adequate for dealing with time series. Each realization of the bootstrap provides a surrogate time series, that imitates the trajectories of the original stochastic process, allowing to build a confidence interval for its mean for both stationary and non-stationary processes. As an application, the confidence interval of the mean rate of bronchiolitis hospitalizations for Paraná-BR state was estimated as well as its bias and standard errors.
Aline Edlaine de Medeiros, Eniuce Menezes de Souza
Chapter 8. A New Wavelet-Based Approach for Mass Spectrometry Data Classification
Abstract
Proteomic patterns can help the diagnosis of the underlying pathological state of an organ such as the ovary, the lung, and the breast, to name a few. An accurate classification of mass spectrometry is a crucial point to establish a reliable diagnosis and decision process regarding the type of cancer. A statistical methodology for classifying mass spectrometry data is proposed. An overview of wavelets, principal component analysis-T 2 statistic, and support vector machines is given. The study is performed on low-mass SELDI spectra derived from patients with breast cancer and from normal controls. There are 156 samples where control (normal) patients contribute with 57 samples and 99 samples are cancer. A hyperparameter optimization is conducted to select a support vector machine classification model based on grid search. The performance was evaluated with a k-fold cross validation technique and Monte-Carlo simulation with 100 replications. The average accuracy is 100% with standard error equals to 0. The averages of the sensitivity and specificity are both equal to 100%, as well as the area under the curve. The excellent performance of our proposed method is mainly due to the statistical modeling and the feature extraction procedure proposed.
Achraf Cohen, Chaimaa Messaoudi, Hassan Badir

Clinical Trials and Statistical Modeling

Frontmatter
Chapter 9. Statistical Power and Bayesian Assurance in Clinical Trial Design
Abstract
In clinical trial design, statistical power is defined as the probability of rejecting the null hypothesis at a pre-specified true clinical treatment effect, which is conditioned on the true but actually unknown effect. In practice, however, this true effect is never a fixed value, but a range from previously held trials which would lead to underpowered or overpowered trials. In order to incorporate the uncertainties of this observed treatment effect, a Bayesian assurance has been proposed as an alternative to the conventional statistical power. This is defined as the unconditional probability of rejecting the null hypothesis. In this chapter, we will review the transition from conventional statistical power to Bayesian assurance and discuss the computations of Bayesian assurance using a Monte-Carlo simulation-based approach.
Ding-Geng Chen, Jenny K. Chen
Chapter 10. Equivalence Tests in Subgroup Analyses
Abstract
Confirmatory clinical trials that aim to demonstrate the efficacy of drugs are typically performed in broad patient populations so that the patient population is usually heterogeneous with respect to demographic variables and medical conditions. Therefore, regulatory guidelines request that, in addition to the primary comparison of the treatment effects in the total study population, the consistency of the treatment effect be evaluated across medically relevant subgroups (e.g. gender, age or comorbidities).
We propose that the consistency of the treatment effect in two subgroups should be assessed using an equivalence test, which in the current context we call consistency test. The proposed tests compare the treatment contrasts in the two subgroups, aiming to reject the null hypothesis of heterogeneity.
We present tests for both quantitative and binary outcome variables. While the details of these tests differ for the two types of outcome variable, both tests are based on a generalised linear model in which treatment, subgroup, and subgroup-by-treatment interaction terms are fitted.
In this text, we review the basic properties of these consistency tests using Monte-Carlo simulations. A key objective of these simulations is to suggest suitable equivalence margins, based on the performance of the tests in various settings. The investigation indicates that equivalence tests can be used both to assess the consistency of treatment effects across subgroups and to detect medically relevant heterogeneity in treatment effects across subgroups.
A. Ring, M. Scharpenberg, S. Grill, R. Schall, W. Brannath
Chapter 11. Predicting Confidence Interval for the Proportion at the Time of Study Planning in Small Clinical Trials
Abstract
Confidence intervals are commonly used to assess the precision of parameter estimations. Particularly in small clinical trials, such assessment may be used in place of a power calculation. We discuss `future' confidence interval prediction with binomial outcomes for small clinical trials and sample size calculation, where the term `future' confidence interval emphasizes the confidence interval as a function of a random sample that is not observed at the planning stage of a study. We propose and discuss three probabilistic approaches to future confidence interval prediction when the sample size is small. We demonstrate substantial differences among these approaches in terms of the interval width prediction and sample size calculation. We show that the approach based on the expectation of the boundaries has the most desirable properties and is easy to implement. In this chapter, we primarily discuss prediction of the Clopper-Pearson exact confidence interval, and then extend our discussion to other confidence interval methods. In particular, we discuss the arcsine transformation as a viable alternative to the exact confidence interval.
Jihnhee Yu, Albert Vexler
Chapter 12. Importance of Adjusting for Multi-stage Design When Analyzing Data from Complex Surveys
Abstract
Social scientists and policy makers commonly use estimates derived from population-based studies, e.g., estimates derived from the Tobacco Use Supplement (TUS) are commonly used in behavioral studies targeted on smoking and quitting behaviors. The U.S. Census Bureau and other agencies designing and administering national surveys provide technical guidelines on suitable statistical methodologies. The guidelines specify the appropriate methods for estimation and prediction. However, when performing secondary data analyses scientists may be prone to simplify analytical strategies and use classical statistical methods, i.e., ignore design specifics and mistreat the complex design used to gather the data as simple random sampling. In this chapter, we illustrate the importance of using the guidelines when analyzing complex surveys. We discuss three methods: method I ignores any weighting, method II incorporates the main weight only, and method III utilizes the main weight and balanced repeated replications with specified replicate weights. We illustrate possible discrepancies in point estimates and standard errors using 2014–2015 TUS data. Presented examples include smoking status, attitudes toward smoking restrictions in public places and cars, and smoking rules at home among single parents in the USA.
Trung Ha, Julia N. Soulakova
Chapter 13. Analysis of the High School Longitudinal Study to Evaluate Associations Among Mathematics Achievement, Mentorship and Student Participation in STEM Programs
Abstract
Advancements in science, technology, and medicine have contributed to the economic growth of the United States. However, this requires individuals to be trained for careers in biostatistics, bioinformatics and other areas of science, technology, engineering, and mathematics (STEM). Despite efforts to improve student recruitment and retention in STEM fields, the proportion of STEM graduates at the undergraduate level remains low in the United States, and is the lowest among underrepresented minorities. Several initiatives by the National Science Foundation (NSF), American Statistical Association (ASA), and academic organizations have invested in training individuals by developing programs to address issues that may be contributing to low recruitment and retention rates. Some factors may include lack of mentoring received from parents and teachers on STEM careers and less opportunities to explore STEM career options. In this study, a subsample of the High School Longitudinal Study (2009–2013) dataset (HSLS:09) is analyzed. Regression models were applied to evaluate mathematics achievement and student enrollment in STEM major/careers based on their individual participation in STEM activities and mentorship. Differences based on sex, race/ethnicity, and socioeconomic status (SES) were investigated. In summary, the aim of this work was to assess the significance of these stated factors in order to give insight into STEM education policy efforts. Our hope is that this work would shed light on the roles of mentors and student participation in STEM activities to motivate future programs aimed to recruit and retain STEM students.
Anarina L. Murillo, Hemant K. Tiwari, Olivia Affuso
Chapter 14. Statistical Modeling for the Heart Disease Diagnosis via Multiple Imputation
Abstract
During statistical analysis of clinic data, missing data is a common challenge. Incomplete datasets can occur via different means, such as mishandling of samples, low signal-to-noise ratio, measurement error, non-responses to questions, or aberrant value deletion. Missing data causes severe problems in statistical analysis and leads to invalid conclusions. Multiple imputation is a useful strategy for handling missing data. The statistical inference of multiple imputation is widely accepted as a less biased and more valid result. In the chapter, we apply the multiple imputation to a public-accessible heart disease dataset, which has a high missing rate, and build a prediction model for the heart disease diagnosis.
Lian Li, Yichuan Zhao

High-Dimensional Gene Expression Data Analysis

Frontmatter
Chapter 15. Learning Gene Regulatory Networks with High-Dimensional Heterogeneous Data
Abstract
The Gaussian graphical model is a widely used tool for learning gene regulatory networks with high-dimensional gene expression data. Most existing methods for Gaussian graphical models assume that the data are homogeneous, i.e., all samples are drawn from a single Gaussian distribution. However, for many real problems, the data are heterogeneous, which may contain some subgroups or come from different resources. This paper proposes to model the heterogeneous data using a mixture Gaussian graphical model, and apply the imputation-consistency algorithm, combining with the ψ-learning algorithm, to estimate the parameters of the mixture model and cluster the samples to different subgroups. An integrated Gaussian graphical network is learned across the subgroups along with the iterations of the imputation-consistency algorithm. The proposed method is compared with an existing method for learning mixture Gaussian graphical models as well as a few other methods developed for homogeneous data, such as graphical Lasso, nodewise regression, and ψ-learning. The numerical results indicate superiority of the proposed method in all aspects of parameter estimation, cluster identification, and network construction. The numerical results also indicate generality of the proposed method: it can be applied to homogeneous data without significant harms. The accompanied R package GGMM is available at https://​cran.​r-project.​org.
Bochao Jia, Faming Liang
Chapter 16. Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis
Abstract
Background: In recent years, metagenomics, as a combination of research techniques without the process of cultivation, has become more and more popular in studying the genomic/genetic variation of microbes in environmental or clinical samples. Though generated from similar sequencing technologies, there is increasing evidence that metagenomic sequence data may not be treated as another variant of RNA-Seq count data, especially due to its compositional characteristics. While it is often of primary interest to compare taxonomic or functional profiles of microbial communities between conditions, normalization for library size is usually an inevitable step prior to a typical differential abundance analysis. Some methods have been proposed for such normalization. But the existing performance evaluation of normalization methods for metagenomic sequence data does not adequately consider the compositional characteristics.
Result: The normalization methods assessed in this chapter include Total Sum Scaling (TSS), Relative Log Expression (RLE), Trimmed Mean of M-value (TMM), Cumulative Sum Scaling (CSS), and Rarefying (RFY). In addition to compositional proportions, simulated data were generated with consideration of overdispersion, zero inflation, and under-sampling issue. The impact of normalization on subsequent differential abundance analysis was further studied.
Conclusion: Selection of a normalization method for metagenomic compositional data should be made on a case-by-case basis. Simulation using the parameters learned from the experimental data may be carried out to assist the selection.
Ruofei Du, Lingling An, Zhide Fang
Chapter 17. Identification of Pathway-Modulating Genes Using the Biomedical Literature Mining
Abstract
Although biomedical literature is considered as a valuable resource to investigate the relationship among genes, it still remains challenging to effectively use it for the identification of the relationships among genes mainly because most abstracts contain information for a single gene while the majority of approaches are based on the co-occurrence of genes within an abstract. In order to address this limitation, we recently developed a Bayesian hierarchical model that allows to identify indirect relationship between genes by linking them using the gene ontology (GO) terms, namely bayesGO. In addition, this approach also facilitates interpretation of the identified pathways by automatically associating relevant GO terms to each gene within a unified framework. In this book chapter, we illustrate this approach using the web interface GAIL which provides the PubMed literature mining results based on human gene entities and GO terms, along with the R package bayesGO implementing the proposed Bayesian hierarchical model. The web interface GAIL is currently hosted at http://​chunglab.​io/​GAIL and the R package bayesGO is publicly available at its GitHub webpage (https://​dongjunchung.​github.​io/​bayesGO/​).
Zhenning Yu, Jin Hyun Nam, Daniel Couch, Andrew Lawson, Dongjun Chung
Chapter 18. Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data
Abstract
Next-generation sequencing has become a powerful tool for gene expression analysis with the development of high-throughput techniques. Discriminating which type of diseases a new sample belongs to is a fundamental issue in medical and biological studies. Different from continuous microarray data, next-generation sequencing reads are mapped onto the reference genome and are discrete data. Consequently, existing discriminant analysis methods for microarray data may not be readily applicable for next-generation sequencing data. In recent years, a number of new discriminant analysis methods have been proposed to discriminate next-generation sequencing data. In this chapter, we introduce three such methods including the Poisson linear discriminant analysis, the zero-inflated Poisson logistic discriminant analysis, and the negative binomial linear discriminant analysis. In view of the importance, we further introduce several normalization methods for processing next-generation sequencing data. Simulation studies and two real datasets are also carried out to demonstrate the usefulness of the newly developed methods.
Yan Zhou, Junhui Wang, Yichuan Zhao, Tiejun Tong

Survival Analysis

Frontmatter
Chapter 19. On the Landmark Survival Model for Dynamic Prediction of Event Occurrence Using Longitudinal Data
Abstract
In longitudinal cohort studies, participants are often monitored through periodic clinical visits until the occurrence of a terminal clinical event. A question of interest to both scientific research and clinical practice is to predict the risk of the terminal event at each visit, using the longitudinal prognostic information collected up to the visit. This problem is called the dynamic prediction: a real-time, personalized prediction of the risk of a future adverse clinical event with longitudinally measured biomarkers and other prognostic information. An important method for dynamic prediction is the landmark Cox model and variants. A fundamental difficulty in the current methodological research of this kind of models is that it is unclear whether there exists a joint distribution of the longitudinal and time-to-event data that satisfies the model assumptions. As a result, this model is often viewed as a working model instead of a probability distribution, and its statistical properties are often studied using data simulated from shared random effect models, where the landmark model works under misspecification. In this paper, we demonstrate that a joint distribution of longitudinal and survival data exists that satisfy the modeling assumptions without additional restrictions, and propose an algorithm to generate data from this joint distribution. We further generalize the results to the more flexible landmark linear transformation models that include the landmark Cox model as a special case. These results facilitate future theoretical and numerical research on landmark survival models for dynamic prediction.
Yayuan Zhu, Liang Li, Xuelin Huang
Chapter 20. Nonparametric Estimation of a Cumulative Hazard Function with Right Truncated Data
Abstract
The reverse-time hazard was routinely evaluated or modeled under the context of right truncation. However, this quantity does not have a natural interpretation. Based on the relation between the reverse-time and forward-time hazards, we developed the nonparametric inference for the forward-time hazard. We studied a family of weighted tests for comparing the hazard function between two independent samples. We showed the weak convergence properties and conduct the simulation studies to investigate the practical performances of the proposed variance estimators and tests. Finally, we analyzed the data set about AIDS incubation time to illustrate estimation and two-sample tests about the cumulative hazard function.
Xu Zhang, Yong Jiang, Yichuan Zhao, Haci Akcin
Chapter 21. Empirical Study on High-Dimensional Variable Selection and Prediction Under Competing Risks
Abstract
Competing risk analysis considers event times due to multiple causes, or of more than one event types. Commonly used regression models for such data include (1) cause-specific hazards model, which focuses on modeling one type of event while acknowledging other event types simultaneously; and (2) subdistribution hazards model, which links the covariate effects directly to the cumulative incidence function. Their use and in particular statistical properties in the presence of high-dimensional predictors are largely unexplored. We study the accuracy of prediction and variable selection of existing statistical learning methods under both models using extensive simulation experiments, including different approaches to choosing penalty parameters in each method.
Jiayi Hou, Ronghui Xu
Chapter 22. Nonparametric Estimation of a Hazard Rate Function with Right Truncated Data
Abstract
Left truncation and right truncation coexist in a truncated sample. Earlier researches focused on left truncation. Lagakos et al. (Biometrika 75:515–523, 1988) proposed to transform right truncated data to left truncated data and then apply the methods developed for left truncation. Interpretation of survival quantities, such as the hazard rate function, in reverse-time is not natural. Though it is most interpretable, researchers seldom use the forward-time hazard function. In this book chapter we studied the nonparametric inference for the hazard rate function with right truncated data. Kernel smoothing techniques were used to get smoothed estimates of hazard rates. Three commonly used kernels, uniform, Epanechnikov, and biweight kernels were applied on the AIDS data to illustrate the proposed methods.
Haci Akcin, Xu Zhang, Yichuan Zhao
Backmatter
Metadata
Title
New Frontiers of Biostatistics and Bioinformatics
Editors
Prof. Yichuan Zhao
Ding-Geng Chen
Copyright Year
2018
Electronic ISBN
978-3-319-99389-8
Print ISBN
978-3-319-99388-1
DOI
https://doi.org/10.1007/978-3-319-99389-8

Premium Partner