Skip to main content

2020 | Buch

Statistical Modeling for Biological Systems

In Memory of Andrei Yakovlev

insite
SUCHEN

Über dieses Buch

This book commemorates the scientific contributions of distinguished statistician, Andrei Yakovlev. It reflects upon Dr. Yakovlev’s many research interests including stochastic modeling and the analysis of micro-array data, and throughout the book it emphasizes applications of the theory in biology, medicine and public health. The contributions to this volume are divided into two parts. Part A consists of original research articles, which can be roughly grouped into four thematic areas: (i) branching processes, especially as models for cell kinetics, (ii) multiple testing issues as they arise in the analysis of biologic data, (iii) applications of mathematical models and of new inferential techniques in epidemiology, and (iv) contributions to statistical methodology, with an emphasis on the modeling and analysis of survival time data. Part B consists of methodological research reported as a short communication, ending with some personal reflections on research fields associated with Andrei and on his approach to science. The Appendix contains an abbreviated vitae and a list of Andrei’s publications, complete as far as we know. The contributions in this book are written by Dr. Yakovlev’s collaborators and notable statisticians including former presidents of the Institute of Mathematical Statistics and of the Statistics Section of the AAAS. Dr. Yakovlev’s research appeared in four books and almost 200 scientific papers, in mathematics, statistics, biomathematics and biology journals. Ultimately this book offers a tribute to Dr. Yakovlev’s work and recognizes the legacy of his contributions in the biostatistics community.

Inhaltsverzeichnis

Frontmatter

Research Articles

Frontmatter
Stochastic Models of Cell Proliferation Kinetics Based on Branching Processes
Abstract
The aim of this memorial survey paper is to present some joint work with Andrei Yu. Yakovlev (http://​www.​biology-direct.​com/​content/​3/​1/​10) focused on new ideas for the theory of branching processes arising in cell proliferation modeling. The following topics are considered: some basic characteristics of cell cycle temporal organization, distributions of pulse-labeled discrete markers in branching cell populations, distributions of a continuous label in proliferating cell populations, limiting age and residual lifetime distributions for continuous-time branching processes, limit theorems and estimation theory for multitype branching populations and relative frequencies with a large number of ancestor, age-dependent branching populations with randomly chosen paths of evolution. Some of the presented results have not been published yet.
Nikolay M. Yanev
Age-Dependent Branching Processes with Non-homogeneous Poisson Immigration as Models of Cell Kinetics
Abstract
This article considers age-dependent branching processes with non-homogeneous Poisson immigration as models of cell proliferation kinetics. Asymptotic approximations for the expectation and variance–covariance functions of the process are developed. Estimators relying on the proposed asymptotic approximations are constructed, and their performance investigated using simulations.
Ollivier Hyrien, Nikolay M. Yanev
A Study of the Correlation Structure of Microarray Gene Expression Data Based on Mechanistic Modeling of Cell Population Kinetics
Abstract
Sample correlations between gene pairs within expression profiles are potentially informative regarding gene regulatory pathway structure. However, as is the case with other statistical summaries, observed correlation may be induced or suppressed by factors which are unrelated to gene functionality. In this paper, we consider the effect of heterogeneity on observed correlations, both at the tissue and subject level. Using gene expression profiles from highly enriched samples of three distinct embryonic glial cell types of the rodent neural tube, the effect of tissue heterogeneity on correlations is directly estimated for a simple two component model. Then, a stochastic model of cell population kinetics is used to assess correlation effects for more complex mixtures. Finally, a mathematical model for correlation effects of subject-level heterogeneity is developed. Although decomposition of correlation into functional and nonfunctional sources will generally not be possible, since this depends on nonobservable parameters, reasonable bounds on the size of such effects can be made using the methods proposed here.
Linlin Chen, Lev Klebanov, Anthony Almudevar, Christoph Proschel, Andrei Yakovlev
Correlation Between the True and False Discoveries in a Positively Dependent Multiple Comparison Problem
Abstract
Testing multiple hypotheses when observations are positively correlated is very common in practice. The dependence between observations can induce dependence between test statistics and distort the joint distribution of the true and false positives. It has a profound impact on the performance of common multiple testing procedures. While the marginal statistical properties of the true and false discoveries such as their means and variances have been extensively studied in the past, their correlation remains unknown.
By conducting a thorough simulation study, we find that the true and false positives are likely to be negatively correlated if testing power is high and the opposite holds true—they are likely to be positively correlated if testing power is low. The fact that positive dependence between observations can induce negative correlation between the true and false discoveries may assist researchers in designing multiple testing procedures for dependent tests in the future.
Xing Qiu, Rui Hu
Multiple Testing Procedures: Monotonicity and Some of Its Implications
Abstract
We review some results concerning the levels at which multiple testing procedures (MTPs) control certain type I error rates under a general and unknown dependence structure of the p-values on which the MTP is based. The type I error rates we deal with are (1) the classical family-wise error rate (FWER); (2) its immediate generalization: the probability of k or more false rejections (the generalized FWER); (3) the per-family error rate—the expected number of false rejections (PFER). The procedures considered are those satisfying the condition of monotonicity: reduction in some (or all) of the p-values used as input for the MTP can only increase the number of rejected hypotheses. It turns out that this natural condition, either by itself or combined with a property of being a step-down or step-up MTP (where the terms “step-down” and “step-up” are understood in their most general sense), has powerful consequences. Those include optimality results, inequalities, and identities involving different numerical characteristics of a procedure, and computational formulas.
Alexander Y. Gordon
Applications of Sequential Methods in Multiple Hypothesis Testing
Abstract
One of the main computational burdens in genome-wide statistical applications is the evaluation of large scale multiple hypothesis tests. Such tests are often implemented using replication-based methods, such as the permutation test or bootstrap procedure. While such methods are widely applicable, they place a practical limit on the computational complexity of the underlying test procedure. In such cases it would seem natural to apply sequential procedures. For example, suppose we observe the first ten replications of an upper-tailed statistic under a null distribution generated by random permutations, and of those ten, five exceed the observed value. It would seem reasonable to conclude that the P-value will not be small enough to be of interest, and further replications should not be needed.
While such methods have been proposed in the literature, for example by Hall in 1983, by Besag and Clifford in 1991 and by Lock in 1991, they have not been widely applied in multiple testing applications generated by high dimensional data sets, where they would likely be of some benefit. In this article related methods will first be reviewed. It will then be shown how commonly used multiple testing procedures may be modified so as to introduce sequential procedures while preserving the validity of reported error rates. A number of examples will show how such procedures can reduce computation time by an order of magnitude with little loss in power.
Anthony Almudevar
Multistage Carcinogenesis: A Unified Framework for Cancer Data Analysis
Abstract
Traditional approaches to the analysis of epidemiologic data are focused on estimation of the relative risk and are based on the proportional hazards model. Proportionality of hazards in epidemiologic data is a strong assumption that is often violated but seldom checked. Risk often depends on detailed patterns of exposure to environmental agents, but detailed exposure histories are difficult to incorporate in the traditional approaches to analyses of epidemiologic data. For epidemiologic data on cancer, an alternative approach to analysis can be based on ideas of multistage carcinogenesis. The process of carcinogenesis is characterized by mutation accumulation and clonal expansion of partially altered cells on the pathway to cancer. Although this paradigm is now firmly established, most epidemiologic studies of cancer incorporate ideas of multistage carcinogenesis neither in their design nor in their analyses. In this paper we will briefly discuss stochastic multistage models of carcinogenesis and the construction of the appropriate likelihoods for analyses of epidemiologic data using these models. Statistical analyses based on multistage models can quite explicitly incorporate detailed exposure histories in the construction of the likelihood. We will give examples to show that using ideas of multistage carcinogenesis can help reconcile seemingly contradictory findings, and yield insights into epidemiologic studies of cancer that would be difficult or impossible to get from conventional methods. Finally, multistage cancer models provide a unified framework for analyses of data from diverse sources.
Suresh Moolgavkar, Georg Luebeck
A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk Factors in Exploratory Epidemiological Studies
Abstract
Epidemiological research, such as the identification of disease risks attributable to environmental chemical exposures, is often hampered by small population effects, large measurement error, and limited a priori knowledge regarding the complex relationships between the many chemicals under study. However, even an ideal study design does not preclude the possibility of reported false positive exposure effects due to inappropriate statistical methodology. Three issues often overlooked include (1) definition of a meaningful measure of association; (2) use of model estimation strategies (such as machine-learning) that acknowledge that the true data-generating model is unknown; (3) accounting for multiple testing. In this paper, we propose an algorithm designed to address each of these limitations in turn by combining recent advances in the causal inference and multiple-testing literature along with modifications to traditional nonparametric inference methods.
Jessica G. Young, Alan E. Hubbard, Brenda Eskenazi, Nicholas P. Jewell
A Latent Time Distribution Model for the Analysis of Tumor Recurrence Data: Application to the Role of Age in Breast Cancer
Abstract
Many popular statistical methods for survival data analysis do not allow the possibility of cure or at least of considerable lengthening of survival for breast cancer patients. Increasingly prolonged follow-up provides new data about the post-treatment outcome, revealing situations with the possibility of high cure rates for early stage breast cancer patients. Then the exclusive use of “classical” statistical models of survival analysis can lead to conclusions not completely reflecting clinical reality. It therefore appears preferable to perform additional analyses based on survival models with cured fraction to study long-term survival data, especially in oncology. This approach allows statistical methods to be adapted to the biological process of tumor growth and dissemination. After presenting the Cox model with time-dependent covariates and the Yakovlev parametric model, we study the prognostic role of age in young women (≤50 years) in Institut Curie breast cancer data. Age is a prognostic factor within all three models, but the interpretation is not the same. With the Cox model the younger women have a bad prognosis (HR = 1.86) comparing to the older one. But the HR does not verify the proportional hazard hypothesis. So the Cox model with time-dependent covariates gives a better interpretation: the age-effect decreases significantly with time. With the Yakovlev model we find that the decreasing age-effect can be viewed through the proportion of cured patients. More, there is an effect of age on survival (palliative effect) and also on the cure rates (curative effect). So the cure rate models demonstrate their utility in analyzing long-term survival data.
Yann De Rycke, Bernard Asselain
Estimation of Mean Residual Life
Abstract
Yang (Ann Stat, 6:112–116, 1978) considered an empirical estimate of the mean residual life function on a fixed finite interval. She proved it to be strongly uniformly consistent and (when appropriately standardized) weakly convergent to a Gaussian process. These results are extended to the whole half line, and the variance of the limiting process is studied. Also, nonparametric simultaneous confidence bands for the mean residual life function are obtained by transforming the limiting process to Brownian motion.
W. J. Hall, Jon A. Wellner
Likelihood Transformations and Artificial Mixtures
Abstract
In this paper we consider the generalized self-consistency approach to maximum likelihood estimation (MLE). The idea is to represent a given likelihood as a marginal one based on artificial missing data. The computational advantage is sought in the likelihood simplification at the complete-data level. Semiparametric survival models and models for categorical data are used as an example. Justifications for the approach are outlined when the model at the complete-data level is not a legitimate probability model or if it does not exist at all.
Alex Tsodikov, Lyrica Xiaohong Liu, Carol Tseng
On the Application of Flexible Designs When Searching for the Better of Two Anticancer Treatments
Abstract
In search for better treatment, biomedical researchers have defined an increasing number of new anticancer compounds attacking the tumour disease with drugs targeted to specific molecular structure and acting very differently from standard cytotoxic drugs. This has put high pressure on early clinical drug testing since drugs may need to be tested in parallel when only a limited number of patients—e.g., in rare diseases—or limited funding for a single compound is available. Furthermore, at planning stage, basic information to define an adequate design may be rudimentary. Therefore, flexibility in design and conduct of clinical studies has become one of the methodological challenges in the search for better anticancer treatments. Using the example of a comparative phase II study in patients with rare non-clear cell renal cell carcinoma and high uncertainty about effective treatment options, three flexible design options are explored for two-stage two-armed survival trials. Whereas the two considered classical group sequential approaches integrate early stopping for futility in two-sided hypothesis tests, the presented adaptive group sequential design enlarges these methods by sample size recalculation after the interim analysis if the study has not been stopped for futility. Simulation studies compare the characteristics of the different design approaches.
Christina Kunz, Lutz Edler
Parameter Estimation for Multivariate Nonlinear Stochastic Differential Equation Models: A Comparison Study
Abstract
Statistical methods have been proposed to estimate parameters in multivariate stochastic differential equations (SDEs) from discrete observations. In this paper, we propose a method to improve the performance of the local linearization method proposed by Shoji and Ozaki (Biometrika 85:240–243, 1998), i.e., to avoid the ill-conditioned problem in the computational algorithm. Simulation studies are performed to compare the new method to three other methods, the benchmark Euler method and methods due to Pedersen (1995) and to Hurn et al. (2003). Our results show that the new method performs the best when the sample size is large and the methods proposed by Pedersen and Hurn et al. perform better when the sample size is small. These results provide useful guidance for practitioners.
Wei Gu, Hulin Wu, Hongqi Xue
On Frailties, Archimedean Copulas and Semi-Invariance Under Truncation
Abstract
Definitions and basic properties of bivariate Archimedean copula models for survival data are reviewed with an emphasis on their motivation via frailties. I present some new characterization results for Archimedean copula models based on a notion I call semi-invariance under truncation.
David Oakes

Short Communications

Frontmatter
The Generalized ANOVA: A Classic Song Sung with Modern Lyrics
Abstract
The widely used analysis of variance (ANOVA) suffers from a series of flaws that not only raise questions about conclusions drawn from its use, but also undercut its many potential applications to modern clinical and observational research. In this paper, we propose a generalized ANOVA model to address the limitations of this popular approach so that it can be applied to many immediate as well as potential applications ranging from an age-old technical issue in applying ANOVA to cutting-edge methodological challenges. By integrating the classic theory of U-statistics, we develop distribution-free inference for this new class of models to address missing data for longitudinal clinical trials and cohort studies.
Hui Zhang, Xin Tu
Analyzing Gene Pathways from Microarrays to Sequencing Platforms
Abstract
Genetic microarrays have been the primary technology for quantitative transcriptome analysis since the mid-1990s. Via statistical testing methodology developed for microarray data, researchers can study genes and gene pathways involved in a disease. Recently a new technology known as RNA-seq has been developed to quantitatively study the transcriptome. This new technology can also study genes and gene pathways, although the statistical methodology used for microarrays must be adapted to this new platform. In this manuscript, we discuss methods of gene pathway analysis in microarrays and next generation sequencing and their advantages over standard “gene by gene” testing schemes.
Jeffrey Miecznikowski, Dan Wang, Xing Ren, Jianmin Wang, Song Liu
A New Approach for Quantifying Uncertainty in Epidemiology
Abstract
Epidemiology is the branch of science on which public health research is founded. This essay shall review some of the principles underlying current methodology, revealing some ambiguities and inconsistencies. A new approach is proposed, the Bernoulli space, which is a complete model of uncertainty in a given situation. Each part of the model is necessary and the entire model is sufficient for describing all relevant parts of uncertainty. Using the Bernoulli space two aims are achieved: (1) Reliable and accurate predictions are obtained as basis for the decision-making process; (2) A unique interpretation of the obtained experimental results is obtained.
Elart von Collani
Branching Processes: A Personal Historical Perspective
Abstract
This article is a slightly edited and updated version of an evening talk during the random trees week at the Mathematisches Forschungsinstitut Oberwolfach, January 2009. It gives a—personally biased—sketch of the development of branching processes, from the mid nineteenth century to 2010, emphasizing relations to bioscience and demography, and to society and culture in general.
Peter Jagers
Principles of Mathematical Modeling in Biomedical Sciences: An Unwritten Gospel of Andrei Yakovlev
Abstract
This article describes Dr. Andrei Yakovlev’s unique philosophy of mathematical modeling in biomedical sciences. Although he never formulated it in a systematic way, it has always been central to his work and manifested amply in the course of the author’s 22-year research collaboration with this visionary scholar. We address methodological tensions between mathematics and biomedical sciences, epistemological status of mathematical models, and various methodological questions of a more practical nature arising in mathematical modeling and statistical data analysis including model selection, model identifiability, and concordance between the model and the observables.
Leonid Hanin
Backmatter
Metadaten
Titel
Statistical Modeling for Biological Systems
herausgegeben von
Anthony Almudevar
David Oakes
Jack Hall
Copyright-Jahr
2020
Electronic ISBN
978-3-030-34675-1
Print ISBN
978-3-030-34674-4
DOI
https://doi.org/10.1007/978-3-030-34675-1