main-content

## Über dieses Buch

This book illustrates numerous statistical practices that are commonly used by medical researchers, but which have severe flaws that may not be obvious. For each example, it provides one or more alternative statistical methods that avoid misleading or incorrect inferences being made. The technical level is kept to a minimum to make the book accessible to non-statisticians. At the same time, since many of the examples describe methods used routinely by medical statisticians with formal statistical training, the book appeals to a broad readership in the medical research community.

## Inhaltsverzeichnis

### Chapter 1. Why Bother with Statistics?

Abstract
Many statistical practices commonly used by medical researchers, including both statisticians and non-statisticians, have severe flaws that often are not obvious. This chapter begins with a brief list of some of the examples that will be covered in greater detail in later chapters. The point is made, and illustrated repeatedly, that what may seem to be a straightforward application of an elementary statistical procedure may have one or more problems that are likely to lead to incorrect conclusions. Such problems may arise from numerous sources, including misapplication of a method that is not valid in a particular setting, misinterpretation of numerical results, or use of a conventional statistical procedure that is fundamentally wrong. Examples will include being misled by The Innocent Bystander Effect when determining causality, how conditional probabilities may be misinterpreted, the relationship between gambling and medical decision-making, and the use of Bayes’ Law to interpret the results of a test for a disease or to compute the probability that a child will have hemophilia based on what has been observed in family members.
Peter F. Thall

### Chapter 2. Frequentists and Bayesians

Abstract
This chapter explains some elementary statistical concepts, including the distinction between a statistical estimator computed from data and the parameter that is being estimated. The process of making inferences from data will be discussed, including the importance of accounting for variability in data and one’s uncertainty when making statistical inferences. A detailed account of binomial confidence intervals will be presented, including a brief history of how Gauss, DeMoivre, and Laplace established important ideas that still are relevant today. The relationship between sample size and statistical reliability will be discussed and illustrated. I will introduce Bayesian statistics, which treats parameters as random quantities and thus is fundamentally different from frequentist statistics, which treats parameters as fixed but unknown. Graphical illustrations of posterior distributions of parameters will be given, including posterior credible intervals, illustrations of how a Bayesian analysis combines prior knowledge and data to form posterior distributions and make inferences, and how reliability improves with larger sample size. An example will be given of how being biomarker positive or negative may be related to the probability that someone has a particular disease.
Peter F. Thall

### Chapter 3. Knocking Down the Straw Man

Abstract
An introduction to clinical trials is given, including a list of things to consider when designing a clinical trial. An extensive discussion is given of the Simon (1989) two-stage phase II clinical trial design, because it is used very commonly, and it very often is misunderstood or applied inappropriately. This design provides a useful illustration of more general problems with the way that the frequentist paradigm of testing hypotheses is misused or misinterpreted, and problems that may arise when making inferences based on single-arm trials. These include (1) treatment-trial confounding, (2) making biased treatment comparisons, (3) the common misunderstanding of what an alternative hypothesis means, (4) how rejection of a null hypothesis often is misinterpreted, (5) consequences of ignoring toxicity in phase II trials, (6) assuming incorrectly that a null response probability estimated from historical data is known with certainty, and (7) logistical problems that may arise when making interim adaptive decisions during a trial. Two alternative Bayesian methods will be described. The first is posterior evaluation of binomial data from a phase II trial based on a binary response variable, leading to a conclusion that is substantively different from that based on a test of hypotheses. The second is a practical Bayesian phase II design that monitors both response rate for futility and toxicity rate for safety.
Peter F. Thall

### Chapter 4. Science and Belief

Abstract
This chapter discusses the relationship between belief and statistical inference. It begins with a brief history of modern statistics, explains some important elementary statistical ideas, and discusses elements of clinical trials. A discussion is given of how the empirical approach used by statisticians and scientists to establish what one believes may be at odds with how most people actually think and behave. This is illustrated by several examples, including how one might go about determining whether dogs are smarter than cats, belief and religious wars, a story from the early fifteenth century about how one might decide how many teeth are in a horse’s mouth, and how a prominent laboratory researcher once threw a temper tantrum in my office. A discussion and several examples are given of cherry-picking, which is the common practice of selecting and reporting a rare event, and how this misrepresents reality. The relationship of this practice to gambling is explained, including examples of how to compute the expected gain of a bet. Illustrations are given of the relationships between these ideas and medical statistics, news reports, and public policy.
Peter F. Thall

### Chapter 5. The Perils of P-Values

Abstract
The chapter gives an extensive discussion of p-values. It begins with a metaphorical example of a convention in which an arbitrary cutoff is used to dichotomize numerical information. The ritualistic use of p-values as a basis for constructing tests of hypotheses and computing sample sizes will be presented and discussed. This will be followed by a discussion of the use and misuse of p-values to establish “statistical significance” as a basis for making inferences, and practical problems with how p-values are computed. Bayes Factors will be presented as an alternative to p-values. The way that the hypothesis testing paradigm often is manipulated to obtain a desired sample size will be described, including an example of the power curve of a test as a more honest representation of the test’s properties. An example will be given to show that a p-value should not be used to quantify strength of evidence. An example from the published literature will be given that illustrates how the conventional comparison of a p-value to the conventional cutoff 0.05 may be misleading and harmful. The problem of dealing with false positive conclusions in multiple testing will be discussed. Type S error and the use of Bayesian posterior probabilities will be given as alternative methods. The chapter will close with an account of the ongoing P-value war in the scientific community.
Peter F. Thall

### Chapter 6. Flipping Coins

Abstract
In this chapter, I will discuss the use of randomization as a fundamental scientific tool in experiments where one wishes to make fair comparisons. I will begin with a brief discussion of the use of randomization in agricultural experiments many years ago, and how much later it became a prominent component of comparative clinical trials in medical research. To illustrate its usefulness and importance, I will describe a famous example of how randomization was used to obtain an unbiased comparison of two very different treatments for breast cancer that changed medical practice worldwide. An explanation of why randomization provides unbiased estimators of causal effects will be given. I will provide an example showing how a covariate effect can be mistaken for a treatment effect if one does not randomize, and instead relies on a conventional regression analysis of observational data. Reviews and illustrations will be given of statistical methods to correct for bias when analyzing observational data from non-randomized studies, including stratification, inverse probability weighting, and pair matching. Rubin’s (1978) Bayesian rationale for randomization will be described. A review will be given of two simulation studies of outcome-adaptive randomization, including potentially severe scientific flaws with this methodology that may not be apparent.
Peter F. Thall

### Chapter 7. All Mixed Up

Abstract
Possible relationships between the probability of early response and expected survival time with a given treatment are at the heart of the conventional paradigm for using phase II response data to plan phase III trials. These relationships often are misunderstood, however, which can lead to very bad decisions. To illustrate this, I will present a simple probability computation which gives numerical results that may seem surprising. I then will give an example of how a trial effect can be mistaken for a treatment effect if one compares data from different trials rather than randomizing. A method will be described for computing the predictive probability that a future phase III trial will be successful given observed phase II data, and this will be illustrated by a numerical example. An example will be given of a randomized trial in which between-arm comparisons of the 90-day response probabilities and the 12-month progression-free survival probabilities gave opposite conclusions regarding which treatment was superior. These examples illustrate the facts that probability often can be counterintuitive, and that basing treatment comparisons on early outcomes can be very misleading.
Peter F. Thall

### Chapter 8. Sex, Biomarkers, and Paradoxes

Abstract
This chapter will begin with an example of Simpson’s Paradox, which may arise in tables of cross-classified categorical data. An explanation of how the paradox may arise will be given, and a method for using the tabulated data to compute correct statistical estimators that resolve the paradox will be illustrated. A second example will be given in the context of comparing the batting averages of two baseball players, where the paradox cannot be resolved. An example of cross-tabulated data on treatment, biomarker status, and response rates will be given in which there appears to be an interactive treatment–biomarker effect on response rate. This example will be elaborated by also including sex in the cross-classification, which leads to different conclusions about biomarker effects. A discussion of latent variables and causality will be given. Latent effects for numerical valued variables in the context of fitting regression models will be illustrated graphically. The importance of plotting scattergrams of raw data and examining possible covariate–subgroup interactions before fitting regression models will be illustrated. An example will be given of data where a fitted regression model shows that a patient covariate interacts with a between-treatment effect, with the consequence that which treatment is optimal depends on the patient’s covariate value.
Peter F. Thall

### Chapter 9. Crippling New Treatments

Abstract
Conventional phase I clinical trials, in which a dose is chosen using adaptive decision rules based on toxicity but ignoring efficacy, are fundamentally flawed. This chapter will provide several illustrations of this important fact. The worst class of phase I “toxicity only” designs are so-called $$3+3$$ algorithms, which are widely used but have terrible properties. Regardless of methodology, the conventionally small sample sizes of phase I trials provide very unreliable inferences about the relationship between dose and the risk of toxicity. More generally, the paradigm of first doing dose-finding in a phase I trial based on toxicity, and then doing efficacy evaluation in a phase II trial, is fundamentally flawed. This chapter will provide numerical illustrations of all of these problems. It will be explained, and illustrated by example, why the class of phase I–II trials, which are based on both efficacy and toxicity, provide a greatly superior general alternative to the conventional phase I $$\rightarrow$$ phase II paradigm. The EffTox design of Thall and Cook (2004) and Thall et al. (2014a) will be reviewed and compared to a $$3+3$$ algorithm and the continual reassessment method of O’Quigley et al. (1990) by computer simulation. It will be argued that, because conventional methods do a very poor job of identifying a safe and effective dose, it is very likely that, at the start of clinical evaluation, they cripple many new treatments by choosing a suboptimal dose that is unsafe, ineffective, or both.
Peter F. Thall

### Chapter 10. Just Plain Wrong

Abstract
This chapter will give examples of particular clinical trial designs that are fundamentally flawed. Each example will illustrate a fairly common practice. The first example is a futility rule that aims to stop accrual to a single-arm trial early if the interim data show that it is unlikely the experimental treatment provides at least a specified level of anti-disease activity. The rule is given in terms of progression-free survival time. An alternative, much more sound, and reliable futility monitoring rule that accounts for each patient’s complete time-to-event follow-up data will be presented. The second example will show how the routine practice of defining patient evaluability can lead one astray when estimating treatment effects, by misrepresenting the actual patient outcomes. The next two examples pertain to the problems of incompletely or vaguely specified safety monitoring rules. The final example shows what can go wrong when one ignores fundamental experimental design issues, including bias and confounding, when evaluating and comparing multiple treatments. As an alternative approach, the family of randomized select-and-test designs will be presented.
Peter F. Thall

### Chapter 11. Getting Personal

Abstract
In day-to-day practice, a physician uses each patient’s individual characteristics, and possibly diagnostic test results, to make a diagnosis and choose a course of treatment. Because the best treatment choice often varies from patient to patient due to their differing characteristics, routine medical practice involves personalized treatment decisions. The advent of sophisticated machines that provide high-dimensional genetic, proteomic, or other biological data for use in this process has made it much more complex, and this often is called “precision” or “personalized” medicine. In this chapter, I will discuss the simple version of personalized medicine in which one or two patient covariates or subgroups may interact with treatment. Examples will include (1) a biomarker that interacts qualitatively with two treatments, (2) an illustration of why the routine practice of averaging over prognostic subgroups when comparing treatments can lead to erroneous conclusions within subgroups, (3) a randomized trial design that makes within-subgroup decisions, (4) a phase II–III select-and-test design that makes within-subgroup decisions, and (5) a Bayesian nonparametric regression survival analysis that identifies optimal dosing intervals defined in terms of the patient’s age and disease status.
Peter F. Thall

### Chapter 12. Multistage Treatment Regimes

Abstract
For each patient, treatment for a disease often is a multistage process involving an alternating sequence of observations and therapeutic decisions, with the physician’s decision at each stage based on the patient’s entire history up to that stage. This chapter begins with discussion of a simple two-stage version of this process, in which a Frontline treatment is given initially and, if and when the patient’s disease worsens, i.e., progresses, a second, Salvage treatment is given, so the two-stage regime is (Frontline, Salvage). The discussion of this case will include examples where, if one only accounts for the effects of Frontline and Salvage separately in each stage, the effect of the entire regime on survival time may not be obvious. Discussions and illustrations will be given of the general paradigms of dynamic treatment regimes (DTRs) and sequential multiple assignment randomized trials (SMARTs). Several statistical analyses of data from a prostate cancer trial designed by Thall et al. (2000) to evaluate multiple DTRs then will be discussed in detail. As a final example, several statistical analyses of observational data from a semi-SMART design of DTRs for acute leukemia, given by Estey et al. (1999), Wahed and Thall (2013), and Xu et al. (2016), will be discussed.
Peter F. Thall

### Backmatter

Weitere Informationen