Top

2020 | Book

Read chapter Read first chapter

Case Studies in Applied Bayesian Data Science

CIRM Jean-Morlet Chair, Fall 2018

Editors: Prof. Kerrie L. Mengersen, Prof. Pierre Pudlo, Prof. Christian P. Robert

Publisher: Springer International Publishing

Book Series : Lecture Notes in Mathematics

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Wirtschaft"

About this book

Presenting a range of substantive applied problems within Bayesian Statistics along with their Bayesian solutions, this book arises from a research program at CIRM in France in the second semester of 2018, which supported Kerrie Mengersen as a visiting Jean-Morlet Chair and Pierre Pudlo as the local Research Professor.

The field of Bayesian statistics has exploded over the past thirty years and is now an established field of research in mathematical statistics and computer science, a key component of data science, and an underpinning methodology in many domains of science, business and social science. Moreover, while remaining naturally entwined, the three arms of Bayesian statistics, namely modelling, computation and inference, have grown into independent research fields. While the research arms of Bayesian statistics continue to grow in many directions, they are harnessed when attention turns to solving substantive applied problems. Each such problem set has its own challenges and hence draws from the suite of research a bespoke solution.

The book will be useful for both theoretical and applied statisticians, as well as practitioners, to inspect these solutions in the context of the problems, in order to draw further understanding, awareness and inspiration.

Frontmatter

Surveys

Frontmatter

Chapter 1. Introduction

Abstract

This chapter is an introduction to this Lecture Note. We briefly describe the contents of this book. Both parts are introduced, namely part A which deals with Bayesian modeling and part B which presents real-world case studies. The last part of the chapter details the organization of the various events related to the Jean-Morlet Chair. It ends with the issues and research directions identified by the participants of the Conference on Bayesian Statistics in the Big Data Era.

Kerrie L. Mengersen, Pierre Pudlo, Christian P. Robert

Chapter 2. A Survey of Bayesian Statistical Approaches for Big Data

Abstract

The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a survey of published studies that present Bayesian statistical approaches specifically for Big Data and discuss the reported and perceived benefits of these approaches. We conclude by addressing the question of whether focusing only on improving computational algorithms and infrastructure will be enough to face the challenges of Big Data.

Farzana Jahan, Insha Ullah, Kerrie L. Mengersen

Chapter 3. Bayesian Neural Networks: An Introduction and Survey

Abstract

Neural Networks (NNs) have provided state-of-the-art results for many challenging machine learning tasks such as detection, regression and classification across the domains of computer vision, speech recognition and natural language processing. Despite their success, they are often implemented in a frequentist scheme, meaning they are unable to reason about uncertainty in their predictions. This article introduces Bayesian Neural Networks (BNNs) and the seminal research regarding their implementation. Different approximate inference methods are compared, and used to highlight where future research can improve on current methods.

Ethan Goan, Clinton Fookes

Chapter 4. Markov Chain Monte Carlo Algorithms for Bayesian Computation, a Survey and Some Generalisation

Abstract

This chapter briefly recalls the major simulation based methods for conducting Bayesian computation, before focusing on partly deterministic Markov processes and a novel modification of the bouncy particle sampler that offers an interesting alternative when dealing with large datasets.

Wu Changye, Christian P. Robert

Chapter 5. Bayesian Variable Selection

Abstract

In this chapter we survey Bayesian approaches for variable selection and model choice in regression models. We explore the methodological developments and computational approaches for these methods. In conclusion we note the available software for their implementation.

Matthew Sutton

Chapter 6. Bayesian Computation with Intractable Likelihoods

Abstract

This chapter surveys computational methods for posterior inference with intractable likelihoods, that is where the likelihood function is unavailable in closed form, or where evaluation of the likelihood is infeasible. We survey recent developments in pseudo-marginal methods, approximate Bayesian computation (ABC), the exchange algorithm, thermodynamic integration, and composite likelihood, paying particular attention to advancements in scalability for large datasets. We also mention R and MATLAB source code for implementations of these algorithms, where they are available.

Matthew T. Moores, Anthony N. Pettitt, Kerrie L. Mengersen

Real World Case Studies in Health

Frontmatter

Chapter 7. A Bayesian Hierarchical Approach to Jointly Model Cortical Thickness and Covariance Networks

Abstract

Estimation of structural biomarkers and covariance networks from MRI have provided valuable insight into the morphological processes and organisation of the human brain. State-of-the-art analyses such as linear mixed effects (LME) models and pairwise descriptive correlation networks are usually performed independently, providing an incomplete picture of the relationships between the biomarkers and network organisation. Furthermore, descriptive network analyses do not generalise to the population level. In this work, we develop a Bayesian generative model based on wombling that allows joint statistical inference on biomarkers and connectivity covariance structure. The parameters of the wombling model were estimated via Markov chain Monte Carlo methods, which allow for simultaneous inference of the brain connectivity matrix and the association of participants’ biomarker covariates. To demonstrate the utility of wombling on real data, the method was used to characterise intrahemispheric cortical thickness and networks in a study cohort of subjects with Alzheimer’s disease (AD), mild-cognitive impairment and healthy ageing. The method was also compared with state-of-the-art alternatives. Our Bayesian modelling approach provided posterior probabilities for the connectivity matrix of the wombling model, accounting for the uncertainty for each connection. This provided superior inference in comparison with descriptive networks. On the study cohort, there was a loss of connectivity across diagnosis levels from healthy to Alzheimer’s disease for all network connections (posterior probability ≥ 0.7). In addition, we found that wombling and LME model approaches estimated that cortical thickness progressively decreased along the dementia pathway. The major advantage of the wombling approach was that spatial covariance among the regions and global cortical thickness estimates could be estimated. Joint modelling of biomarkers and covariance networks using our novel wombling approach allowed accurate identification of probabilistic networks and estimated biomarker changes that took into account spatial covariance. The wombling model provides a novel tool to address multiple brain features, such as morphological and connectivity changes facilitating a better understanding of disease pathology.

Marcela I. Cespedes, James M. McGree, Christopher C. Drovandi, Kerrie L. Mengersen, Lee B. Reid, James D. Doecke, Jurgen Fripp

Chapter 8. Bayesian Spike Sorting: Parametric and Nonparametric Multivariate Gaussian Mixture Models

Abstract

The analysis of action potentials is an important task in neuroscience research, which aims to characterise neural activity under different subject conditions. The classification of action potentials, or “spike sorting”, can be formulated as an unsupervised clustering problem, and latent variable models such as mixture models are often used. In this chapter, we compare the performance of two mixture-based approaches when applied to spike sorting: the Overfitted Finite Mixture model (OFM) and the Dirichlet Process Mixture model (DPM). Both of these models can be used to cluster multivariate data when the number of clusters is unknown, however differences in model specification and assumptions may affect resulting statistical inference. Using real datasets obtained from extracellular recordings of the brain, model outputs are compared with respect to the number of identified clusters and classification uncertainty, with the intent of providing guidance on their application in practice.

Nicole White, Zoé van Havre, Judith Rousseau, Kerrie L. Mengersen

Chapter 9. Spatio-Temporal Analysis of Dengue Fever in Makassar Indonesia: A Comparison of Models Based on CARBayes

Abstract

Background: Dengue fever is one of the world’s most important vector-borne diseases and it is still a major public health problem in the Asia-Pacific region including Indonesia. Makassar is one of the major cities in Indonesia where the incidence of dengue fever is still quite high. Since dengue cases vary between areas and over time, these spatial and temporal components should be taken into consideration. However, unlike many other spatio-temporal contexts, Makassar is comprised of only a small number of areas and data are available over a relatively short timeframe. The aim of this paper is to better understand the spatial and temporal patterns of dengue incidence in Makassar, Indonesia by comparing the performance of six existing spatio-temporal models, taking into account these specific data characteristics (small number of areas and limited small number of time periods) and to select the best model for Makassar dengue dataset.

Methods: Six different Bayesian spatio-temporal conditional autoregressive (ST CAR) models were compared in the context of a substantive case study, namely annual dengue fever incidence in 14 geographic areas of Makassar, Indonesia, during 2002–2015. The candidate models included linear, ANOVA, separate spatial, autoregressive (AR), adaptive and localised approaches. The models were implemented using CARBayesST and the goodness of fit was compared using the Deviance Information Criterion (DIC) and Watanabe-Akaike Information Criterion (WAIC).

Results: The six models performed differently in the context of this case study. Among the six models, the spatio-temporal conditional autoregressive localised model had a much better fit than other options in terms of DIC, while the conditional autoregressive model with separate spatial and temporal components performed worst. However, the spatio-temporal CAR AR had a much better fit than other models in terms of WAIC. The different performance of the models may have been influenced by the small number of areas.

Conclusion: Different spatio-temporal models appeared to have a large impact on results. Careful selection of a range of spatio-temporal models is important for assessing the spatial and temporal patterns of dengue fever, especially in a context characterised by relatively few spatial areas and limited time periods.

Aswi Aswi, Susanna Cramb, Wenbiao Hu, Gentry White, Kerrie L. Mengersen

Chapter 10. A Comparison of Bayesian Spatial Models for Cancer Incidence at a Small Area Level: Theory and Performance

Abstract

The increase in Bayesian models available for disease mapping at a small area level can pose challenges to the researcher: which one to use? Models may assume a smooth spatial surface (termed global smoothing), or allow for discontinuities between areas (termed local spatial smoothing). A range of global and local Bayesian spatial models suitable for disease mapping over small areas are examined, including the foundational and still most popular (global) Besag, York and Mollié (BYM) model through to more recent proposals such as the (local) Leroux scale mixture model. Models are applied to simulated data designed to represent the diagnosed cases of (1) a rare and (2) a common cancer using small-area geographical units in Australia. Key comparative criteria considered are convergence, plausibility of estimates, model goodness-of-fit and computational time. These simulations highlighted the dramatic impact of model choice on posterior estimates. The BYM, Leroux and some local smoothing models performed well in the sparse simulated dataset, while centroid-based smoothing models such as geostatistical or P-spline models were less effective, suggesting they are unlikely to succeed unless areas are of similar shape and size. Comparing results from several different models is recommended, especially when analysing very sparse data.

Susanna Cramb, Earl Duncan, Peter Baade, Kerrie L. Mengersen

Chapter 11. An Ensemble Approach to Modelling the Combined Effect of Risk Factors on Age at Parkinson’s Disease Onset

Abstract

Ensemble approaches to statistical modelling combine multiple statistical methods to form a comprehensive analysis. They are of increasing interest for problems that involve diverse data sources, complex systems and subtle outcomes of interest. An example of such an ensemble approach is described in this chapter, in the context of a substantive case study that aimed to tease out factors affecting the age at onset of the neurodegenerative medical condition, Parkinsons Disease (PD), with a particular focus on the role of a particular potential risk factor, pesticide exposure.

Aleysha Thomas, Paul Wu, Nicole M. White, Leisa Toms, George Mellick, Kerrie L. Mengersen

Chapter 12. Workplace Health and Workplace Wellness: Synergistic or Disconnected?

Abstract

Workplace health and wellness is paramount in many businesses and industries, for economic and social reasons. Workplace wellness programs have emerged to meet this need. This paper pursues a deeper understanding of the relationship between workplace health and workplace wellness initiatives in Australia. Based on a survey of published literature, Bayesian networks are developed to describe and quantify factors that contribute to each of these components of workplace efficiency. Workplace health was found to be a complex system of acute and chronic occupational medical conditions, as well as lifestyle factors. Successful wellness programs were found to be those that have a high level of participation and positive financial impacts, and are integrated into business strategy and company culture. It was observed that many workplace wellness programs tend to target non-occupational health risks and that there is an opportunity to address other critical components of worker health risk factors. The outputs of the Bayesian networks can provide an interrogative monitor of workplace health and the potential impact of corresponding wellness initiatives, facilitating the development of more targeted and cost-effective programs.

G. Davis, E. Moloney, M. da Palma, Kerrie L. Mengersen, F. Harden

Chapter 13. Bayesian Modelling to Assist Inference on Health Outcomes in Occupational Health Surveillance

Abstract

Objectives: Occupational Health Surveillance (OHS) facilitates early detection of disease and dangerous exposures in the workplace. Current OHS analysis ignore important workplace structures and repeated measurements. There is a need to provide systematic analyses of medical data that incorporate the data structure. Although multilevel statistical models may account for features of OHS data, current applications in occupational health medicine are often not appropriate for OHS. Additionally, typical OHS data has not been analysed in a Bayesian framework, which allows for calculation of probabilities of potential events and outcomes. This paper’s objective is to illustrate the use of Bayesian modeling of OHS. Three analytic aims are addressed: (1) Identify patterns and changes in health outcomes; (2) Explore the effects of a particular risk factor, smoking and industrial exposures over time for individuals and worker groups; (3) identify risk of chronic conditions in individuals. Method: A Bayesian hierarchical model was developed to provide individual and group level estimates and inferences for health outcomes, FEV1%, BMI, and Diastolic and Systolic blood pressure. Results: We identified individuals with the greatest degree of change over time for each outcome, and demonstrated how to flag individuals with substantive negative health outcome change. We also assigned probabilities of individuals moving into “at risk” health categories 1 year from their last visit. Conclusion: Bayesian models can account for features typically encountered in OHS data, such as individual repeated measurements and group structures. We describe one way to fit these data and obtain informative estimates and predictions of employee health.

Nicholas J. Tierney, Samuel Clifford, Christopher C. Drovandi, Kerrie L. Mengersen

Real World Case Studies in Ecology

Frontmatter

Chapter 14. Bayesian Networks for Understanding Human-Wildlife Conflict in Conservation

Abstract

Human-wildlife conflict is a major threat to survival and viability of many native animal species worldwide. Successful management of this conflict requires evidence-based understanding of the complex system of factors that motivate and facilitate it. However, for many affected species, data on this sensitive subject are too sparse for many statistical techniques. This study considers two iconic wild cats under threat in diverse locations and employs a Bayesian Network approach to integrate expert-elicited information into a probabilistic model of the factors affecting human-wildlife conflict. The two species considered are cheetahs in Botswana and jaguars in the Peruvian Amazon. Results of the individual network models are presented and the relative importance of different conservation management strategies are presented and discussed. The study highlights the strengths of the Bayesian Network approach for quantitatively describing complex, data-poor real world systems.

Jac Davis, Kyle Good, Vanessa Hunter, Sandra Johnson, Kerrie L. Mengersen

Chapter 15. Bayesian Learning of Biodiversity Models Using Repeated Observations

Abstract

Predictive biodiversity distribution models (BDM) are useful for understanding the structure and functioning of ecological communities and managing them in the face of anthropogenic disturbances. In cases where their predictive performance is good, such models can help fill knowledge gaps that could only otherwise be addressed using direct observation, an often logistically and financially onerous prospect. The cornerstones of such models are environmental and spatial predictors. Typically, however, these predictors vary on different spatial and temporal scales than the biodiversity they are used to predict and are interpolated over space and time. We explore the consequences of these scale mismatches between predictors and predictions by comparing the results of BDMs built to predict fish species richness on Australia’s Great Barrier Reef. Specifically, we compared a series of annual models with uninformed priors with models built using the same predictors and observations, but which accumulated information through time via the inclusion of informed priors calculated from previous observation years. Advantages of using informed priors in these models included (1) down-weighting the importance of a large disturbance, (2) more certain species richness predictions, (3) more consistent predictions of species richness and (4) increased certainty in parameter coefficients. Despite such advantages, further research will be required to find additional ways to improve model performance.

Ana M. M. Sequeira, M. Julian Caley, Camille Mellin, Kerrie L. Mengersen

Chapter 16. Thresholds of Coral Cover That Support Coral Reef Biodiversity

Abstract

Global environmental change, such as ocean warming and increased cyclone activity, is driving widespread and rapid declines in the abundance of key ecosystem engineers, reef-building corals, on the Great Barrier Reef. Our ability to understand how coral associated species, such as reef fishes, respond to coral loss can be impeded by uncertainty surrounding natural spatio-temporal variability of coral populations. To address this issue, we developed a semi-parametric hierarchical Bayesian model to estimate long-term trajectories of habitat-forming coral cover as a function of three spatial scales (sub-region, habitat and site) and environmental disturbances. The relationships between coral cover trajectories and fish community structure were examined using posterior predictive distributions of estimated coral cover from the statistical model. In the absence of direct observations of fish community structure, we used the probability of coral cover being above some ecological threshold values as a proxy for potential disruptions of fish community structure. Threshold values were derived from published field studies that estimated changes in the structure of coral-reef fish communities and coral cover after major disturbances. In these studies, fish community structure did not change where post-disturbance coral cover was > 20%. Disruptions in the structure of these communities were observed when coral cover dropped to between 10–20% and declines in fish diversity were typical where coral cover ranged from between 5 and 10%. Based on these thresholds values, posterior probabilities of coral cover being above 20% and between 10 and 20% and between 5 and 10% were calculated across spatial scales on the Great Barrier Reef (GBR) from 1995 to 2011. At the GBR scale, probabilities of coral cover being above these thresholds remained relatively stable through time. Across years, probabilities of coral cover being at least > 20% remained null for the sub-regions of Cairns, Townsville, Whitsundays and Swain but highly variable between reef sites within these sub-regions, with the exception of Townsville. In the Townsville area, probabilities of coral cover being between 10–20% and 5–10% declined from 0.75 to 0 during the study period. This finding highlights potential sub-regional fish community structure disruptions which have not yet been observed at this spatial scale. As frequency and intensity of disturbance events continue to rise, and consequently, as coral cover declines further, the probabilistic Bayesian approach presented in this chapter could be used to help provide early warnings of major ecological shifts at management relevant scales in the absence of direct observations.

Julie Vercelloni, M. Julian Caley, Kerrie L. Mengersen

Chapter 17. Application of Bayesian Mixture Models to Satellite Images and Estimating the Risk of Fire-Ant Incursion in the Identified Geographical Cluster

Abstract

Bayesian non-parametric mixture models have found great success in the statistical practice of identifying latent clusters in data. However, fitting such models can be computationally intensive and of less practical use when it comes to tall datasets, such as Landsat imagery. To overcome this issue, we propose to obtain multiple samples from data using stratified random sampling to enforce adequate representation in each sample from sub-populations that may exist in data. The non-parametric model is then fitted to each sample dataset independently to obtain posterior estimates. Label correspondence across multiple estimates is achieved using multivariate component densities of a chosen reference partition followed by pooling multiple posterior estimates to form a consensus posterior inference. The labels for pixels in the entire image are inferred using the conditional posterior distribution given pooled estimates, thereby substantially reducing the computational time and memory requirement.

The method is tested on Landsat images from the Brisbane region in Australia, which were compiled as a part of the national program for the eradication of the imported red fire-ant that was launched in September 2001 and which continues to the present date. The aim is to estimate the risk of fire-ant incursion in each of the identified geographical cluster so that the eradication program focuses on high risk areas.

Insha Ullah, Kerrie L. Mengersen

Backmatter

Title: Case Studies in Applied Bayesian Data Science
Editors: Prof. Kerrie L. Mengersen
Prof. Pierre Pudlo
Prof. Christian P. Robert
Publisher: Springer International Publishing
Electronic ISBN: 978-3-030-42553-1
Print ISBN: 978-3-030-42552-4
DOI: https://doi.org/10.1007/978-3-030-42553-1

Springer Professional

About this book

Table of Contents

Frontmatter

Surveys

Frontmatter

Chapter 1. Introduction

Chapter 2. A Survey of Bayesian Statistical Approaches for Big Data

Chapter 3. Bayesian Neural Networks: An Introduction and Survey

Chapter 4. Markov Chain Monte Carlo Algorithms for Bayesian Computation, a Survey and Some Generalisation

Chapter 5. Bayesian Variable Selection

Chapter 6. Bayesian Computation with Intractable Likelihoods

Real World Case Studies in Health

Frontmatter

Chapter 7. A Bayesian Hierarchical Approach to Jointly Model Cortical Thickness and Covariance Networks

Chapter 8. Bayesian Spike Sorting: Parametric and Nonparametric Multivariate Gaussian Mixture Models

Chapter 9. Spatio-Temporal Analysis of Dengue Fever in Makassar Indonesia: A Comparison of Models Based on CARBayes

Chapter 10. A Comparison of Bayesian Spatial Models for Cancer Incidence at a Small Area Level: Theory and Performance

Chapter 11. An Ensemble Approach to Modelling the Combined Effect of Risk Factors on Age at Parkinson’s Disease Onset

Chapter 12. Workplace Health and Workplace Wellness: Synergistic or Disconnected?

Chapter 13. Bayesian Modelling to Assist Inference on Health Outcomes in Occupational Health Surveillance

Real World Case Studies in Ecology

Frontmatter

Chapter 14. Bayesian Networks for Understanding Human-Wildlife Conflict in Conservation

Chapter 15. Bayesian Learning of Biodiversity Models Using Repeated Observations

Chapter 16. Thresholds of Coral Cover That Support Coral Reef Biodiversity

Chapter 17. Application of Bayesian Mixture Models to Satellite Images and Estimating the Risk of Fire-Ant Incursion in the Identified Geographical Cluster

Backmatter