Skip to main content

Über dieses Buch

This book presents modern methods and real-world applications of compositional data analysis. It covers a wide variety of topics, ranging from an updated presentation of basic concepts and ideas in compositional data analysis to recent advances in the context of complex data structures. Further, it illustrates real-world applications in numerous scientific disciplines and includes references to the latest software solutions available for compositional data analysis, thus providing a valuable and up-to-date guide for researchers and practitioners working with compositional data. Featuring selected contributions by leading experts in the field, the book is dedicated to Vera Pawlowsky-Glahn on the occasion of her 70th birthday.



An Interpretable Orthogonal Decomposition of Positive Square Matrices

This study of square matrices with positive entries is motivated by a previous contribution on exchange rates matrices. The sample space of these matrices is endowed with a group operation, the componentwise product or Hadamard product. Also an inner product, identified with the ordinary inner product of the componentwise logarithm of the matrices, completes the sample space to be a Euclidean space. This situation allows to introduce two orthogonal decompositions: the first one inspired on the independence of probability tables, and the second related to the reciprocal symmetry matrices whose transpose is the componentwise inverse. The combination of them results in an orthogonal decomposition into easily computable four parts. The merit of this decomposition is that, applied to exchange rate matrices, the four matrices of the decomposition admit an intuitive interpretation.
J. J. Egozcue, Wilfredo L. Maldonado



The Information-Geometric Perspective of Compositional Data Analysis

Information geometry uses the formal tools of differential geometry to describe the space of probability distributions as a Riemannian manifold with an additional dual structure. The formal equivalence of compositional data with discrete probability distributions makes it possible to apply the same description to the sample space of Compositional Data Analysis (CoDA). The latter has been formally described as a Euclidean space with an orthonormal basis featuring components that are suitable combinations of the original parts. In contrast to the Euclidean metric, the information-geometric description singles out the Fisher information metric as the only one keeping the manifold’s geometric structure invariant under equivalent representations of the underlying random variables. Well-known concepts that are valid in Euclidean coordinates, e.g., the Pythagorean theorem, are generalized by information geometry to corresponding notions that hold for more general coordinates. In briefly reviewing Euclidean CoDA and, in more detail, the information-geometric approach, we show how the latter justifies the use of distance measures and divergences that so far have received little attention in CoDA as they do not fit the Euclidean geometry favored by current thinking. We also show how Shannon entropy and relative entropy can describe amalgamations in a simple way, while Aitchison distance requires the use of geometric means to obtain more succinct relationships. We proceed to prove the information monotonicity property for Aitchison distance. We close with some thoughts about new directions in CoDA where the rich structure that is provided by information geometry could be exploited.
Ionas Erb, Nihat Ay

Log-Ratio Analysis of Finite Precision Data: Caveats, and Connections to Digital Lines and Number Theory

Log-ratio analysis (LRA) is a popular and theoretically coherent framework for investigating and modelling compositional data. Most empirical compositional data will be measured and recorded with finite precision; count data is a special instance of this in which the fundamental quantity of interest is discrete, but it is also common and practical to round continuous variables to the nearest convenient multiple of the unit of measurement. LRA is often applied to such finite precision measurements without considering the underlying discrete nature of the data (with the exception of the special case of zero values). Here we examine how the characteristics of finite precision data can manifest in LRA so that theoreticians and practitioners can be mindful of situations in which finite precision might affect their conclusions. We focus in particular on log-ratio variance—a fundamental measure of pairwise association between components—and demonstrate situations in which finite precision can have a profound effect on this statistic and related measures of proportionality. We also make connections to computer science concepts about digital lines and to mathematical concepts in number theory, including Farey sequences, to understand how finite precision approximations can affect the value of log-ratio variance even when the underlying continuous variables are perfectly proportional.
David R. Lovell

Distributions on the Simplex Revisited

A large number of families of distributions are available to model multivariate real vectors. On the contrary, for the simplex sample space, we have only a limited number of families arising through the generalization of the Dirichlet family or the logratio normal family. This chapter tries to summarize those models and some generalizations with a special emphasis on the algebraic-geometric structure of the simplex and on the measure which is considered compatible. In particular, the shifted-scaled Dirichlet distribution is studied and the logratio t distribution is rewritten and studied with respect to the Aitchison measure.
Gloria Mateu-Figueras, Gianna S. Monti, J. J. Egozcue

Compositional Biplots: A Story of False Leads and Hidden Features Revealed by the Last Dimensions

Logratio principal component analysis is often one of the first steps in exploring a compositional data set. Compositional biplots based on the first two principal components are frequently used to uncover proportionality between parts or to detect one-dimensional patterns of variability for larger subcompositions. This article argues that this approach is likely to produce false leads and proposes an alternative procedure based on condition indices and low-variance principal components. We advocate the calculation of condition indices, combined with biplots of the last few principal components and lists of subcompositions with large condition numbers, and these are shown to be useful for detecting proportionality and one-dimensional relationships. The detection of such patterns in compositional data sets is shown to be closely related to the analysis of multicollinearity as employed in linear regression. Two example data sets, amino acid compositions in calves and chemical components of coffee aroma, are used as illustrations.
Jan Graffelman

Statistical Methodology


Geographically Weighted Regression Analysis for Two-Factorial Compositional Data

The chapter focuses on the modelling and analysis of spatial dependent two-factorial compositional data. Spatial statistics provides a wide range of methods for the analysis of data with local variations but only a few of them are accommodated for the purposes of modelling relative structures. In this chapter, the geographically weighted regression model is introduced to analyse the relationship between the dependent variable and an explanatory variable reflecting a structure expressed in terms of a compositional table. The methodology is motivated by the problem of modelling local variations of the relationship between at-risk-of-poverty rates and the structure of the highest attained educational level in the German population aged 30–34. The real data study shows how information included in a compositional table and further expressed in real-valued coordinates can be highly valuable in selecting variables and prioritising them with respect to a research interest to facilitate the final interpretation of the model.
Kamila Fačevicová, Petra Kynčlová, Karel Macků

Factor Analysis of Compositional Data with a Total

The sample space of a manifest random vector is of crucial importance for a latent variable model. Compositional data require an appropriate statistical analysis because they provide the relative importance of the parts of a whole. Any statistical model including variables created using the original parts should be formulated according to the geometry of the simplex. Methods based on log-ratio coordinates give a consistent framework for analyzing this type of data. Here, we introduce an approach that includes both the orthonormal log-ratio coordinates and an auxiliary variable carrying absolute information and illustrate it through the factor analysis of two real datasets.
Carles Barceló-Vidal, Josep Antoni Martín-Fernández

A Compositional Three-Way Approach for Student Satisfaction Analysis

Three-way rating data on student satisfaction contain the scores assigned by students to a set of items measuring different aspects of educational quality at different time points. Such data provide information on the magnitude of satisfaction as well as information on how aspects vary with respect to each other and how they contribute to the total satisfaction of each student. Data magnitude is predominant in determining variability patterns, thus, any standard tool applied to these arrays only yields a one-dimensional solution measuring scale differences. The relative changes among items go completely undetected unless a compositional approach is used in combination with a multilinear tool. A case study on student satisfaction is presented to demonstrate that this method provides an insightful analysis of the role played by each aspect in generating satisfaction throughout faculties and years.
Michele Gallo, Violetta Simonacci, Valentin Todorov

Artificial Neural Networks to Impute Rounded Zeros in Compositional Data

Methods of deep learning have become increasingly popular in recent years, but they have not arrived in compositional data analysis. Imputation methods for compositional data are typically applied on additive, centered, or isometric log-ratio representations of the data. Generally, methods for compositional data analysis can only be applied to observed positive entries in a data matrix. Therefore, one tries to impute missing values or measurements that were below a detection limit. In this paper, a new method for imputing rounded zeros based on artificial neural networks is shown and compared with conventional methods. We are also interested in the question whether for ANNs, a representation of the data in log-ratios for imputation purposes is relevant. It can be shown that ANNs are competitive or even performing better when imputing rounded zeros of data sets with moderate size. They deliver better results when data sets are big. Also, we can see that log-ratio transformations within the artificial neural network imputation procedure nevertheless help to improve the results. This proves that the theory of compositional data analysis and the fulfillment of all properties of compositional data analysis is still very important in the age of deep learning.
Matthias Templ

Compositional DuPont Analysis. A Visual Tool for Strategic Financial Performance Assessment

DuPont analysis is a classical tool for assessing the determinants of financial performance of firms. It is based on financial ratios comparing revenues with costs (the so-called margin ratio), revenues with assets (turnover ratio), and debt with assets (leverage ratio). DuPont analysis thus focuses on comparing accounting values in relative terms and lends itself naturally to compositional analysis. In this chapter, we show how to graphically display firms according to margin, turnover and leverage by means of a standard compositional biplot, and how to cluster firms into strategic groups by means of k-means compositional cluster analysis. Practitioners who prefer to stick to the classic definitions of industry or cluster-level financial ratios can compute them with the usual formulae from the centre of the composition, i.e. from the industry or cluster geometric averages rather than the totals or arithmetic averages commonly used. An illustration is presented with farm-tourism firms.
Elisabet Saus–Sala, Àngels Farreras–Noguer, Núria Arimany–Serrat, Germà Coenders

Spatial Statistics for Distributional Data in Bayes Spaces: From Object-Oriented Kriging to the Analysis of Warping Functions

In the presence of increasingly massive and heterogeneous spatial data, geostatistical modeling of distributional observations plays a key role. Choosing the “right” embedding space for these data is of paramount importance for their statistical processing, to account for their nature and inherent constraints. The Bayes space theory is a natural embedding space for (spatial) distributional data and was successfully applied in varied settings. The aim of this work is to review the state-of-the-art methods for spatial dependence modeling and prediction of distributional data, while shedding light on the strong links between Compositional Data Analysis, Functional Data Analysis, and, more generally, Object-Oriented Data Analysis, in the context of spatial statistics. We propose extensions of these methods to the multivariate setting, and discuss the applicability of the Bayes space approach to the spatial modeling of phase variability in Functional Data Analysis.
Alessandra Menafoglio

Spatial Simultaneous Autoregressive Models for Compositional Data: Application to Land Use

Econometric land use models study determinants of land use shares of different classes: “agriculture”, “forest”, “urban” and “other” for example. Land use shares have a compositional nature as well as an important spatial dimension. We compare two compositional regression models with a spatial autoregressive nature in the framework of land use. We study the impact of the choice of coordinate space and prove that a choice of coordinate representation does not have any impact on the parameters in the simplex as long as we do not impose further restrictions. We discuss parameters interpretation taking into account the non-linear structure as well as the spatial dimension. In order to assess the explanatory variables impact, we compute and interpret the semi-elasticities of the shares with respect to the explanatory variables and the spatial impact summary measures.
Christine Thomas-Agnan, Thibault Laurent, Anne Ruiz-Gazen, Thi Huong An Nguyen, Raja Chakir, Anna Lungarska



The Whole Versus the Parts: The Challenge of Compositional Data Analysis (CoDA) Methods for Geochemistry

In complex geochemical aqueous systems, chemical species are conceptually distinct but empirically related thanks to a large number of interactions taking place at different spatial and/or temporal scales. In this condition, common elements are shared, multiplicative interactions arise, and feedback mechanisms may be able to maintain the system far from the thermodynamical equilibrium, bearing wide fluctuations. Chemical species can have alternative stable states and transitions among them that could produce important consequences for the stability and the resilience of the solutions, also forced by climate changes and with impact on human health. Under the Compositional Data Analysis (CoDA) methodology, it is possible to appreciate the power of some tools able to take a look at the whole instead of the constituting parts, enhancing the understanding of the nature of mutual interactions. In this research work, the role of the perturbation operator governing addition/subtraction in the simplex geometry is explored as a way to trace compositional changes and investigate the system dynamics. The results of our approach on the chemistry of the Arno River waters (Central Italy) highlight the possibility to discover the resilience of chemical species under the pressure of the environmental drivers affecting the catchment. Geochemical mobility (e.g., ionic potential, ionic strength) can be associated with new tools that provide information on either the resistance to change, predisposing the system to critical shifts, or its adaptive capacity, which instead favors gradual changes. This information appears to be fundamental since river water chemistry enables to decipher processes at the boundaries among lithosphere, biosphere, hydrosphere, and atmosphere, all key reservoirs involved in the dynamics of the Earth. This knowledge will be particularly relevant if the pressure of the climatic changes on our planet will continue to increase.
Antonella Buccianti, Caterina Gozzi

Groundwater Origin Determination in Historic Chemical Datasets Through Supervised Compositional Data Analysis: Brines of the Permian Basin, USA

Data from historic water quality databases often lack critical measurements necessary for focused investigations, such as determining the origin of the water. The U.S. Geological Survey produced waters database contains nearly 7,000 data of good quality for the Permian Basin, the single largest oil-producing province in the United States. However, fewer than 350 of those points contain enough geochemical data (Br concentration or δ18O and δ2H composition) to determine whether the origin of the samples is meteoric water or paleoseawater. Three supervised methods were applied to isometric and pairwise log-ratio transformed major ion data from a subset of samples of known origin but where the Br concentration and δ18O and δ2H composition were excluded to predict origin: linear discriminant analysis (isometric only), support vector machines (isometric and pairwise), and random forests (pairwise only). Error rates from validation, using data of known origin (excluding Br concentration and δ18O and δ2H composition) that were not used in model development, found that no method performed exceptionally well. An ensemble approach of only assigning classification when all three methods provide the same classification reduced the error rate of the validation data to 11% but failed to classify 28% of the data. This latter approach was applied to the nearly 7,000 samples which only contained concentrations of major ions (Cl, Ca, HCO3, Mg, Na, and SO4). Spatial mapping of these newly classified data generated insight on distribution and flow of meteoric and paleoseawater across the basin.
Mark A. Engle, Julien A. Chaput

Chronic Kidney Disease of Uncertain Aetiology and Its Relation with Waterborne Environmental Toxins: An Investigation via Compositional Balances

The occurrence of environmental clusters of Chronic Kidney Disease of uncertain aetiology (CKDu), where there is no known cause for the onset of kidney dysfunction, is a concern globally. Waterborne exposure pathways in the environment may result in indirect or direct ingestion of trace elements with potential health risks. This research examines the relationship between Standardised Incidence Rates (SIRs) of CKDu and the log-ratio balances of Potentially Toxic Elements (PTEs) in regional stream water. Compositional elemental balances were created and regression was used to identify the balances most associated with log-transformed SIR CKDu. At the regional scale, a statistically significant relationship was found between log (CKDu SIR) and the elemental balance Al/As which effectively delineates different geological domains across Northern Ireland. Following stratification by basalt bedrock (the dominant bedrock geology for SIR CKDu), the balance Al/Fe was identified as significantly associated with log (CKDu SIR). Superficial deposits, dissolved organic carbon and pH may act as controls on the balance of Al and Fe. With a high proportion of private water supplies registered in these areas, this research highlights the importance of considering bedrock geology and superficial deposits in understanding multi-element interactions of waterborne environmental toxins and potential links with environmental clusters of CKDu. We would like to acknowledge and thank our esteemed colleague and friend Vera Pawlowsky-Glahn who has been instrumental in the development of this research, pioneering the importance of Compositional Data Analysis (CoDA) in the study of health and the environment. For several of us, Vera provided our first introduction to CoDA and over many years has aided our shared understanding and increased awareness of the need to address the compositional nature of data such as soil and water geochemistry in our research. We are pleased to have this work on the use of compositional balances in exploring the explanatory environmental factors related to chronic kidney disease of uncertain aetiology included in this Festschrift. It represents an exciting field of innovative research where an acknowledgement of CoDA principles is vital if we are to understand fully the critical relationship between our health and the environment.
Jennifer M. McKinley, Ute Mueller, Peter M. Atkinson, Ulrich Ofterdinger, Siobhan F. Cox, Rory Doherty, Damian Fogarty, J. J. Egozcue

Multivariate Classification of the Crude Oil Petroleum Systems in Southeast Texas, USA, Using Conventional and Compositional Data Analysis of Biomarkers

Chemically, petroleum is an extraordinarily complex mixture of different types of hydrocarbons that are now possible to isolate and identify because of advances in geochemistry. Here, we use biomarkers and carbon isotopes to establish genetic differences and similarities among oil samples. Conventional approaches for evaluating biomarker and carbon isotope relative abundances include statistical techniques such as principal component and cluster analysis. Considering that proportions of the different hydrocarbon molecules are relative parts of a laboratory sample, the data are compositional in nature, thus requiring the use of log-ratio approaches for adequate mathematical modeling. We apply both traditional and compositional modeling approaches to crude oil samples from an onshore area of about 50,000 square miles in southeast Texas. The data comprise 177 crude oil samples from producing oil fields that include key biomarkers, elemental, and isotopic values commonly used in source rock correlation studies. Our results indicate that compositional modeling has higher discriminating power and lower uncertainty than the traditional approach, allowing the identification of up to 16 clusters. Each cluster represents one oil family from a source rock organofacies ranging from Carboniferous to Paleogene. The families provide new insights into important petroleum systems in the Texas onshore region of the Gulf of Mexico sedimentary basin.
Ricardo A. Olea, Josep Antoni Martín-Fernández, William H. Craddock

Finding the Centre: Compositional Asymmetry in High-Throughput Sequencing Datasets

High-throughput sequencing datasets comprise millions of reads of genomic data and can be modelled as count compositions. These data are used for transcription profiles, microbial diversity, or relative cellular abundance in culture. The data are sparse and high dimensional. Moreover, they are often unbalanced, i.e. there is often systematic variation between groups due to presence or absence of features, and this variation is important to the biological interpretation of the data. The imbalance causes samples in the comparison groups to exhibit varying centres contributing to false positive and false negative identifications. Here, we extend the centred log-ratio transformation method used for the comparison of differential relative abundance between two groups in a Bayesian compositional context. We demonstrate the pathology in modelled and real unbalanced experimental designs to show how this causes both false negative and false positive inference. We examined four approaches to identify denominator features, and tested them with different proportions of modelled asymmetry; two were relatively robust, and recommended. We recommend the ‘LVHA’ transformation for asymmetric transcriptome datasets, and the ‘IQLR’ method for all other datasets when using the ALDEx2 tool available on Bioconductor.
Jia R. Wu, Jean M. Macklaim, Briana L. Genge, Gregory B. Gloor

Bayesian Balance-Regression in Microbiome Studies Using Stochastic Search

In microbiome studies, 16S rRNA sequencing is commonly used to quantify the taxonomic abundance of a microbial community. The resulting data are counts of amplicons. However, the total count is not informative because of the sampling, sample preparation, and sequencing processes. These counts are used to obtain estimates of the relative abundance of the taxa, which is compositional with a unit sum constraint. Analysis of compositional data requires special statistical treatment to account for the intrinsic dependence of the components due to this constraint. Balance, defined as the normalized log-ratio of the geometric mean of the values for the two groups of components, provides an interesting way of studying microbial community structure, where the two groups represent the beneficial and detrimental taxa, respectively. Such a balance can be used to quantify dysbiosis of the microbial community that is associated with a clinical outcome. However, identification of the outcome-associated balance is challenging. In this paper, we introduce a Bayesian balance-regression and a Markov Chain Monte Carlo (MCMC) stochastic search algorithm to identify the compositional balance that is associated with the outcome. Specifically, we propose a random walk strategy in MCMC that explores the very large space of all possible balance defined from high-dimensional compositional vector. Simulation studies suggest that the algorithm can identify the bacterial taxa that define the outcome-associated balance with a high probability. The effect of the balance on the outcome can be easily inferred from their predictive posterior distribution. We apply the proposed methods to two human microbiome studies and identify the balance of gut microbiome composition associated with body mass index and risk of inflammatory bowel disease, respectively.
Lu Huang, Hongzhe Li

Compositional Data Analysis in Physical Activity and Health Research. Looking for the Right Balance

The time spent in different types of sleep, sedentary behaviour and physical activity during a day represents compositional data. Previous use of compositional analysis of daily activity data through orthonormal log-ratio coordinates in the form of pivot coordinates of four components (sleep, sedentary behaviour, light physical activity, moderate to vigorous physical activity) has facilitated critical comparison with and the resolution of contradicting reports of the influence of daily behaviour on health which had arisen from the use of standard statistical analysis. For example, there has been a long-standing debate about whether the effects of time spent sedentary are independent of the time spent in moderate to vigorous activity. However, daily activity may be usefully broken down into more categories, and other compositional analysis approaches may be relevant. Here we demonstrate the use of log-ratio balances to evaluate the association between daily activity broken into six components (sleep, short periods of sedentary behaviour, long periods of sedentary behaviour, standing, slow walking and fast walking) from a nationally representative sample of adults with health markers in obesity, lipidemia and glycemia. The balanced approach to compositional analysis of daily behaviour patterns offers a way to focus the statistical assessment and modelling on variables, representing meaningful contrasts of interest between behaviours. This is fully consistent with the view that actual effects on health are associated with exchanges between some classes of daily behaviours.
Duncan E. McGregor, Philippa M. Dall, Javier Palarea-Albaladejo, Sebastien F. M. Chastin

Compositional Data Analysis in Time-Use Epidemiology

How we allocate time to activities impacts our health. Daily times spent in activities are interrelated because they compete for time-shares within a finite 24 h window. If more time is spent in one activity, time must be taken from one or more of the remaining activities to maintain the fixed total of 24 h. Thus, time-use data have a relative nature and can be analysed accordingly using compositional data analysis. In this chapter, we demonstrate exploratory and cross-sectional inferential analyses of an eight-part time-use composition using data from the Longitudinal Study of Australian Children (n = 2224, 52% boys, mean age = 34 months, standard deviation = 3). For inferential analyses, time-use compositions are expressed as a specific choice of balance coordinates to separate between types of activities. Considering the balance coordinates as explanatory variables, we explore the relationship between children’s time-use composition and their socio-emotional health. Subsequently, we consider the balance coordinates as dependent variables and explore the relationship between parental perception of neighbourhood liveability and their child’s time-use composition.
Dorothea Dumuid, Željko Pedišić, Javier Palarea-Albaladejo, Josep Antoni Martín-Fernández, Karel Hron, Timothy Olds
Weitere Informationen

Premium Partner