Using quantile regression forest to estimate uncertainty of digital soil mapping products
Introduction
Soil maps are simplified representations of more complex and partially unknown patterns of soil variations. Therefore, any prediction of a soil property that can be derived from these soil maps have an irreducible and, most often, substantial uncertainty that have been considered by a number of soil surveyors (Beckett and Burrough, 1971, Wilding, 1985, Marsman and de Gruijter, 1986). For want of fully eliminating this uncertainty, soil surveyors communicated it to the soil maps users essentially by means of soil surveys norms (e.g. (GEPPA, 1967); (Soil Survey Staff, 1993)) that related map scales with densities of observations and expected purities of soil mapping units. However, further validations of soil maps (Beckett and Burrough, 1971) showed that this uncertainty was most often underestimated and that the relation between map scale and uncertainty was largely scrambled by the differences of soil pattern complexities between soil surveys. Finally, it has been stated that the soil surveyor community unsatisfactorily addressed the evaluation and communication of the uncertainty of soil maps (Wilding, 1985).
The introduction of geostatistics in soil mapping in the early eighties - see a review in Webster (1994) - dramatically changed the way of addressing uncertainty in soil mapping. Indeed, the quantitative estimation of the uncertainty was simply a by-product of the geostatistical model from which the soil property was spatially estimated. Therefore, uncertainty maps were provided to users together with the predicted soil map. This uncertainty mapping approach was further taken over by (McBratney et al., 2003) in the definition of the principle of Digital Soil mapping (DSM). Following this principle, a soil class or a soil attribute could be related with, and further predicted from, the so-called scorpan factors by a Spatial Soil Prediction Function with auto-correlated error (SSPFe), the latter representing the uncertainty associated with soil prediction. Moving towards operational use of DSM with the GlobalSoilMap (GSM) project (Sanchez et al., 2009, Arrouays et al., 2014), the DSM principle was further translated into technical specifications that defined the uncertainty as the 90% Prediction Interval (PI). The PI reports the range of values within which the true value is expected to occur 9 times out of 10, with a 1 out of 20 probability for each of the two tails (Arrouays et al., 2014). This level of probability implies that a DSM model will be considered suitable for delivering GSM products not only for its ability to accurately predict the value of a soil property at a given location but also for its ability to predict how uncertain this predictions is.
Geostatistical models are, by essence, adequate for providing such uncertainty estimates (Goovaerts, 2001, Heuvelink, 2013). In particular, regression kriging has become very popular in the DSM community (Odeh et al., 1995, Malone et al., 2009, Hengl et al., 2015). While incorporating the relationships between soil properties and environmental covariates by means of various linear and non-linear regression models, regression kriging includes a kriging of the regression residuals that provide an estimate of the probability distribution of the true value of a particular soil property at any location. Since we assume normality, we can easily calculate the lower and upper limit of the 90% PI of the soil property required by the GSM specifications by subtracting and adding 1.64 times the kriging standard deviation to the kriging prediction (Heuvelink, 2013). However, the calibration of the geostatistical models of the residuals may be difficult in situations of large study areas with sparse sampling of measured sites (Vaysse and Lagacherie, 2015), which is increasingly frequent in operational DSM applications.
A regression tree (Breiman et al., 1984) can theoretically also provide the PI of a soil property, the terminal node mean and the within-node standard deviation, playing the role of the kriging prediction and the kriging standard deviation, respectively. However, Breiman et al. (1984, p 255) warned against this practice because the within-node standard deviation can be underestimated and the node mean poorly estimated because of non-normal node distributions; to overcome these problems, they recommended growing smaller trees with larger terminal nodes. Random Forests (RFs) (Breiman, 2001) were further proposed for improving the regression tree performances by using bootstrap aggregation techniques, which allow employing an ensemble of regression trees and therefore obtaining more robust estimations with less biased internal error estimates. Meinshausen (2006) proposed modifying the outputs of the Random Forest procedure by allowing the estimation of prediction intervals of the targeted variables. Whereas for each node and each tree Random Forests keeps only the mean of the observations that fall into this node and neglects all other information, Quantile Regression Forests (QRF) considers the spread of the response variable from which prediction intervals are constructed. Although Random Forest is one of the most popular regression procedures in the DSM community, to the best of our knowledge, Quantile Random Forest has not been tested yet in DSM applications.
This paper presents a test of Quantile Regression Forest for mapping three soil properties (clay content, organic carbon content and pH at 0–15 cm depth) and the associated uncertainties over the 27,236 km2 Languedoc-Roussillon Region (France). The results of QRF are compared with those of a classical regression kriging (denoted further RK) using RFs as the regression algorithm. A particular focus is given to the ability of QRF and RK to estimate a priori the uncertainties of their predictions.
Section snippets
Materials and methods
The case study has been described in a previous paper (Vaysse and Lagacherie, 2015), however, large excerpts are provided for the sake of understanding this paper.
Variograms of the residuals
Fig. 3 shows the variograms of the RF residuals that were further interpolated following the Regression Kriging procedure (see above).
All the variograms exhibited a nugget effect that largely exceeded the spatially structured variability (nugget minus sill). However, the pH variogram showed a clear spatial structure at a 40 km range. Conversely, the clay variogram did not show a spatial structure. The OC variogram exhibited an intermediate case in which a weak spatial structure can be
Uncertainty assessments
The uncertainty predictions of QRF and RK were validated by constructing accuracy plots, which still remains an uncommon practice in Digital Soil Mapping (Malone et al., 2011, Lagacherie et al., 2012).
It must be highlighted that the performances of DSM measured through accuracy plots (Fig. 4) were very different from those measured by the classical indicators. Indeed, low performances for predicting clay and OC values were registered for QRF (Table 2) whereas accuracy plots revealed that QRF
Conclusions
This paper tested the use of Quantile Regression Forest (QRF) for delivering reliable estimates of uncertainty on Digital Soil Mapping products. QRF was compared with Regression Kriging (RK), the spatial model that is usually applied in operational Digital Soil Mapping. In the specific context of Languedoc Roussillon and of sparse sampling of soil observations (1/13.5 km2), the lessons that can be retrieved from this testing are as follows:
- •
QRF outperformed RK in delivering uncertainty estimates,
Acknowledgements
This research was granted by the French National Institute of Agronomical Research (INRA) and the French Research and Technology Agency (ANRT). The authors are also indebted to BRGM (French Geological Survey), Jean-François Desprats for providing geological maps at the 1:50,000 scale,. Additionally, we thank Dr. Vinatier for his great advice on programming the R software and the access of his computational machine, and Dr. Bailly for his useful advises.
References (30)
- et al.
Chapter three — GlobalSoilMap: toward a fine-resolution global grid of soil properties
- et al.
Mapping topsoil physical properties at European scale using the LUCAS database
Geoderma
(2016) - et al.
Modelling soil attribute depth functions with equal-area quadratic smoothing splines
Geoderma
(1999) Geostatistical modelling of uncertainty in soil science
Geoderma
(2001)- et al.
Mapping continuous depth functions of soil carbon storage and available water capacity
Geoderma
(2009) - et al.
Empirical estimates of uncertainty for mapping continuous depth functions of soil attributes
Geoderma
(2011) - et al.
On digital soil mapping
Geoderma
(2003) - et al.
Spatial prediction of soil properties using EBLUP with the Matérn covariance function
Geoderma
(2007) - et al.
Further results on prediction of soil properties from terrain attributes: heterotopic cokriging and regression-kriging
Geoderma
(1995) - et al.
Influence of parameter uncertainty estimations in kriging
J. Hydrol.
(1996)
Evaluating digital soil mapping approaches for mapping GlobalSoilMap soil properties from legacy data in Languedoc-Roussillon (France)
Geoderma Reg.
The development of pedometrics
Geoderma
The relation between cost and utility in soil survey
J. Soil Sci.
Random forests
Mach. Learn.
Classification and Regression Trees
Cited by (204)
Diffuse reflectance spectroscopy and digital soil mapping for assessing soil-associated off-road vehicle mobility risk
2024, Journal of Environmental Management