Elsevier

Geoderma

Volume 291, 1 April 2017, Pages 55-64
Geoderma

Using quantile regression forest to estimate uncertainty of digital soil mapping products

https://doi.org/10.1016/j.geoderma.2016.12.017Get rights and content

Highlights

  • Two methods of uncertainty estimation of 3 GSM products were tested in South France.

  • 100 validations sets were built by iterative sampling of 25% of the sites.

  • Accuracy plots were proposed for validating uncertainty predictions.

  • Quantile Regression Forests outperformed Regression Kriging in mapping uncertainty.

  • Quantile Regression Forests is recommended in situations of sparse soil data.

Abstract

Digital Soil Mapping (DSM) products are simplified representations of more complex and partially unknown patterns of soil variations. Therefore, any prediction of a soil property that can be derived from these products has an irreducible uncertainty that needs to be mapped. The objective of this study was to compare the most current DSM method – Regression Kriging (RK) – with a new approach derived from RandomForest – Quantile Regression Forest (QRF) – in regard to their ability of predicting the uncertainties of GlobalSoilMap soil property grids. The comparison was performed for three soil properties, pH, organic carbon and clay content at 5–15 cm depth in a 27,236 km2 Mediterranean French region with sparse sets of measured soil profiles (1/13.5 km2) and for a set of environmental covariates characterizing the relief, climate, geology and land use of the region. Apart from classical performance indicators, comparisons involved accuracy plots and the visual examinations of the uncertainty maps provided by the two methods.

The results obtained for the three soil properties showed that QRF provided more accurate and more interpretable predicted patterns of uncertainty than RK did, while having similar performances in predicting soil properties. The use of QRF in operational DSM is therefore recommended, especially when spatial sampling of soil observations are too sparse for applying RK.

Introduction

Soil maps are simplified representations of more complex and partially unknown patterns of soil variations. Therefore, any prediction of a soil property that can be derived from these soil maps have an irreducible and, most often, substantial uncertainty that have been considered by a number of soil surveyors (Beckett and Burrough, 1971, Wilding, 1985, Marsman and de Gruijter, 1986). For want of fully eliminating this uncertainty, soil surveyors communicated it to the soil maps users essentially by means of soil surveys norms (e.g. (GEPPA, 1967); (Soil Survey Staff, 1993)) that related map scales with densities of observations and expected purities of soil mapping units. However, further validations of soil maps (Beckett and Burrough, 1971) showed that this uncertainty was most often underestimated and that the relation between map scale and uncertainty was largely scrambled by the differences of soil pattern complexities between soil surveys. Finally, it has been stated that the soil surveyor community unsatisfactorily addressed the evaluation and communication of the uncertainty of soil maps (Wilding, 1985).

The introduction of geostatistics in soil mapping in the early eighties - see a review in Webster (1994) - dramatically changed the way of addressing uncertainty in soil mapping. Indeed, the quantitative estimation of the uncertainty was simply a by-product of the geostatistical model from which the soil property was spatially estimated. Therefore, uncertainty maps were provided to users together with the predicted soil map. This uncertainty mapping approach was further taken over by (McBratney et al., 2003) in the definition of the principle of Digital Soil mapping (DSM). Following this principle, a soil class or a soil attribute could be related with, and further predicted from, the so-called scorpan factors by a Spatial Soil Prediction Function with auto-correlated error (SSPFe), the latter representing the uncertainty associated with soil prediction. Moving towards operational use of DSM with the GlobalSoilMap (GSM) project (Sanchez et al., 2009, Arrouays et al., 2014), the DSM principle was further translated into technical specifications that defined the uncertainty as the 90% Prediction Interval (PI). The PI reports the range of values within which the true value is expected to occur 9 times out of 10, with a 1 out of 20 probability for each of the two tails (Arrouays et al., 2014). This level of probability implies that a DSM model will be considered suitable for delivering GSM products not only for its ability to accurately predict the value of a soil property at a given location but also for its ability to predict how uncertain this predictions is.

Geostatistical models are, by essence, adequate for providing such uncertainty estimates (Goovaerts, 2001, Heuvelink, 2013). In particular, regression kriging has become very popular in the DSM community (Odeh et al., 1995, Malone et al., 2009, Hengl et al., 2015). While incorporating the relationships between soil properties and environmental covariates by means of various linear and non-linear regression models, regression kriging includes a kriging of the regression residuals that provide an estimate of the probability distribution of the true value of a particular soil property at any location. Since we assume normality, we can easily calculate the lower and upper limit of the 90% PI of the soil property required by the GSM specifications by subtracting and adding 1.64 times the kriging standard deviation to the kriging prediction (Heuvelink, 2013). However, the calibration of the geostatistical models of the residuals may be difficult in situations of large study areas with sparse sampling of measured sites (Vaysse and Lagacherie, 2015), which is increasingly frequent in operational DSM applications.

A regression tree (Breiman et al., 1984) can theoretically also provide the PI of a soil property, the terminal node mean and the within-node standard deviation, playing the role of the kriging prediction and the kriging standard deviation, respectively. However, Breiman et al. (1984, p 255) warned against this practice because the within-node standard deviation can be underestimated and the node mean poorly estimated because of non-normal node distributions; to overcome these problems, they recommended growing smaller trees with larger terminal nodes. Random Forests (RFs) (Breiman, 2001) were further proposed for improving the regression tree performances by using bootstrap aggregation techniques, which allow employing an ensemble of regression trees and therefore obtaining more robust estimations with less biased internal error estimates. Meinshausen (2006) proposed modifying the outputs of the Random Forest procedure by allowing the estimation of prediction intervals of the targeted variables. Whereas for each node and each tree Random Forests keeps only the mean of the observations that fall into this node and neglects all other information, Quantile Regression Forests (QRF) considers the spread of the response variable from which prediction intervals are constructed. Although Random Forest is one of the most popular regression procedures in the DSM community, to the best of our knowledge, Quantile Random Forest has not been tested yet in DSM applications.

This paper presents a test of Quantile Regression Forest for mapping three soil properties (clay content, organic carbon content and pH at 0–15 cm depth) and the associated uncertainties over the 27,236 km2 Languedoc-Roussillon Region (France). The results of QRF are compared with those of a classical regression kriging (denoted further RK) using RFs as the regression algorithm. A particular focus is given to the ability of QRF and RK to estimate a priori the uncertainties of their predictions.

Section snippets

Materials and methods

The case study has been described in a previous paper (Vaysse and Lagacherie, 2015), however, large excerpts are provided for the sake of understanding this paper.

Variograms of the residuals

Fig. 3 shows the variograms of the RF residuals that were further interpolated following the Regression Kriging procedure (see above).

All the variograms exhibited a nugget effect that largely exceeded the spatially structured variability (nugget minus sill). However, the pH variogram showed a clear spatial structure at a 40 km range. Conversely, the clay variogram did not show a spatial structure. The OC variogram exhibited an intermediate case in which a weak spatial structure can be

Uncertainty assessments

The uncertainty predictions of QRF and RK were validated by constructing accuracy plots, which still remains an uncommon practice in Digital Soil Mapping (Malone et al., 2011, Lagacherie et al., 2012).

It must be highlighted that the performances of DSM measured through accuracy plots (Fig. 4) were very different from those measured by the classical indicators. Indeed, low performances for predicting clay and OC values were registered for QRF (Table 2) whereas accuracy plots revealed that QRF

Conclusions

This paper tested the use of Quantile Regression Forest (QRF) for delivering reliable estimates of uncertainty on Digital Soil Mapping products. QRF was compared with Regression Kriging (RK), the spatial model that is usually applied in operational Digital Soil Mapping. In the specific context of Languedoc Roussillon and of sparse sampling of soil observations (1/13.5 km2), the lessons that can be retrieved from this testing are as follows:

  • QRF outperformed RK in delivering uncertainty estimates,

Acknowledgements

This research was granted by the French National Institute of Agronomical Research (INRA) and the French Research and Technology Agency (ANRT). The authors are also indebted to BRGM (French Geological Survey), Jean-François Desprats for providing geological maps at the 1:50,000 scale,. Additionally, we thank Dr. Vinatier for his great advice on programming the R software and the access of his computational machine, and Dr. Bailly for his useful advises.

References (30)

  • K. Vaysse et al.

    Evaluating digital soil mapping approaches for mapping GlobalSoilMap soil properties from legacy data in Languedoc-Roussillon (France)

    Geoderma Reg.

    (2015)
  • R. Webster

    The development of pedometrics

    Geoderma

    (1994)
  • P.H.T. Beckett et al.

    The relation between cost and utility in soil survey

    J. Soil Sci.

    (1971)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • Cited by (204)

    View all citing articles on Scopus
    View full text