Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Spatial modelling of topsoil properties in Romania using geostatistical methods and machine learning

Abstract

Various research topics from the field of soil science or agriculture require digital maps of soil properties as input data. Such maps can be achieved by digital soil mapping (DSM) techniques which have developed consistently during the last decades. Our research focuses on the application of geostatistical methods (including ordinary kriging, regression-kriging and geographically weighted regression) and machine learning algorithms to produce high resolution digital maps of topsoil properties in Romania. Six continuous predictors were considered in our study (digital elevation model, topographic wetness index, normalized difference vegetation index, slope, latitude and longitude). A tolerance test was performed to ensure that all predictors can be used for the purpose of digital soil mapping. The input soil data was extracted from the LUCAS database and includes 7 chemical properties (pH, electrical conductivity, calcium carbonate, organic carbon, N, P, K) and the particle-size fractions (sand, silt, clay). The spatial autocorrelation is higher for pH, organic carbon and calcium carbonate, as indicated by the partial sill / nugget ratio of semivariograms, meaning that these properties are more predictable than the others by kriging interpolation. The optimal DSM method was selected by independent sample validation, using resampled statistics from 100 samples randomly extracted from the validation dataset. Also, an additional independent sample of soil profiles, comprising legacy soil data, and the 200k Romania soil map were used for a supplementary validation. The results show that machine learning and regression-kriging are the optimal methods in most cases. Among the machine learning tested algorithms, the best performance is associated with Support Vector Machines and Random Forests methods. The geographically weighted regression is also among the optimum methods for pH and calcium carbonates spatial prediction. Good predictions were achieved for pH (R2 of 0.417–0.469, depending on the method), organic carbon (R2 of 0.302–0.443), calcium carbonates (R2 of 0.300–0.330) and moderate predictions for electric conductivity, total nitrogen, silt and sand (R2 of 0.155–0.331), while the lowest prediction characterizes the phosphorous content (R2 of 0.015–0.044). LUCAS proved to be a reliable and useful soil database and the achieved spatial distributions of soil properties can be further used for national and regional soil studies.

1. Introduction

Digital soil mapping (DSM) techniques can help achieve accurate maps of soil properties, which are further needed for reliable assessments of soil quality in relation to different land uses. These techniques have developed notably during the last decades, enhanced by the development of geographical information systems (GIS) software, statistical tools and computer technology [1]. Nowadays, there is a wide variety of statistical methods and software which can be used to produce digital maps of soil properties. Currently, DSM techniques are being applied from local to national [2, 3], continental [4, 5] and even global scales [6].

DSM can be successfully used to derive digital maps of both qualitative (soil taxonomic units, soil horizons) and quantitative (various chemical, physical and biological variables) soil properties, as well as to achieve data on pedogenetic processes and to discover new knowledge on soil processes [7]. In the European context, DSM has been used in several studies to predict the occurrence of soil taxonomic units: in Germany (Rhineland-Palatinate area) using artificial neural networks [8], in Hungary (forest environment) using discriminant and classification trees [9], in Germany (Bavarian forest soils) using random forest [10], in Romania (at watershed scale) using logistic regression and fuzzy techniques [11], in Denmark (at country scale) using boosted-decision tree method [12], in Cyprus using random forest [13]. Much more numerous are the studies applying DSM techniques for quantitative soil variables predictions. Soil organic carbon is a parameter of special interest due to its implications in soil fertility and climate change and has been approached in many studies [14]. [15] used geostatistical and machine learning techniques to predict topsoil organic carbon in Sicily (Italy); [16] applied partial least square regression, multiple linear regression and principal component regression in Galicia (Spain); [17] used multiple linear regression, boosted regression trees, artificial neural networks and least-squares support vector machines to estimate top- and subsoil organic carbon stocks in nature reserves in Belgium. Other studies analyzed different or multiple soil properties: [18] focused on modelling soil depth in France; [19] used geostatistical and machine learning techniques and large datasets of covariables to predict several chemical and physical soil properties in areas from Switzerland; [20] used quantile regression forests and regression kriging to predict pH, organic carbon and clay content in the Mediterranean French region.

Digital soil mapping in Romania is still a less approached subject. The only country-wide GIS soil map of Romania is in vector format and it was produced within SIGSTAR-200 project [21] by the digitalization of the soil map units from the analogue 200k scale soil map of Romania [22] and the creation of an attribute database comprising various supplementary information on the soil map units. This digital map was used to derive the topsoil organic carbon content at 500m resolution, using pedotransfer rules [23]. However, its high degree of generalization and the fact that it only includes qualitative parameters within the attribute database makes it unsuitable for a proper evaluations of terrain suitability for crops. Other studies focused on comparative applications of DSM techniques in different regions of the country. [24] used kriging and regression-kriging to derive quantitative soil parameters and logistic regression–kriging to derive occurrence spatial probabilities for soil horizons in a part of Iaşi County; [11] tested the used of logistic regression and fuzzy techniques for the prediction of qualitative soil variables; [25] applied comparatively several geostatistical methods including ordinary kriging, geographically weighted regression, regression-kriging to map soil quantitative properties; [26] investigated the effect of scale and resolution on DSM; [27] used regression-kriging to map soil heavy metals in Cluj County; [28] explored the potential of land-surface segmentation technique to create predictors for DSM of qualitative soil parameters (eg. soil taxonomic units, textural classes); [29] used RK and OK to interpolate pXRF-derived soil chemical data in an archaeological site.Among DSM methods, kriging was one of the first to be used for the interpolation of soil properties in the 80’s [30, 31]. It has developed ever since and now includes several approaches, some simpler (ordinary kriging, simple kriging), other more complex (co-kriging, universal kriging, indicator kriging, probability kriging, kriging with external drift) [32].

In parallel, regression analysis has been used to correlate quantitative soil properties and terrain variables [33]. Coupled with kriging for residuals interpolation, the hybrid regression-kriging method became a powerful digital mapping technique [3436].

Machine learning (ML) is a more recent approach, the first applications for DSM occurring in the 90’s [37]. It includes a variety of non-linear algorithms originally used for data mining. For DSM purposes, algorithms based on regression and classification, such as random forest [38], which is used in our study among other methods, are the most widespread [3941]. The approach based on ML has the advantage of not making any assumptions regarding data distribution (as it is the case for geostatistical methods) and of handling a large number of cross-correlated predictors. On the other hand, a disadvantage is that the models behind the prediction are very complex and can’t be readily used to explain the prediction [42].

LUCAS (Land Use and Coverage Area frame Survey) (https://esdac.jrc.ec.europa.eu/content/lucas2015-topsoil-data) project began in 2000 following a decision of the European Commission, with the purpose of creating a harmonized soil database across the EU [43]. By repeating the data collection, changes in soil properties can be detected. This database has been successfully used at continental scale to estimate soil erodibility [44], organic carbon [23, 45, 46], copper distribution [47] or other chemical parameters [48, 49]. Several studies also applied LUCAS at country level, eg. to derive soil organic carbon for Belgium and Luxembourg [50], to analyze the evolution of phosphorous content in agricultural soils of Hungary [51]. Some European scale studies regarding the mapping of soil properties through DSM techniques included the Romanian territory as well. [52] modelled the European soil organic carbon stocks under current and future climate and land cover conditions using the regression-kriging approach. [49] used LUCAS 2009/2012 topsoil data to derive maps of 8 chemical parameters for the EU.

The goals of our study are to test several spatial interpolation methods, choose the optimal one through independent validation and to provide high resolution maps of topsoil parameters based on LUCAS database, potentially useful for regional and local agricultural land evaluation. It is the first time that digital mapping techniques, including geostatistical methods and machine learning, are applied for the entire Romanian territory. In relation to these goals, the research questions we address in our study are the following: (i) which is the optimal DSM method for the analyzed soil properties; (ii) how accurate are the resulting spatial models; (iii) how reliable is the LUCAS database for national and regional soil assessments. Compared to the European scale studies which have included the Romanian territory, our study brings more insight for several reasons: it is focused only on Romania; it applies several DSM methods in order to choose the optimal one; it uses more recent LUCAS data (2015 for chemical properties) and provides spatial models for more soil variables (in addition to pH, CaCO3, N, P, K, we mapped organic carbon content, electrical conductivity and particle-size variables–clay, silt, sand); it provides higher resolution maps (100x100 m).

2. Materials and methods

2.1. Study area

Romania is a medium-size European country (about 237,000 km2) situated in the central-eastern part of the continent. It has a temperate continental climate with mean annual temperature of about 9.5°C, ranging from <0°C on the highest mountain peaks to >11°C in the southern and south-eastern plain and plateau regions. The precipitations increase with elevation and from east to west, with a mean annual value of about 635 mm yr-1 and a value range from <400 mm yr-1 in eastern Dobrogea and >1000 mm yr-1 on the highest mountain peaks [53]. The surface lithology is characterized by the dominance of loess and loess-like formations in the plain areas, lime and sandstones in the plateau and hilly areas and hard metamorphic, igneous and sedimentary rocks in the mountain areas. Land use in dominated by agriculture in the lower plain, hilly and plateau areas (57% of the country).

A general overview regarding the soil distribution in Romania is shown in Fig 1. The main characteristic of the Romanian soil cover is the concentric distribution of the major soil zones around the Carpathian Mountains. The mountain area is covered mainly by Cambisols, Podzols and Umbrisols, with a consistent extent of Andosols in the volcanic area from Eastern Carpathians. The hilly area and the large plateaus surrounding the mountains are dominated by Luvisols. They gradually turn into Phaeozems followed by the broad belt of Chernozems which occurs in the peripheral low plains. Fluvisols are present in the large river valleys, usually accompanied by Gleysols and saline soils. The relatively equal percentages of Cambisols (24.8%), Luvisols (24.2%) and Chernozems (22.1%) are consistent with the distribution of major landforms (mountains, plateaus / hills and plains) across Romania.

2.2. LUCAS soil database

The LUCAS (Land Use and Coverage Area frame Survey) project started in 2001, following the 1445/2000 decision of the European Commission, its target being to continue the monitoring of land use and land cover in the EU, providing information related to the sustainable development of environment, landscapes, agriculture and terrain degradation. In 2006, the survey changed its sampling methodology and broadened its focus and frequency. Since 2009 the sampling has included topsoil surveys, in order to construct a robust and standardized database regarding the soil cover. The purpose implied various applications of the dataset, including of modelling soil properties, evaluating soil implication in climate change, nutrient cycles, agriculture or comparing the quality of national / regional soil inventories. The same year, the covered area has been expanded to most of the EU states, Bulgaria and Romania being sampled in 2012. A description of LUCAS 2012 database for Romania and Bulgaria is provided by [43]. In 2015, sampling was expanded to cover locations at altitudes above 1000 m [54].

In order to establish baseline values of topsoil properties, the sampling points have been selected at similar densities in each country. However, differences occur due to a preference for agricultural land in comparison to grasslands and woodlands, which can limit the quality of the statistical extrapolations or the use of the data for soil mapping [48].

The sampling methodology and laboratory analyses are the same since 2009. The methods used for the analysis of physical and chemical properties in topsoil samples are detailed in [55], while detailed description of the sampling efficiency is described by [56].

LUCAS Topsoil Survey is based on the soil parameters that are deemed relevant for agricultural policy, providing an excellent basis to assess the changes in topsoil characteristics across the EU. Therefore, analysis of physical and chemical properties represents the core of the LUCAS Soil survey. Each property has its standard protocol, and all samples are analyzed following the same procedure. This makes the data comparable over time [48, 54]. In this study we analyzed 10 parameters extracted from LUCAS 2012 and 2015 topsoil databases: soil pH, CaCO3, organic carbon (OC), total nitrogen (N), phosphorous content (P), extractible potassium content (K), electrical conductivity (EC) and particle-size distribution (clay, silt, sand).

LUCAS 2012 and 2015 databases were extracted for Romania. The 2015 database, comprising 1083 points, was used to interpolate the chemical soil properties. A random subsample of 160 points (about 15% of the total sample) was extracted and used for independent validation (Fig 2). The 2012 database and the additional points in 2015 database, totaling 1444 points, were used to interpolate the particle-size fractions. A random subsample of 220 points (about 15% of the total sample) was extracted for independent validation.

thumbnail
Fig 2.

Training and test samples used for interpolation of chemical (a) and particle-size (b) properties (extracted from LUCAS database: https://esdac.jrc.ec.europa.eu/projects/lucas) and the supplementary validation sample (c) [57] overlaying the major landform units of Romania [58].

https://doi.org/10.1371/journal.pone.0289286.g002

2.3. Predictor variables

We used 6 predictors in our study as covariates in RK, GWR, GWR-OK and EML methods: digital elevation model (DEM), slope (S), topographic wetness index (TWI), normalized difference vegetation index (NDVI) (Fig 3), latitude (LAT) and longitude (LON). We preferred to use a minimum number of significant predictors, instead of a large predictor dataset, in order to avoid overfitting and multicollinearity problems and to ensure that the achieved models can be explained through the contributions of the factors. Also, a small number of quantitative predictors facilitates model reproduction in different areas, at different scales.

thumbnail
Fig 3.

Predictor maps used in interpolation procedures: a–elevation; b–slope; c–topographic wetness index; d–normalized difference vegetation index.

https://doi.org/10.1371/journal.pone.0289286.g003

The spatial analysis was performed at 100 m resolution using ArcGIS 10.8 software (ESRI, https://www.esri.com/en-us/arcgis/). The DEM was extracted from EUDEM (https://land.copernicus.eu/imagery-in-situ/eu-dem/eu-dem-v1.1/view) and it was further used to derive slope and topographic wetness index, the latter in SAGA-GIS software (http://www.saga-gis.org/). NDVI was derived from LANDSAT 8 imagery using Google Earth Engine (https://developers.google.com/earth-engine/datasets/catalog/landsat-8). The median values of June-August interval from 2015–2020 were used to compute this predictor. The choice of the summer months for NDVI computation is related to the vegetation development, which is at its peak under the temperate continental conditions of Romania, while the different types of vegetation are best discriminated during summer.

The DEM is probably the most important factor, expressing the altitudinal distribution of soils as effect of climate and vegetation zonation, but also as effect of different lithology (soft, loamy loess-like formations in the lower plain and hilly regions and hard rocks in the mountain areas). Slope influences soil erosion, while topographic wetness index expresses the accumulation potential of the relief for water and sediments. NDVI accounts for the influence of vegetation on soil properties, while latitude and longitude may account for general eastward / northward trends induced by climate variations.

2.4. Spatial interpolation methods

Our study compares the outcome of 5 spatial interpolation methods, namely ordinary kriging (OK), regression-kriging (RK), geographically weighted regression (GWR), geographically weighted regression–kriging (GWR-OK) and ensemble machine learning (EML).

Ordinary kriging (OK) is a local interpolation method based on the autocorrelation concept [59], according to which the values of a variable are similar in close locations and the dissimilarity increases with the distance between points. The dissimilarity is measured through semivariance, while the chart displaying the semivariances against point distances is called semivariogram and it is used to derive point weights on the basis of which the estimated values are computed. The empirical variogram, as a nonparametric estimator of the variogram of a spatial process, is described by the following equation according to the Matheron estimator: (1) where:

  • γ(h): semivariance of points separated by distance h;
  • Zi: variable values in points xi;
  • Zi+h: variable values in points situated at distance h from points xi;
  • N(h): number of points situated at distance h.

The kriging estimator at unsampled locations is given by a linear combination of the variable values at known locations: (2) where:

  • Z(x0): kriging estimator at unsampled location x;
  • Z(xi): variable values in xi neighboring points;
  • λi: weighting coefficients of xi neighboring points, ;
  • ε: standard error.

The weighting coefficients (λi) are determined in such a way as to minimize the estimation variance. Ordinary kriging doesn’t take into account co-variables for interpolation, the predicted values depending only on the neighboring values. Ordinary kriging has been extensively used for mapping soil [30, 31, 6062]. In our study, ordinary kriging interpolation was applied by using the Geostatistical Analyst module from ArcGIS 10.8 software (ESRI, https://www.esri.com/en-us/arcgis/).

Regression-kriging (RK) combines regression analysis with ordinary kriging into a theoretically superior interpolation method [33, 34]. Our study uses global linear regression models to derive spatial trends of topsoil parameters: (3) where:

  • : predicted value;
  • Xi: predictors;
  • a: intercept;
  • bi: regression coefficients;
  • ε: standard error.

The deviations from this trend (residuals) are then interpolated by ordinary kriging, the final spatial model resulting from the addition of spatial trend and residuals. As ordinary kriging, the regression-kriging method has numerous applications for mapping soil parameters [32, 36].

A stepwise integration of predictors into the regression models was applied in order to avoid collinearity and insert only the predictors with significant contributions to the prediction.

The geographically weighted regression (GWR) applies local regression models in moving windows with fixed or variable dimensions [63]. Regression models are computed for each location based on the neighboring point values. These points receive different weights according to a gaussian function. The closer a point is to the center of the moving window, the higher is its weight and contribution to the estimated value. In contrast to global regression, GWR has the advantage of rendering local variations. Combining GWR with ordinary kriging of residuals leads to the hybrid geographically weighted regression–kriging (GWR-OK) method. GWR and GWR-OK have also been successfully used for digital mapping of soil parameters [25, 64, 65]. In our study, ordinary kriging interpolation was applied by using the Geographically Weighted Regression module from ArcGIS 10.8 software (ESRI, https://www.esri.com/en-us/arcgis/).

Ensemble machine learning (EML) as described by [66, 67] is a predictive technique which combines two or more models to improve the prediction accuracy. Compared to individual learners, the use of ensemble learners has several advantages: it improves performance, by combining the output of multiple learners, it ensures robustness, by reducing the extrapolation / overshooting effects, and minimizes bias.

For this study, we used EML implemented in landmap package for R [68], which extends the functionality of mlr package [69] and contains the necessary tools and functions for automatic mapping of environmental variables without too much intervention required from the user.

The package has 5 stacked learners implemented by default ("regr.cvglmnet"–Generalized Linear Model with Lasso or Elasticnet Regularization, "regr.ksvm"—Support Vector Machines, "regr.nnet"—Neural Network, "regr.ranger"—Random Forests, "regr.xgboost"—eXtreme Gradient Boosting) and allows the use of oblique coordinates [70] as well as the geographical distance to all as “geographical features” [71, 72]. By default, the landmap package allows the user to train a spatial prediction using a SuperLearner algorithm [73] which consists in computing each individual learner by 5-fold cross-validation and determine the meta-learner. Due to the large numbers of points (over 900) we used only oblique coordinates with a list of 28 angles [70] to reduce the artefacts.

The landmap package integrates several techniques, allowing an automate application of EML, with minimum user intervention [68]: derivation of geographical distances, computation of principal components based on a covariate dataset, automated filling of gaps, automated fitting of variogram, spatial data overlay, cross-validation model training, fitting of the final EML model.

2.5. Validation procedure

For validation purposes, we used randomly selected independent samples which were not included in the model computation stage, representing about 15% of the total samples (160 points for chemical soil properties and 220 points for particle-size fractions). Correlation charts, displaying observed vs predicted values, were computed for each soil parameter and interpolation method, while the selection of the best model was based on the following statistical quality indices:

  • Coefficient of determination (R2), which is the squared Pearson correlation coefficient:
(4)

where:

  1.  ‐ COV(X,Y): covariance of predicted () and observed (Y) values;
  2.  ‐ σX, σY: standard deviations of predicted () and observed (Y) values.
    • Mean absolute error (MAE):
(5)

where:

  1.  ‐ Yi: observed values;
  2.  ‐ : predicted values;
  3.  ‐ N: validation sample size.
    • Root mean square error (RMSE):
(6)

In addition, the mean bias error was computed in order to evaluate the bias of observed–predicted relationships: (7)

High positive values of ME indicate that the model underestimate the observed values, while low negative ME values indicate that the model tend to overestimate the observed values.

In order to refine the validation stage, the test datasets were randomly resampled using 85% of the data points into 100 samples which were further used to compute average and standard deviation values of the validation parameters (R2, MAE, RMSE and ME). Resampling was achieved in Excel environment using Resampling Stats Addin module Bruce P. (https://resample.statistics.com/excel/).

A supplementary independent sample was also used for validation, comprising legacy soil data from 148 soil profiles (Fig 2C). This database was compiled from various soil studies conducted in Romania during 1960–2020 period using standard laboratory determinations. The database has been compiled, verified and used for the first time by [57]. We also performed a validation of our quantitative soil parameter maps with the 200k Romanian soil map by computing average parameter values for soil reference groups and interpreting the plausibility of results.

3. Results

3.1. Descriptive statistics, variable correlations and variography

LUCAS dataset for Romania spans from sea level to about 1500 m altitude and cover slopes up to 29o (Table 1). It is therefore important to keep in mind possible extrapolation errors at higher elevations and on steeper slopes. Among soil properties, CaCO3 and P content present the highest spatial variability. We notice that the median value is 0 for CaCO3, meaning that more than half of the sample has 0 values for this parameter. P content is characterized by the presence of some very high peak values, exceeding 300 mg kg-1. The lowest spatial variability characterizes soil pH, followed by clay and silt contents.

thumbnail
Table 1. Descriptive statistics of LUCAS soil properties and predictor variables.

https://doi.org/10.1371/journal.pone.0289286.t001

The Pearson correlation matrix (Table 2) shows that there are significant correlations among the predictor variables, the highest being between TWI and slope (-0.805). However, the two predictors can’t be seen as redundant because they express different, though complementary, sides of terrain reality: slope reflects the erosion potential, while TWI reflects the accumulation potential. The other correlation values are moderate and low. A tolerance test was carried out to ensure that the predictors can be all used in digital mapping procedures. Tolerance values are computed as 1 –R2 (XLSTAT Tutorial, https://help.xlstat.com/), where R2, for a certain variable, is the determination coefficient of the prediction of all other variables. Tolerance values less than 0.2 [74] or 0.1 [75] indicate the presence of multicollinearity, that is the presence of significant correlations among the predictor variables which may induce wrong interpretations of the variable contributions to the final spatial models. In our situation, the minimum tolerance value is 0.3, indicating that all predictors can be used in prediction models.

thumbnail
Table 2. Pearson correlation matrix and tolerance values for predictors.

https://doi.org/10.1371/journal.pone.0289286.t002

The correlations between soil variables and predictors (Table 2) are generally significant, due to the large sample size, but of low or moderate magnitude. Soil reaction (pH) has the highest number of significant correlations and the highest magnitude, while the silt content seems to be the variable the least correlated with the predictor variables. The correlations among soil variables show, as expected, a close dependency of pH on CaCO3 content and a very tight relationship between OC and N. Also, the high correlation between P and K is probably the effect of their origin from the same source, namely the fertilizers.

The total LUCAS samples are characterized by a mean point distance of about 8.5 km for the chemical properties sample and of 6.7 km for the particle size sample, 50% of the points being separated by distances of less than 19–20 km. The mean point densities of the two samples are 4.5 and 6.1 points / 1000 km2 respectively, 50% of the territory having densities of more than 8.5–10.0 points / 1000 km2.

According to variography (Fig 4), the partial sill / nugget ratio shows that the spatial autocorrelation is higher for pH, OC and CaCO3, these properties being therefore more predictable than the others by kriging interpolation, while the spatial autocorrelation is lower for clay, silt and P, in which case the kriging predictions are expected to be more uncertain. Among the available semivariogram models, the exponential type was chosen to fit the experimental semivariogram cloud because it minimized the nugget effect. The lag size is generally around 44–45 km, except for OC, EC and N in which case it is smaller (around 15 km) indicating more localized spatial variations. Anisotropy was not considered in the variography analysis because a single clear data variation direction could not be identified throughout the country due to landscape complexity.

3.2. Spatial representativeness of predictor variables

Fig 5 displays the spatial representativeness of the predictors (DEM, SLOPE, NDVI, TWI) by comparing the relative frequencies of classes based on predictors and LUCAS 2015 sample points. We notice a good agreement between the histograms for elevation and slope. Some minor differences indicate slight underrepresentation in the medium elevation domain (700–1000 m) and at higher slopes (> 13o). Also, LUCAS 2015 sample doesn’t contain points with elevations higher that 1550 m and with slope values higher than 31o, which may induce model extrapolation problems. For NDVI and TWI there are no extrapolation issues, the point sample covering all value classes. However, the comparison of the frequency curves shows some differences related to the underrepresentation of the sample in the high value domain for NDVI (> 0.8) and in the lower value domain for TWI (<7.6).

thumbnail
Fig 5.

Spatial representativeness of LUCAS 2015 total sample in respect to the main predictors: a–elevation; b–slope; c–topographic wetness index; d–normalized difference vegetation index.

https://doi.org/10.1371/journal.pone.0289286.g005

3.3. Spatial models of soil properties

The coefficients of determination (R2) values show that the most predictable soil property is the pH, in the case of which 41–47% of the spatial variance of the validation sample could be accounted for by the computed spatial models (Table 3 and Fig 6). Good performances were also achieved for OC (R2 of 0.302–0.443) and CaCO3 (R2 of 0.300–0.330). The weakest prediction characterizes the phosphorous content (P), for which none of the spatial interpolation methods provided a statistically significant model with the initial settings. Still, in order to provide a map for this soil parameters as well, and considering the fact that the tolerance test showed no multicollinearity, we made an exception and computed the regression model using all predictors, regardless of their contribution to prediction. The model achieved has a low (R2 of 0.044) but statistically significant explained variance for the 0.05 significance level. Weak performances were also achieved for K (R2 of 0.037–0.150) and clay content (R2 of 0.075–0.112). Moderate performance was achieved for the other soil parameters (EC, N, silt, sand), for which the explained variances computed for validation samples are about 20–30%. The overall determination coefficients may seem small, but in the context of soil prediction by digital soil mapping, R2 values of 0.3–0.5 are commonly found [36, 61, 76].

thumbnail
Fig 6. Mean R2 values computed for resampled validation datasets.

https://doi.org/10.1371/journal.pone.0289286.g006

thumbnail
Table 3. Mean values and standard deviations of quality statistical indices computed for 100 resampled validation datasets*.

https://doi.org/10.1371/journal.pone.0289286.t003

Based on R2, MAE and RMSE values (Table 3), the best models were identified for each soil property (Table 4). For pH, which is the most predictable soil parameter, 4 out of 5 methods presents similar performances (RK, GWR, GWR-OK, EML), qualifying them as best models. The OK model is also acceptable, the quality parameters being close to the ones of the other methods. For EC, the optimal methods are EML and RK. Though GWR-OK presents the highest R2 value, the prediction is very biased (ME = -19.828), resulting from the dominance of predicted values higher than the observed ones, which reflects in the high MAE and RMSE values. For this reason, GWR-OK model was not considered among the optimal one for this parameter. RK model was found to be the optimal one for OC and among the optimal ones for CaCO3, P, N and silt. EML was the optimal method in the case of K, clay and sand, and among the optimal ones for N and silt. Finally, GWR, GRW-OK and RK proved to be the optimal methods for CaCO3 mapping. Overall, the best methods proved to be EML and RK, both cumulating 8 selections as optimal methods (Table 4). GWR and GWR-OK were each selected twice as optimal methods, while OK only once.

thumbnail
Table 4. The best models identified by quality statistical indices computed for validation samples.

https://doi.org/10.1371/journal.pone.0289286.t004

The spatial distributions of the optimal models are classified according to the Romanian standards [77], excepting EC for which the values are lower than the 200 mS/m threshold for saline soils and a natural breaks classification is used. For these selected spatial models, Fig 7 shows the correlations between the observed and predicted values computed for the independent validation samples. We notice better performances in the case of pH and OC (R2 values of 0.461 and 0.439 respectively), followed by CaCO3 and sand fraction (0.337 and 0.332), while the weakest model is the one of P content (0.043).

thumbnail
Fig 7. Correlations between observed and predicted values of the selected spatial models computed for validation samples.

https://doi.org/10.1371/journal.pone.0289286.g007

The validation according to the supplementary dataset is shown in Fig 8. The sample has less points in the case of CaCO3 content (77) because no measurements were carried out for the other points of this parameter. Also, P, K contents and EC data were not available in this dataset. For the other 7 soil variables we may notice that there is a high degree of similarity between these correlation charts and the ones computed for the LUCAS validation sample (Fig 7), which demonstrates that the models are stable. The pH and N content have very similar R2 values (0.455 and 0.268 respectively), the sand fraction has a slightly higher value (0.393), while the R2 values of OC and CaCO3 are slightly lower (0.333 and 0.263 respectively). It is to be noted the higher R2 value in the case of the clay content (0.210 using the supplementary validation dataset, compared to 0.111 using LUCAS validation dataset).

thumbnail
Fig 8. Correlation charts using the supplementary validation dataset.

https://doi.org/10.1371/journal.pone.0289286.g008

Soil pH, CaCO3 and organic C are highly correlated and show strong climatic, vegetation cover and lithological dependence. They reflect the occurrence of various soil processes and have multiple functions in soils in relation with natural vegetation and crops. The distribution of the major macronutrients (N, P, K) reflects both natural processes and the intensity of human activity, representing important parameters for agricultural and environmental applications. Electrical conductivity is a good indicator of salinity, which can cause fertility problems in soil and can affect crop productivity. Particle-size distribution refers to the relative proportions of sand, silt and clay within the fine earth fraction, determining the movement of soil water and implicitly the retention of cations and nutrients.

The soil reaction (pH) spatial distribution (Fig 9A) depends mainly on elevation, the soils becoming more acid towards higher elevations as an effect of increased precipitations. The regression equation shows that there is a general decrease with 0.195 pH units for each 100 m increase in elevation (Table 4). It also reveals a general trend of pH increase towards east (by 0.155 units / 1 degree of longitude) explained by climate aridity increase in the same direction. Strongly and moderately acid soils (<5.9) are present in the mountain areas, weakly acid soils (5.9–6.8) are found in the hilly, plateau areas and parts of Western and Romanian Plains, while neutral and weakly alkaline (6.9–8.4) are specific to Dobrogea Plateau, eastern Romanian Plain and lower sectors of the Moldavian, Transylvanian Plateaus and Western Plain.

thumbnail
Fig 9.

Optimal maps of chemical soil parameters: a–soil reaction; b–electrical conductivity; c–organic carbon; d–calcium carbonates.

https://doi.org/10.1371/journal.pone.0289286.g009

The electrical conductivity (EC) spatial distribution (Fig 9B) depends on elevation and TWI, according to the regression equation (Table 4). The EC values are higher in accumulation zones (eg. floodplains) because of soluble salts accumulation. Also, higher EC values are found in lower, plain or plateau areas, due to the more intense evapotranspiration processes. The higher values in the mountain area are probably the influence of parent materials, these soils being younger and richer in minerals. Overall, the predicted EC values are lower than the 200 mS/m threshold which separates non-saline from saline soils.

The organic carbon (OC) spatial distribution depends mainly on elevation and TWI (Table 4 and Fig 9C). The values of this soil parameter increase with elevation as a consequence of the cooler and wetter climate, which reduces the humification rate and favors the accumulation of weakly decomposed organic matter in larger quantities. On the other hand, OC increases with TWI values, that is at lower relative elevations along floodplains, because of the wetter soil conditions which slow down the decomposition rate of the organic matter. Moreover, the periodical deposition of materials during floods hampers the humification process and allows the accumulation of larger organic matter quantities. The moderate and low OC contents in the hilly, plateau and mountain areas are also the effect of higher soil erosion rates, especially in the Subcarpathians [78]. Most of the country has a moderate OC content (17.4–37.8 g/kg), while in the mountain areas the content is high and very high (> 37.8 g/kg) (Fig 7C).

Calcium carbonates (CaCO3) content is closely and directly related to soil pH. Higher contents are found within arid climates at lower elevations (Dobrogea Plateau, eastern Romanian Plain) because of the lower rainfall and higher evapotranspiration values (Fig 9D). Also, as revealed by the regression equation (Table 4), CaCO3 is higher on steeper slopes because the erosion process brings to surface soil material from lower depths which is richer in carbonates. Moreover, there is a general trend of CaCO3 increase towards east as an effect of increasing aridity.

Nitrogen content (N) is directly related to the OC content, higher values being found in the mountain areas. Also, higher N values may be the consequence of nitrogen fertilizer applications on arable soils. Such values are found in northern Moldavian Plateau, Transylvanian Depression and Western Plain. However, most of the Romanian soils have medium N contents in the top layers (Fig 10A). [79] shows that the repartition of total nitrogen in the upper horizon of agricultural soils respects the same regularities as OM, having values of 0.2–3.5 g kg-1 in the case of Chernozems and Phaeozems, the nitrogen content increasing by 0.04–0.05 g kg-1 for every percentage of clay and decreasing by 0.2–0.21 g kg-1 for every increase in 1°C of the mean annual temperature.

thumbnail
Fig 10.

Optimal maps of macronutrients: a–nitrogen; b–potassium; c–phosphorous.

https://doi.org/10.1371/journal.pone.0289286.g010

Potassium content (K) spatial distribution is directly related to the clay content and to the application of fertilizers on agricultural soils. Several studies in Romania have shown that K content in soils is determined by the content and nature of potassium minerals from parent materials, their weathering, the intensity of leaching and bioaccumulation processes [79]. As displayed in Fig 10B, the Romanian topsoil is rich in potassium. High and very high (> 200 g/kg) K values are found in the lower plain and plateau areas, where soils are richer in clay and mostly used for agriculture, while medium and low K values are present in mountain soils.

Phosphorous content (P) is the least predictable soil variable. The spatial model displayed in Fig 10C was achieved by using all predictors and after the elimination of 4 outliers with values over 290 g kg-1. The problem of P values is acknowledged by the LUCAS authors themselves (https://esdac.jrc.ec.europa.eu/content/lucas2015-topsoil-data), the readme file stating that P contents below 10 mg kg-1 (the limit of detection) could not be quantified with enough uncertainty in the samples by the laboratory. The most important factors determining P content are the parent materials and the intensity of OM accumulation, which concentrates it into organic compounds [79]. The spatial distribution of topsoil P values also shows the relation with fertilization, the highest values being specific to areas of intensive agriculture. Overall, high P values are found in agricultural soils from the plain, hilly and plateau areas, and low P values in mountain soils but also in the Transylvanian Depression.

The particle-size fractions are mainly dependent on parent materials, but are also related to pedogenesis (Fig 11). Clay content values are higher in the lower, plain areas (Romanian Plain, Western Plain), as well as the lower parts of the Transylvanian Depression and Moldavian Plateau because of the higher clay content of parent materials. As elevation increases, clay content generally decreases presenting lower values in the mountain areas, because of the dominance of hard bedrocks and the depletion of clay from topsoil in the wetter mountain climate. Silt content presents higher values on loess and loess-like formations from Dobrogea Plateau and eastern Romanian Plain. Sand content is higher on sandy parent materials and sandstones. Also, higher contents in the mountain areas are the effect of pedogenesis which removes clay and humus from topsoil, allowing the residual accumulation of sand fraction.

thumbnail
Fig 11.

Optimal maps of particle-size parameters: a–clay; b–silt; c–sand.

https://doi.org/10.1371/journal.pone.0289286.g011

Fig 12 displays the interpolated errors (residual values) for the optimal spatial models of soil parameters. The errors were classified according to their magnitude, the yellow color indicating small error ranges. We notice that, for most of the Romanian territory, the spatial errors are small in the case of the chemical soil variables, excepting the K content, which ensures that the models are reliable. It is to be noted that the P content presents small errors for most of the country despite the low statistical significance of the model. In the case of particle-size fractions, the sand content seems to be the most reliable variable.

4. Discussions

Our study applies comparatively, the established (OK, RK), more recent (GWR, GWR-OK) and novel (EML) digital soil mapping techniques with the purpose of choosing the optimal one for each analyzed soil variable, based on independent sample validation. OK, RK and GWR are implemented in various GIS software (eg. Geostatistical Analyst in ArcGIS, Spatial and Geostatistics modules in SAGA or GRASS) and are therefore easier to apply. On the other hand, ML techniques are increasing in popularity, but are much more computationally demanding, making them difficult to use as a processing toolbox in GIS environment. Recent developments on deep learning, including the Generative Adversarial Networks (GAN) [80] and Graph Convolutional Neural Networks (GCN) [81] may prove useful to improve the prediction accuracy of the spatial models.

EML has the advantage of using multiple methods in order to improve performance, avoid overfitting and extrapolation problems. In our study, we assessed the relative performance of the 5 individual learners included in the landmap package by ranking them from 5 (the best performance) to 1 (the weakest performance) in each soil spatial model and by computing the average rank values (Fig 13). According to this measure, the best performance is associated with regr.ksvm (Support Vector Machines) method, followed closely by regr.ranger (Random Forests) method. The regr.ksvm method preformed best (and was assigned a rank value of 5) in 5 cases (EC, CaCO3, P, K, clay), while regr.ranger method had the same number of best performances but for the other 5 soil variables (pH, OC, N, silt, sand).

thumbnail
Fig 13. Relative performance of the individual learners included in landmap package.

https://doi.org/10.1371/journal.pone.0289286.g013

The average soil parameter values computed for soil reference groups of Romania [22, 21] generally confirm the reliability of the results (Table 5). As expected, pH values are the lowest in Podzols (4.13) and Umbrisols (4.59) because they develop in the wet and cool mountain climate and on acid parent rocks (eg. sandstones) and the highest on Kastanozems (7.89), developed on calcium-rich parent materials in the arid climate of Dobrogea Plateau, and Solonchaks (7.51) due to soluble salts saturation. The OC values are the highest also in mountain soils (Umbrisols– 59.00 g kg-1, Podzols– 57.61 g kg-1) because the restrictive climate conditions favor the accumulation of high amounts of weakly decomposed organic matter. Andosols have also high OC contents (53.69 g kg-1) due to their specific mineralogy, which favors the formation of organo-mineral compounds. The lowest OC contents are specific to Arenosols (16.76 g kg-1) because of the sandy texture. CaCO3 contents are higher in arid climate conditions, specifically in Kastanozems (49.75 g kg-1) because of the exudative water regime. Also, Regosols have high values due to the erosion process which exposes Ca-rich soil material. The lowest CaCO3 values are found in soils developed in wet mountain or plateau climates (Podzols, Planosols, Stagnosols, Umbrisols with average values under 2 g kg-1), where high precipitations led to an advanced debasification of soils. N content is higher in Vertisols and Gleysols (> 4 g kg-1) because of the higher contents of humified organic matter of these soils, while the lowest values are specific to Podzols (1.68 g kg-1) due to the intense soil profile depletion. K contents are higher in clay soils (Planosols, Solonetz, Solonchaks with values over 280 g kg-1) and agricultural soils (Chernozems– 262.59 g kg-1) and the lowest in Umbrisols (47.41 g kg-1) and Podzols (54.48 g kg-1) due to the intense profile depletion and weak clay formation processes. P content is somewhat similar to K content, higher values being specific to clay soils (Solonetz, Planosols with values over 22 g kg-1) and soils rich in organic matter (Histosols with 24.15 g kg-1), and lower values (10–15 g kg-1) to acid soils like Umbrisols and Podzols and incipient developed soils like Leptosols and Regosols.

thumbnail
Table 5. Average values of soil parameters for soil reference groups of Romania.

https://doi.org/10.1371/journal.pone.0289286.t005

As expected, clay content is higher in Vertisols (48.72%), Solonetz (46.53%), Planosols (47.16%) and lower in Podzols (28.05%) and Umbrisols (29.43%). High silt values are specific to soils developed on loess and loess-like formations (Kastanozems– 54.50%), while sand content is higher in mountain soils due either to residual accumulation of sand fraction (Podzols– 45.03%) or to soil development on sandstones (part of the Cambisols– 51.08%).

There are also some inconsistencies which need to be noted: the lower sand content in Arenosols and the lower EC values in saline soils like Solonchaks. Regarding the Arenosols, we notice that there is agreement between the presence of these soils, as indicated by the 1:200k soil map of Romania (Fig 1) and the high sand content (Fig 9C) in the south-western part of the country, along the Danube River, and also in the north-western part of the country, along the border. In other areas, like eastern Romanian Plain, Danube Delta and Black Sea coast, the presence of Arenosols is not revealed by the higher sand content values. On one hand, this is because Arenosols occur in small patches and the LUCAS point samples didn’t overlap them (like in the eastern Romanian Plain), and on the other hand, because the Danube Delta and the northern seaside were not sampled. This is also the case of Solonchaks, which also occur in small patches and along the Black Sea coast and in the Danube Delta. Further updates of LUCAS database could take into account our findings and sample these problematic areas.

Our results generally agree with the ones achieved by [49] within a similar study conducted at continental scale using Gaussian Process Regression. For the Romanian territory, the spatial distributions of pH, K and N present similar patterns, the R2 values being of 0.631, 0.399 and 0.314 respectively. There is less agreement in the case of CaCO3 (R2 = 0.103), while there is little agreement for the spatial distribution of P content (R2 = 0.062). The models we achieved have lower RMSE values in the case of CaCO3 (21.366 vs 78.29 g kg-1), N content (0.873 vs 2.40 g kg-1), and higher RMSE values for P (18.845 vs 17.52 mg kg-1), K (203.697 vs 121.89 mg kg-1) contents and pH (0.916 vs 0.78). However, it is to be noted that these RMSE values are not readily comparable because of the scale difference (continental vs. country) and the different validation procedures (k-fold cross-validation vs. independent sample validation).

5. Conclusions

Soil cover is a complex natural system resulting from the integration of both abiotic (rock, water, air) and biotic (plants and fauna) components. This complexity, along with the human interventions on soils manifested especially through agricultural practices, makes it difficult to predict its properties. Our study tested several DSM methods for derivation of soil property maps based on LUCAS topsoil database. The independent validation procedure, which reflects the prediction power of models, showed that EML and RK were the optimal methods in most cases. The use of EML for spatial interpolation can produce better results than kriging or other spatial interpolation method but it requires a certain amount of intervention either to reduce the artefacts through model fine-tuning or to improve the performance of the model. Also, the method is computationally demanding and thus difficult to use as a processing toolbox in GIS software. The most predictable soil property was pH, followed by OC and CaCO3, while the predictive power was less for K and clay content and statistically unsignificant in the case of P. The results are in agreement with the general distribution of soil major groups in Romania. Our models use a minimum number of quantitative predictors in order to make the application straightforward. Some improvements could be achieved by integration of qualitative predictors such as the surface lithology. LUCAS proved to be a reliable and useful soil database, with a proper sample density and relevant topsoil parameters recently acquired. However, a special attention must be paid to the less sampled mountain area and the unsampled Danube Delta in order to avoid possible extrapolation errors. The soil property maps we developed can be used for regional soil studies, their validity at local scale needing further investigation.

References

  1. 1. Savin IY, Zhogolev AV, Prudnikova EY. Modern Trends and Problems of Soil Mapping. Eurasian Soil Sci. 2019; 52: 471–480. https://doi.org/10.1134/S1064229319050107.
  2. 2. Mulder VL, Lacoste M, Richer-de-Forges AC, Arrouays D. GlobalSoilMap France: high-resolution spatial modelling the soils of France up to two meter depth. Sci Total Environ. 2016; 573: 1352–1369. https://doi.org/10.1016/j.scitotenv.2016.07.066 pmid:27436772
  3. 3. Adhikari K, Kheir RB, Greve MB, Bøcher PK, Malone BP, Minasny B, et al. High-resolution 3-D mapping of soil texture in Denmark. Soil Sci Soc Am J. 2013; 77: 860–876.
  4. 4. Hengl T, Heuvelink GBM, Kempen B, Leenaars JGB, Walsh MG, Shepherd K. Mapping soil properties of Africa at 250 m resolution: random forests significantly improve current predictions. PLoS One. 2015; 10(6): 1–26. https://doi.org/10.1371/journal.pone.0125814 pmid:26110833
  5. 5. Ballabio C, Panagos P, Montanarella L. Mapping topsoil physical properties at European scale using the LUCAS database. Geoderma. 2016; 261: 110–123.
  6. 6. Hengl T., Mendes de Jesus J., Heuvelink G. B. M., Ruiperez Gonzalez M., Kilibarda M., Blagotić A, et al. SoilGrids250m: global gridded soil information based on machine learning. PLoS ONE. 2017; 12(2): e0169748. pmid:28207752
  7. 7. Ma Y, Minasty B, Malone B, McBratney AB. Pedology and digital soil mapping (DSM). Eur J Soil Sci. 2019; 70(2): 216–235. https://doi.org/10.1111/ejss.12790.
  8. 8. Behrens T, Förster H, Scholten T, Steinrücken U, Spies ED, Goldschmitt M. Digital soil mapping using artificial neural networks. J Plant Nutr Soil Sc. 2005; 168: 21–33.
  9. 9. Illés G, Kovács G, Heil B. Comparing and evaluating digital soil mapping methods in a Hungarian forest reserve. Can. J. Soil Sci. 2011; 91(4): 615–626. https://doi.org/10.4141/cjss2010-007.
  10. 10. Häring T, Dietz E, Osenstetter S, Koschitzki T, Schröder B. Spatial disaggregation of complex soil map units: A decision-tree based approach in Bavarian forest soils. Geoderma. 2021; 185–186: 37–47. https://doi.org/10.1016/j.geoderma.2012.04.001.
  11. 11. Patriche CV, Rosca B, Pirnau RG, Stoica DL. Spatial prediction of soil qualitative variables using logistic regression and fuzzy techniques. Study region: Dobrovăţ basin (Central Moldavian Plateau), Georeview 2012; 21(1): 46–55.
  12. 12. Adhikari K, Minasny B, Greve MB, Greve MH, 2014. Constructing a soil class map of Denmark based on the FAO legend using digital techniques. Geoderma. 2014; 214–215: 101–113.
  13. 13. Camera C, Zomeni Z, Noller JS, Zissimos AM, Christoforou IC, Bruggeman A. A high resolution map of soil types and physical properties for Cyprus: A digital soil mapping optimization. Geoderma. 2017; 285: 35–49.
  14. 14. Lamichhane S, Kumar L, Wilson B. Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: A review. Geoderma. 2019; 352: 395–413. https://doi.org/10.1016/j.geoderma.2019.05.031
  15. 15. Veronesi F, Schillaci C. Comparison between geostatistical and machine learning models as predictors of topsoil organic carbon with a focus on local uncertainty estimation. Ecol Indic. 2019; 101: 1032–1044. https://doi.org/10.1016/j.ecolind.2019.02.026
  16. 16. Rodríguez-Lado L, Martínez-Cortizas A. Modelling and mapping organic carbon content of topsoils in an Atlantic area of southwestern Europe (Galicia, NW-Spain). Geoderma. 2015; 245–246: 65–73. https://doi.org/10.1016/j.geoderma.2015.01.015.
  17. 17. Ottoy S, Van Meerbeek K, Sindayihebura A, Hermy M, Van Orshoven J. Assessing top- and subsoil organic carbon stocks of low-input high-diversity systems using soil and vegetation characteristics. Sci Total Environ. 2017; 589: 153–164. https://doi.org/10.1016/j.scitotenv.2017.02.116 pmid:28258751
  18. 18. Lacoste M, Mulder VL, Richer-de-Forges AC, Martin MP, Arrouays D. Evaluating large-extent spatial modeling approaches: A case study for soil depth for France. Geoderma Regional. 2016; 7(2): 137–152. https://doi.org/10.1016/j.geodrs.2016.02.006.
  19. 19. Nussbaum M, Spiess K, Baltensweiler A, Grob U, Keller A, Greiner L., et al. Evaluation of digital soil mapping approaches with large sets of environmental covariates. SOIL. 2018; 4: 1–22. https://doi.org/10.5194/soil-4-1-2018.
  20. 20. Vaysse K, Lagacherie P. Using quantile regression forest to estimate uncertainty of digital soil mapping products. Geoderma. 2017; 291: 55–64. https://doi.org/10.1016/j.geoderma.2016.12.017.
  21. 21. Vintila R, Munteanu I, Cojocaru G, Radnea C, Turnea D, Curelariu G, et al. The Geographic Information System of the Soil Resources of Romania SIGSTAR-200: Development and main types of applications, Proc. XVII-th Romanian National Conference of Soil Science, 2004; 34A: 439–449 (in Romanian).
  22. 22. Florea N, Bălăceanu V, Munteanu I, Asvadurov H, Conea A., Oancea C, et al. (Eds.). Harta Solurilor României, scara 1:200,000 (Soil Map of Romania, scale 1:200,000). 1963–1993; Geological Institute/IGFCOT, Bucharest, 50 sheets.
  23. 23. Vintila R, Radnea C. First steps in Romania converging toward the GlobalSoilMap specifications. In: Arrouays D, Savin ., Leenaars J, McBratney AB, editors. GlobalSoilMap-Digital Soil Mapping from Country to Globe. CRC Press; 2018. pp. 35–41.
  24. 24. Tanasă IC, Niculită M, Roşca B, Pîrnău RG. Pedometric techniques in spatialisation of soil properties for agricultural land evaluation. Bulletin of University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca. Agriculture, 2010; 67(1).
  25. 25. Patriche CV, Pîrnău RG, Roşca B. Statistical Spatial Models for Soil Parameters: A Comparative Analysis of Several Methods Applied to Dobrovăţ Basin (Eastern Romania), Commun Soil Sci Plan. 2013; 44 (1–4): 104–112.
  26. 26. Vasiliniuc I, Patriche CV, Pîrnău R, Roşca B. Statistical spatial models of soil parameters. An approach using different methods at different scales, Environ Eng Manag J. 2013; 12(3): 457–464.
  27. 27. Rogozan GC, Micle V, Sur IM. Maps of heavy metals in Cluj County soils developed using the regression-kriging method. Environ Eng Manag J. 2016; 15(5): 1035–1039.
  28. 28. Drăguţ L, Dornik A. Land-surface segmentation as a method to create strata for spatial sampling and its potential for digital soil mapping. Int J Geogr Inf Sci. 2016; 30 (7): 1359–1376.
  29. 29. Pîrnău RG, Patriche CV, Roşca B, Vasiliniuc I, Vornicu N, Simina S. Soil spatial patterns analysis at the ancient city of Ibida (Dobrogea, SE Romania), via portable X-ray fluorescence spectrometry and multivariate statistical methods, Catena. 2020; 189: 104506.
  30. 30. Burgess TM, Webster R. Optimal interpolation and isarithmic mapping of soil properties. I. The semivariogram and punctual kriging. J Soil Sci. 1980; 31: 315–331.
  31. 31. Yost RS, Uehara G, Fox R. Geostatistical Analysis of Soil Chemical Properties of Large Land Areas. II. Kriging. Soil Sci Soc Am. 1982; 46(5): 1028–1032.
  32. 32. McBratney AB, Mendonça Santos ML, Minasny B. On digital soil mapping. Geoderma. 2003; 117: 3–52.
  33. 33. Odeh IOA, McBratney AB, Chittleborough DJ. Spatial prediction of soil properties from landform attributes derived from a digital elevation model. Geoderma. 1994; 63: 197–214. https://doi.org/10.1016/0016-7061(94)90063-9.
  34. 34. Hengl T, Heuvelink GBM, Stein A. A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma. 2004; 120: 75–93.
  35. 35. Hengl T. A Practical Guide to Geostatistical Mapping of Environmental Variables. JRC Scientific and Technical Research series, Luxembourg: Office for Official Publications of the European Communities. EUR 22904 EN; 2007
  36. 36. Keskin H, Grunwald S. Regression kriging as a workhorse in the digital soil mapper’s toolbox. Geoderma. 2018; 326: 22–41.
  37. 37. Lagacherie P. Digital soil mapping: a state of the art. In: Hartemink AE, McBratney A, de Lourdes Mendonça-Santos M, editors. Digital Soil Mapping with Limited Data. Dordrecht, Netherlands: Springer; 2008, pp. 3–14.
  38. 38. Minasny B, McBratney AB. Digital soil mapping: A brief history and some lessons. Geoderma. 2016; 264 (B): 301–311.
  39. 39. Rostaminia M, Rahmani A, Mousavi SR, Taghizadeh-Mehrjardi R, Maghsodi Z. Spatial prediction of soil organic carbon stocks in an arid rangeland using machine learning algorithms. Environ Monit Assess. 2021; 193: 815. https://doi.org/10.1007/s10661-021-09543-8 pmid:34787728
  40. 40. Matinfar HR, Maghsodi Z, Mousavi SR, Rahmani A. Evaluation and Prediction of Topsoil organic carbon using Machine learning and hybrid models at a Field-scale. Catena. 2021; 202: 105258. https://doi.org/10.1016/j.catena.2021.105258.
  41. 41. Mousavi SR, Sarmadian F, Omid M, Bogaert P. Three-dimensional mapping of soil organic carbon using soil and environmental covariates in an arid and semi-arid region of Iran. Measurement 2022; 201: 111706. https://doi.org/10.1016/j.measurement.2022.111706
  42. 42. Wadoux AMJ, Minasny B, McBratney AB. Machine learning for digital soil mapping: Applications, challenges and suggested solutions. Earth-Sci Rev. 2020; 210: 103359. https://doi.org/10.1016/j.earscirev.2020.103359.
  43. 43. Tóth G., Antofie TE, Jones A, Apostol B. The LUCAS 2012 topsoil survey and derived cropland and grassland soil properties of Bulgaria and Romania, Environ Eng Manag J. 2016; 15(12): 2651–2662.
  44. 44. Panagos P, Meusburger K, Ballabio C, Borrelli P, Alewell C. Soil erodibility in Europe: a high‐resolution dataset based on LUCAS. Sci. Total Environ. 2014; 479–480: 189–200. pmid:24561925
  45. 45. de Brogniez D, Ballabio C, Stevens A, Jones R, Montanarella L, van Wesemael B. A map of the topsoil organic carbon content of Europe generated by a generalized additive model. Eur J Soil Sci. 2015; 66: 121–134
  46. 46. Panagos P, Ballabio C, Yigini Y, Dunbar MB. Estimating the soil organic carbon content for European NUTS2 regions based on LUCAS data collection. Sci Total Environ. 2013; 442: 235–246. pmid:23178783
  47. 47. Ballabio C, Panagos P, Lugato E, Huang JH, Orgiazzi A, Jones A, et al. Copper distribution in European topsoils: An assessment based on LUCAS soil survey. Sci Total Environ. 2018; 636: 282–298. https://doi.org/10.1016/j.scitotenv.2018.04.268 pmid:29709848
  48. 48. Toth G, Jones A, Montanarella L. LUCAS Topsoil Survey. Methodology, data and results. Luxembourg: EUR 26102, Publications Office of the European Union; 2013. https://doi.org/10.2788/97922
  49. 49. Ballabio C, Lugato E, Fernández-Ugalde O, Orgiazzi A, Jones A, Borrelli P., et al. Mapping LUCAS topsoil chemical properties at European scale using Gaussian process regression. Geoderma. 2019; 355: Article 113912. pmid:31798185
  50. 50. Castaldi F, Chabrillat S, Chartin C, Genot V, Jones AR, van Wesemael B. Estimation of soil organic carbon in arable soil in Belgium and Luxembourg with the LUCAS topsoil database. Eur J Soil Sci. 2018; 69(4): 592–603.
  51. 51. Kassai P, Tóth G. Agricultural Soil Phosphorus in Hungary: High Resolution Mapping and Assessment of Socioeconomic and Pedological Factors of Spatiotemporal Variability. Sustainability. 2020; 12(13): 5311. https://doi.org/10.3390/su12135311.
  52. 52. Yigini Y, Panagos P. Assessment of soil organic carbon stocks under future climate and land cover changes in Europe. Sci Total Environ. 2016; 557: 838–850. https://doi.org/10.1016/j.scitotenv.2016.03.085 pmid:27082446
  53. 53. Dumitrescu A, Bîrsan MV. ROCADA: a gridded daily climatic dataset over Romania (1961–2013) for nine meteorological variables. Nat Hazards. 2015; 78(2): 1045–1063.
  54. 54. Orgiazzi A, Ballabio C, Panagos P, Jones A, Fernández-Ugalde O. LUCAS Soil, the largest expandable soil dataset for Europe: a review. Eur J Soil Sci. 2018; 69: 140–153. https://doi.org/10.1111/ejss.12499.
  55. 55. Jones A, Fernandez-Ugalde O, Scarpa S. LUCAS 2015 Topsoil Survey. Presentation of dataset and results, Luxembourg: EUR 30332 EN, Publications Office of the European Union; 2020. https://doi.org/10.2760/616084.
  56. 56. Fernández-Ugalde O, Jones A, Meuli RG. Comparison of sampling with a spade and gouge auger for topsoil monitoring at the continental scale. Eur J Soil Sci. 2020; 71: 137–150.
  57. 57. Vasiliniuc I, Patriche CV. Validating soil bulk density pedotransfer functions using a Romanian dataset, Carpath J Earth Env. 2015; 10(2): 225–236.
  58. 58. Posea G, Badea L. 1984. Romania. Relief units. Geomorphological zoning. Scientific and Encyclopedic Publishing, Bucharest.
  59. 59. Goovaerts P. Geostatistics for natural resources evaluation. Oxford University Press; 1997.
  60. 60. Mueller TG, Pusuluri NB, Mathias KK, Cornelius PL, Barnhisel RI, Shearer SA. Map Quality for Ordinary Kriging and Inverse Distance Weighted Interpolation, Soil Science of America Journal. 2004; 68(6): 2042–2047. https://doi.org/10.2136/sssaj2004.2042.
  61. 61. Zhu Q, Lin HS. Comparing ordinary kriging and regression kriging for soil properties in contrasting landscapes. Pedosphere. 2010; 20(5): 594–606.
  62. 62. Wadoux AM, Marchant BP, Lark RM. Efficient sampling for geostatistical surveys. Eur J Soil Sci. 2019; 70: 975–989. https://doi.org/10.1111/ejss.12797.
  63. 63. Fotheringham AS, Brunsdon C, Charlton, M. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley; 2002.
  64. 64. Song XD, Brus DJ, Liu F, Li DC, Zhao YG, Yang JL et al. Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China. Geoderma. 2016; 261: 11–22.
  65. 65. Costa EM, Tassinari WS, Pinheiro HSK, Beutler SJ, dos Anjos LHC. Mapping soil organic carbon and organic matter fractions by geographically weighted regression. J Environ Qual. 2018; 47 (4): 718–725. pmid:30025066
  66. 66. Seni G, Elder JF. Ensemble methods in data mining: Improving accuracy through combining predictions. Morgan & Claypool Publishers; 2010. https://doi.org/10.2200/S00240ED1V01Y200912DMK002
  67. 67. Zhang C, Ma Y. Ensemble machine learning: Methods and applications. New York: Springer; 2012.
  68. 68. Hengl T. landmap: Automated Spatial Prediction using Ensemble Machine Learning. R package version 0.0.13. 2021 Available from: https://CRAN.R-project.org/package=landmap.
  69. 69. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, et al. mlr: Machine Learning in R. J Mach Learn Res. 2016; 17: 1–5.
  70. 70. Møller AB, Beucher AM, Pouladi N, Greve MH. Oblique geographic coordinates as covariates for digital soil mapping. SOIL. 2020; 6(2): 269–289.
  71. 71. Hengl T, Nussbaum M, Wright MN, Heuvelink GB, Gräler B. Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ. 2018; 6:e5518. https://doi.org/10.7717/peerj.5518 pmid:30186691
  72. 72. Hengl T, MacMillan RA. Predictive Soil Mapping with R. OpenGeoHub foundation, Wageningen, the Netherlands; 2019. Available from: www.soilmapper.org
  73. 73. Polley EC, van der Laan MJ. Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. 2010; Paper 266. https://biostats.bepress.com/ucbbiostat/paper266.
  74. 74. Bai SB, Wang J, Lü GN, Zhou PG, Hou SS, Xu SN. GIS-based logistic regression for landslide susceptibility mapping of the Zhongxian segment in the Three Gorges area, China, Geomorphol. 2010; 115: 23–31.
  75. 75. Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carre G. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 2013; 36: 27–36.
  76. 76. Gonzales JP, Jarvis A, Cook SE, Oberthür T, Rincon-Romero M, Bagnell AJ, Dias MB. Chapter 33. Digital soil mapping of soil properties in Honduras using readily available biophysical datasets and Gaussian Process. In: Hartemink AE, McBratney A, de Lourdes Mendonça-Santos M, editors. Digital Soil Mapping With Limited Data. Springer Science & Business Media; 2008. pp. 367–380.
  77. 77. ICPA. Metodologia elaborării studiilor pedologice. Partea a III-a–Indicatorii ecopedologici (Methodology for elaboration of soil studies. Part III–Eco-pedological indicators). Bucharest, Romania: ICPA; 1987.
  78. 78. Patriche CV. Quantitative assessment of rill and interrill soil erosion in Romania. Soil Use Manage. 2019; 35: 257–272. https://doi.org/10.1111/sum.12475.
  79. 79. Vintilă I, Borlan Z, Răuţă C, Daniliuc D, Ţigănaş. Situaţia agrochimică a solurilor din România (Agrochemical state of soils in Romania). Bucharest: Ceres; 1984.
  80. 80. Zhu D, Cheng X Zhang F, Yao X, Gao Y, Liu Y. Spatial interpolation using conditional generative adversarial neural networks. Int J Geogr Inf Sci. 2020; 34(4): 735–758.
  81. 81. Zhu D, Zhang F, Wang S, Wang Y, Cheng X, Huang Z, Liu Y. Understanding place characteristics in geographic contexts through graph convolutional neural networks. Ann Am Assoc Geogr; 110(2): 408–420.