1 Introduction

Algeria has a large area, the majority of which constitutes the Sahara. In the arid and semi-arid regions sunny places life the nomads. Algeria with its solar deposit very important to large opportunities for the development of the chain. Today, the country therefore has the duty to put in place an incentive policy in the framework of the operation and the popularization of these devices. The use of these technologies will open new prospects and will preserve the current reserves and to provide an alternative to oil and gas, from the point of view national income of the country and sources of energy.

The Sun emits electromagnetic radiation included in a the wave-length band ranging from 0.22 to 10 μm. The terrestrial atmosphere receives this radiation at an average power of 1.37 kW/m2 to more or less 3 %.The amount of energy reaching the Earth’s surface rarely exceeds 1200 W/m2. The rotation and the tilt of the Earth are also that the energy available in a given point varies depending on the latitude, the time and the season. Clouds, fog, atmospheric particles and various other weather phenomena cause variations hourly and daily that increase or decrease the solar radiation.

The geological science is a part of renewable energy it consists in extracting the heat stored in the soil for the production of electricity, geothermal science in high temperature (Bastola and Peterson 2016), or the heating, geothermal science in low-temperature (Shamshirband et al. 2015). The temperature of the soil depends on the depth in which is measured, Such as the sun rays, the ambient temperature and the wind speed. The received solar radiation on the earth’s surface depends upon the climatic conditions of a location and geological characteristics of the studied area (Wang and Bras 1999). An optimal use of solar energy (Yacef et al. 2014) needs an accurate knowledge of solar radiation at a particular geological location. Nevertheless, these data are not always dispensable particularly in isolated areas, in this respect several approaches have been developed in the literature for modeling (Yuan et al. 2008; Chan et al. 2013) and predicting soil temperature. Accurate measurement of soil temperature is a difficult task. Heat flux plates can be used to make direct measurements of soil temperature (Ni et al. 2014; Wang et al. 2011). At this stage, Gaussian process regression (GPR), relevance vector machine, and other methods. SVR, LSSVR, and GPR based soft sensors have attracted more attention recently because of their nonlinear modeling ability. However, the selection of suitable parameters for an SVR/LSSVR model is still difficult. Compared with SVR/LSSVR, the GPR model can optimize its parameters automatically. Additionally, GPR can simultaneously provide probabilistic information for its prediction; this is an appealing property in the process modeling area (Yi and Gao 2015).

However, the instruments of measure usually need to be placed at a certain depth in the soil normally a few centimeters below the surface according to avoid disturbances. Several single Gaussian process regression (GPR) models are first constructed for each steady-state grade (Yi et al. 2015) The objective of this work is to develop a simple method to model Soil Temperature based only on air temperature using Gaussian Process regression. The prediction can be achieved using the related steady-state GPR model if its reliability using this model is large is large enough (Yi et al. 2015). As the best of our knowledge this is the first work that uses GPR for estimating the DST based only on DAR. The rest of this paper is organized as follows: Sect. 2 presents site location and data collection. In Sect. 3 we present the theory of GPR, model validation is presented in Sect. 4. Experimental results and discussion are presented in Sects. 5 and 6 concludes the and suggest a future work.

2 Site location and data collection

The experimental data used in this work (solar radiation, temperature, etc.), have been collected at the Applied Research Unit for Renewable Energies (URAER) situated in the south of Algeria (Fig. 1) far from the Ghardaïa city with latitude: (+32.370), longitude (+3.770), and altitude of (450 m) above the mean level.

Fig. 1
figure 1

Site location of Ghardaia city. a Algeria area, b Ghardaia area, c relief (3D) of Ghardaia area

The landscape is characterized by a vast expanse where rocky outcrops of bare rock a blackish brown color the values of soil (limestone) diffusivity are: 8.3910–7 m2/s. This tray is masked by the strong river erosion early Quaternary who cut in its southern part of the flat-topped buttes and shaped valleys. The climate of Ghardaia region is semi-arid with a minimum and maximum air temperature ranging from 14 to 47 °C and from 2 to 37 °C during summer and winter months respectively. The daily global solar radiation (GSR) varies between a minimum of 607 Wh/m2/day to a maximum of 7574 Wh/m2/day and the annual-mean-daily GSR is about 5656 Wh/m2/day (Şenkal and Kuleli 2009). The data are recorded every 5 min with a high precision by a radiometric station installed at (URAER) (Fig. 2).

Fig. 2
figure 2

Instruments of measurement. Photo of the unit of Applied Research in Renewable Energies—URAER, Ghardaïa, a pyrheliometer, b pyranometer

As mentioned above the prediction model uses the data collected between 2005 and 2008.The daily evolution of MDSR is shown in Fig. 3.

Fig. 3
figure 3

Daily evolution of normalized soil temperature

3 Theory of Gaussian process regression (GPR)

The theory of Gaussian process regression (GPR) has become increasingly a powerful statistical tool for data-driven modeling. GPR models are Bayesian non parametric approach that can be applied to solve classification and regression supervised (ML) problems. It has been applied to response surface modeling, system identification, calibration of spectroscopic analyzers (Vapnik and Vapnik 1998; Guermoui et al. 2013) and ensemble learning. The main idea of GPR modeling is to place a prior directly on the space of functions. The combination of the prior and the data leads to the posterior distribution over functions. In this latter, we are focused on using the GPR approach for modeling (Sozen et al. 2004) the DGSR in the semi-arid area. Let us consider a regression \({\text{x}}\) group containing \({\text{d}}\) variables. In the machine-learning approach, the main objective is to learn the functional relationship between the inputs of (d-) dimensional \(( {\text{x}} \in {\mathbb{R}}^{\text{d}} )\) and the output variable (y).

$$y = f(x)$$
(1)

where \(({\mathbb{R}})\) denotes the real space and f the unknown function. The unknown function f can be approximated by the following linear combination of basic function:

$$\hat{f}\left( {x,w} \right)\mathop \sum \limits_{j = 1}^{M} W_{J} \phi_{J} \left( X \right)$$
(2)

{ϕ j (x)} M j=1 , represent a set of basis function which can be linear or nonlinear and \({\text{w}} = \left[ {{\text{w}}_{1} , \ldots ,{\text{w}}_{\text{M}} } \right]^{\text{T}}\) is the unknown vector for M basis function of (f).

$$y = \mathop \sum \limits_{j = 1}^{M} w_{j} \phi_{j} \left( x \right) + \varepsilon ,$$
(3)

In Eq. (3) \(\upvarepsilon\) represents the error term. In the general wide range of linear and nonlinear regression models uses a set of training data \(\left( {{\text{D}} = \left\{ {{\text{X}},{\text{Y}}} \right\}_{{{\text{i}} = 1}}^{\text{N}} } \right)\) Of (N) observation to estimate the unknown weights (w) and the basis function \(\left( {\phi_{\text{j}} \left( {\text{x}} \right)} \right)\) Can be seen as a transformation of the data from the original space in high dimensional space which is not the case in(GPR) models, as will be shown below. In their work (Suykens and Vandewalle 1999; Williams and Rasmussen 2006) mentioned that the basic block of GPR is a GP that assumes Gaussian priors for function values specified which is specified by its second order statistics:

$$f\left( x \right) \sim GP\left( {m\left( x \right),k\left( {x,x^{{\prime }} } \right)} \right)$$
(4)

where \(\left( {\text{x}} \right)\), \({\text{k}}\left( {{\text{x}},{\text{x}}^{{\prime }} } \right)\) represent the mean and the covariance function of f. By definition GP is a finite set of random variables with joint Gaussian distribution (Dong et al. 2005). Under GP, the prior distribution of (f) is Gaussian:

$$p\left( {f|X,\theta } \right) \sim {\mathcal{N}}\left( {0,K} \right)$$
(5)

The mean of f is assumed to be zero and the N * N matrix K is a covariance matrix of f, with its hyper parameters denoted by \(\uptheta\).

If the error term \(\upvarepsilon\) in Eq. (5) is independent and identically Gaussian distributed, the likelihood function of the training target is also Gaussian:

$$p\left\{ {y|f,\sigma^{2} } \right\} \sim {\mathcal{N}}\left( {f,\sigma^{2} I} \right)$$
(6)

where \(\upsigma^{2}\) and I denote the variance of model error and identity matrix respectively. Then the posterior distribution of f can be obtained by applying the Bayes’ rule:

$$p\left( {f|y,X,\theta ,\sigma^{2} } \right) = \frac{{p\left( {y|f,\sigma^{2} } \right)p\left( {f|X,\theta } \right)}}{{p\left( {y|X,\theta ,\sigma^{2} } \right)}}$$
(7)

Note that the posterior distribution of f is also Gaussian, since both the prior and likelihood function is Gaussian. From (Suykens and Vandewalle 1999) the mean and covariance of the posterior distribution is given by:

$$\mu = K^{T} \left( {K + \sigma^{2} I} \right)^{ - 1}$$
(8)
$$\varSigma = K - K^{T} \left( {K + \sigma^{2} I} \right)^{ - 1} \,K$$
(9)

We note that the covariance function \({\text{K }}\left( {.{,}.} \right)\) is referred to us Kernel function in machine learning. In GPR literature come commonly used kernel functions include squared exponential or Gaussian kernel (Suykens and Vandewalle 1999).

$$k\left( {x,x^{{\prime }} |\theta } \right) = \sigma_{f}^{2} exp\left( { - \frac{{r^{2} }}{{2l^{2} }}} \right),\theta = \left\{ {\alpha ,l,\sigma_{f}^{2} } \right\}$$
(10)

And the maternal family of covariance function is:

$$k\left( {x,x^{\prime } |\theta } \right) = \sigma_{f}^{2} \frac{{2^{1 - v} }}{\varGamma \left( \upsilon \right)} \left( {\frac{{\sqrt {2vr} }}{l}} \right)^{\upsilon } k_{\upsilon } \left( {\frac{{\sqrt {2\upsilon r} }}{l}} \right),\theta \left\{ {\upsilon ,l,\sigma_{f}^{2} } \right\}$$
(11)

In Eqs. (10) and (11) the term \(r = \left| {x - x^{\prime } } \right|\) denote the Euclidean distance between two points and \({\varvec{\uptheta}}\) represent the hyper parameters associated with each covariance function. The variance noise \(\left( {\upsigma^{2} } \right)\) is additional parameters that are determined during the training phase. The marginal probability distribution can be estimated by integration over the function f (Dong et al. 2005):

$$p\left( {y|X} \right) = \smallint p\left( {y|f,\sigma^{2} } \right)p\left( {f|X,\theta } \right)$$
(12)

The log marginal likelihood is obtained:

$$logp\left( {y|X} \right) \propto - \frac{1}{2}y^{T} \left( {K + \sigma^{2} I} \right)^{ - 1} y - \frac{1}{2}log\left| {K + \sigma^{2} I} \right| - \frac{N}{2}log\left( {2\pi } \right)$$
(13)

Then the unknown parameters \(\left( {\uptheta,\upsigma^{2} } \right)\) can be estimated from the Eq. (13) using a gradient based algorithm. Since the posterior of f is determined through training data, we can evaluate the predictive distribution of any test data \(\left( {{\text{x}}_{ *} } \right)\) conditioned on training results:

$$p\left( {f_{*} |x_{*} ,y,X,\theta ,\sigma^{2} } \right)$$
(14)

From (Suykens and Vandewalle 1999) it can be shown that the predictive distribution Eq. (14) is Gaussian with mean m and variance (\(\upupsilon^{2}\)) given by:

$$m\left( {x_{*} } \right) = \phi \left( {x_{*} } \right)^{T} \mu = K_{*}^{T} \left( {K + \sigma^{2} I} \right)^{ - 1} \,y$$
(15)
$$\vartheta^{2} \left( {x_{*} } \right) = \phi \left( {x_{*} } \right)^{T} \varSigma \varPhi \left( {x_{*} } \right) = K_{**} - K_{*}^{T} \left( {K + \sigma^{2} I} \right)^{ - 1} \,K_{*}$$
(16)

\({\text{K}}_{*} = \left[ {{\text{K}}\left( {{\text{x}}_{*} ,{\text{x}}_{1} } \right), \ldots ,{\text{K}}\left( {{\text{x}}_{*} ,{\text{x}}_{\text{N}} } \right)} \right]^{\text{T}} ,{\text{K}}_{**} {\text{K}}\left( {{\text{x}}_{*} ,{\text{x}}_{*} } \right),\upmu\) and \(\Sigma\) are the posterior mean and variance of f. The prior mean was assumed to zeros and the kernel function used in the present work is squared exponential.

4 Model validation

In this latter, the performance of GPR modeling of DGSR on a horizontal surface is evaluated by comparing the estimated values with these measured using different statistical indexes such as mean absolute bias error (MABE), root mean square error (RMSE), relative square error (RRMSE), determination coefficient (R2) and correlation coefficient (r):The MABE, give the mean absolute value of bias error. Its expression is given Eq. (17) by:

$$MABE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {H^{{\prime }} - H} \right|$$
(17)

where \((H^{{\prime }} )\) Is the estimated value and (H) is the measured value and (i = 1,……, n) number of observations.

The RMSE represents the difference between the predicted values and the measured values. In fact RMSE identifies the model’s accuracy. It is calculated Eq. (18) by:

$$RMSE = \sqrt {\frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left( {H^{{\prime }} - H} \right)^{2} } }$$
(18)

The RRMSE is calculated by dividing the RMSE to the average of measured data as:

$$RRMSE = \frac{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} \left( {H^{\prime } - H} \right)^{2} } }}{{\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{n} H}} \times 10$$
(19)

The performance of the model is defined by the RRMSE range as follows:

  • Excellent if: \({\text{RMSE }} < 10\;\%\)

  • Good if: \(10\;{\text{\% }} < RMSE < 20\;\%\)

  • Fair if: \(20\;{\text{\% }} < RMSE < 30\;\%\)

  • Poor if: \({\text{RMSE }} > 30\;\%\)

The r indicate the strength of a linear relationship between the measured and predicted values:

$$r = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {H^{{\prime }} - \bar{H}^{{\prime }} } \right) \cdot \left( {H - \bar{H}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {H_{P} - \bar{H}^{{\prime }} } \right) \cdot \mathop \sum \nolimits_{i = 1}^{n} \left( {H - \bar{H}} \right)} }}.$$
(20)

5 Experimental results

In this section, we will introduce the application of GPR for modeling MDSR using MDAT as input and MDST as output. Usually, measuring such physical quantities would include formulas that mathematically describe the relationships between the parameters inputs and the desired output. The experimental database used in the current study contain 1061 days of measurement. For the training of GPR model we are splitting the database into tow subset. The first subset contain 560 days for training and the second one 501 days for testing the model.

As shown in Fig. 4, we observe that GPR model based on air temperature as input give high precision and the predicted values of MDST are similar to the measured values.

Fig. 4
figure 4

Predicted and measured values of soil temperature

An important observation from Fig. 5 is that using air temperature alone as input achieves high performance due to its high correlation with the soil temperature.

Fig. 5
figure 5

Correlation coefficient (observed soil temperature and predicted soil temperature (x = 1 × x − 0.00034))

The obtained statistical indexes confirm also the performance of the proposed model. The values of these indexes (Table 1).

Table 1 The obtained statistical indexes

Now that we have all the information regarding the prior and the hierarchical priors, for a given new point(x), (Fig. 5) the value of the target variable can be predicted as The resulting expression.

6 Conclusion

In this work we present the applicability of Gaussian process regression for modeling soil temperature using only air temperature as input. The obtained result is very satisfactory this due to the high correlation between the input and the output and the good precision of GPR for modeling the no linear relationship between the soil and air temperature compared with other recent models such as neural networks and support vector machine.

As a perspective to this work, we will use GPR for modeling soil temperature at different depth of soil using other available meteorological data.