Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning

Meng, Jinyu; Dong, Zengchuan; Shao, Yiqing; Zhu, Shengnan; Wu, Shujun

doi:10.3390/su15010100

Open AccessArticle

Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning

College of Hydrology and Water Resources, Hohai University, Nanjing 210098, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(1), 100; https://doi.org/10.3390/su15010100

Submission received: 16 September 2022 / Revised: 21 November 2022 / Accepted: 19 December 2022 / Published: 21 December 2022

(This article belongs to the Special Issue Sustainable Planning, Management and Utilization of Water Resources)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, machine learning, a popular artificial intelligence technique, has been successfully applied to monthly runoff forecasting. Monthly runoff autoregressive forecasting using machine learning models generally uses a sliding window algorithm to construct the dataset, which requires the selection of the optimal time step to make the machine learning tool function as intended. Based on this, this study improved the sliding window algorithm and proposes an interval sliding window (ISW) algorithm based on correlation coefficients, while the least absolute shrinkage and selection operator (LASSO) method was used to combine three machine learning models, Random Forest (RF), LightGBM, and CatBoost, into an ensemble to overcome the preference problem of individual models. Example analyses were conducted using 46 years of monthly runoff data from Jiutiaoling and Zamusi stations in the Shiyang River Basin, China. The results show that the ISW algorithm can effectively handle monthly runoff data and that the ISW algorithm produced a better dataset than the sliding window algorithm in the machine learning models. The forecast performance of the ensemble model combined the advantages of the single models and achieved the best forecast accuracy.

Keywords:

ensemble learning; interval sliding window; least absolute shrinkage and selection operator; monthly runoff forecasting; time series

1. Introduction

Monthly runoff forecasting is of great significance for grasping the future hydrological situation of river basins, realizing optimal reservoir operation, and scientific management and efficient utilization of water resources [1]. Due to the randomness of runoff processes and the nonlinear characteristics of runoff time series, the accuracy of monthly runoff forecasting is not high and it is difficult to guide production practice effectively, so monthly runoff forecasting has become a controversial and difficult topic within the field of hydrological forecasting [2]. At present, monthly runoff forecasting methods are roughly divided into two categories: one contains methods based on the coupled physical models of hydro-meteorology [3,4,5,6,7]; the other contains methods based on mathematical and statistical approaches, including traditional statistical methods and modern big data intelligent mining models [8,9,10,11,12]. Monthly runoff forecasting based on the coupled hydro-meteorological pathway couples quantitative precipitation forecasts with hydrological models to provide month-scale incoming water forecast information [13]. However, due to the highly nonlinear nature of the atmospheric system and the presence of uncertainties in the numerical weather prediction models themselves and in the boundary input conditions, there are large errors and uncertainties in the numerical weather prediction results [14], which, together with the uncertainties in the hydrological models, make coupled hydro-meteorological models more difficult to apply in practice. Unlike hydro-meteorological coupling approaches, models based on mathematical and statistical pathways require less data and can provide satisfactory forecast results, especially modern big data intelligent mining models such as machine learning models, which have recently been developing at a very rapid rate and are widely used in monthly runoff forecasting as popular artificial intelligence tools. In recent years, a large number of machine learning models have been investigated for hydrological time-series forecasting [15].

Machine learning models in monthly runoff forecasting applications mainly use supervised learning models for forecasting via feature sets, and supervised learning models mainly involve the screening of forecast factors and the construction of models [16]. The screening of factors is mainly considered from the perspectives of atmospheric circulation, sea surface temperature, preset rainfall, and preset runoff. Almost all key elements of the whole hydro-meteorological process are involved and the forecast factors are relatively comprehensive, while the model construction usually adopts Random Forest, CatBoost, LightGBM and other models. In recent years, with the development of computer technology and modern statistical theory, machine learning models (e.g., neural networks, Random Forests, etc.) have also been introduced into monthly runoff time-series forecasting [17,18,19]. In monthly runoff time-series forecasting, machine learning models learn the trend, period, state shift, and other features of historical time series to forecast the future runoff [20].

When a machine learning model is used to forecast a time series of monthly runoff, sliding windows and other methods are often used to process the time series [21]. Some scholars have studied the influence of different time steps on the performance of machine learning models. For example, Chen et al. [22] selected different time steps to study their influence on the LSTM model, and found that with the increase of the time step, the performance of the LSTM first increased and then tended towards stability. The performance of machine learning models varies greatly with different time steps, and the selection of appropriate time steps has become an important research topic. To address this issue, the ISW algorithm is proposed in this study. Based on the principle of selecting forecast factors using the correlation coefficient, a monthly runoff value with a large correlation coefficient is selected as a predictor and a specific treatment is carried out. There is no need to build 12 models as only a single autoregressive model is needed to achieve runoff forecasting, and the performance is excellent.

Meanwhile, machine learning models often used in monthly runoff time series include Random Forest, CatBoost, and LightGBM models, and Random Forest (RF) [23] has been widely used for monthly runoff prediction, as it is a machine learning algorithm with relatively excellent performance. For example, Ditthakit et al. [24] analyzed hydrological data from and physical characteristics of 37 runoff stations in a southern basin in Thailand, used RF to predict monthly runoff, and evaluated the performance using the Nash–Sutcliffe efficiency (NSE), correlation coefficient (r), and overall index (OI), and found that the regionalized monthly runoff forecasting using RF performed the best. Huang et al. [25] used an RF model to simulate and forecast the monthly runoff series from Huangzhuang hydrological station and showed that the Random Forest (RF) algorithm presented almost the same accuracy as the artificial neural network (ANN) algorithm, and both outperformed the support vector machine (SVM) method. LightGBM [26] and CatBoost [27] were proposed only in recent years, and have since attracted widespread attention. Many scholars have applied these tools to the field of hydrology. Saber et al. [28] used LightGBM and CatBoost to predict flash flood sensitivity in the Wadi system (Hurghada, Egypt), and the results showed that the area under the receiver operating characteristic curve (AUROC) of all models was above 97%, and LightGBM was better than other models in terms of classification metrics and processing time.

Through analysis, it was found that previous researchers have mostly applied the RF, CatBoost, and LightGBM models individually in monthly runoff time-series forecasting, which can provide high accuracy but also create applicability problems; the RF model is suitable for smoother runoff time forecasting, while the CatBoost and LightGBM models are more sensitive to peaks. In order to overcome the above disadvantages, this study coupled the three models using the least absolute shrinkage and selection operator (LASSO) algorithm, i.e., the RLC-LASSO model, so as to both predict the smooth part of the runoff series and fit the runoff peak well. The feasibility of the ISW algorithm and the RLC-LASSO model was verified through a case study of monthly runoff series from two actual hydrological stations in China. The innovative points of this study are as follows. (1) An ISW algorithm is proposed to process monthly runoff series data in building an autoregressive model. (2) The LASSO method was used to combine three machine learning models for better prediction of monthly runoff. (3) The results of the case study show that the ISW algorithm outperformed the sliding window algorithm and the RLC-LASSO model had good applicability in monthly runoff prediction. The ISW algorithm proposed in this paper provides a new way of handling monthly runoff series data when building autoregressive models in the future, and also verifies the feasibility of ensemble learning methods in monthly runoff forecasting.

The rest of this work is organized as follows: Section 2 describes the details of the proposed approaches; Section 3 describes the use of the proposed methods to forecast the monthly runoffs of two stations; and finally, the conclusions are summarized.

2. Materials and Methods

2.1. Study Area

The Shiyang River Basin is located in the eastern part of the Hexi Corridor in Gansu Province, west of the Wushaoling Mountains and at the northern foot of the Qilian Mountains, latitude 36°29′–39°27′ N, longitude 101°41′–104°16′ E. It is connected to Baiyin and Lanzhou City in Gansu Province in the southeast, adjacent to Zhangye City in Gansu Province in the northwest, close to Qinghai Province in the southwest, and borders the Inner Mongolia Autonomous Region in the northeast, with a total area of 41,600 km². The basin is one of the three major inland river basins in the Hexi Corridor of Gansu Province. The Shiyang River Basin is one of the three major inland river basins in Gansu Province, with the highest population density, most advanced economic development, highest degree of water resource development and utilization, most prominent conflict between water resource supply and demand, and most serious ecological challenges. The geographical location of the Shiyang River Basin is shown in Figure 1.

The Shiyang River Basin is composed of eight rivers and several small tributaries. From east to west, the rivers are as follows: Dajing River, Gulang River, Huangyang River, Zamu River, Jinta River, Xiying River, Dongda River, and Xida River. The topography and water system of the basin are shown in Figure 2. The river is recharged by atmospheric precipitation in mountainous areas and snowmelt water in the high mountains, with a flow-producing area of 11,100 km² and an average annual runoff of 1.560 billion m³. The main objects of this study were the Xiying River (Jiutiaoling station) and the Zamu River (Zamusi station).

This study used monthly runoff series data from Jiutiaoling station on the Xiying River and Zamusi station on the Zamu River. The monthly runoff series of both Jiutiaoling station and Zamusi station were taken from January 1972 to December 2018. The detailed parameters are shown in Table 1.

2.2. Interval Sliding Window Algorithm

When machine learning models are used for autoregressive forecasting of monthly runoff series, it is necessary to process the data to fit the machine learning model. A commonly used method is the sliding window method. Suresh et al. [21] chose a sliding window with a time step of 10 to apply to a convolutional neural network (CNN) model. Different time steps can affect the performance of these models, so it is necessary to determine the optimal time step, which increases the complexity of the application. Therefore, this study proposes an interval sliding window algorithm based on sliding windows with correlation coefficients.

The essence of the ISW algorithm is based on the correlation coefficient. After creating a dataset using a sliding window, it is sufficient to select several columns with strong correlations with the label columns as features, thus eliminating the interference caused by low-correlation features from the model. This algorithm is mainly applied to time series with periodicity, such as monthly runoff, monthly power load data, monthly precipitation, etc. The step size is set to period T + 1, i.e., for monthly runoff time-series data, the step size is set to 13.

The main idea of the ISW algorithm is as follows. For the monthly runoff series X₁, X₂, …, X_n, the sliding window algorithm is first used to select a time step of 13 to obtain the data matrix Data_(n−12)_×13 (Figure 2 shows a sliding window plot with time step k). Then, the runoff correlation coefficient matrix Cor1_12×12 is calculated for 12 months, and the correlation coefficient matrix is rearranged to obtain Cor2 as shown in Figure 3, where the first row represents the correlation coefficients of all months with January, the second row represents the correlation coefficients of all months with February, and so on. The column m with the largest correlation coefficient is selected as the predictor. Observing the data obtained via the above method, it is found that the process of obtaining the data can be imagined as a sliding window with intervals, so this method is named interval sliding window.

The selection of forecast factors can be achieved via the interval sliding window method. The algorithm for the sliding window approach is presented in Algorithm 1.

Algorithm 1: Interval Sliding Window

Input: n: The length of runoff time series
X_n: Runoff time series
m: Number of forecast factors
Process:
1. Data = [ ]
2. for i = 1, 2, …, n−12 do:
3. Data[i] = [X_i, X_i+1, …, X_i+12]
4. end for
5. Calculate the correlation coefficient matrix Cor1 between each month
6. Convert Cor1 to Cor2 as shown in Figure 3
7. Select m columns with large correlation coefficients as forecast factors in Data
Output: Data: Data set for machine learning

2.3. Ensemble Learning

The main method used in RLC-LASSO model construction is the ensemble learning algorithm. Ensemble learning achieves learning tasks by combining multiple learners, the model structure is shown in Figure 4. Through ensemble methods, multiple weak learners can be combined into one powerful learner, so the generalization ability of ensemble learning is generally better than that of a single learner [29]. In monthly runoff time-series forecasting, the commonly used methods are RF, LightGBM, and CatBoost, which have better generalization performance; in this study, we decided to use these three machine learning models as the base learners for the ensemble.

Since the predictions of RF, LightGBM, and CatBoost are highly correlated, using linear regression would lead to unreliable interpretations. Therefore, the combined strategy applied in this study was LASSO, which constructs a penalty function to obtain a relatively refined model which compresses some regression coefficients, i.e., forces the sum of the absolute values of the coefficients to be less than a fixed value. At the same time, some of the regression coefficients are set to zero, thus retaining the advantage of subset compression, which is a biased estimation of complex cross-tabulated data [30]. The formula for the ensemble of RF, LightGBM, and CatBoost using LASSO (RLC-LASSO) is as follows:

H (X) = \sum_{i = 1}^{3} ω_{i} h_{i} (X) + ω_{0} + λ \sum_{i = 1}^{3} | ω_{i} |

(1)

where h₁(X), h₂(X), h₃(X) are the base learners RF, LightGBM, and CatBoost;

ω_{1}, ω_{2}, ω_{3}

are their weights;

ω_{0}

is the intercept term; and

λ

is the adjustment parameter to control the strength of the penalty term. In a study of stacking regression, Breiman [31] found that non-negative weights must be used to guarantee inheritance performance for the single best base learner, requiring

ω_{i} \geq 0

.

2.4. Performance Metrics

Performance was evaluated using the Nash coefficient (NSE), root mean square error (RMSE), mean absolute error (MAE), and correlation coefficient (R). These criteria are defined as follows:

N S E = 1 - \frac{\sum_{i = 1}^{N} {(Q_{i}^{o} - Q_{i}^{f})}^{2}}{\sum_{i = 1}^{N} {(Q_{i}^{o} - {\bar{Q}}^{o})}^{2}}

(2)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(Q_{i}^{o} - Q_{i}^{f})}^{2}}

(3)

M A E = \frac{1}{N} \sum_{i = 1}^{N} | Q_{i}^{o} - Q_{i}^{f} |

(4)

R = \frac{\sum_{i = 1}^{N} (Q_{i}^{o} - {\bar{Q}}^{o}) \sum_{i = 1}^{N} (Q_{i}^{f} - {\bar{Q}}^{f})}{\sqrt{\sum_{i = 1}^{N} {(Q_{i}^{o} - {\bar{Q}}^{o})}^{2} \sum_{i = 1}^{N} {(Q_{i}^{f} - {\bar{Q}}^{f})}^{2}}}

(5)

where N is the number of data pairs,

Q_{i}^{o} a n d Q_{i}^{f}

are the observed value and forecast value of runoff at the i-th moment, and

{\bar{Q}}^{o} a n d {\bar{Q}}^{f}

are the average of observed values and predicted values of runoff. A p-value significance test (Pearson correlation test) was also required to confirm the correlation between the two variables. When RMSE and MAE are close to 0 and R is close to 1, the monthly predictions of the prediction model are accurate. The NSE value ranges from to 0 to 1. When the NSE value is greater than 0.9, the model fits well. NSE values in the range of 0.8–0.9 are considered to indicate good models. If the NSE value is less than 0.8, the model is not ideal.

The data processing and algorithms for this study were implemented in Python 3.7, using the Numpy library for numerical calculations and Pandas for data processing. The ISW algorithm is a self-coded program in Python. The RF and LASSO were performed using the Sklearn library, LightGBM using the LightGBM library, and CatBoost using the CatBoost library.

3. Results

3.1. Results for ISW

This section describes the use of the ISW algorithm to process the monthly runoff series of Jiutiaoling station and Zamusi station. Firstly, the correlation coefficient matrix of monthly runoff was calculated, and the correlation coefficient matrix was converted according to the description in the interval sliding window algorithm, as shown in Figure 5.

It can be seen that at Jiutiaoling, the correlation coefficients of the runoff in January and the runoffs in January, February, June, July, August, and December of the previous year were higher than 0.6, and the correlation coefficients of the runoff in February and the runoffs in February, March, July, August, and September of the previous year were higher than 0.6. This was used to determine the monthly runoff with the highest correlation for each month and to select the forecast factors. Based on the selection of predictors after the conversion matrix of correlation coefficients, the results of runoff forecast factor selection at Jiutiaoling and Zamusi stations are shown in Table 2.

For example, to forecast the runoff in January, the monthly runoff data of January, February, June, July, August, and December of the previous year were selected as the forecast factors; to forecast the runoff in February, the monthly runoff data of January, March, July, August, and September of the previous year and January of the present year were selected. The results in Figure 5b can be obtained by reordering. It can be seen that the forecast factors selected for the 12 months of runoff to be forecast were in the same position in the dataset. Therefore, when building the autoregressive model, in the case of this study, the 3rd, 4th, 5th, 9th, 10th, and 11th columns of data were removed to form an interval sliding window.

3.2. Comparison of ISW and Sliding Windows

The monthly runoff series data of Jiutiaoling station and Zamusi station were taken from January 1972 to December 2018. A training period from January 1972 to December 2008 and a validation period from January 2009 to December 2018 were selected in this study. In comparison with the ISW algorithm, the sliding window algorithm selected 2–15 time steps, built RF, LightGBM and CatBoost models, and used four evaluation metrics, R, NSE, RMSE, and MAE. We compared and analyzed the performance of the ISW algorithm and the sliding window algorithm with different time steps in different models. Figure 6 and Figure 7 show the performance comparison of the sliding window algorithm and ISW algorithm with different time steps in the validation period.

For Jiutiaoling station, based on the results in Figure 6, it can be seen that the RF, LightGBM, and CatBoost models using ISW (ISW-RF, ISW-LightGBM, ISW-CatBoost) reached 0.94, 0.95, and 0.95, respectively, when analyzed in terms of the correlation coefficient R. The R values of the three models using the sliding window algorithm increased with the time step and leveled off after a time step of 9. The R values for all three models using the sliding window algorithm were close to but did not exceed the results achieved using ISW. In terms of NSE values, the three models with the sliding window algorithm also increased with the time step and leveled off after a time step of 10, and stabilized around but did not exceed the NSE values of ISW-RF, ISW-LightGBM, and ISW-CatBoost. The results of the above two metrics show that the ISW algorithm significantly outperformed the sliding window algorithm.

Analyzing the RMSE, the RMSE was 2.720 for ISW-RF, 2.635 for ISW-LightGBM, and 2.450 for ISW-CatBoost. The RMSE values of the RF and CatBoost models using sliding windows decreased with increasing time step and then tended to increase, and at the lowest point they did not reach the level of ISW-RF at the lowest point. The RMSE of the LightGBM model using sliding windows decreased with increasing time step and reached the level of ISW-LightGBM at a time step of 9, and then remained consistent with the RMSE of ISW-LightGBM. In terms of MAE, the MAE of ISW-RF was 1.847, that of ISW-LightGBM was 1.781, and that of ISW-CatBoost was 1.671. The MAE minima of both the RF and CatBoost models using sliding windows were not at the level of ISW-RF and ISW-CatBoost. The MAE of the LightGBM model using sliding windows dropped lower than that of ISW-LightGBM at a time step of 9. The results of the above two metrics show that the ISW algorithm performed better than the sliding window algorithm.

For Zamusi station, as shown in Figure 7, the correlation coefficients R were 0.91, 0.90, and 0.91 for ISW-RF, ISW-LightGBM, and ISW-CatBoost, respectively. When using the sliding window algorithm, R increased with increasing time step, reaching a maximum and stabilizing after a time step of 8. The R values of all three models using the sliding window algorithm did not exceed the results of ISW-RF, ISW-LightGBM, and ISW-CatBoost, but were close. For the NSE, the three models reached a maximum and leveled off after a time step of 8, and were close to the results of ISW-RF, ISW-LightGBM, and ISW-CatBoost but did not exceed them. The RMSE analysis showed that the RMSE of ISW-RF was 2.493, that of ISW-LightGBM was 2.595, and that of ISW-CatBoost was 2.457. The RMSE values of the three models with sliding windows decreased as the time step increased, and none of them were lower than the values of the ISW algorithm. From the MAE analysis, the MAE of ISW-RF was 1.696, the MAE of ISW-LightGBM was 1.730, and the MAE of ISW-CatBoost was 1.675. Only the LightGBM model using sliding windows had a lower MAE value than the result of ISW-LightGBM at a time step of 6, and the MAE values of the RF and CatBoost models using sliding windows. The results of the above four metrics indicate that the ISW algorithm outperformed the sliding window algorithm.

Based on the above analysis, the ISW algorithm outperformed the sliding window algorithm in predicting the monthly runoff series, not only reducing the pressure on the step size selection, but also offering better performance. Therefore, the ISW algorithm proposed in this paper can be used to replace the sliding window algorithm in prediction of monthly runoff autoregression.

3.3. Forecast Results

Monthly runoff data for the Jiutiaoling and Zamusi stations from January 1972 to December 2008 and January 2009 to December 2018 were used to obtain the required training and validation datasets, respectively, for the machine learning model using the ISW algorithm described above. During the model training process, the parameters of a machine learning model affect its performance; the grid search algorithm was used in this study to determine the optimal parameters, which are not presented specifically because they are not the focus of this paper. The specific parameter names and values are given in Table 3. R, NSE, RMSE, and MAE were used as evaluation metrics to compare RF, LightGBM, CatBoost, and the fusion of the three models (RLC-LASSO), and the specific results are shown in Table 4 and Table 5.

As can be seen from the evaluation metrics reported in Table 4 for Jiutiaoling station, among the individual models, RF had the best performance in the training period, with R reaching 0.99 and NSE reaching 0.977, which were much higher than the values of the other two models. At the same time, RMSE and MAE were lower than those of the other two models. In the validation period, CatBoost had the best results, with an R of 0.96 and NSE of 0.919, and the RMSE and MAE were also lower than those of the other two models. The three single models were compared with RLC-LASSO, which performed far better than the three single models and performed extremely well on the four evaluation metrics.

As can be seen from the evaluation metrics reported in Table 5 for Zamusi station, among the single models, the results were the same as for Jiutiaoling station, with the best performance from RF in the training period and the best performance from CatBoost in the validation period. Compared with the three single models and RLC-LASSO, the R of RLC-LASSO in the training period was 0.99, which was the same as that of the best-performing single model, RF. The NSE of RLC-LASSO was 0.02 higher than that of RF in the training period, and both the RMSE and MAE were lower than those of RF. The R of RLC-LASSO was greater than that of any of the three single models in the validation period, and the NSE was 0.035 higher than the highest value for a single model, CatBoost, while the RMSE and MAE values were lower than those of any of the three single models. The fits of the monthly runoff from Jiutiaoling and Zamusi stations during the training and validation periods are shown in Figure 8 and Figure 9.

To further verify the effectiveness of the RLC-LASSO method, scatter plots of the results of each model validation period, showing comparisons between pairs of models and with the observed values, are given in Figure 10 for Jiutiaoling and Zamusi stations. From the results for Jiutiaoling station shown in Figure 10a, it can be seen that the R² of RLC-LASSO was the highest among the fitted lines with the observed values, with a value of 0.9211. The R² values of the fitted lines between the predicted values of the three models, RF, LightGBM, and CatBoost, were greater than 0.98, which indicates that the prediction effects of these three models were comparable, and that integration by LASSO combined the advantages of these three single models.

4. Discussion

In this study, we propose the ISW algorithm based on the sliding window algorithm and demonstrate the effectiveness of ISW by comparing it with the sliding window algorithm, providing a new method for constructing datasets when using machine learning models for monthly runoff forecasting in the future. Meanwhile, this study selected three machine learning models that perform relatively well in monthly runoff forecasting and integrated them, and the statistical results based on the model prediction results show that the model proposed in this study had the best prediction effect among these models. In practical applications, some areas lack rainfall, evaporation, and other data; the model can be built using only historical monthly runoff data, and after the model is built, it can loop to predict subsequent monthly runoff. As time goes by, the measured monthly runoff can be input into the model, and the model can be continuously modified to make rolling predictions.

According to the comparison results for ISW and the sliding window algorithm with different time steps, it was found that ISW was optimal in terms of all four evaluation metrics. It was also found that as the time step increased, the performance of all three models increased to a maximum and then stabilized. In a sense, selection of the time step is particularly important when applying machine learning to predict monthly runoff. The ISW algorithm proposed in this study is based on the principle of selecting predictors based on correlation coefficients, in which the monthly runoff values for several months with large correlation coefficients are selected as predictors, and then specific processing is performed to achieve the monthly runoff forecast by building not twelve models, but only one autoregressive model, and the forecast performance is very good. The ISW algorithm can be used for forecasting not only monthly runoff time series, but also other time series of a periodic nature, such as monthly electric load data, etc.

According to the forecasting results provided by RF, LightGBM, and CatBoost, it was found that these three models were highly correlated in their prediction results, which may be due to the fact that all three belong to the class of integrated learning algorithms. The grid optimization algorithm used for parameter optimization in this study has been applied in many previous studies, and no detailed analysis was conducted in this study. The RLC-LASSO method proposed in this study outperformed all single models in terms of monthly runoff forecasting, because when using the LASSO algorithm for integration, the advantages of three single models can be concentrated to some extent, while LASSO can also handle multiple covariance problems, thus making the final prediction better than that of any of the three single models.

Although the ISW algorithm and LRC-LASSO have been validated for monthly runoff data from two stations, Jiutiaoling and Zamusi, further research is still needed. The ISW algorithm is currently only applicable to the autoregressive forecasting system for month-by-month runoff and needs further improvement for more use scenarios. While machine learning models have been used extensively in monthly runoff forecasting, they still need to be studied in depth because the data requirements of different machine learning models may not be consistent, and the applicability of machine learning models still needs to be studied in detail. The LRC-LASSO model, although effective, needs more in-depth research in terms of its application to machine-learning-based monthly runoff forecasting in order to improve the forecast accuracy.

5. Conclusions

In this study, a new algorithm ISW for processing monthly runoff data and an ensemble machine learning model, LRC-LASSO, assembled using the LASSO algorithm, are proposed to forecast monthly runoff. Firstly, in order to verify the effectiveness of the ISW algorithm, four commonly used statistical evaluation metrics were used and three models, RF, LightGBM, and CatBoost, were used to compare the ISW and the sliding window algorithm, and the results show that the ISW algorithm proposed in this paper outperformed the sliding window algorithm. To evaluate the performance of the model, four commonly used statistical evaluation metrics were used and four models, RF, LightGBM, CatBoost, and LRC-LASSO, were used for comparison. The results show that the model proposed in this paper performed well on all four evaluation metrics. The method is easy to understand and implement. Therefore, the ISW algorithm and the integrated model proposed in this paper are feasible and promising for improving the accuracy of monthly runoff forecasting. In addition, they also provide a useful tool for forecasting other hydrological time series, such as water levels.

Author Contributions

Conceptualization, J.M. and Z.D.; methodology, J.M. and Y.S.; software, S.W.; validation, J.M., Z.D., and S.Z.; formal analysis, Y.S.; investigation, Z.D.; resources, J.M.; data curation, Z.D.; writing—original draft preparation, J.M.; writing—review and editing, S.Z.; visualization, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sivakumar, B.; Berndtsson, R.; Persson, M. Monthly runoff prediction using phase space reconstruction. Hydrol. Sci. J. 2001, 46, 377–387. [Google Scholar] [CrossRef] [Green Version]
Nilsson, P.; Uvo, C.B.; Berndtsson, R. Monthly runoff simulation: Comparing and combining conceptual and neural network models. J. Hydrol. 2006, 321, 344–363. [Google Scholar] [CrossRef]
Abuzied, S.M.; Mansour, B.M.H. Geospatial hazard modeling for the delineation of flash flood-prone zones in Wadi Dahab basin. Egypt J. Hydroinform. 2019, 21, 180–206. [Google Scholar] [CrossRef] [Green Version]
Abuzied, S.; Yuan, M.; Ibrahim, S.; Kaiser, M.; Saleem, T. Geospatial risk assessment of flash floods in Nuweiba area. Egypt J. Arid Env. 2016, 133, 54–72. [Google Scholar] [CrossRef]
Bournas, A.; Baltas, E. Increasing the efficiency of the Sacramento model on event basis in a mountainous river basin. Environ. Process 2021, 8, 943–958. [Google Scholar] [CrossRef]
Liao, S.L.; Li, G.; Sun, Q.Y.; Li, Z.F. Real-time correction of antecedent precipitation for the Xinanjiang model using the genetic algorithm. J. Hydroinform. 2016, 18, 803–815. [Google Scholar] [CrossRef]
Ren-Jun, Z. The Xinanjiang model applied in China. J. Hydrol. 1992, 135, 371–381. [Google Scholar] [CrossRef]
Chu, H.B.; Wei, J.H.; Jiang, Y. Middle- and long-term streamfow forecasting and uncertainty analysis using lasso-DBN-bootstrap model. Water Resour. Manag. 2021, 35, 2617–2632. [Google Scholar] [CrossRef]
Liang, D.; Xu, J.; Li, S.; Sun, C. Short-term passenger flow prediction of rail transit based on VMD-LSTM neural network combination model. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 5131–5136. [Google Scholar] [CrossRef]
Riahi-Madvar, H.; Dehghani, M.; Memarzadeh, R.; Gharabaghi, B. Short to long-term forecasting of river flows by heuristic optimization algorithms hybridized with ANFIS. Water Resour. Manag. 2021, 35, 1149–1166. [Google Scholar] [CrossRef]
Ren, L.; Xiang, X.Y.; Ni, J.J. Forecast modeling of monthly runoff with adaptive neural fuzzy inference system and wavelet analysis. J. Hydrol. Eng. 2013, 18, 1133–1139. [Google Scholar] [CrossRef]
Sharma, S.K.; Tiwari, K.N. Bootstrap based artificial neural network (BANN) analysis for hierarchical prediction of monthly runoff in Upper Damodar Valley Catchment. J. Hydrol. 2009, 374, 209–222. [Google Scholar] [CrossRef]
Bennett, J.C.; Wang, Q.J.; Li, M.; Robertson, D.E.; Schepen, A. Reliable long-range ensemble streamflow forecasts: Combining calibrated climate forecasts with a conceptual runoff model and a staged error model. Water Resour. Res. 2016, 52, 8238–8259. [Google Scholar] [CrossRef]
Hu, Y.; Schmeits, M.J.; Jan van Andel, S.; Verkade, J.S.; Xu, M.; Solomatine, D.P.; Liang, Z. A stratified sampling approach for improved sampling from a calibrated ensemble forecast distribution. J. Hydrometeorol. 2016, 17, 2405–2417. [Google Scholar] [CrossRef]
Zheng, H.; Yuan, J.; Chen, L. Short-term load forecasting using EMD-LSTM neural networks with a Xgboost algorithm for feature importance evaluation. Energies 2017, 10, 1168. [Google Scholar] [CrossRef] [Green Version]
Hou, S.; Shi, H.; Cao, X.; Zhang, X.; Jiao, L. Hyperspectral Imagery Classification Based on Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5521213. [Google Scholar] [CrossRef]
Huang, S.; Chang, J.; Huang, Q.; Chen, Y. Monthly streamflow prediction using modified EMD-based support vector machine. J. Hydrol. 2014, 511, 764–775. [Google Scholar] [CrossRef]
Yaseen, Z.M.; Jaafar, O.; Deo, R.C.; Kisi, O.; Adamowski, J.; Quilty, J.; El-Shafie, A. Stream-flow forecasting using extreme learning machines: A case study in a semi-arid region in Iraq. J. Hydrol. 2016, 542, 603–614. [Google Scholar] [CrossRef]
Ni, L.; Wang, D.; Singh, V.P.; Wu, J.; Wang, Y.; Tao, Y.; Zhang, J. Streamflow and rainfall forecasting by two long short-term memory-based models. J. Hydrol. 2020, 583, 124296. [Google Scholar] [CrossRef]
Wang, W.C.; Chau, K.W.; Xu, D.M.; Chen, X.Y. Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour. Manag. 2015, 29, 2655–2675. [Google Scholar] [CrossRef]
Suresh, V.; Janik, P.; Rezmer, J.; Leonowicz, Z. Forecasting Solar PV Output Using Convolutional Neural Networks with a Sliding Window Algorithm. Energies 2020, 13, 723. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Huang, J.; Han, Z.; Gao, H.; Liu, M.; Li, Z.; Huang, Y. The importance of short lag-time in the runoff forecasting model based on long short-term memory. J. Hydrol. 2020, 589, 125359. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Ditthakit, P.; Pinthong, S.; Salaeh, N.; Binnui, F.; Khwanchum, L.; Pham, Q.B. Using machine learning methods for supporting GR2M model in runoff estimation in an ungauged basin. Sci. Rep. 2021, 11, 19955. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Liang, Z.; Li, B.; Wang, D.; Hu, Y.; Li, Y. Combination of multiple data-driven models for long-term monthly runoff predictions based on Bayesian model averaging. Water Resour. Manag. 2019, 33, 3321–3338. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 2991. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Saber, M.; Boulmaiz, T.; Guermoui, M.; Abdrado, K.I.; Kantoush, S.A.; Sumi, T.; Mabrouk, E. Examining LightGBM and CatBoost models for wadi flash flood susceptibility prediction. Geocarto Int. 2021, 3, 1–26. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar] [CrossRef] [Green Version]
Kukreja, S.L.; Löfberg, J.; Brenner, M.J. A least absolute shrinkage and selection operator (LASSO) for nonlinear system identification. IFAC Proc. Vol. 2006, 39, 814–819. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]

Figure 1. Drainage distribution map of the Shiyang River Basin.

Figure 2. Step k sliding window.

Figure 3. Correlation coefficient matrix transformation.

Figure 4. Ensemble learning diagram.

Figure 5. Correlation coefficient matrix and transformed correlation coefficient matrix: (a) Correlation matrix of monthly runoff series at Jiutiaoling station; (b) reordering of monthly runoff correlation coefficient matrix at Jiutiaoling station; (c) correlation matrix of monthly runoff series at Zamusi station; (d) reordering of monthly runoff correlation coefficient matrix at Zamusi station.

Figure 6. Performance comparison of sliding windows and ISW for Jiutiaoling station.

Figure 7. Performance comparison of sliding windows and ISW for Zamusi station.

Figure 8. Comparison of the forecast results for Jiutiaoling station.

Figure 9. Comparison of the forecast results for Zamusi station.

Figure 10. Comparison of the forecast results for two stations during the validation period: (a) Jiutiaoling station; (b) Zamusi station.

Table 1. Multiyear average runoff at Jiutiaoling station and Zamusi station.

Station	Time-Series Length	Multiyear Average Runoff
Jiutiaoling	January 1972–December 2018	9.96 m³/s
Zamusi	January 1972–December 2018	7.30 m³/s

Table 2. Forecast factor selection for Jiutiaoling station and Zamusi stations.

Station	Forecast Month	Forecast Factors
Jiutiaoling	Jan_t	Jan_t−1	Feb_t−1	Jun_t−1	Jul_t−1	Aug_t−1	Dec_t−1
	Feb_t	Feb_t−1	Mar_t−1	Jul_t−1	Aug_t−1	Sep_t−1	Jan_t
	Mar_t	Mar_t−1	Apr_t−1	Aug_t−1	Sep_t−1	Oct_t−1	Feb_t
	Apr_t	Apr_t−1	May_t−1	Sep_t−1	Oct_t−1	Nov_t−1	Mar_t
	May_t	May_t−1	Jun_t−1	Oct_t−1	Nov_t−1	Dec_t−1	Apr_t
	Jun_t	Jun_t−1	Jul_t−1	Nov_t−1	Dec_t−1	Jan_t	May_t
	Jul_t	Jul_t−1	Aug_t−1	Dec_t−1	Jan_t	Feb_t	Jun_t
	Aug_t	Aug_t−1	Sep_t−1	Jan_t	Feb_t	Mar_t	Jul_t
	Sep_t	Sep_t−1	Oct_t−1	Feb_t	Mar_t	Apr_t	Aug_t
	Oct_t	Oct_t−1	Nov_t−1	Mar_t	Apr_t	May_t	Sep_t
	Nov_t	Nov_t−1	Dec_t−1	Apr_t	May_t	Jun_t	Oct_t
	Dec_t	Dec_t−1	Jan_t	May_t	Jun_t	Jul_t	Nov_t
Zamusi	Jan_t	Jan_t−1	Feb_t−1	Jun_t−1	Jul_t−1	Aug_t−1	Dec_t−1
	Feb_t	Feb_t−1	Mar_t−1	Jul_t−1	Aug_t−1	Sep_t−1	Jan_t
	Mar_t	Mar_t−1	Apr_t−1	Aug_t−1	Sep_t−1	Oct_t−1	Feb_t
	Apr_t	Apr_t−1	May_t−1	Sep_t−1	Oct_t−1	Nov_t−1	Mar_t
	May_t	May_t−1	Jun_t−1	Oct_t−1	Nov_t−1	Dec_t−1	Apr_t
	Jun_t	Jun_t−1	Jul_t−1	Nov_t−1	Dec_t−1	Jan_t	May_t
	Jul_t	Jul_t−1	Aug_t−1	Dec_t−1	Jan_t	Feb_t	Jun_t
	Aug_t	Aug_t−1	Sep_t−1	Jan_t	Feb_t	Mar_t	Jul_t
	Sep_t	Sep_t−1	Oct_t−1	Feb_t	Mar_t	Apr_t	Aug_t
	Oct_t	Oct_t−1	Nov_t−1	Mar_t	Apr_t	May_t	Sep_t
	Nov_t	Nov_t−1	Dec_t−1	Apr_t	May_t	Jun_t	Oct_t
	Dec_t	Dec_t−1	Jan_t	May_t	Jun_t	Jul_t	Nov_t

Table 3. Model parameters.

Parameters		Jiutiaoling Station	Zamusi Station
RF	n_estimators	1000	1500
RF	max_features	6	6
CatBoost	iterations	100	100
	depth	6	6
	learning_rate	0.1	0.1
	loss_function	RMSE	RMSE
	logging_level	Verbose	Verbose
LightGBM	max_ depth	6	6
	learning_rate	0.006	0.03
	n_estimators	600	500
	metric	rmse	rmse
	bagging_fraction	0.8	0.8
	feature_fraction	0.8	0.8
LASSO	alpha	1	1

Table 4. Comparison of evaluation indexes of different models for Jiutiaoling station.

Models	Training				Verification
Models	NSE	RMSE	MAE	R	NSE	RMSE	MAE	R
RF	0.976	1.394	0.864	0.99	0.879	2.720	1.847	0.94
LightGBM	0.912	2.869	1.835	0.95	0.888	2.635	1.781	0.95
CatBoost	0.940	2.891	1.864	0.95	0.898	2.450	1.671	0.95
RLC-LASSO	0.987	0.670	0.454	0.99	0.931	2.312	1.638	0.96

Table 5. Comparison of evaluation indexes of different models for Zamusi station.

Models	Training				Verification
Models	NSE	RMSE	MAE	R	NSE	RMSE	MAE	R
RF	0.970	1.148	0.704	0.99	0.825	2.493	1.696	0.91
LightGBM	0.905	2.036	1.260	0.95	0.811	2.595	1.730	0.90
CatBoost	0.925	1.812	1.167	0.96	0.837	2.457	1.675	0.91
RLC-LASSO	0.980	0.667	0.443	0.99	0.876	2.190	1.597	0.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, J.; Dong, Z.; Shao, Y.; Zhu, S.; Wu, S. Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning. Sustainability 2023, 15, 100. https://doi.org/10.3390/su15010100

AMA Style

Meng J, Dong Z, Shao Y, Zhu S, Wu S. Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning. Sustainability. 2023; 15(1):100. https://doi.org/10.3390/su15010100

Chicago/Turabian Style

Meng, Jinyu, Zengchuan Dong, Yiqing Shao, Shengnan Zhu, and Shujun Wu. 2023. "Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning" Sustainability 15, no. 1: 100. https://doi.org/10.3390/su15010100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Monthly Runoff Forecasting Based on Interval Sliding Window and Ensemble Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Interval Sliding Window Algorithm

2.3. Ensemble Learning

2.4. Performance Metrics

3. Results

3.1. Results for ISW

3.2. Comparison of ISW and Sliding Windows

3.3. Forecast Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI