Predicting streamflow in Peninsular Malaysia using support vector machine and deep learning algorithms

Essam, Yusuf; Huang, Yuk Feng; Ng, Jing Lin; Birima, Ahmed H.; Ahmed, Ali Najah; El-Shafie, Ahmed

doi:10.1038/s41598-022-07693-4

Download PDF

Article
Open access
Published: 10 March 2022

Predicting streamflow in Peninsular Malaysia using support vector machine and deep learning algorithms

Yusuf Essam¹,
Yuk Feng Huang¹,
Jing Lin Ng²,
Ahmed H. Birima³,
Ali Najah Ahmed⁴ &
…
Ahmed El-Shafie⁵

Scientific Reports volume 12, Article number: 3883 (2022) Cite this article

3475 Accesses
22 Citations
Metrics details

Subjects

Abstract

Floods and droughts are environmental phenomena that occur in Peninsular Malaysia due to extreme values of streamflow (SF). Due to this, the study of SF prediction is highly significant for the purpose of municipal and environmental damage mitigation. In the present study, machine learning (ML) models based on the support vector machine (SVM), artificial neural network (ANN), and long short-term memory (LSTM), are tested and developed to predict SF for 11 different rivers throughout Peninsular Malaysia. SF data sets for the rivers were collected from the Malaysian Department of Irrigation and Drainage. The main objective of the present study is to propose a universal model that is most capable of predicting SFs for rivers within Peninsular Malaysia. Based on the findings, the ANN3 model which was developed using the ANN algorithm and input scenario 3 (inputs consisting of previous 3 days SF) is deduced as the best overall ML model for SF prediction as it outperformed all the other models in 4 out of 11 of the tested data sets; and obtained among the highest average RMs with a score of 3.27, hence indicating that the model is very adaptable and reliable in accurately predicting SF based on different data sets and river case studies. Therefore, the ANN3 model is proposed as a universal model for SF prediction within Peninsular Malaysia.

Current and future global water scarcity intensifies when accounting for surface water quality

Article 23 May 2024

Accurate medium-range global weather forecasting with 3D neural networks

Article Open access 05 July 2023

Applying machine learning algorithms to predict the stock price trend in the stock market – The case of Vietnam

Article Open access 12 March 2024

Introduction

Floods and droughts are natural phenomena that have impacted regions within Peninsular Malaysia throughout recorded history. Recently, continuous heavy rainfall in January 2021 caused high streamflow (SF) within rivers and consequent widespread flooding in Peninsular Malaysia, with the state of Pahang representing the worst affected state. Approximately 50,000 individuals were evacuated, while at least six people died. Meanwhile, the worst water shortage affecting Peninsular Malaysia occurred back in 1998 when a prolonged drought caused very low amounts of SF and the drying up of dam reservoir water resources. Given the shortage, water was rationed for almost 150 days in the Klang Valley, affecting 3.2 million people. Ultimately, these phenomena can be understood to be a result of extreme values of SF¹. Too high amounts of SF cause the stream to exceed its confinement and submerge surrounding land, causing floods. On the other hand, droughts are a result of too low amounts of SF which causes diminishing water resources as rivers and dam reservoirs dry up simultaneously. SF is even recognized by the World Meteorological Organization (WMO) as a significant predictor of droughts and has been used in existing studies to forecast drought indicators namely the standardized drought index (SDI) and standardized SF index (SSI)^2,3. As history has shown, floods and droughts make the task of water resource management and allocation extremely difficult, while also affecting other industries and activities such as hydropower generation, agriculture, and environmental protection^1,4,5,6. Additionally, existing studies have also demonstrated the correlation of SF with river suspended sediment load (SSL). SF data has been used to obtain better predictions of SSL^7,8,9,10, hence highlighting the effects of SF on SSL, with higher amounts of SF typically causing higher SSLs. On top of that, streamflow also has an effect on the capacity of rivers to receive pollution. The water quality index (WQI) is commonly used to describe the water quality of streamflow and is affected by six substances namely biochemical oxygen demand, chemical oxygen demand, dissolved oxygen (DO), suspended solids, ammoniacal nitrogen, and potential for hydrogen¹¹. Large streamflow are better in receiving and diluting away pollution discharge concentrated with these substances, while smaller streamflow are more easily polluted by these substances as they are not able to degrade the pollution discharge swiftly. Given the aforementioned factors, a means to predict SF is greatly significant, especially approaching or during periods of floods and droughts, particularly for municipal and environmental damage mitigation; water resource management; continuation of hydropower generation and agricultural activities; and SSL and WQI monitoring.

Machine learning (ML), a branch of artificial intelligence, has been studied and utilized for the purpose of SF prediction. ML algorithms are able to identify trends and patterns in a large database easily and continually improve in predictive ability with time, while not requiring much human intervention as they self-learn. For these reasons, ML is a valuable tool for modelling and predicting SF as different rivers have different SF magnitudes and behaviours, depending on the spatial and temporal variability as well as the water balance component heterogeneity of a particular river^1,12. Existing studies from recent years have established and shown several ML algorithms capable of producing SF predictions of high accuracy while outperforming other ML algorithms, namely the support vector machine (SVM) and two deep learning algorithms: artificial neural network (ANN) and long short-term memory (LSTM). Standalone SVMs have been demonstrated to produce more accurate SF predictions compared to extreme learning machine (ELM), adaptive neuro-fuzzy inference system (ANFIS), multivariate adaptive regression splines (MARS), M5 model tree, and ANN^{6,13,14,15,16,17}. Hybridization has also been studied to enhance the predictive ability of SVM for different case studies. Malik et al.¹⁸ hybridized SVM with ant lion optimization (ALO), multi-verse optimizer (MVO), spotted hyena optimizer (SHO), Harris’ hawks optimization (HHO), particle swarm optimization (PSO), and Bayesian optimization (BO), to predict the daily SF in the watershed of Naula, India. It was found that SVM hybridized with HHO (SVM-HHO) was superior in SF predictive performance compared to the other hybridized algorithms. The study by Tikhamarine, Souag-Gamane, and Kisi¹ hybridized SVM with the grey-wolf optimizer (GWO), shuffled complex evolution (SCE) algorithm, MVO, and PSO, to predict SF for the Ain Bedra and Fermatou stations in Algeria, in which it was found that SVM hybridized with GWO (SVM-GWO) outperformed the other hybridized algorithms. One of the primary advantages of the SVM that makes it perform well in SF prediction is that it is able to deal with overlearning and high dimensionality which may cause computational complexity and local extremum¹⁹. In addition, tuning or adjustment of only a few hyperparameters need to be performed, hence giving SVM a simple structure and ease of implementation^20,21. However, the SVM’s predictive ability is negatively affected when the utilized data set is significantly noisy, as SVMs are sensitive to noise^22,23,24. Meanwhile, standalone ANNs have been shown to produce superior performances in SF prediction compared to linear regression (LR), autoregressive integrated moving average (ARIMA), genetic expression programming (GEP), ANFIS, and SVM^{25,26,27,28,29}. The studies by Zaini et al.³⁰ and Sammen et al.³¹ on Malaysian rivers demonstrated improved ANN predictive performance when hybridized with the bat algorithm (BA) and sunflower optimization algorithm (SFA). ANN hybridization was also performed in the study by Li et al.³² using empirical mode decomposition (EMD), ensemble empirical mode decomposition (EEMD), and discrete wavelet transformation (DWT). It was found that ANN hybridized with EEMD (EEMD-ANN) was the best performing model in the respective study. In addition, the predictive performance of ANN was shown to be improved through the utilization and integration of additional data mining techniques, as shown by Zamanisabzi et al.³³ in the study on the Elephant Butte Reservoir. SF was also able to be predicted accurately through the modelling of the relationship between SF and rainfall, as proven in the study by Ali and Shahbaz³⁴ on Pakistan rivers. The upsides that make ANN powerful in SF prediction include being able to easily handle large data sets; detect complex non-linear relationships between input and output parameters; and relate input and output parameters without the utilization of complex mathematical models or calculations^35,36,37. A drawback of the ANN is that it is computationally expensive and has a high dependence on the capability of available hardware^38,39,40. This means that adequate processing power is required for models to be trained with realistic and efficient training durations. Apart from ANN, LSTM is another deep learning algorithm that has produced good performances in SF prediction. Standalone LSTMs have outperformed other algorithms such as the nonlinear autoregressive exogenous model (NARX), Gaussian process regression (GPR), SVM, ANN, and the standard technique of hydrological model parameters regionalization also known as the HMREG scheme^41,42,43. LSTMs have also been hybridized to improve their performances in SF prediction for different case studies. The study by Ghimire et al.⁴⁴ on the Brisbane River and Teewah Creek in Australia hybridized LSTM with the convolutional neural network (CNN), resulting in SF predictive performances outperforming algorithms such as gradient boosting regression (GBM), extreme gradient boosting (XGB), decision tree (DT), ELM, and MARS. Liu et al. developed an algorithm hybridizing Encoder Decoder LSTM with EMD, which was capable of producing accurate SF predictions for the case study of the Yangtze River, China⁴⁵. The advantages of the LSTM which are the strong abilities to capture long-term time dependencies between input and output parameters; and to learn relationships within complex and high-dimensional data sets, contributes to its good performances in the field of SF prediction^46,47. The downside of the LSTM is that it also requires high computational power to train and develop models in a reasonable timeframe, given that it is a deep learning ML algorithm^48,49. An LSTM model may also take a longer time to train and develop depending on the difficulty of the problem to be solved as well as the LSTM architecture chosen⁵⁰. Additionally, the LSTM is also prone to overfitting effects^51,52, which may be reduced with the help of dropout regularization and early call-back mechanisms. Apart from these established algorithms (SVM, ANN, LSTM), other ML algorithms with good potential that have been developed and focused on for the purpose of accurate SF prediction include variations of ELM, ANFIS, and random forest (RF)^{4,5,15,45,53,54}.

Based on the aforementioned existing studies, it can be found that majority have developed SF prediction models based on data from only one hydrological station or river. As SF is affected by factors namely spatial variability, temporal variability, and water balance component heterogeneity, the magnitude and behaviour of SF in different rivers often vary^1,12. Due to this, the suitability of ML algorithms for SF prediction may also vary based on different rivers. Certain ML models or algorithms may excel in predicting SF accurately for a particular river but perform poorly in predicting SF for a different river, as they may be unable to effectively capture the behaviour of SF for the different river. Existing studies in Peninsular Malaysia have developed ML algorithms namely LR, M5P tree, RF, SVM, ANFIS, ARIMA, ANN, and LSTM to predict SF in rivers such as Sungai Muda in Kedah; Sungai Kuantan and Sungai Kenau in Pahang; Sungai Kelantan in Kelantan; and Sungai Kurau, Sungai Bernam, and Sungai Tualang in Perak^{26,29,30,31,42,53}. Aside from the studies by Zaini et al.³⁰, Sammen et al.³¹, and Pandhiani et al.⁵³ which utilized data sets from two hydrological stations or rivers to develop SF prediction models, other SF prediction studies in Peninsular Malaysia have focused on data sets from only one hydrological station or river. This brings up a research gap in which it is unknown whether there exists a single ML model or algorithm that has the ability of accurately predicting SF for the many different rivers within Peninsular Malaysia, as there are no existing studies that have developed and tested ML models or algorithms based on data sets from a substantial number of rivers within the region. Therefore, the present study intends to undertake this research gap by developing SF prediction models based on SF time series data sets of hydrological stations located along 11 different rivers throughout Peninsular Malaysia. The ML algorithms utilized for SF prediction in the present study are the SVM, ANN, and LSTM. This is because the conducted literature review has shown them to produce accurate SF predictions as well as outperforming other ML algorithms in the field of SF prediction, hence indicating their superiority in this field. Additionally, the literature review performed has highlighted the algorithms’ noteworthy advantages which make them suitable to be used for SF prediction in the present study. Hybridization of SVM, ANN, and LSTM is not investigated in the present study, as the present study intends to identify the standalone ML model that is most accurate and suitable as a universal model for the case study of 11 different river streamflow data sets in Peninsular Malaysia, which has not been performed before in existing studies. The findings of the present study may then open up a topic or focus for a future study on the hybridization of the standalone universal model proposed at the end of the present study.

Real-life adoption and application of an ML model proposed from scientific literature for the purpose of SF prediction may be complicated due to doubt on whether the proposed ML model is able to reproduce its accurate performance for different river case studies, which may have different SF magnitudes and behaviours due to variability on a spatial and temporal scale, as well as varying heterogeneity in water balance components. Meanwhile, the development of individual or personalized SF predictive ML models for each river within a region is resource intensive as it may require a significant amount of time and cost. Rather than using up lots of resources to develop many tailor-made SF predictive ML models for each river within a region, it would be more resource-friendly to identify one ML model that is capable of predicting SF with good accuracy for many different rivers within a region. Therefore, the present study was motivated by the idea of proposing a single universal ML model that has been substantially and simultaneously tested on different rivers; and is capable of accurately predicting SF for any river case study within Peninsular Malaysia. The main contribution of the present study is the testing and development of SF prediction models using 3 ML algorithms and SF data sets of hydrological stations from 11 different rivers throughout Peninsular Malaysia; and the proposal of the best performing ML model in the present study as the universal model for accurate SF prediction in the region. The best performing ML model is selected by considering two factors, which are the number of times a model produced the most accurate predictive performance for a data set, and the reliability of each model in producing relatively high-accuracy predictions for the different data sets. The accuracy of the ML models in the present study is quantified through the utilization of selected performance evaluation measures, namely mean absolute error (MAE), root mean squared error (RMSE) coefficient of determination (R²) and ranking mean (RM). The findings from the present study may interest hydrological authorities or institutions that are searching for substantially tested ML models within Peninsular Malaysia, or even other regions. The rest of the present study is organized as follows: “Materials and methods” describes the materials and methods used to develop and test the SF prediction models. Section “Results and discussion” reports and discusses the performance of the SF prediction models. Section “Conclusion” concludes the overall study and provides suggestions for future studies.

Materials and methods

The materials and methods used in the developing and testing of SF predictions models for the 11 selected rivers within Peninsular Malaysia are explained in this section. Information on the location and data of case study, model development process, feature selection; data pre-processing; ML algorithms; and performance measures are described.

Location and data of case study

The western region of Malaysia is known as Peninsular Malaysia. It comprises of 13 states and 2 federal territories; and has an area of approximately 132,265 km². Located just North of the equator, Peninsular Malaysia consists of 40% of Malaysian land. Malaysia’s capital is the Federal Territory of Kuala Lumpur, which is located about 40 km from the coast. There are approximately 1235 river basins in Peninsular Malaysia, of which 74 are classified as main river basins while the remaining 1161 are categorized as small river basins⁵⁵. The longest river in Peninsular Malaysia is Sungai Pahang, measuring up to 459 km in length.

The raw daily average SF data for different rivers within 11 states in Peninsular Malaysia was obtained from the Water Resources Management and Hydrology Division of the Malaysian Department of Irrigation and Drainage. To conduct the present study, one river is selected per state based on suitability of data in terms of volume and time series continuity; and the significance of the river to their respective state or federal territory. Table 1 provides information on the selected rivers for each state, the SF station numbers as well as latitudes and longitudes, and the data duration provided by each SF station.

Table 1 Information on selected rivers’ data for each state.

Subjects

Abstract

Similar content being viewed by others

Introduction

Materials and methods

Location and data of case study

Model development process

Feature selection

Data pre-processing

Missing data

Data partitioning

Feature scaling

Machine learning algorithms

Support vector machine (SVM)

Artificial neural network (ANN)

Long short-term memory (LSTM)

Performance measures

Mean absolute error (MAE)

Root mean squared error (RMSE)

Coefficient of determination (R2)

Ranking mean (RM)

Results and discussion

Performance of models based on the Sungai Johor, Johor data set

Performance of models based on the Sungai Muda, Kedah data set

Performance of models based on the Sungai Kelantan, Kelantan data set

Performance of models based on the Sungai Melaka, Melaka data set

Performance of models based on the Sungai Kepis, Negeri Sembilan data set

Performance of models based on the Sungai Pahang, Pahang data set

Performance of models based on the Sungai Perak, Perak data set

Performance of models based on the Sungai Arau, Perlis data set

Performance of models based on the Sungai Selangor, Selangor data set

Performance of models based on the Sungai Dungun, Terengganu data set

Performance of models based on the Sungai Klang, Kuala Lumpur data set

Overall comparison and discussion of model performances

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links

Coefficient of determination (R²)