Odor emissions from a treatment plant is one of the major environmental issues that results in negative health consequences and repercussions on economic, commercial, and touristic activities. To address this problem an accurate assessment of the odor sources is of crucial interest. In this paper, different machine learning methods are applied to identify the most suitable model to estimate odor concentrations through the responses of a multiparametric system. It is observed that random forest regression method shows superior performance compared to the other methods. In this context, advanced data analytics technologies, such as machine learning methods, have provided data-driven decision-making capabilities to address the challenges that arise in the analysis and evaluation of a sustainable development. The findings of the proposed study can help implement proactive actions to minimize the effects of odors and prevent any potential health and environmental concerns.
Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
The balance between environmental protection and social responsibility requires the development of adequate and effective territorial monitoring actions aiming at controlling all possible sources of pollution. Odors emitted by treatment plants represent a significant ecological and social problem (Lewkowska et al. 2016) as they negatively affect the environment stimulating the human sense of smell causing unpleasant odor sensations (Capelli et al. 2019). Wastewater treatment plants (WWTPs) and industrial facilities can generate odor emissions that can cause annoyance to the residents in close proximity (Zarra et al. 2021b). Emissions originating from wastewater treatment plants (WWTPs) and industrial facilities can generate unpleasant odors that cause annoyance to residents in close proximity (Zarra et al. 2021b). Investigating odor concentrations in wastewater treatment facilities is challenging due to the complex architecture of the plants that emanate odors from distinct sections, making it crucial to detect their precise origin and to quantify their actual influence on the environment. Odor concentration assessment and accurate prediction are useful to urban planners to ensure community engagement, environmental management, and effective urban planning. Residential areas and other sites, such as hospitals and schools, need to be placed at an appropriate distance from odor concentration zones. In addition, an accurate odor prediction allows the industry to properly implement odor control strategies.
Generally, measuring the odor concentration faces challenges because of the nature of gaseous compounds and the complexity of their detection. These challenges encompass the dynamic nature of odors, chemical composition, cost and time constraints, as well as spatial variability.
Anzeige
There are numerous methods for measuring odor concentration, each with its own advantages and limitations. According to Munoz et al. (Muñoz et al. 2010; Hawko et al. 2021), there are two techniques for odor measurement: analytical, which includes identification of specific compounds, electronic nose, and gas chromatography-mass spectrometry (GC-MS), and sensory, which includes dynamic olfactometry, field surveys, and resident records.
In this context, dynamic olfactometry is the standardized reference method for measuring odor concentration (Zarra et al. 2012). Dynamic olfactometry serves as a method for gauging odor concentration and addressing odor-related complaints as described by the European standard EN 13725.
Its merit lies in its ability to measure the comprehensive impact of odor on human perception, utilizing the human nose as a sensor (Gostelow et al. 2001; Parabucki et al. 2019). This method also offers benefits, like its adaptability for implementing atmospheric dispersion models and its superiority in electronic devices in terms of human nose sensitivity. The measurement results of these methods have a high accuracy and a small relative error. However, the disadvantages of high cost and analysis time severely limit the possibility of detection applications.
Another dynamic method for measuring the odor concentration is field survey. Field inspection methodology is a standardized technique that involves the collaboration of a group of individuals who assess and quantify odors in the environment. Several steps are encompassed within field inspection, including the formulation of a sampling plan, the use of equipment for odor sample collection, on-site observations, the gathering of odor samples from numerous locations, the analysis of the collected data to determine odor concentrations, and the assessment of other crucial parameters.
Anzeige
On the other hand, the analytical method based on GC-MS is used to monitor variations in odor emissions during industrial processes and to identify the most critical phases (Moufid et al. 2021). This method has found extensive application in the study of air quality, yielding a list of involved substances along with their respective concentrations (Davoli et al. 2003). However, the primary drawback of this approach is linked to the complex nature of the odor since the detected fragrance arises from numerous volatile compounds (Yuwono and Schulze Lammers 2003).
Another type of analytical methodology is the electronic nose. Electronic noses, also referred to as sensor arrays or instrumental odor monitoring systems (IOMS), have been considered valid substitute tools for dynamic olfactometry (Bax et al. 2020). Indeed, as defined within CEN/TC264/WG41, these systems allow continuous monitoring and provide predictions of odor concentration that can be compared to the values obtained by dynamic olfactometry according to EN13725 (Burgués et al. 2022; Settimo and Avino 2024). The multivariate response of the gas sensor array of an electronic nose can be elaborated to provide qualitative and quantitative characteristics of the odors to which they are exposed (Bax and Capelli 2023b). For this reason, the electronic nose has been implemented in WWTPs (Capelli and Sironi 2017; Prudenza et al. 2022), where the presence of multiple and extensive wastewater tanks is usually associated with the emission of unpleasant odors into the atmosphere (Blanco-Rodríguez et al. 2018). Currently, they are regarded as the most promising instruments for monitoring odors (Bax et al. 2020; Capelli et al. 2014). The primary objective of these devices is to establish the relationship between sensor response and odor concentration. In terms of the objective evaluation of odor nuisance, the joint use of the dynamic olfactometer and electronic nose has many advantages over the other techniques (Littarru 2007).
A number of studies on WWTPs have been done to evaluate the effectiveness of electronic noses. A recent study conducted by Jońca et al. (2022) demonstrated the application of electronic noses in WWTPs. Their attention was mainly focused on refining waste treatment methods like composting, anaerobic digestion, and biofiltration, among others. The results showed that electronic noses serve as effective tools for overseeing waste treatment processes and evaluating the impact of odors. One notable advantage of employing electronic noses is their capability to determine the odor impact on receptors or on plant fence lines without the absolute necessity for precise characterization of the odor source, as mentioned by Capelli et al. (2014). Similar outcomes have been demonstrated by other researchers in various WWTP scenarios, including composting (D’Imporzano et al. 2008), anaerobic digestion (Savand-Roumi et al. 2022), biofiltration (López et al. 2011), landfill (Lucernoni et al. 2016), complete plants (Giungato et al. 2016), and even biodiesel production facilities (Mahmodi et al. 2022). Furthermore, (Burgués et al. 2022) employed IOMS to monitor the odor from a WWTP in Molina de Segura, Spain. It is worth pointing out that, in the recent years, different studies and applications have been developed to analyze data from IOMS to identify the most relevant odor source, as well as to assess the odor impact of various industrial activities (Bax and Capelli 2023a). Moreover, among the most recent works on this subject it is also interesting to include those oriented to evaluate the ability of odor sensors to distinguish between the treatment plant sections by using multiclass random forest classifier (Distefano et al. 2024). Cangialosi et al. (2021) provided a field IOMS application to classify odor sources and quantify odor concentrations by using ML algorithms. On the other hand, in the literature, several statistical methods have been suggested to evaluate the applicability and usefulness of machine learning (ML) models in predicting odor concentrations from WWTPs. Byliński et al. (2019) documented the release of odor concentrations from post-digestion sludge in a WWTP. Their approach involved the analysis of volatile fatty acid content and pH levels in the sludge. Artificial neural network (ANN) was applied for prediction, yielding a mean absolute percentage error of \(30\%\). Kang et al. (2020) conducted a study to predict the concentration of odor using ANNs derived from a WWTP. They achieved a predictive accuracy of \(70\%\) using the water quality variables. Bagherzadeh et al. (2021) conducted a study where they employed ANN, random forests (RF), and gradient boosting machines (GBM) as independent methodologies. The input parameters, namely ammonia nitrogen (NH-N), biological oxygen demand (BOD), chemical oxygen demand (COD), and mixed liquid suspended solids (MLSS), were utilized to predict odor concentration in wastewater. Even if the GBM exhibited the highest level of accuracy, further efforts were still required to improve the final results. Zarra et al. (2021a) applied partial least squares (PLS), ANN, multivariate adaptive regression splines (MARS), and response surface regression (RSR) to the responses from IOMS. They found out that ANN performed the best followed by MARS. Yuanyan et al. (Jiang et al. 2023) applied RF, extreme gradient boosting (XGBoost), and light gradient boosting machine (LGBM) to WWTPs. RF regression performed very well highlighting its importance in predicting odor concentration from WWTPs. In terms of mean absolute error, the RF model’s performance was comparable to that of the XGBoost model, while LGBM exhibited the lowest performance. Concerning mean squared error, the RF and XGBoost models were also similar, but the RF’s mean squared error was smaller, indicating better stability. RF has advantages over other ML methods since it provides reliable variable importance estimation, which sets it apart from other approaches.
Given the above considerations, this study aims to develop prediction models using four different ML regression methods, such as ANN, k-nearest neighbor (k-NN), RF and XGBoost, to estimate odor concentration from an IOMS. This choice is based on their capabilities and capacity to adapt to the complex nature of the problem. The k-NN algorithm is a straightforward and efficient method that utilizes the similarity between data points to produce predictions. This makes it well-suited for predicting odor concentration based on historical data (Kang et al. 2020). The ANN, a sophisticated form of deep learning model, is renowned for its capacity to understand complex patterns and correlations within the data, and as a result, it is ideal for predicting odor concentrations (Alsulaili and Refaie 2020). By combining many decision trees, RF is a good ensemble learning method that improves the accuracy of predictions and takes into account complex interactions between variables (Shen et al. 2023). XGBoost is gradient-boosting framework known for its exceptional performance in managing extensive datasets and capturing complex patterns (Nurhayati et al. 2023). Overall, this methodology yields precise and effective predictions of odor concentrations in WWTP, rendering it the favored option for odor monitoring and management (Shen et al. 2023; Nurhayati et al. 2023).
The fundamental purpose of this work is to create multivariate statistical models that may be used to estimate odor concentration based on responses from IOMS. The methods will be appropriately implemented for the study’s objectives, and the results obtained using the four methods will be compared to determine the best approach.
The rest of this paper is organized as follows: in Sect. 2, the theoretical framework of the herein used ML methods is presented. In Sect. 3 the main statistical indices used for model’s performance evaluation are reviewed. Then, a case study concerning odor concentration with data collected by electronic noses installed at WWTPs operating in the Apulia region (Italy), is developed (Sect. 4). In this section, exploratory data analysis, data pre-processing, variable selection and ML modeling are thoroughly discussed. The predictive performances of the proposed ML models are compared in Sect. 5. Finally, Sect. 6 is devoted to the conclusion and some remarks about possible directions for future research.
2 Theoretical framework
As described by Jordan and Mitchell (2015), the term ML is used to indicate a set of techniques which utilize reprogrammed algorithms to predict future patterns of raw data by using knowledge acquired from the underlying relationships within the study data. Typically, it is necessary to process the data thoroughly in order to construct the dataset and accurately identify the underlying laws. The features of the input data and the specifications for the output data are then used to select an acceptable ML algorithm. To build the appropriate model, the selected method needs to be tested after being trained using carefully prepared data. The proposed ML model is then capable of making predictions on fresh data. While the application of ML in environmental science lags behind in other fields, it has shown success in predicting both airborne and waterborne pollutants (Kerckhoffs et al. 2019).
In this paper, four ML methods, namely ANN, k-NN, RF and XGBoost, have been carried out in order to predict odor concentration by using the measurements from IOMS installed at various WWTPs and a comparison of their prediction abilities has been developed. Therefore, in the following section, a brief description regarding the above-mentioned ML approaches has been provided.
2.1 Artificial neural networks
Neural networks are most widely used for prediction. ANN consists of input, hidden, and output layers, each playing a distinct role. The input layer allocates the data in the network, the hidden layer captures complex patterns, and the output layer generates precise forecasts for given inputs (Moustris et al. 2010), as illustrated in Fig. 1. In each structure, the input data is conveyed through the hidden layer and eventually reaches the network’s output, if acceptable. Otherwise, if computational discrepancies are significant, the calculations are iterated to achieve satisfactory outcomes (Malik et al. 2017). According to Yu et al. (2011), in neural network implementations, each of the input signals that comes from other nodes or external sources is a real number with an assigned weight that reflects its relevance to other inputs. The weighted inputs are added together, and the node’s activation is calculated by feeding this weighted sum of inputs to an activation function.
Fig. 1
Schematic representation of an ANN
×
The activation function of a node is an abstraction of the action potential in the cell body of biological neurons. The neurons are the fundamental processing units that take in input signals and process using an activation function and then output the results (Alsulaili and Refaie 2020).
Thus, the data is moved from the hidden layer to the output layer using activation functions (Mohd Najib et al. 2020; Hmoud Al-Adhaileh and Waselallah Alsaade 2021). Each neuron’s weight and the transfer functions are passed along as the signal travels from one layer to the next (Kang et al. 2020). Several non-linear functions are used to provide the desired outcome (Matheri et al. 2021). A sigmoidal-type activation function, for example, can be utilized to speed up network mapping, whereas a logistic sigmoid function may cause the network to become stuck at a local minimum or maximum during the training period (Sundui et al. 2021). In addition, the sigmoid function has an S-shape curve ranging between 0 and 1; for this reason it can be interpreted as a probabilistic function.
It is necessary to sample the inputs and the pre-assigned outputs (referred to as labels). In order to produce a loss function, such as square error, absolute error, or any other function, the model output for those inputs is then compared to their labels. The data must be scaled correctly before being used to train neural networks. The optimization may prefer portions of the sample with a larger scale if the scale of the data varies along the sample since they may contribute more to the loss function.
One of the main advantages of the neural network is its ability to model complex and non-linear relationships. However, there are several drawbacks to ANN that should not be overlooked, with the “black box” aspect ranking as the most important (Kannangara et al. 2018). ANNs are typically unable to explain their reasoning process and provide supporting evidence. For training, ANNs require a lot of data. When there is a lack of training data, they might not work effectively. Another drawback is that ANN is vulnerable to overfitting when the data is insufficient and/or the model structure is very complicated (Younes et al. 2015), which will result in low prediction accuracy for cases outside the input dataset.
2.2 k-nearest neighbor
k-NN is a non-parametric and flexible method that does not assume a functional form for the dependent variable (James et al. 2021). The method predicts the output value of the data based on the k-nearest data points in the training set. The parameter k holds significance in determining the effectiveness of k-NN, serving as a vital tuning factor, as emphasized in research by Qian et al. (2015). To predict the labels of unknown samples among k-selected samples, the average of the dependent variables is computed, as shown in different studies by Akbulut et al. (2017) and Wei et al. (2017). Using a bootstrap procedure facilitates the determination of the parameter k. According to Imandoust and Bolandraftar (2013), the k-NN for prediction requires the definition of the training set \(P = \{(x_1, y_1), (x_2, y_2),..., (x_N, y_N)\}\) with a distance metric d, where \(x_i = (x_{i1}, x_{i2},..., x_{im})\) is the i-th instance (\(i=1, \ldots , N\)) characterized by m features with its i-th output \(y_i\). Given a test instance x, it is necessary to calculate the distance \(d_i\) between the test instance and each instance \(x_i\) in P and sort the distances. If \(d_i\) is ranked in the i-th position, its output is denoted as \(y_i(x)\) and the corresponding instance is known as the i-th nearest neighbor \(NN_i(x)\). Finally, the regression’s output for x, denoted by \(\hat{y}_i\), is the average of the outputs of its k closest neighbors.
As suggested by Connor and Kumar (2010), the advantages of the k-NN analysis are the capacity to handle big data sets, less space consumption, improved cache efficiency, faster creation of k-NN graphs in practice on multi-core machines, and ease of parallelization and implementation. However, there are some disadvantages in using a k-NN analysis due to the choice of distance or similarity measure which has a large impact on the k-NN performance, and some measures are less influenced by additional noise than others (Prasath et al. 2017). Moreover, there are no strategies for making an empirical choice of k (Hall et al. 2008).
2.3 Ensemble learning algorithms
RF and XGBoost methods are ensemble learning algorithms which combine multiple ML models. In general, an ensemble learning algorithm aims to join several ML methods into a single predictive model in order to reduce variance using bagging, reduce bias using boosting, or improve prediction accuracy using stacking (Zhang and Ma 2012). Ensemble learning is based on the notion of combining multiple weak learning models to create a powerful learning model that improves ML performance, resulting in more accurate prediction outcomes compared to the ones obtained by a single model. The sentence “boosting algorithm” is commonly employed to denote a variety of strategies that enhance the capabilities of weak learners, transforming them into powerful ones (Kearns 1988). In order to enhance the model’s ability to make generalizations, many fundamental models are combined (Rajsingh et al. 2018). The boosting technique improves prediction accuracy by iteratively constructing several models, with a focus on the data that are challenging to estimate. The primary distinction between the bagging and boosting methodologies lies in their respective approaches to reducing prediction variance. Bagging accomplishes this by generating additional training data through combinations with repetitions, resulting in multiple sets of the original data. On the contrary, boosting modifies the weight of an observation based on its prior iteration. The boosting technique has a non-uniform likelihood of selecting certain samples, unlike the bagging strategy, which evenly selects each sample to generate a training dataset. When samples have a higher weight, they are more likely to be selected as improperly computed samples. Consequently, any subsequent model can focus on the specific samples that were incorrectly classified by earlier models (Zhang and Haghani 2015). Figure 2 displays the difference between bagging and boosting which belong to the category of ensemble learning. Both methods can be used to reduce the variance associated with prediction and improve the accuracy process. More specifically, in Fig. 2a, the prediction is performed by using a simple averaging of results to obtain an overall prediction. On the other hand, the boosting method, illustrated in Fig. 2b, is an iterative procedure that, at each iteration, creates a new model where the weights are properly assigned to the data accordingly to the errors of the previous model.
Fig. 2
a Bagging and b Boosting algorithms
×
2.3.1 Random forest
RF is an ensemble technique that combines the predictions of multiple regression trees to get a single prediction. The fundamental concept underlying RF is bagging, as described by Lahouar and Slama (2015). Bagging aims to reduce the variance of the model while minimizing the increase in bias. RF possesses one additional feature in contrast to the typical bagging procedure. Each split made during the learning phase is determined by a randomly-selected subset of the characteristics. Integrating feature bagging (a form of random subspace technique) with the standard sample bagging method further decreases the correlation among the individual trees in the ensemble. Random selection of data is performed, and each tree is trained independently from one another. Moreover, previously selected data points can be re-selected as the procedure involves replacement. The final prediction is obtained by taking the average of the predictions from all the trees. The subsequent information presents the forecast of RF regression as proposed by Ganesh et al. (Narayanan et al. 2021).
Let D be a training dataset with n observations that are used to construct prediction rules by executing the RF algorithm on T trees, and Dtest the test dataset used to evaluate the classifier predictions. Let \(f_i\), with \(i = 1,..., n_{test}\), be the true response for the i-th observation from the Dtest, which can either be a numerical value (in the case of regression) or the binary label 0 vs. 1 (in the case of binary classification). While \(f_i\) stands for the predicted value obtained as the output from the entire RF, predicted value output from the k-th tree is indicated as \(\hat{f}_{ik}\), with \(k = 1,..., K\). The \(\hat{f}_i\) is achieved in the case of regression by averaging the predicted outputs \(\hat{f}_{ik}\) over the K trees.
RF has some advantages compared to other statistical modeling approaches (Tasan et al. 2008). Both continuous and categorical variables are applicable. Furthermore, it is worth mentioning that RF only consists of two hyper-parameters: the number of variables in the random subset at each node and the number of trees in the forest. These hyper-parameters are generally not highly sensitive, which is the reason for which it has been observed that using the default values is often a prudent choice (Liaw and Wiener 2001). In addition, RF utilizes a variable relevance index (Breiman 2001) to rank variables, considering the interaction between factors and indicating the significance or “importance”of each variable. The RF technique determines the relevance of a variable by measuring the increase in prediction error when the data for a variable, which is not included in the bootstrap sample, is permuted while keeping all other variables unchanged. This dataset, referred to as out-of-Bag (OOB) data by Breiman (2001), is used for the calculation. During the construction of the RF, the necessary calculations are carried out sequentially for each tree. Díaz-Uriarte and Alvarez de Andrés (2006) stated that the significance score is commonly presented in a hierarchical arrangement based on rank. Nevertheless, in the literature there is no specification about the optimal number of trees that can be used to train RF. It has been found that an increase in the number of trees does not always result in a significantly better performance of the RF compared to the RF with few trees (Paul et al. 2018; Oshiro et al. 2012).
2.3.2 XGBoost
XGBoost operates similarly to a decision tree by building a predetermined number of trees that correct the mistakes produced by the previous tree, hence improving its prediction capacity (Pesantez-Narvaez et al. 2019). XGBoost represents an advanced and scalable version of the GBM method and employs ensembles of decision trees to enhance the goodness-of-fit and maximize performance (Chen and Guestrin 2016). The approach recalls regularization and boosting techniques, in which decision trees rectify mistakes caused by preceding trees to enhance the accuracy of predictions (Chen et al. 2018). XGBoost offers several hyper-parameters for adjusting in order to improve the performance of the model. For example, the first regularization term incorporates a shrinkage parameter that acts similarly to lasso regression, punishing the magnitudes of variable weights. In addition, the second regularization term penalizes the squared magnitudes of variable weights. The gamma parameter determines the least loss required for a tree to be divided. Moreover, column subsampling is used in the model to make it behave like RF regression by picking a subset of features (columns) for each iteration of boosting when decision trees are being built. This involves random selection of a certain proportion of predictors in each iteration, as described by Chen and Guestrin (2016). Additionally, the risk of overfitting is restricted by employing shrinkage and selective column sampling.
The better success of XGBoost may be attributed to the shorter training timeframes (Trizoglou et al. 2021) and the ability of sophisticated ML methods to capture non-linear relationships between variables that traditional regression is unable to discover. However, XGBoost is a complex algorithm that requires the tuning of numerous hyper-parameters. Finding the appropriate hyper-parameters can be time-consuming and requires a lot of experimentation. Moreover, XGBoost generally requires a substantial volume of data to achieve optimal performance. When data is scarce, less complex models may achieve better performance.
3 Model performance evaluation
In this section, a brief description regarding the metrics used to evaluate the performance of the prediction models obtained by using different ML approaches are listed.
Given \(\hat{y}_i\) the predicted values, \(y_{i}\) the true ones, with \(i=1,\ldots , N\), and \({{\bar{y}}}\) the mean value of the observed data, the model performance metrics can be defined as described below.
1.
Mean squared error (MSE), which represents the average of the error’s squares, i.e.
This makes sure that the direction of the fault is irrelevant and that all errors have the same weight. A lower MSE value indicates superior predictive performance.
2.
Mean absolute error (MAE), which measures the average magnitude of errors in a collection of prediction values, regardless of their direction (Chai and Draxler 2014), is defined as follows:
Root mean squared error (RMSE), which represents the most commonly used evaluation and comparison tool for predictive models, according to Kuhn and Johnson (2013). It computes the square root of the average squared errors between observed and predicted values as follows:
Although it is quite similar to the MAE, it penalizes greater absolute values by giving them more weight. The variance in the individual errors increases when MAE and RMSE diverge more widely. The smaller the value of RMSE, the better the algorithm model (Nguyen et al. 2021).
4.
Coefficient of the variation of the root mean square error (CV-RMSE) which is a measure of cumulative normalized errors to the mean of the observed values (Royapoor and Roskilly 2015). It is mainly useful in showing the degree of error accumulation thus offering insights into the model’s accuracy. It measures how much the projected input series varies from the observed input series, as follows:
Nash-Sutcliffe efficiency (NSE), which is a normalized measure of the residual variation, also known as “noise”, represents the variance of the measured data (Nash and Sutcliffe 1970), i.e.,
NSE, which can range from \(-\infty\) to 1, helps evaluate the predictive ability of a model. In particular, NSE equal to 1 indicates complete agreement between predictions and observations, while an NSE \(< 0\) implies that the predictions are less accurate than the observation mean.
6.
Coefficient of determination (\(R^2\)), which quantifies the extent to which the independent variables in a regression model account for the variance in the dependent variable (Pan et al. 2022), as given below:
where \(\text{SST} =\displaystyle \sum _{i=1}^N (y_i - {\bar{y}})^2\) and \(\text{SSE} = \displaystyle \sum _{i=1}^N (\hat{y}_i - {\bar{y}})^2.\) An \(R^2\) value equal to 1 suggests a perfect match between predictions and actual values.
4 Dataset description and pre-processing
The analyzed dataset concerns the measurements from a multi-parametric odor monitoring. The data have been recorded during the period January–April 2022, from an Italian company specialized in providing environmental support and consulting services, and with advanced experience in evaluating the effects of odor emissions from WWTPs. In particular, the data are measurements from the sensor array consisting of 10 metal oxide semiconductors (MOS) specifically engineered to detect the released volatile organic compounds (VOCs) at the different phases of a WWTP plus two other sensors which measure hydrogen sulfide (\(\text {H}_{2}\text {S}\)) and ammonia (\(\text {NH}_{3}\)). Figure 3 illustrates a typical WWTP where one can distinguish between primary treatment (grit removal and primary sedimentation), secondary or biological treatment (stabilization of sludge, sludge storage and thickening), secondary sedimentation (separation of lighter particles and micro-organisms) and tertiary treatment (denitrification and equalization). As a consequence, at each stage several groups of odoriferous compound can be generated.
Fig. 3
Scheme of a typical WWTP (Source: Arpat.it)
×
The data obtained from the sensors, as shown in Table 1, allows for the identification of particular odor components, with their corresponding measurements represented in milliamperes (mA). The dataset comprises 285 measurements collected from 12 sensors at different phases of diverse treatment processes. The sensors used as independent variables to measure the gaseous compound were W1C, W3C, W5C, W2W, W2S, W1S, W5S, W6S, W3S, W1W, \(\text {H}_{2}\text {S}\) and \(\text {NH}_{3}\). The response variable in olfactory investigations is the odor concentration of gaseous compounds, which is given in odor units per cubic meter \((ouE/m^{3})\). This concentration is crucial to determine the perception of odor, since compound gases are the major ingredients that influence it. Thus, the main goal of this case study is to create a prediction model for the odor concentration based on the compound gases as covariates by using the aforementioned ML approaches.
Table 1
Description of 10 sensors of an IOMS plus \(\text {H}_{2}\text {S}\), \(\text {NH}_{3}\) and odor concentration
Gaseous compounds
Unit of
measurement
Odor concentration
\(uoE/m ^3\)
\(\text {H}_{2}\text {S}\)-Hydrogen sulfide
mA
\(\text {NH}_{3}\)-Ammonia
mA
W1C (Aromatic)
mA
W1S-Broad methane
mA
W1W-Sulphur organic (terpenes and sulphur organic compounds)
W3S-Methane-aliph (sensitive to high concentrations of methane)
mA
W5C-Arom-aliph (alkanes, aromatic compounds)
mA
W5S-Broad range (broad range sensitivity, react on nitrogen oxides)
mA
W6S-Hydrogen
mA
Exploratory data analysis (EDA) is an essential step in constructing an ML model since it involves a thorough exam of the data to obtain insights and discover possible patterns and relationships among the variables. EDA encompasses essential procedures such as data gathering, data cleaning, basic descriptive statistics, and visualization. Descriptive statistics include several measures such as mean, median, range, lowest values, maximum values, standard deviation, kurtosis, and skewness. Significant variability in the dataset can result in substantial errors during training while still ensuring effective performance during testing. The descriptive statistics for the data captured by the 10 sensors of the IOMS and the 2 variables \(\text {H}_{2}\text {S}\) and \(\text {NH}_{3}\) are presented in Table 2. By analyzing the statistics, it is clear that W1W, W2W, \(\text {H}_{2}\text {S}\), and \(\text {NH}_{3}\) had the highest average values, namely 25.826, 20.454, 691.4, and 16.44, respectively; whereas sensors W1C, W3C, and W5C had the lowest average values. Significant changes in the gaseous compound level were observed across different parts of the WWTPs, as indicated by the high variability recorded on sensors W5S, W1W, and W2W, whose coefficients of variation were, respectively, 1.948, 1.572, and 1.854. In addition, these last sensors displayed a diverse range of values, with measurements of 282.53, 266.25, and 130.72, respectively. In order to evaluate normality, both kurtosis and skewness have been analyzed. According to the obtained results, only sensors W1C, W3C and W5C exhibited negative skewness values equal respectively to \(-\)1.176, \(-\)1.242, and \(-\)1.087, confirming the findings of Hair et al. (2019). Similarly, the kurtosis values displayed a consistent pattern, suggesting that the dataset deviates from a normal distribution. Hence, it is imperative to carry out data transformation.
Table 2
Descriptive statistics for the distributions of the study data
Gaseous compounds
Range
Mean
Median
Standard deviation
Coefficient of variation
Skewness
Kurtosis
\(\text {H}_{2}\text {S}\)
19,975
642.152
32
1951.11
2.822
5.959
47.474
\(\text {NH}_{3}\)
4140
22.004
3
200.890
12.226
16.530
273.124
W1C
0.80
0.686
0.720
0.158
0.230
\(-\)1.176
1.481
W1S
42.79
7.494
5.470
5.943
0.793
2.743
10.091
W1W
282.53
25.826
11.91
40.61
1.572
3.575
15.118
W2S
47.23
9.242
6.190
08.179
0.885
2.416
6.423
W2W
266.25
20.454
7.130
37.930
1.854
3.816
16.261
W3C
0.820
0.724
0.770
0.172
0.238
\(-\)1.242
1.211
W3S
25.13
5.752
4.4508
3.98
0.692
2.162
5.035
W5C
0.710
0.777
0.820
0.152
0.196
\(-\)1.087
0.553
W5S
130.72
5.97
3.130
11.63
1.948
7.232
63.414
W6S
14.25
1.708
1.640
0.810
0.474
13.185
205.4
Within EDA, the variable dependence plots (Fig. 4) allow the analyst to understand the relationships between the covariates and the response variable. Additionally, it seeks to discover any non-linear relationship that may exist. It is likely that the input variables are important and help predict the target variable if the dependent plot shows a strong and steady link between the sensory measurements and the predicted odor concentration (Choi and Yim 2016). From Fig. 4, it is evident a positive linear relationship between \(\text {H}_{2}\text {S}\) and odor concentration, indicating that higher levels of \(\text {H}_{2}\text {S}\) are linked to higher levels of odor concentration. The data collected from W1W, W2W, W1S and W2S indicate a direct correlation with the odor concentration values. These variables have the potential to make a substantial contribution to the predictive capability of the model, as indicated by (Choi and Yim 2016). Note that the observations for W1C, W3C, and W5C exhibit a significant degree of dispersion. This aspect could potentially provide difficulties for algorithms, such as k-NN, which significantly depend on the proximity of data points and so may rely on variables with a wider range.
Fig. 4
Variables dependence plots between the covariates and the response variable
×
4.1 Pre-processing of the data
Before developing a prediction model, a sequence of data pre-processing activities was carried out. Data pre-processing in this work helps improve the prediction capability of ML models, hence enabling more precise data mining, as suggested by Jamshed et al. (2019). Data pre-processing techniques enhance the efficacy of the anticipated findings (Guo et al. 2015). The most effective data pre-processing methods were normalization and data transformation. In this case study, the min-max normalization approach was applied to the study data which were shifted and resized to fit inside the range 0 to 1. This is especially advantageous because characteristics with wide ranges might cause instability throughout the model training process, as seen by Mousavi et al. (2018). By using this normalization method, the study data were transformed as follows:
where \(x_{i}\) are the measurements for the observed variable X, \(x_{(1)}\), and \(x_{(n)}\) are, respectively, the minimum and the maximum values recorded for X.
As already pointed out through the dependence plots (Fig. 4),
certain sensors exhibited a significant dispersion in their distribution. This could potentially create difficulties for algorithms like k-NN, which significantly depend on the proximity of data points. For this reason, the cubic root method has been used to mitigate the fluctuations and skewness in the odor concentration measured by the sensors (Prolhac 2008). Afterwards, a variable selection has been performed as detailed in the following section.
4.2 Variable selection
Variable selection is the act of allocating values to input parameters based on their predictive capability for a target variable (Guyon and Elisseeff 2003). Variable selection scores are crucial in predictive modeling as they improve data comprehension, offer insights into the model, and facilitates feature selection. Consequently, this can improve the effectiveness of predictive models for a specific situation. Correlation analysis is a technique used for selecting variables. This technique seeks to distinguish the connection between two or more quantitative variables (Gogtay and Thatte 2017). In this study, the correlation matrix (Fig. 5) encompasses the correlation coefficients between all the features in the dataset, including the variable representing the concentration of odor. The main goal was to select characteristics strongly correlated with the dependent variable, taking into account the presence of multicollinearity. Thus, in presence of a strong link between two variables, one of them was excluded. Indeed, a significant multicollinearity among the independent variables poses difficulties in regression analysis, resulting in difficulty in accurately evaluating the influence of individual predictors. More precisely, the factors that had a strong correlation with the odor concentration were retained for the model development phase, applying the approaches ANN, k-NN, RF and XGBoost. On the basis of this analysis, the variables \(\text {H}_{2}\text {S}\), W1S, W3S, W5S, W2S and W1W were considered relevant for the further steps, which was also consistent with the variable importance plot depicted in Fig. 8.
Fig. 5
Correlation matrix among the study variables
×
While ANN, RF and XGBoost models are typically robust against multicollinearity, it was necessary to perform a correlation study in order to address the detrimental effects of highly correlated variables on k-NN during the model creation process. The input dataset was completed and divided into two subsets: the training dataset \((70\%)\) and the testing dataset \((30\%)\).
4.3 Hyper-parameter selection
One of the most crucial elements in creating prediction models using ML algorithms is the setting of hyper-parameters which is often even more important than the choice of the model (Yang and Shami 2020). Indeed, the adjustment of the hyper-parameters critically influence how well ML models perform on unseen data and should be based on the dataset rather than being specified explicitly (Arnold et al. 2024). The common approaches for searching hyper-parameters include grid search, Bayesian optimization, heuristic search, and randomized search (Kumar 2019). The randomized search approach is employed here because of its superior efficiency in simultaneously optimizing several hyper-parameters, allowing for the identification their optimal collection. In particular, in random search, the model has been trained on the basis of certain random combinations of hyper-parameters.
The recommended steps for hyper-parameter tuning of an ML algorithm can be outlined as follows:
Identify the most appropriate method for the research problem;
Split the original dataset into training and test set;
Tune the network hyper-parameters referring to the training process;
Choose the optimal hyper-parameters based on their performance on the test set.
According to the four points above, it is worth pointing out the following details:
1.
For the ANN, the activation function used in the hidden layers of this specific model was the hyperbolic tangent function. The hyperbolic tangent function compresses the input values within the interval of \(-1\) to 1, allowing the neural network to capture complex correlations present in the data. Unlike several other activation functions, the hyperbolic tangent function has a zero-centered property that helps solve issues like the vanishing gradient problem. This attribute enables more consistent and effective training by aligning the activations around zero, which is especially advantageous when working with variables that have different scales and distributions. The inclusion of the hyperbolic tangent function in the hidden layers enhances the model’s ability to catch complex and subtle patterns in the input data, hence increasing its overall capacity for expression. The training and test sets were carefully selected to verify the model’s effectiveness and ability to generalize. The process of hyper-parameter optimization was carried out using a random search to refine parameters such as learning rates and regularization strengths. A careful optimization process was necessary to find a good balance between how complicated the model was and how well it could generalize. This led to the best performance of the ANN across the datasets. The final model was constructed using a four-layer design, consisting of one input layer, two hidden layers, and one output layer. The input layer consisted of six variables. Optimizing parameters is crucial for managing model complexity and achieving optimal performance.
2.
In the k-NN, the model was fitted using the training set, which accounted for a significant majority of the data. During this stage, the model acquired knowledge of the connections and patterns inherent in the data, enabling it to produce predictions based on the proximity of input data points to their nearest neighbors. By evaluating the model’s performance on unfamiliar data, it was feasible to confirm its capacity to generate precise predictions outside the training dataset. The optimal value of k must be carefully determined using the RMSE validation. The optimal k value utilized in this analysis was 15, as it resulted in the lowest RMSE, indicating improved predictive accuracy.
3.
In the RF model, the model-training process is setting by arranging the data in a random order. Then, the dataset has been divided into two subsets: 70% of the data has been used for training, and 30% of the data has been used for testing (or validation test) to evaluate the model’s ability to generalize. The separation of data guaranteed that the model’s ability to make predictions was evaluated on data that it had not been trained on, thereby measuring its capacity to make accurate predictions beyond the data it was familiar with. Next, the optimal RF classifier has been performed tuning the hyper-parameters to find the model with the highest testing prediction accuracy. In particular, the parameter related to number of trees developed from a bootstrap sample of the original observations has been set to 500, the number of variables used to split at each node (known as the mtry parameter) has been set to 3 and the node-size, that is the smaller value of terminal node sizes of the trees, has been set to 1.
4.
In the XGBoost, the original dataset was splitted into two sets: training and testing set. By strategically dividing the data, the model learned the fundamental patterns and their correlations during the training phase. The study examined specific parameters, called minimum child weight, maximum depth (ranging from 2 to 30), nrounds (ranging from 10 to 2000), and the eta (ranging from 0.01 to 0.3), as mentioned by Chen and Guestrin (2016). The XGBoost model has been implemented by settinga maximum depth of 3, an eta value of 0.1, and a verbosity level of 0 to monitor the training process. The findings were consistent with the study conducted by Chen and Guestrin (2016). The study reached the best testing RMSE after 50 attempts. This suggested a strong model that balanced complexity and generalization, reducing risk of overfitting.
5 Prediction model results
To verify the accuracy of the constructed models, the statistical metrics described in Sect. 3, have been compared. Table 3 displays the results of the comparative analysis among ANN, k-NN, RF and XGBoost. The model with a lower RMSE value was deemed preferable. In addition, the coefficient of determination (\(R^2\)) is included to quantify the proportion of the observed variation in the outcomes that can be explained by the inputs, hence enhancing the explanation of the results.
Table 3
Valuation metrics to evaluate the performance of the ML methods
Performance measures
ANN
k-NN
RF
XGBoost
MSE
0.012
0.013
0.002
0.011
MAE
0.079
0.084
0.032
0.075
RMSE
0.109
0.114
0.046
0.106
CV-RMSE
0.035
0.039
0.021
0.034
NSE
0.711
0.633
0.873
0.784
\(R^2\)
0.724
0.713
0.953
0.771
Fig. 6
Predicted and observed odor plots with respect to the different ML models a ANN, bk-NN, c RF, d XGBoost
×
The findings shown in Table 3 demonstrate that the RF model exhibited superior accuracy and precision in comparison to the other three techniques. The RF model yielded the most optimal outcomes, with an MSE of 0.002, MAE of 0.032, RMSE of 0.046, CV–RMSE of 0.021, and NSE of 0.873. The coefficient of determination was \(95.3\%\). XGBoost exhibited the second-highest level of performance. The ANN and k-NN models achieved coefficients of determination of \(72.4\%\) and \(71.3\%\) respectively. The results obtained are comparable to those reported by Yuanyan et al. (Jiang et al. 2023) in predicting odor gas generation from treatment plants. In that work, the RF model had a coefficient of determination equal to \(89.6\%\), whereas the XGBoost model obtained \(89.4\%\). The results were nearly identical, indicating the effectiveness of the ML techniques in predicting odors. The ML models listed in Table 3 demonstrated high accuracy in predicting odor concentrations. The RF method demonstrated superior performance based on the reported data. Additionally, the predicted values closely matched the observed odor, as shown in Fig. 6. As a result, the proposed model offers valuable support to industry decision-makers in attaining more accurate predictions. It is crucial to emphasize that the generated models successfully tackled the problems of overfitting and training mistakes. The residual plots were assessed to ensure the validity of the proposed models. Residual plots are decisive in ML prediction because they shed light on how the observed and predicted values relate to one another. They are also important for assessing the bias and variance of nonlinear regressions in ML prediction (Searle 1988). According to Espinheira et al. (2021), these graphs aid in evaluating the significance and dependability of predictive statistical models. Additionally, researchers can modify the prediction model to increase its accuracy and dependability by examining the residual plot (Searle 1988). The residual plots of the RF and XGBoost models represented in Fig. 7 demonstrate a good reliability of the analysis. The residuals show a dispersed distribution along the horizontal axis (zero line), indicating randomness in the absence of any visible patterns. This confirms that these models successfully represent the relationships underlying data. Additionally, the residuals’ mean is close to zero, indicating a lack of bias, and their spread is nearly constant across all predicted values, demonstrating homoscedasticity. Furthermore, only a small number of outliers are found, highlighting how reliable these models are. However, the models for k-NN and ANN show several noteworthy outliers in their residual plots, despite their models performing satisfactorily overall. This difference implies that, even though these models performed well, XGBoost and RF did better at managing the unpredictability and patterns within data.
Fig. 7
Residual plots with respect to the different ML models a ANN, bk-NN, c RF, d XGBoost
×
5.1 Variables importance
During the process of odor prediction, the models can also provide insights into the importance of the features used for making predictions. The contribution of compound gases to predicting odor concentration was investigated by Wei et al. (2017).
Fig. 8
Variable importance plots
×
Variable importance is a technique used to identify the most crucial variable in the development of a model. Identifying the paramount features is crucial for aiding companies and governments in their planning efforts, as it enables them to prioritize resources toward the features that yielding the most significant effects. The bar plots in Fig. 8 display the importance of variables, sorted by their significance, under the RF and XGBoost models, which have demonstrated greater performance in predicting odor. Hydrogen sulfide (\(\text {H}_{2}\text {S}\)) made the most significant impact in both the RF and XGBoost models, showing a considerable disparity compared to the other variables. The findings suggest that \(\text {H}_{2}\text {S}\) had the greatest impact on the predicted outcomes. Sulfur organic (W1W) had the second highest relevance for the RF and the XGBoost. Furthermore, it was noted that there was a significant and positive association between W1W and odor concentration. Methane-aliph (W3S) and broad-alcohol (W2S) had the least impact on odor prediction for the RF and XGBoost models respectively.
6 Conclusions
This study focused on assessing the efficacy of various ML algorithms, such as ANN, k-NN, RF and XGBoost in predicting odor emissions from a WWTP. The findings highlighted that RF exhibited the highest performance in predicting odor concentration achieving a coefficient of determination equal to \(95.3\%\). As a result, the RF model can be considered as a reasonable and dependable model than the other methods. This study proved that ML models could be an effective tool for odor prediction. It could help companies to address odor concerns proactively, decrease the environmental impact, comply with regulations, and maintain positive relationships with communities nearby. Although the proposed models yielded satisfactory outcomes, certain limitations were observed, which encompass:
The availability of data for training and evaluation, both in terms of quality and quantity, may provide a limitation. Insufficient or non-representative data can result in less-than-ideal model performance. The limited size of the training dataset can determine the under-performance of ML algorithms;
The efficacy of ML algorithms is greatly influenced by the configuration of hyper-parameters. Inadequate or unsystematic hyper-parameter tuning may result in unsatisfactory outcomes.
In order to guide future study, the following considerations can be useful:
A wider sample size may help improve the performance of particular models;
The investigation of advanced neural network structures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), can potentially enhance the detection of specific patterns in odor concentration data;
The analysis of of automated ML (AutoML) methods may allow the identification of the most effective model and hyper-parameter configurations for predicting odor concentration.
Finally, it is worth highlighting some positive practical consequences that could arise from the proposed study, namely:
Better comprehension of odor emissions ML techniques can be employed to identify patterns and correlations in wastewater data that are responsible for the release of unpleasant odors. This knowledge can aid to understand the factors and components that mainly contribute to the production of odors in wastewater treatment facilities;
Enhanced odor control techniques theoretical advancements within this domain could lead to the development of odor control strategies that exhibit enhanced efficacy. Properly predicting odor emissions can optimize treatment methods and reduce the environmental impact of WWTPs;
Deeper control of health risks progress in odor prediction theory can aid in clarifying potential health risks associated with the exposure to odors from wastewater. Using this information can contribute to the development of public health policies and enhance the quality of life in regions next to wastewater treatment facilities.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.