Skip to main content
Top

Open Access 06-05-2025 | Original Article

Comparative analysis of CatBoost, LightGBM, XGBoost, RF, and DT methods optimised with PSO to estimate the number of k-barriers for intrusion detection in wireless sensor networks

Author: Kadir Ileri

Published in: International Journal of Machine Learning and Cybernetics

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article presents a comparative analysis of machine learning models optimized with Particle Swarm Optimization (PSO) for estimating the number of k-barriers in wireless sensor networks (WSNs). It highlights the critical role of WSNs in border security and intrusion detection, addressing the limitations of conventional patrolling methods. The study evaluates the performance of CatBoost, LightGBM, XGBoost, Random Forest (RF), and Decision Tree (DT) algorithms, demonstrating the superiority of the CatBoost-PSO model in terms of accuracy, efficiency, and adaptability. The article provides a detailed examination of feature importance, revealing that network area and transmission range are the most significant factors in predicting the number of k-barriers. Additionally, it discusses the time complexity and computational efficiency of the models, emphasizing the practicality of real-time intrusion detection. The comparative analysis with existing methods in the literature further underscores the robustness and effectiveness of the proposed CatBoost-PSO approach, making it a pivotal read for those interested in advanced intrusion detection technologies.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abbreviations
CatBoost
Categorical Boosting
D1
Device 1
D2
Device 2
DT
Decision Tree
GPR
Gaussian Process Regression
GPS
Global Positioning System
LightGBM
Light Gradient Boosting Machine
M
Mobile
MAE
Mean Absolute Error
MDI
Mean Decrease Impurity
MSE
Mean Squared Error
NS
Network Simulator
PCA
Principal Component Analysis
PSO
Particle Swarm Optimisation
R2
Coefficient of Determination
RF
Random Forest
RMSE
Root Mean Squared Error
S
Stationary
TBS
Target Based Statistics
UAV
Unmanned Aerial Vehicle
WSN
Wireless Sensor Network
XGBoost
Extreme Gradient Boosting

1 Introduction

The protection of borders is a very critical concern for all the countries in the world. To ensure the security of their own citizens, every nation is responsible for taking measures against unauthorized refugee influx, attempts by enemy soldiers to clandestinely steal national security information, the entry of drug smugglers, and other prohibited intrusions [1]. Many countries depend on their regular armies to safeguard their international borders from potential threats. However, conventional and periodic patrolling approaches come with inherent limitations. Due to international boundaries being long, it is not feasible to emplace soldiers to every point. This leaves unguarded areas vulnerable to enemy infiltration. To address these security challenges, accurately and rapidly identifying intruders and unauthorized activities is crucial. To accomplish monitoring and surveillance along international borders, the WSNs can be employed effectively [24].
The WSNs are extensively adopted and consist of sensors capable of processing, sensing, and transmitting data. Due to their advantages such as rapid and simple installation, cost-effectiveness, and the absence of a need for pre-existing infrastructure, the WSNs are widely used for intrusion detection purposes [5]. A WSN can be implemented to create a sensor barrier for any potential intrusion path of an intruder in the rectangular region, as illustrated in Fig. 1. Apart from intrusion detection, WSNs are used in various fields such as health monitoring, environment monitoring, landmine detection, fire detection, and so on [68].
Fig. 1
Example of k-barrier coverage for an intruder
Table 1 provides a concise summary of the contributions and key findings from various studies related to applications of barrier coverage using WSNs [920]. For each work, it typically includes the name of the authors, publication year, short description, limitations, main outcomes, and sensor types used as mobile (M) or stationary (S). In one of these studies, Yang et al. [9] introduced an energy-efficient border intrusion detection system with the aim of minimizing intruder entries and improving detection accuracy. The focus of the research is on coverage quality and the techniques to improve it. The study proposes a repair method to identify and fix weak zones to ensure the desired barrier coverage quality. Additionally, a one-directional coverage model is presented for intrusion detection in border regions, including both single and a few intruder scenarios. In addition to these advantages, the proposed algorithm extends the lifetime of the WSN and maintains energy efficiency. Similarly, Raza et al. [10] investigated the performance of an intrusion detection system in a heterogeneous WSN. The investigation is done for WSNs formed through both Gaussian and uniform distribution of sensors. The impacts of sensor node densities on both distributions are analysed using k-sensing detection. Simulation results indicate that the heterogeneous WSN outperforms homogeneous WSN in intrusion detection. Sharma et al. [11] represented analytic statements to compute the k-barrier coverage probability in a mobile WSN for an intruder moving within the rectangular region. Various system and network parameters such as sensing range, sensor density, and intruder path angle are analysed for their impacts on the coverage probability. The research shows that the k-barrier coverage probability heightens with the intruder's path angle. It leads to higher detection probability as the intruder spends more time within the region.
Table 1
Summary of related works
Reference
Year
Description
Advantages
Limitations
M/S
Yang et al. [9]
2014
Identifying weak zones to deal with bad coverage quality
- energy efficient detection
- one-directional coverage
S
Raza et al. [10]
2015
Analysing the influence of various sensor node densities in computing detection probability in a heterogeneous network
- operates in both Gaussian and uniform distribution scenarios
- analysed only for a single intruder
S
Sharma et al. [11]
2020
Analysing the analytic statements for computing the k-barrier coverage probability in a mobile WSN for an intruder
- analysed for different path types of an intruder
- not analysed for different mobility and sensing models
M
Singh et al. [12]
2021
An implementation of three machine learning methods (GPR, S-GPR, and C-GPR) to estimate the k-barrier coverage probability for intrusion detection
- improved computational efficiency and accuracy
- reduced time complexity
- lack of comparison of the performance (compared only with the SVR)
S
Nurellari et al. [13]
2021
Designing the optimal trajectories for mobile sensors for intrusion detection
- reduced energy consumption
- maximised coverage area
- not fully distributed wireless mobile sensor network
M
Saraereh et al. [14]
2021
A method for secondary implementation of mobile sensors using a set-based maximum flow approach
- improved efficiency of computation time
- consuming energy faster
M
Amutha et al. [15]
2021
A distributed border surveillance system in different environments using WSN
- operates in both rectangular and circular regions
- operates under both shadowed and non-shadowed environmental conditions
- limited number of sensing models
S
Sharma et al. [16]
2021
Presenting an autonomous surveillance system which is capable of detecting weapons, human infiltrations, weapons, and unmanned aerial vehicles (UAVs)
- capable of detecting the location of the intruder
- experimental setup with microcontroller boards
- lack of comparison of the performance
S
Singh et al. [17]
2022
A smart border surveillance system that can identify the intrusion and subsequently send alert messages in the presence of an intruder
- experimental setup with microcontroller boards
- has the capability to distinguish between persons and animals
- lack of capability to track the intruder
S
Rajan et al. [18]
2023
Presenting an intrusion detection system for armed attackers
- efficient energy cost
- lack of comparison of the performance
M and S
Mahjoub et al. [19]
2023
Designing WSN using mobile robots for intruder detection and border surveillance
- autonomous travel capability of the sensors
- multiple swarms cannot collaborate to achieve common goals
M
In another study, Singh et al. [12] presented a comprehensive framework for accurately predicting k-barrier coverage probability for intruder detection. The three Gaussian Process Regression (GPR) based machine learning methods (GPR, S-GPR, and C-GPR) are implemented for the prediction. Simulation results show that the native GPR model performs the best, accurately predicting k-barrier coverage probability compared to S-GPR, and C-GPR. Further, Nurellari et al. [13] designed optimal trajectories of mobile sensors for intruder detection. These trajectories maximised the coverage area and reduced the consumption of energy to extend the lifetime of the sensor network. The performance of the sensor network is optimised by carefully designing trajectories and dynamically adapting them to changes for better intrusion detection outcomes. Besides, Saraereh et al. [14] proposed a method for secondary deployment of mobile sensors to increase the utilization rate. The set-based maximum flow method is used to compute the number of weak points in the network, and the binary search method is employed for the optimisation of the method. The primary objective of the research is to strengthen the weak points while minimizing the total movement distance of mobile sensors. Nevertheless, certain deficiencies persist in the approach. For instance, the search for weaknesses in the barrier does not account for the actual intrusions in the environment, and nodes with increased intruder perception consume energy at a faster rate. Moreover, Amutha et al. [15] proposed a distributed border surveillance system in a WSN that uses a log-normal shadowing approach for both circular and rectangular regions. The performance of the system is assessed by analysing the number of barriers obtained to track the region of interest. The influence of varied system and network parameters on the number of barriers is assessed. According to the results obtained, the number of barriers increases in correlation with the number of sensors and sensing range. The proposed surveillance system with the log-normal shadowing approach outperforms the surveillance system with the binary sensing approach. Furthermore, Sharma et al. [16] presented a surveillance system designed for volatile border areas vulnerable to terrorist intrusions. The system is capable of detecting weapons, human infiltrations, weapons, and unmanned aerial vehicles (UAVs) using machine learning algorithms and it determines the location of the intruder with a Global Positioning System (GPS) module. This artificial intelligence-embedded autonomous surveillance system enhances its capabilities, making border surveillance more effective and efficient. It reduces the need for physical presence and surveillance by troops along the fence, minimizing risks to life and allowing existing surveillance equipment to be utilized for other crucial tasks. In a different study, Singh et al. [17] represented a smart border surveillance system utilizing wireless sensors to detect and alert intrusions of intruders. A prototype is developed by integrating Raspberry Pi boards with camera, ultrasonic, and infrared sensors. The prototype communicates via Zigbee serial communication, enabling video streams to be forwarded to a control station for further analysis or action. The developed system has the capability to distinguish between persons and animals. Rajan et al. [18] introduced a WSN model that combines mobile and stationary sensor nodes to enhance intrusion detection against armed attackers. To address the issue of armed intruders, a collaborative approach called the vehicle partnership sensing network is presented, where mobile and static sensor nodes work together for intrusion detection. Furthermore, a system for identifying intrusions is developed. To efficiently track armed intruders, a decentralized target pursuit method utilizing mobile sensor nodes is proposed. Finally, Mahjoub et al. [19] developed a system with mobile robots that mimics natural swarms and it is capable of monitoring borders and detecting intrusions. The robots are equipped with simple sensors such as ultrasonic, camera, temperature, and humidity. Each robot can move autonomously in the swarm.
The methods outlined above are valuable for detecting intruders in border regions. However, they exhibit notable drawbacks such as high computational costs and time complexity. Due to the extensive data generated by WSNs, a significant amount of time is required for processing and analysis. Time is vital in scenarios like preventing intrusions in border regions. So, even a minor delay in seconds could lead to catastrophic outcomes. Therefore, rapid detection of any intrusion along borders and in restricted areas is crucial. This challenge can be addressed by implementing machine learning approaches known for their computational efficiency [20].
In this study, an effective machine learning model has been proposed to predict the number of k-barriers for rapid and robust intrusion detection and prevention for a rectangular area using features of the WSN. The proposed model integrates the CatBoost algorithm with PSO. While the CatBoost is utilized for predictive modelling tasks, the PSO optimises its parameters to enhance prediction performance. The four features such as network area, transmission range, sensing range, and number of sensor nodes are used for the prediction, which are obtained through the Monte-Carlo simulation process. The performance evaluation of the proposed CatBoost-PSO model has been analysed by using four different metrics such as MAE, MSE, RMSE, and R2. The major contributions obtained in this paper are as follows:
  • An efficient and robust CatBoost model which is optimised with metaheuristic PSO algorithm has been employed for the k-barriers estimation in wireless systems. This provides a more finely tuned model that demonstrates improved performance in terms of efficiency, accuracy, and adaptability to the specific challenges of intrusion detection in WSNs.
  • The proposed model has been compared with state-of-the-art models such as LightGBM, XGBoost, RF, and DT to ensure the reliability of the model.
  • The proposed model has been compared with other existing methods in the literature which are implemented for the same dataset to demonstrate the superiority and effectiveness of the model. It has shown better results than existing approaches in all metrics.
  • The feature importance of input parameters such as network area, number of sensors, sensing range, and transmission range has been analyzed, with the analysis identifying the network area and transmission range as the two most significant factors in predicting the number of k-barriers. Additionally, timing analyses of the models have been conducted, and the proposed model demonstrates the best prediction time, making it suitable for real-time systems.
Further, the rest of this paper is structured as follows: Sect. 2 presents concise details regarding the communication protocols employed in WSNs. Section 3 provides an overview of the dataset used and the machine learning methods employed. In Sect. 4, a comprehensive analysis and discussion of the experimental findings are presented. Lastly, Sect. 5 outlines the conclusion of the study.

2 Communication protocols of WSNs

WSNs employ various communication protocols to facilitate effective and reliable data transmission among sensor nodes. These protocols are designed to address the challenges of WSNs such as limited energy resources, dynamic network topology, and the need for scalable and self-organizing communication. Here are some commonly used WSN communication protocols [21, 22]:
  • Bluetooth: It, specified by the Bluetooth 5.3 standard, is recognized for its low power consumption at 100 mW. With a communication range spanning from 1 to 10 m, Bluetooth operates in the 2.4 GHz frequency band. Employing a point-to-point topology facilitates direct and efficient communication between paired devices.
  • Zigbee: It, following the IEEE 802.15.4 standard, features low power consumption at 36.9 mW, providing an energy-efficient solution. It operates in the 2.4 GHz frequency band, offering a communication range of 1 to 75 m. With a mesh topology, Zigbee enables interconnected networks, enhancing reliability and flexibility for diverse applications.
  • NFC: It, following the ISO/IEC 18000–3 standard, operates at a close range of approximately 10 cm. With a frequency of 13.56 MHz, it facilitates short-range wireless communication.
  • Wi-Fi: It, adhering to various IEEE standards such as 802.11a, b, g, and n, is renowned for its high power consumption at 835 mW. Operating in the 2.4 GHz and 5 GHz frequency bands, it provides a broad communication range of up to 100 m. With a star topology, Wi-Fi networks are structured around a central hub, enabling multiple devices to connect to a central access point for efficient data exchange.
  • Sigfox: It, operating under the Sigfox standard, is distinguished by its low power consumption, making it an energy-efficient option for different applications. With an impressive communication range of up to 100 kms, it is well-suited for long-range deployments. Employing a star topology, Sigfox networks are organized around a central base station, simplifying the network structure, and enhancing scalability. It operates in the frequency range of 862 to 928 MHz.
  • Cellular Network: It, operating on the 4G standard, exhibits a communication range spanning from 25 to 35 kms, with frequencies typically ranging from 600 MHz to 2.5 GHz. This technology enables high-speed wireless communication over vast geographical areas.
Designing WSNs necessitates careful consideration of critical factors, owing to challenges arising from environmental characteristics and network application requirements. Addressing issues such as limited energy capacity, fault tolerance, scalability, hardware constraints, latency, range, and production costs is crucial in formulating effective routing protocols for WSNs. It's essential to adapt protocols to specific WSN deployments and use cases, as different protocols may be more suitable for varying scenarios [23]. The proposed algorithm processes the network area, sensing range, transmission range, and number of sensor nodes features to predict the number of k-barriers for intruder detection. Once the necessary features mentioned above are fulfilled, the algorithm can function independently of the protocol. The protocols modify the values of certain features, such as sensing and transmission ranges. Therefore, the generalizability of the proposed method has been demonstrated, regardless of environmental considerations.

3 Materials and methods

3.1 Wireless sensor network dataset

The dataset [24] used in this study was synthetically generated through Monte-Carlo simulations using the NS-2.35 network simulator which is a specific version of the Network Simulator 2 (NS-2) software. The dataset involves five columns which are network area, sensing range, transmission range, number of sensor nodes, and number of barriers. The initial four columns represent the features, while the last column corresponds to the target variable to be predicted.
The features in WSN can be classified into two categories: real and synthetic, based on the data acquisition process. Real data is collected through direct measurements using sensors, but generating real data can be costly and labour-intensive. On the other hand, synthetic data can be generated using mathematical rules, statistical models, and simulations. Compared to real data, acquiring synthetic data is more efficient and cost-effective. Therefore, the dataset has been synthetically generated. Furthermore, the number of barriers has been chosen for the target variable because predicting this variable in WSNs is crucial for optimizing network performance, resource allocation, energy efficiency, and ensuring security and reliability in various deployment scenarios [25].
In this process of generating a dataset, a finite set of sensors is randomly and uniformly placed within a rectangular region. Each sensor is considered to be identical in terms of sensing, transmission, and computational capabilities. The network region dimensions are adjusted, ranging from 100 × 50 m2 to 250 × 200 m2. If two sensors within the deployed WSN meet the condition Rtx > 2Rs, indicating that the transmission range (Rtx) is greater than twice the sensing range (Rs), they can establish communication. In this context, the binary sensing model which is a widely used sensing range model has been adopted to evaluate WSN performance. According to this sensing model, a random sensor can identify a target with a probability of one if the target falls within its sensing range. Alternatively, if the target is outside the sensing range, the probability of target detection will be zero [25]. The simulation parameters used for dataset generation are given in Table 2.
Table 2
Simulation parameters used in dataset generation
Parameter name
Value
Network area
100 × 50 m2 to 250 × 200 m2
Sensing range
15 to 40 m
Transmission range
30 to 80 m
Number of sensor nodes
100 to 400
Network region
Rectangular region
Deployment type of sensor
Uniform distribution
Sensing model
Binary sensing model
Simulator
NS-2.35
The dataset consists of 183 examples and an overview of the key statistics of the dataset is provided in Table 3, involving minimum value, maximum value, and mean value. The mean values of the features which are network area, sensing range, transmission range, and number of sensors are 24,375, 27.5, 55, and 250, respectively. The mean value of the target variable is 94.0714.
Table 3
Overview of key statistics of each feature
Feature name
Minimum value
Maximum value
Mean value
Network area
5000
50,000
24,375
Sensing range
15
40
27.5
Transmission range
30
80
55
Number of sensor nodes
100
400
250
The distributions of the features can affect the performance of the machine learning models. Having features with smoother distributions contributes positively to the performance of the models. Figure 2 illustrates the histogram of each feature for the dataset used in this experiment. The network area consists of 7 distinct values ranging from 5000 to 50,000. In the data, each value is repeated 26 times. The sensing range, transmission range and number of sensor nodes involve 26 different values ranging from 15 to 40, 30 to 80, and 100 to 400, respectively. Each value is repeated 7 times. Finally, the target variable consists of 122 different values ranging from 12 to 320. Each value is repeated between 1 to 5 times. In conclusion, there is no need for distribution adjustment since the distributions of the features are regular (not skewed to the right or left).
Fig. 2
Histogram of the features; (a) network area, (b) sensing range, (c) transmission range, and (d) number of sensor nodes
Identifying and addressing outliers in the data is crucial for regression tasks to ensure that the model is not disproportionately influenced by extreme values, which could result in inaccurate predictions or biased parameter estimates [26]. Figure 3 depicts boxplots for each feature, providing insights into the presence of outliers within the dataset. As seen clearly, there is no need for cleaning outliers as there are none present in the features.
Fig. 3
Boxplots of features; (a) network area, (b) sensing range, (c) transmission range, and d) number of sensor nodes

3.2 Machine learning methods

3.2.1 Decision tree algorithm

Decision Tree (DT) is a widely used supervised machine learning algorithm for regression and classification processes. A decision tree employed for classification is referred to as a classification tree, whereas when it is employed for regression, it is referred to as a regression tree. It is based on the binary tree which is one of the popular data structures used in computer science. The aim of the DT is to build a model that estimates the target output by learning simple decision rules from data features, achieved by dividing a complex problem into smaller ones.
The process of creating a DT employs a multistage or hierarchical decision-making scheme. The structure of DT consists of three parts, arranged from top to bottom: a root node, a series of internal nodes, and terminal nodes which are also called leaf nodes. The processing involves traversing down the tree until the leaf is reached.
The DT provides several advantages such as reduced data preparation, interpretability, and efficacy in data exploration [27, 28]. Nevertheless, it also comes with drawbacks such as overfitting, and instability. The tree's structure can be notably influenced by minor changes in the data due to its instability, while increasing its depth may lead to overfitting [29]. To mitigate overfitting, applying pruning after the construction of a tree or limiting the tree's maximum depth can be effective.
3.2.1.1 Random forest algorithm
Random Forest (RF), proposed by Breiman (2001) [29], is an ensemble learning algorithm based on decision trees. It is used for both classification and regression problems. The RF gathers information by randomly selecting features from each decision tree in the forest. It then employs a majority voting approach for classification or an averaging approach for regression problems to reach the final prediction.
The RF exhibits robustness against overfitting, a common concern in machine learning. Additionally, RF demonstrates a noteworthy capability in handling large datasets characterized by high dimensionality, making it a suitable choice for complex and extensive data scenarios. Besides the advantages of RF, there are also disadvantages. One such limitation is its potential underperformance when encountered with imbalanced datasets [30].
Figure 4 illustrates the operation of the RF as an ensemble model with \(L\) decision trees. Each tree independently processes inputs (\({x}_{i}\)) and produces individual output estimations (\({P}_{i}^{*}\)) for \(i\) = 1,…,\(L\). These estimations from the decision trees are then aggregated to generate the final prediction. In classification, the final prediction (\({P}_{final}\)) is determined by the majority of trees, while in regression, it is computed as the average of the decision trees' outputs by using Eq. 1.
Fig. 4
Structure of RF algorithm
$${P}_{final}=\frac{1}{L}\sum_{i=1}^{L}{P}_{i}^{*}$$
(1)
The RF's strength lies in its ability to create diverse decision trees within the forest. This is achieved by utilizing different subsets of the training dataset through the bagging process. By allowing each decision tree to be trained on distinct sets of data, the ensemble model can effectively prevent overfitting and enhance its overall performance and generalization capability.
3.2.1.2 Extreme gradient boosting algorithm
Extreme Gradient Boosting (XGBoost), proposed by Guestrin and Chen (2016) [31], is a supervised machine learning technique based on gradient boosting decision tree. As depicted in Fig. 5, it represents a boosting model that utilizes a sequential ensemble technique, combining multiple basic regression trees in a series. It involves sequentially adding and training new decision trees to adjust the residuals from previous iterations [32]. The estimation values of all these decision trees are then accumulated to derive the ultimate estimation outcome. Meanwhile, the XGBoost enhances estimation performance by minimizing bias in the model and substantially lowering the risk of model overfitting [33].
Fig. 5
Structure of XGBoost algorithm
In contrast to the conventional gradient boosting decision tree algorithm, the XGBoost alters the objective function as follows:
$${F}_{Loss}^{(k)}= \sum_{i=1}^{K}\left[{g}_{i}\bullet {\delta }_{k}\left({x}_{i}\right)+\frac{1}{2}\bullet {h}_{i}\bullet {\delta }_{k}^{2}\left({x}_{i}\right)\right]+\Omega ({\delta }_{k})$$
(2)
$${g}_{i}= {\partial }_{\widehat{y}\left(k-1\right)}\bullet l\left({y}_{i,} {\widehat{y}}_{i}^{\left(k-1\right)}\right), { h}_{i}={\partial }_{\widehat{y}\left(k-1\right)}^{2}\bullet l\left({y}_{i,} {\widehat{y}}_{i}^{\left(k-1\right)}\right)$$
(3)
$$\Omega \left({\delta }_{k}\right)= \alpha \bullet K+\frac{1}{2}\bullet \lambda \bullet \sum_{n=1}^{K}{\omega }_{n}^{2}$$
(4)
where;
  • \({\delta }_{k}\left({x}_{i}\right)\) indicates the score of the \({i}^{th}\) sample in the \({k}^{th}\) tree,
  • \(\Omega \left({\delta }_{k}\right)\) indicates the regular term,
  • \({g}_{i}\) is the first partial derivative of the loss function,
  • \({h}_{i}\) is the second partial derivative of the loss function,
  • \({y}_{i}\) indicates the observed value,
  • \(\widehat{{y}_{i}}\) indicates the predicted value,
  • \(l\) indicates the tree quantity,
  • \(\alpha\) indicates the complexity of the terminal nodes,
  • λ indicates the scaler factor,
  • \(K\) indicates the number of the leaves,
  • \({\omega }_{n}\) indicates the weight of the \({n}^{th}\) terminal node in the tree.
The objective function, after eliminating the constant term, can be obtained as follows:
$${F}_{Loss}=- \frac{1}{2}\bullet \sum_{j=1}^{K}C+\alpha \bullet K$$
(5)
where \(C\) indicates the contribution of each leaf node to the current model loss function and is obtained as follows:
$$C=\frac{{G}_{n}^{2}}{{H}_{n}+\lambda }$$
(6)
The XGBoost traverses all split leaf nodes by employing the greedy algorithm. If the gain of the target after the split is below the predefined threshold, the split can be ignored. The XGBoost does not utilize entropy or information gain for decision tree splits; instead, it employs the following gain metric:
$$Gain= \frac{1}{2}\bullet \left({S}_{Left}+{S}_{Right}-{S}_{Both}\right)-\alpha$$
(7)
where \({S}_{Left}\) indicates the score of the left subtree, \({S}_{Right}\) indicates the score of the right subtree, and \({S}_{Both}\) indicates the score when it is not split. These scores are obtained as follows:
$${S}_{Left}=\frac{{G}_{L}^{2}}{{H}_{L}+\lambda }$$
(8)
$${S}_{Right}=\frac{{G}_{R}^{2}}{{H}_{R}+\lambda }$$
(9)
$${S}_{Both}=\frac{{\left({G}_{L}+{G}_{R}\right)}^{2}}{{H}_{L}+{H}_{R}+\lambda }$$
(10)
3.2.1.3 Light gradient boosting machine algorithm
Light Gradient Boosting Machine (LightGBM), proposed by Ke et al. (2017) [34], is an efficient and open-source gradient boosting decision tree based algorithm. It is used for regression, sorting and classification problems. The primary distinction from the XGBoost, the LightGBM employs a histogram based algorithm, resulting in an accelerated training process and decreased computational complexity [35, 36]. Meanwhile, it uses a leaf-wise growth strategy with a pre-defined maximum depth parameter. By using this strategy, the LightGBM avoids the simultaneous splitting of leaves on the same layer, as seen in the level-wise growth strategy which is used by XGBoost. This can enhance accuracy by reducing training errors. However, this improvement in accuracy comes with the potential downside of increasing the risk of overfitting. This trade-off between accuracy improvement and overfitting risk can be mitigated by setting an appropriate maximum depth limit. Therefore, setting a maximum depth limit is crucial for developing a robust and generalizable model.
3.2.1.4 Categorical boosting algorithm
Categorical Boosting (CatBoost), proposed by Prokhorenkova et al. (2018) [37], is a gradient boosting decision tree algorithm. It is developed to manage categorical features more efficiently. The CatBoost utilizes enhanced and revised target based statistics (Greedy TBS). Within this procedure, categorical features can be replaced with their corresponding average label values. These average label values are then served as the criteria for node splitting during the construction of decision trees. This innovative approach effectively minimizes overfitting and accelerates the estimation of the model. The Greedy TBS can be obtained as follows [36]:
$$\frac{\sum_{j=1}^{p}\left[{x}_{j,t}={x}_{i,t}\right]\bullet {Y}_{i}}{\sum_{j=1}^{p}\left[{x}_{j,t}={x}_{i,t}\right]}$$
(11)
In most cases, features involve more information than labels. However, forcefully using the average label value to represent features may cause a conditional shift, which refers to a change in the relationship or distribution between features and labels due to the way in which the average label value is used to represent features [38]. To handle this situation, CatBoost put a prior value using Greedy TBS. Consider a dataset of observations O = {\({X}_{i}\),\({Y}_{i}\)}, \(i=1, \dots , N,\) where \({X}_{i}\) = (\({X}_{i,1}\),......,\({X}_{i,M}\)) is a vector including both categorical and numerical features, with \(M\) representing the number of features.\({Y}_{i}\), which is the corresponding label for each observation, is \(c\) real number.
Initially, the dataset undergoes a random permutation, which can lead to overfitting. For each sample within the permutation, the mean label value is computed using the samples that share the same category value before the given one to mitigate overfitting. if a permutation is σ = (σ1,…, σN), the permutated observation \({x}_{{\sigma }_{p,t}}\) is substituted with [36]:
$${x}_{{\sigma }_{p,t}}=\frac{\sum_{j=1}^{p-1}\left[{x}_{{\sigma }_{j,t}}={x}_{{\sigma }_{p,t}}\right]\bullet {Y}_{{\sigma }_{j}}+c\bullet P}{\sum_{j=1}^{p-1}\left[{x}_{{\sigma }_{j,t}}={x}_{{\sigma }_{p,t}}\right]+c}$$
(12)
where \(P\) is a prior value which refers the mean label value in regression, \(c\) is the weight of the prior value which is greater than 0. This approach leads to reducing the noise for the low frequency categories [29].
During the construction of a new split node in the current tree, the CatBoost utilizes a greedy technique to explore all possible combinations of different feature types. It dynamically transforms these combined categorical features into numeric features (see Fig. 6). Additionally, ordered boosting is employed by CatBoost to replace the gradient prediction in the traditional boosting algorithm, effectively handling estimation shifts induced by gradient bias and improving the generalization ability of the model.
Fig. 6
Structure of CatBoost algorithm

3.3 Particle swarm optimisation algorithm

Particle Swarm Optimisation (PSO), proposed by Kennedy and Eberhart [39], is a metaheuristic optimisation algorithm that is first inspired by seeking the food behaviour of birds. The PSO possesses several advantages, including simplicity, relatively low computational cost, and effectiveness [40]. Thanks to these advantages, it has gained substantial popularity in the field of computer science in recent times [39, 4145].
The PSO is initialized with random solutions called particles. Each particle \(i\) has a current position and velocity vectors, which are \({V}_{i}=\left[{v}_{i}^{1},{v}_{i}^{2},...,{v}_{i}^{D}\right]\) and \({X}_{i}=\left[{x}_{i}^{1},{x}_{i}^{2},...,{x}_{i}^{D}\right]\), respectively. The process begins with a randomized \({V}_{i}\) and \({X}_{i}\). Then, the particles traverse the multidimensional problem space, remembering their best solution (\({P}_{Best}\)). The global best solution (\({G}_{Best}\)) is also tracked. During each iteration, the particles adjust their velocities and positions in every dimension towards \({P}_{Best}\) and \({G}_{Best}\) locations with the Eq. 13 and Eq. 14 [45]. The PSO efficiently converges to the optimal solution.
$${v}_{i}^{d}\left(t+1\right)={v}_{i}^{d}\left(t\right)+{c}_{1}\bullet {\varepsilon }_{1}\bullet \left({{P}_{Best}}_{i}^{d}(t)-{x}_{i}^{d}\left(t\right)\right)+{c}_{2}\bullet {\varepsilon }_{2}\bullet \left({{G}_{Best}}^{d}(t)-{x}_{i}^{d}\left(t\right)\right)$$
(13)
$${x}_{i}^{d}\left(t+1\right)={x}_{i}^{d}\left(t\right)+{v}_{i}^{d}\left(t+1\right)$$
(14)
where \({c}_{1}\) and \({c}_{2}\) are two positive acceleration constants. \({\varepsilon }_{1}\) and \({\varepsilon }_{2}\) are two uniform random values generated between [0,1] intervals in each iteration.

4 Results and Discussion

In this study, the CatBoost algorithm, which is optimised with a metaheuristic PSO approach, was implemented to forecast the optimal number of barriers needed for intrusion detection and prevention. The performance of the model was compared with the other state-of-the-art models such as LightGBM, XGBoost, RF, and DT. These models were also optimised with PSO to ensure an unbiased comparison. The feature importance of the input parameters was analysed for all models. Moreover, the performance of the model was compared with the other existing methods in the literature that use the same dataset. The flowchart representing the detailed methodology can be seen in Fig. 7.
Fig. 7
Flowchart representing the detailed methodology
The dataset was divided into two subsets: a training set comprising 75% of the total data, and a test set comprising the remaining 25%. To assess the effectiveness of each model, the training process was employed with fivefold cross-validation. The experiments were carried out in two phases. In the first phase, the algorithms were employed using a grid search configuration with limited configurations. In the second phase, hyperparameter optimisation was employed using the metaheuristic PSO algorithm to enhance the performance of the model. The robustness of the PSO algorithm was demonstrated through 30 tests with different seeds, using a parameter setting in Table 4.
Table 4
Values of the PSO parameters used in the experiment
Parameter
Value
c1 (cognitive coefficient)
0.5
c2 (social coefficient)
0.3
w (inertia weight)
0.9
Number of iterations
50
Number of particles
20
The pseudo-code of the implemented methodology is as follows:
1.
For each model m (\({M}_{m}=\left[DT, RF,LightGBM,XGBoost,CatBoost\right]\))
 
2.
    Repeat until reaches maximum runs (1 to 30)
 
3.
        Initialize particle swarm with random velocities (\({V}_{i}=\left[{v}_{i1},{v}_{i2},...,{v}_{iD}\right]\))
 
4.
        Initialize particle swarm with random positions (\({X}_{i}=\left[{x}_{i1},{x}_{i2},...,{x}_{iD}\right]\))
 
5.
        Set PSO parameters (c1 = 0.5, c2 = 0.3, w = 0.9, population size = 20)
 
6.
        Set the maximum and minimum values of each hyperparameter of the model m
 
7.
        Repeat until reaches maximum iteration (1 to 50)
 
8.
            For each particle i in the swarm
 
9.
                Initialize hyperparameters of the model m within given ranges randomly
 
10.
              Evaluate fitness value using the objective function (lowest MSE)
 
11.
              If fitness value is better than personal best (\({P}_{Best}\)):
 
12.
                  Set current fitness values as the \({P}_{Best}\)
 
13.
          Choose particle i with the best fitness value as \({G}_{Best}\)
 
14.
          For each particle i in the swarm
 
15.
              Update velocity with the Eq. 13
 
16.
              Update position with the Eq. 14
 
17.
      Train model m with the optimised hyperparameters on the training set
 
18.
      Evaluate the performance of model m on the testing set by MAE, MSE, RMSE, and R.2
 
19.
      Store the performance results of model m
 
20.
Compare the stored results and select the optimal model
 
By providing a structured and readable pseudo-code, the paper aims to facilitate a smoother implementation of the proposed approach and encourages collaboration and further exploration in the field of intrusion detection in WSNs using machine learning methods optimised with PSO.

4.1 Evaluation metrics

The performances of the models are compared by employing four various evaluation metrics which are the mean absolute error (MAE), the mean squared error (MSE), the root mean squared error (RMSE), and the coefficient of determination (R2). These metrics are obtained as follows:
$$MAE=\frac{\sum_{i=1}^{N}\left|{A}_{i} -{ E}_{i}\right|}{N}$$
(15)
$$MSE=\frac{\sum_{i=1}^{N}{\left({A}_{i} -{ E}_{i}\right)}^{2}}{N}$$
(16)
$$RMSE=\sqrt{\frac{\sum_{i=1}^{N}{\left({A}_{i} -{ E}_{i}\right)}^{2}}{N}}$$
(17)
$${R}^{2}=\frac{\sum_{i=1}^{N}{\left({A}_{i} -{ E}_{i}\right)}^{2}}{\sum_{i=1}^{N}{\left({A}_{i} -\overline{ {A }_{i}}\right)}^{2}}$$
(18)
where N is the number of tests, \({A}_{i}\) is the actual value, \(\overline{{A }_{i}}\) is the average of the actual values, and \({E}_{i}\) is the estimated value.
The R2 value lies within the range of 0 to 1. A score of 1 represents perfect precision, while a score of 0 indicates that the model's predictions are at their worst. The other three metrics which are MAE, MSE, and RMSE are highly correlated with each other [46]. Their values range from 0 to + \(\infty\). Lower values of RMSE, MSE, and MAE point out better model performance, while higher values represent better model performance for R2.

4.2 Hyperparameter optimisation

The best configuration for the decision tree-based machine learning models has been determined through a hyper-parameter optimisation approach. Each model has its own set of internal parameters that need to be fine-tuned in order to achieve its best performance. To accomplish this, the PSO was employed for optimizing these parameters.
In selecting the hyperparameters for the machine learning models employed in this study, careful consideration was given to balancing model complexity, predictive performance, and computational efficiency. For the CatBoost algorithm, specific hyperparameters such as max_depth and iterations were selected to manage effectively the complexity of the tree and to avoid overfitting. Also, learning_rate was adjusted with precision to maintain a stable training process while ensuring optimal convergence speed. Similarly, hyperparameters of XGBoost such as eta and max_depth were carefully chosen to achieve a trade-off between the speed of convergence and the complexity of the model. The hyperparameters of LightGBM such as num_leaves and min_data_in_leaf were tailored to prevent overfitting while keeping the tree structure manageable. The hyperparameters of RF such as n_estimators and max_features were fine-tuned to manage model complexity efficiently. Additionally, hyperparameters of DT were meticulously chosen to regulate tree depth and criteria for node splitting. All selections of hyperparameters were informed by empirical evidence, domain expertise, and extensive experimentation to ensure the optimal performance of the model for the given prediction task. Table 5 presents the hyperparameters of the models and their descriptions, lower–upper bounds, and optimal values obtained after optimisation process.
Table 5
Hyperparameters of the models and their descriptions, lower and upper bounds, and optimal values obtained by PSO
Model name
Hyperparameter
Description
Range
Optimal value
CatBoost
max_depth
Maximum depth of a tree
[1, 15]
5
learning_rate
A parameter that determines the step size at each iteration
[0.01, 0.1]
0.0985
iterations
Number of trees
[10, 400]
375
random_strength
The amount of randomness to use for scoring splits
[0, 0.2]
0.0593
l2_leaf_reg
L2 regularization term for the leaf weights
[1,10]
1
XGBoost
eta
A parameter that determines the step size at each iteration
[0.1, 0.9]
0.2130
max_depth
Maximum depth of a tree
[3, 20]
3
n_estimators
Number of trees
[10, 200]
200
alpha
L1 regularization term on weights
[0, 0.0001]
0.0001
lambda
L2 regularization term on weights
[0.5, 1]
0.5
min_child_weight
Minimum sum of instance weight needed in a child
[1, 6]
1
subsample
Subsample ratio of training samples
[0.6, 1]
0.6
LightGBM
num_leaves
Maximum tree leaves
[2, 100]
7
min_data_in_leaf
The minimum number of data points required in a leaf node
[10, 300]
10
max_depth
Maximum tree depth
[1, 15]
15
learning_rate
A parameter that determines the step size at each iteration
[0.01, 0.1]
0.0999
RF
n_estimators
Number of trees
[5, 400]
12
min_samples_split
Minimum number of samples placed in a node before the node is split
[2, 50]
2
min_samples_leaf
Minimum number of samples allowed in a leaf node
[1, 50]
1
max_depth
Maximum depth of a tree
[1, 100]
45
max_features
Maximum number of features considered for splitting a node
[1, 15]
14
min_weight_fraction_leaf
Minimum weighted fraction of the total sum of weights required to be at a leaf node
[0, 0.5]
0.0
DT
max_features
Maximum number of features considered for splitting a node
[0.4, 1]
0.4
min_samples_split
Minimum number of samples placed in a node before the node is split
[2, 10]
2
min_samples_leaf
Minimum number of samples allowed in a leaf node
[1, 20]
1
max_depth
Maximum depth of a tree
[1, 15]
13
The R2 results of the models, obtained through hyperparameter optimisation with 30 independent runs, are presented in Table 6. The performance comparison of the models was conducted using tree metrics; the mean of R2, the standard deviation of R2, and the best of R2. These metrics provide insights into the overall performance and variability of the models' results.
Table 6
Statistical analysis of R2 values of the models
Model name
Mean
Standard deviation
Best
CatBoost
0.9997
7.2476e-05
0.9998
XGBoost
0.9983
0.0004
0.9989
LightGBM
0.9826
0.0267
0.9903
RF
0.9871
0.0103
0.9919
DT
0.9781
0.0119
0.9821
Best results are in bold*
As given in Table 6, the highest R2 value has been achieved by the CatBoost with the best value of 0.9998. Moreover, it demonstrates a mean value of 0.9997 and a standard deviation of 7.2476e-05, highlighting its consistency and performance superiority. On the other hand, the LightGBM presents the largest standard deviation with a value of 0.0267, indicating that it is slightly more unstable compared to the other models.
The R2 results are also visualized in Fig. 8, illustrating the distributions of all 30 values of R2 for each model in detail. In the visualization, the red dashed line indicates the mean value, while the green dashed line indicates the median value. It is observed that in the proposed CatBoost model, these two lines are closest to each other, indicating the robustness of the model [22].
Fig. 8
Distributions of the R2 values obtained from the models

4.3 Performance comparison with other up-to-date models

In Fig. 9, the convergence of the PSO algorithm during the search process is illustrated. The figure depicts the convergence behaviour of the optimisation algorithm. It reveals that during the initial 10 iterations, the algorithm rapidly converges towards local optima. However, its progress gets impeded between iterations 10 and 20, suggesting a potential difficulty in escaping local optima during this period. After this period, the overall convergence speed shows a slight decrease. The mean R2 values derived from 30 models configured using the PSO algorithm are 0.9997, 0.9983, 0.9826, 0.9871, and 0.9781 for CatBoost, XGBoost, LightGBM, RF, and DT, respectively. Notably, the CatBoost model exhibits the most outstanding performance and the XGBoost model is following it in second place. On the contrary, the DT model shows the worst performance among the five models.
Fig. 9
Convergence graph of the PSO process
The performance of the CatBoost is compared with other up-to-date models such as XGBoost, LightGBM, RF, and DT by using their best configurations. The comparison results are given in Table 7. In addition to the R2 value, the other evaluation metrics which are MAE, MSE, and RMSE values are also used to strengthen the comparison. As clearly seen, the CatBoost outperforms the XGBoost, LightGBM, RF, and DT with values of 0.9998 R2, 0.6298 MAE, 0.6018 MSE, and 0.7758 RMSE. The DT shows the worst performance considering all metrics.
Table 7
Performance comparison of the models
Model name
MAE
MSE
RMSE
R2
CatBoost-PSO
0.6298
0.6018
0.7758
0.9998
XGBoost-PSO
1.4660
3.6055
1.8988
0.9989
LightGBM-PSO
4.2256
32.8267
5.7295
0.9903
RF-PSO
3.9239
27.4825
5.2424
0.9919
DT-PSO
7.1304
60.5652
7.7824
0.9821
Best results are in bold*
Figure 10 depicts the estimated number of barriers for each model alongside the actual number of barriers for all models. The results show that the data are not uniformly distributed and are mostly clustered within the 0–100 range. The CatBoost model demonstrated satisfactory performance in estimating the number of barriers with a narrower error range compared to XGBoost, LightGBM, RF, and DT. The DT model performed poorly with a wider error range. As a result, built-in techniques of the CatBoost like random permutations and bootstrap reduce overfitting, making it advantageous for smaller datasets. Additionally, its ability to handle missing data during both training and prediction eliminates the need for explicit imputation strategies, making it beneficial for datasets with missing values.
Fig. 10
Regression lines of the models; (a) CatBoost, (b) XGBoost, (c) LightGBM, (d) RF, and (e) DT

4.4 Analysis of time complexity

Time complexity is a crucial metric in the analysis and evaluation of algorithms, offering insights into their efficiency and scalability. Expressed using Big-O notation, time complexity characterizes the upper bound of an algorithm's runtime, representing the worst-case scenario. Common time complexities involve constant (O(1)), linear (O(n)), logarithmic (O(log n)), polynomial (O(an)), quasilinear (O(n log n)), factorial (O(n!)), and exponential (O(an)). A lower time complexity generally implies faster algorithmic performance, making it a key consideration in algorithm design and selection [47].
The training time, prediction time and space complexities of the models used in this study are given in Table 8 [31, 48]. The parameters are defined as follows:
  • m is the number of features,
  • n is the number of data points,
  • n_tree refers to the number of individual decision trees,
  • n_boost represents the number of boosting rounds,
  • n_ensemble refers to the number of leaves in the ensemble,
  • K is the number of unique categories across all categorical features,
  • n_categories is the total number of unique categories across all features (both categorical and numerical)
Table 8
Training time, prediction time, and space complexities of the models
Model name
Training time complexity
Prediction time complexity
Space complexity
DT
O(n*m*log(n))
O(log(n))
O(n*m)
RF
O(n*m*log(n)*n_tree)
O(n_tree*log(n))
O(n*m*n_tree)
LightGBM
O(n*m*log(M)*n_boost)
O(n_ensemble)
O(n*m)
XGBoost
O(n*m*log(M)*n_boost)
O(n_ensemble)
O(n*m)
CatBoost
O(n*m*log(M)*n_boost)
O(n_ensemble)
O(n*m + k* n_categories)
While LightGBM, XGBoost, and CatBoost are based on gradient boosting trees and share a similar overall time complexity, there are some nuances to consider. LightGBM utilizes leaf-wise splitting and histogram-based decision tree learning, which can lead to faster training compared to XGBoost's level-wise approach and exact gradient calculations. On the other hand, CatBoost excels in handling categorical data and can be fast in specific scenarios, but its overall training speed might be much slower than that of LightGBM and XGBoost.
Table 9 presents the training and prediction times of each individual model. To assure the reliability of the computational results, every model underwent 10 runs on two different devices (D1 and D2), each with distinct specifications. The average run times have been calculated separately for each device. D1 is equipped with an Intel (R) Core (TM) i7-8700 CPU (6 Cores) running at 3.20 GHz, 16 GB RAM, and Windows 10 Pro 64-bit. D2, on the other hand, features an Intel (R) Core (TM) i9-12,900 CPU (16 Cores) running at 2.4 GHz, 64 GB RAM, and Microsoft Windows 11 Pro 64-bit.
Table 9
Comparison of the computing times of the models
Model name
Training time (ms)
Prediction time (ms)
D1
D2
D1
D2
DT-PSO
1.5930
1.2701
0.7084
0.4787
RF-PSO
72.5962
45.7289
3.6933
1.5708
LightGBM-PSO
8.8765
6.2572
2.1063
1.5566
XGBoost-PSO
67.4722
33.3975
2.0150
1.6232
CatBoost-PSO
365.0140
116.0589
0.7009
0.4326
Best results are in bold
The computational results reveal that all suggested machine learning models exhibit the capacity to predict the number of barriers within milliseconds. This efficiency highlights the practicality of employing these models for real-time predictions for intruder detection [49, 50]. Although the proposed CatBoost-PSO model has demonstrated the worst training time performance, it shows the best prediction time performance. To assess the speed of the system, it is essential to consider the prediction time rather than the training time. This is because the model will be trained once before it starts working live. Therefore, the proposed model is the best in terms of real-time operation.

4.5 Analysis of feature importance

The prominence of each feature has been evaluated for all models. This evaluation process has been conducted by using the Mean Decrease Impurity (MDI) approach. It is commonly used feature importance algorithm used for calculating the overall importance of each feature in decision tree-based ensemble methods [51]. MDI determines the significance of each feature by assessing its impact on reducing impurity within the tree nodes during model training. Features that contribute to the most significant decrease in impurity are regarded as more important. Since all models used in this study are decision tree-based, the MDI algorithm is preferred for feature importance analysis. The bars depicted in Fig. 11 represent the relative feature importance score of each feature. As clearly seen, in the estimation of the number of barriers for intrusion detection and prevention using the CatBoost, the network area and the transmission range appear as the most relevant features. However, for the XGBoost and LightGBM, only the network area and sensing range demonstrate significant effects, while the other features appear to have no influence on the estimation. Interestingly, all four features exhibit almost equal importance in the RF model, showing similar contributions to the estimation. As for the DT model, the sensing range stands out as the most effective feature, followed by the network area.
Fig. 11
Feature importance of the input parameters

4.6 Performance comparison with other existing methods in the literature

In this study, a comprehensive performance comparison of the proposed method with other existing methods in the literature was conducted, all of which have been previously employed over the same dataset. Due to the dataset being relatively new, with its release date in January 2022 [24], there is a limited number of studies available so far. In each study, multiple machine learning methods were employed, and their results were presented. These outcomes are provided in Table 10.
Table 10
Performance comparison of the proposed model with other existing methods
Model name
MAE
MSE
RMSE
R2
CatBoost-PSO (This paper)
0.6298
0.6018
0.7758
0.9998
LT-FS-ID [25]
-
41.87
6.47
0.98
ANN [25]
-
2150.20
46.37
0.38
GRNN [25]
-
3312.00
57.56
0.96
RF [25]
-
1033.60
32.15
0.99
GPR [25]
-
4074.70
63.83
0.94
FANN [2]
-
-
48.36
0.79
IT3FLS [52]
-
-
5.36
0.997
P2CA-GAM-ID [53]
-
-
15.22
0.97
ENFS-Uni0-reg [4]
-
-
11.16
-
Best results are in bold*
As clearly seen in Table 10, the proposed model, which is CatBoost optimised with PSO, outperforms other existing methods in estimating the number of barriers for intrusion detection and prevention with MAE value of 0.6298, MSE value of 0.6018, RMSE value of 0.7758, and R2 value of 0.9998. On the contrary, the ANN shows the worst performance with R2 value of 0.38. Considering the MSE and RMSE metrics, it can be observed that the GPR model exhibits the poorest performance among all the models listed in the table with MSE value of 0.4074.70, RMSE value of 63.83. Compared to the MAE, MSE and RMSE metrics, it can be argued that the ANN model is relatively worse due to the R2 metric being more informative for performance evaluation [46, 54].
CatBoost utilizes an optimised ordered boosting technique, enhancing the training process and enabling more effective learning from data. This strategy contributes to improved performance, particularly in regression tasks where accurately capturing relationships between features is crucial. Also, this technique naturally handles categorical data, making it convenient for datasets with a mix of categorical and numerical features [55, 56]. As seen in Fig. 2, the features exhibit a categorical distribution of values, which contributes to the high performance demonstrated by the CatBoost algorithm. Additionally, in the proposed approach, the CatBoost model is optimised with PSO to further enhance its performance.

5 Conclusion

This study proposed an effective machine learning model to estimate the number of k-barriers for robust intrusion detection and prevention for a rectangular area using features of the WSN. The features were obtained through the Monte-Carlo simulation process. The proposed CatBoost model, which is optimised with the PSO algorithm, estimated the number of barriers accurately. To ensure a fair assessment, the performance of the proposed model was compared with state-of-the-art methods such as LightGBM, XGBoost, RF, and DT using the MAE, MSE, RMSE, and R2 metrics. The proposed model showed a clear superiority. Additionally, the performance of the proposed model was assessed against the other existing methods in the literature that use the same dataset, and it yielded the best results as well. Afterwards, the feature importance was analysed. The network area and transmission range appeared as the most relevant features for the proposed model.
The limitations of the proposed CatBoost-PSO approach can be grouped into theoretical and practical aspects. From a theoretical perspective, the CatBoost requires more computational resources during training compared to alternative models. Furthermore, the PSO is prone to getting stuck in local optimum points, especially in complex optimisation scenarios. Moreover, the effectiveness of PSO decreases in high-dimensional optimisation spaces due to the expansive search space, which limits the exploration capabilities of particles. From a practical perspective, the evaluation of the proposed model was limited to the rectangular network region of WSN, and it could be beneficial to explore its performance in the circular network region as well. Additionally, the proposed model was solely tested under the uniform distribution of sensors, whereas further research could include evaluating its effectiveness under the Gaussian distribution of sensors. These potential expansions present promising paths for future work and could enhance the versatility and robustness of the proposed model across various deployment scenarios.
In future work, the feature engineering techniques can also be used to enhance the predictive capability of the proposed model. These techniques include generating new features from existing features or transforming the existing features. Additionally, correlation analysis can be conducted to identify highly related features. Removing one of the identified highly correlated features can contribute to reducing the time complexity of the proposed model. Finally, surrogate-assisted optimization algorithms, such as Bayesian Optimization, could be implemented to reduce training time while maintaining reasonable detection accuracy.

Declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

Ethics approval was not required for this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Our product recommendations

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information. 

Order your 30-days-trial for free and without any commitment.

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik. 

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

Literature
4.
16.
go back to reference M Sharma, CRS Kumar (2021) Machine learning-based smart surveillance and intrusion detection system for national geographic borders. In Artificial Intelligence and Technologies: Select Proceedings of ICRTAC-AIT 2020, pp. 165–176. Singapore: Sp, p. 2021. M Sharma, CRS Kumar (2021) Machine learning-based smart surveillance and intrusion detection system for national geographic borders. In Artificial Intelligence and Technologies: Select Proceedings of ICRTAC-AIT 2020, pp. 165–176. Singapore: Sp, p. 2021.
18.
go back to reference Antony Joseph Rajan D, Gomathy CK. (2023) A robust intrusion detection mechanism in wireless sensor networks against well-armed attackers. Int J Intell Syst Appl Eng 11(2), 180–187. Antony Joseph Rajan D, Gomathy CK. (2023) A robust intrusion detection mechanism in wireless sensor networks against well-armed attackers. Int J Intell Syst Appl Eng 11(2), 180–187.
25.
go back to reference Singh A, Amutha J, Nagar J, Sharma S, Lee CC (2022). LT-FS-ID: Log-transformed feature learning and feature-scaling-based machine learning algorithms to predict the k-barriers for intrusion detection using wireless sensor network. Sensors, 22(3). https://doi.org/10.3390/s22031070. Singh A, Amutha J, Nagar J, Sharma S, Lee CC (2022). LT-FS-ID: Log-transformed feature learning and feature-scaling-based machine learning algorithms to predict the k-barriers for intrusion detection using wireless sensor network. Sensors, 22(3). https://​doi.​org/​10.​3390/​s22031070.
34.
go back to reference Ke G et al. (2017) LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 3147–3155. Ke G et al. (2017) LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 3147–3155.
37.
go back to reference Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) Catboost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst., vol. 2018-Decem, no. Section 4, pp. 6638–6648. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) Catboost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst., vol. 2018-Decem, no. Section 4, pp. 6638–6648.
38.
go back to reference Zhang K, Schol̈kopf B, Muandet K, Wang Z (2013), “Domain adaptation under target and conditional shift,” 30th Int. Conf. Mach. Learn. ICML 2013, vol. 28, no. PART 3, pp. 1856–1864. Zhang K, Schol̈kopf B, Muandet K, Wang Z (2013), “Domain adaptation under target and conditional shift,” 30th Int. Conf. Mach. Learn. ICML 2013, vol. 28, no. PART 3, pp. 1856–1864.
Metadata
Title
Comparative analysis of CatBoost, LightGBM, XGBoost, RF, and DT methods optimised with PSO to estimate the number of k-barriers for intrusion detection in wireless sensor networks
Author
Kadir Ileri
Publication date
06-05-2025
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-025-02654-5