Skip to main content
Top

A novel approach for spam detection using horse herd optimization algorithm

  • Open Access
  • 29-03-2022
  • Original Article
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article introduces a novel approach for spam detection using the Horse Herd Optimization Algorithm (HOA), which is transformed into a binary and multiobjective version for feature selection and classification in email spam detection. The study highlights the significant challenges posed by spam emails and the limitations of existing detection methods. The proposed MOBHOA algorithm is shown to outperform traditional methods in terms of accuracy, precision, and computational efficiency. The article also provides a detailed explanation of the HOA algorithm and its adaptation for spam detection, as well as a comprehensive evaluation of the proposed method using the Spam Base dataset from the UCI repository. The results demonstrate the superior performance of MOBHOA in detecting spam emails, making it a promising solution for improving email security and reducing the burden of spam on users and data centers.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

There are several types of email messages that computer users do not opt to receive in their email inboxes, such as spam, bulk email, junk email, promotion and commercial emails, and so on. These messages have some differences; however, in this study, they are all considered spam emails. Inappropriate messages on a large scale on the Internet that do not have useful content for the user would be classified as spam. Spam can be distributed in different formats and on various platforms. Social media spam, web spam, forum spam, spam instant messaging, email spam, and so on are examples of various types of spam. Although the majority of internet-based platforms can be successfully utilized to transmit spam, email spamming has grown in popularity due to its widespread use for a variety of purposes [34]. Text REtrieval Conference (TREC) has a definition for spam: "Spam is unsolicited mail that is sent vaguely, directly or indirectly by someone who has no relationship with the recipient of the letter" [11].
Although emails are effective and easily accessible means of communication, they can become a disaster due to the exploitation of marketers to advertise their products and scammers to deceive people into abusing their designs. The significant negative effect of spam emails is not limited to the severe waste of resources, time, and effort but also increases the burden of communication and cybercrime, affecting even the global economy and costing millions of dollars annually for businesses and individuals. Unwanted emails, in addition to consuming resources such as bandwidth, removal time, and storage space, also pose a security threat [5]. Attackers use a variety of methods to gain access to the victim's information. Email systems are one of the platforms used by attackers to spread malware. A recent McAfee report states that more than 97% of spam emails in the last four months of 2017 were sent via Necurs and Gamut botnets [33].
Detecting suspicious emails manually by users prevents attackers from reaching their goals in this way. To facilitate identifying suspicious emails, users, after observing their characteristics, should immediately take the necessary actions to prevent spam distribution and must inform the relevant institutions [16]. However, developing efficient mechanisms to automatically identify unsolicited emails is very important. Some of the characteristics of emails that are believed to be malicious are listed in “Appendix”.
Spam detection is a challenging problem, and several techniques have been developed and introduced to automatically detect spam emails; however, not all of them show an accuracy of 100%. Machine learning and deep learning techniques have proven to be the most successful of the methods introduced. In recent years, one of the common applications of machine learning has been spam detection [53]. Natural Language Processing (NLP) helps these methods to increase their accuracy. These spam detection methods consist of two stages: feature selection and classification [15].
Optimization algorithms are the other methods that can help developing spam detection systems. The Horse herd Optimisation Algorithm (HOA) [35] is a novel meta-heuristic algorithm and has a high exploration and exploitation performance. It excels at finding the best and optimal solutions to high-dimensional problems. In this article, our objective is to present a new method for detecting spam emails using HOA. To do this, we first convert the basic HOA, which is a continuous algorithm, to a discrete algorithm and then modify it into a multiobjective algorithm to solve multiobjective problems. Finally, the new multiobjective binary HOA is used in selecting the important features of spam emails to recognize them so that the received emails are classified correctly into spam or genuine emails. These two categories are then evaluated.
This study's main motivation for using HOA in solving spam detection problems was its outstanding performance in addressing complex high-dimensional problems. It is exceptionally efficient in exploration and exploitation. It can find the optimal solution very fast, with a low cost and complexity. With regards to accuracy and efficiency, it outperforms many well-known optimization algorithms such as the grasshopper optimization algorithm [48], the sine cosine algorithm [38], the multi-verse optimizer [39], the moth-flame optimization algorithm [36], the dragonfly algorithm [37], and the grey wolf optimizer [40].
Overall, the current study has the following main contributions:
  • HOA, a novel metaheuristic algorithm for high exploration and faster convergence, has been used in the study. To the best of the authors’ knowledge, this algorithm has not yet been used for spam detection.
  • The original HOA was a single objective algorithm developed to solve continuous problems. In this study, HOA was discretized and converted to a multiobjective algorithm.
  • The original HOA was transformed into a binary opposition-based algorithm.
  • Using HOA for feature selection, a novel spam detection method is proposed.
  • After selecting the optimal features, the K-Nearest Neighbours (KNN) classification method was used to classify the collection of spam emails.
  • According to the evaluation results, the proposed method outperforms well-known algorithms in terms of accuracy, precision, and sensitivity.
The remainder of this article is organized as follows: Sect. 2 introduces the related works. In Sect. 3, the original horse herd optimization algorithm is presented. Section 4 introduces the new proposed approach, and finally, in Sect. 5, the evaluation results and conclusion are discussed.
Unsolicited spam emails sent by marketers for promoting their products are regarded as annoying since they take up a lot of space in servers [45]. Some innocent users may also fall prey to fake emails [21]. Scammers try to get users' bank account details by sending these emails to steal money. Spam emails by attackers and hackers to distribute viruses and other malicious software are also hidden behind attractive and exciting offer links [23]. Therefore, the problem of spam emails should be addressed immediately, and effective measures should be taken to control this problem. Efforts have been made to reduce spam emails, including the development of advanced filtering tools and anti-spam laws in the United States [5].
Many researchers have focused their attention on the email spam detection problem, and in the literature, several notable approaches have been proposed. This section discusses some of the previous studies focusing on detecting and classifying spam through machine learning techniques and deep learning algorithms. One of the widely used algorithms for this problem is Naive Bayes [4, 47, 50]. There are various techniques introduced for detecting spam; however, our main focus would be on metaheuristic optimization algorithms in the present study.
A decision tree was applied in the study by Carreras and Marquez [8] to filter unwanted emails. Because the features of spam emails are difficult to define, this method is not extensively employed in spam filtering. K-nearest neighbours (KNN), Naïve Bayes and Reverse DBSCAN algorithms were used by Harisinghaney et al. [18] to classify image-based and text-based spam, and performance comparison of the mentioned algorithms were provided based on four measuring factors.
Egozi and Verma [13] used natural language processing techniques to detect phishing emails. Their model applies a feature selection method to select 26 features in order to determine if an email is a genuine email or spam. With only 26 features, their approach correctly identified more than 95% of ham emails as well as 80% of phishing emails.
Sharma and Bhardwaj [51] introduced a spam mail detection (SMD) system based on hybrid machine learning applying Naive Bayes and the J8 decision tree. This system consists of four models: data set preparation, data preprocessing, feature selection, and hybrid bagged approach. A total of three experiments were performed, of which the first two were conducted based on Naive Bayes and J8, and the other experiment was the proposed SMD which achieved an accuracy of 87.5%.
A new model for spam detection (THEMIS) was introduced by Soni [52] that is used to show emails at the header, body, character, and word level all at the same time. This approach uses deep convolutional neural network algorithms for recognizing spam emails. The evaluation results show that THEMIS's accuracy of 99.84% is higher than LSTM and CNN's accuracy.
In the study by GuangJun et al. [17], a method is proposed for spam classification in mobile communication using predictive machine learning models (e.g., logistic regression, K-Nearest Neighbor, and decision tree). Experiment results suggest that this method is accurate and timely in detecting spam and can protect email communication in mobile systems.
The study by Bibi et al. [6] provides a comparison of past spam filtering algorithms discussing their accuracy and the employed data sets. The study presents in-depth knowledge of the simple Naive Bayes algorithm, which is one of the best algorithms for text classification. This study evaluated classifier machine learning algorithms in spam detection and found that using WEKA, the Naïve Bayes algorithm provides effective accuracy and precision.
Mohmmadzadeh [42] developed a new hybrid model by combining the whale optimization algorithms and the flower pollination algorithms to solve the feature selection problem on the basis of opposition-based learning for detecting spams. The new model has higher accuracy in spam detection compared to previous approaches.
A spam detection approach using word embedding based on deep learning architecture in the NLP context was introduced by Srinivasan et al. [53]. The study reveals that deep learning outperforms standard machine learning classifiers when it comes to spam detection.
Apart from the sample methods described earlier, other methods are also available that only used metaheuristic algorithms, but none of the proposed methods are entirely accurate, and they are all erroneous to some extent. Moreover, only the classification phase was carried out in many previous methods, and the feature selection phase was not implemented. Feature selection reduces the dimensions of computation and increases classification accuracy by removing unnecessary features. Due to the lack of the feature selection process, the majority of the previous solutions spend a tremendous amount of time running the algorithm and do not have a high accuracy percentage. Table 1 demonstrates some examples of optimization methods used in spam detection that have been published recently, with some drawbacks that the proposed method in this study attempts to rectify.
Table 1
Examples of the recent spam detection methods
References
Classifier
Description
Disadvantage(s)
Abdulhamid et al. [1]
Rotation forest algorithm
The whale optimization algorithm was used for selecting spam features selection, and the rotation forest algorithm was used for the classification
Wall optimization algorithms rely too much on the optimal member of the population to find the optimal position
Pandey and Rajpoot [43]
Spiral cuckoo search clustering
The whale optimization algorithm was employed for selecting spam features, and the spiral cuckoo search clustering method was applied to solve the convergence problem in spam detection and classification
Wall optimization algorithms rely too much on the optimal member of the population to find the optimal position. The disadvantage of the cuckoo algorithm is that it has difficulty finding the optimal solution
Batra et al. [5]
BIC and k-NN
The integration of BIC metaheuristic algorithms with k-NN has been used to detect spam
It has a low classification rate and employs a classification technique with a probability of inaccuracy
Yaseen [57]
BERT Base Cased
The pre-trained transformer model BERT is used for spam detection
Mainly due to the inclusion of all the spam features in the training phase, it has a significant error rate
Srinivasan et al. [53]
CNN-LSTM
Detection of spam is based on black classification and machine learning classifiers
Mainly due to the inclusion of all the spam features in the training phase, it has a significant error rate
Wang et al. [54]
Isomap + SVM
The Laplace feature map algorithm is used to obtain geometric information from the email text dataset and to extract the features
Mainly due to the inclusion of all the spam features in the training phase, it has a significant error rate
Dedeturk and Akay [12]
ABC-LR
A combination of the artificial bee colony algorithm and a logistic regression model was employed for spam detection
Uses filtering techniques to select spam features
Pashiri et al. [44]
ANN
Feature selection is performed using the sine–cosine algorithm
Has computational complexity
As can be seen in Table 1, even the most recent methods are not 100% accurate and need a lot of time to execute the algorithm, and some of them have high computational complexity and high error rate. Thus, the objective of the current study was to employ a robust metaheuristic optimization algorithm, which is highly efficient in exploration and exploitation, to enhance the computation speed and accuracy of spam detection as well as reduce the error rate. After a comprehensive search in the literature and examination of several optimization algorithms, the authors decided to use the novel metaheuristic optimization algorithm, HOA, for the feature selection phase of the proposed approach, and as a result, the spam detection method suggested by the current study is on the basis of HOA. This optimization algorithm has been tested by multiple well-known test functions in high dimensions and has proven that it is able to solve challenging and high-dimensional problems.
In order to carefully assess and evaluate the performance and efficiency of the proposed method, some of the most popular and highly efficient optimization and classification algorithms in the literature were selected for the simulation, and their performance was compared to the proposed method’s performance. The simulation results indicate that the proposed method outperforms the previous methods, and demonstrates a high level of accuracy and precision, spends less execution time, and has lower error rate. Thus, the new method’s superiority is its higher accuracy and speed, and lower error rate and complexity.
As stated earlier, to be able to use HOA in selecting features, we converted that into a discrete algorithm since it was originally a continuous algorithm. Then, because feature selection is also a multiobjective problem, we transformed HOA into a multiobjective HOA and used  it to select spam features. To the best of our knowledge, this is the first research in the field that presents a binary and multiobjective version of HOA. The following section introduces the horse herd optimization algorithm.

3 Horse herd optimization algorithm

In recent years, various metaheuristic algorithms have been employed to solve a wide range of optimization problems [10, 29, 56]. A reason for this is the ability of metaheuristic algorithms to mathematically model and solve a variety of real-world problems [49]. This study aimed to employ a novel metaheuristic algorithm for solving the feature selection problem for detecting spam emails. Therefore, the Horse herd Optimisation Algorithm (HOA) was used as the primary method for this purpose. HOA, proposed in the study by MiarNaeimi et al. [35], is a robust metaheuristic algorithm inspired by the horses’ herding behaviors at various ages. Because of the vast number of control factors based on the behavior of horses of various ages, HOA shows an outstanding performance at addressing complex high-dimensional problems. Its performance at high dimensions (up to 10,000) has been evaluated using popular test functions, and it was discovered to be extremely efficient in exploration and exploitation. It has the ability to find the best solution in the shortest time, at the lowest cost, and with the least amount of complexity, and in terms of accuracy and efficiency, it outperforms many well-known metaheuristic optimization algorithms. This algorithm is discussed in greater detail in the following section.
At different ages, horses show various behaviors [35]. A horse's maximum lifespan is around 25–30 years [25]. In HOA, horses are divided into four categories according to their age: horses in ages 0–5, 5–10, 10–15 and older than 15, which are represented by δ, γ, β, and α respectively. HOA uses six general horse behaviors at the mentioned ages to simulate their social life. Those behaviours are: "grazing, hierarchy, sociability, imitation, defence mechanism and roaming".
Equation (1) describes the horse movement at each iteration:
$$X_{m}^{{\text{Iter,AGE}}} = \vec{V}_{m}^{{\text{Iter,AGE}}} + X_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} , \quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$
(1)
where \(X_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the position of the mth horse, \(\vec{V}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the velocity vector of the mth horse, AGE is the horse age range, and Iter is the current iteration.
To determine the age of horses, each iteration should have a thorough matrix of responses. The matrix is sorted according to the best responses, with the first 10% of the horses from the matrix’s top chosen as α. The β, δ, and γ horses comprised the next 20%, 30% and 40% of the remainder of the horses, respectively. In order to detect the velocity vector, the steps of simulating the mentioned six behaviors are mathematically implemented. During each cycle of the algorithm, the motion vector of horses of various ages can be expressed by Eq. (2) [35]:
$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter,}}\alpha }} & = \vec{G}_{m}^{{{\text{Iter,}}\alpha }} + \vec{D}_{m}^{{{\text{Iter,}}\alpha }} \\ \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \vec{G}_{m}^{{{\text{Iter}},\beta }} + \vec{H}_{m}^{{{\text{Iter}},\beta }} + \vec{S}_{m}^{{{\text{Iter}},\beta }} + \vec{D}_{m}^{{{\text{Iter}},\beta }} \\ \vec{V}_{m}^{{{\text{Iter,}}\gamma }} & = \vec{G}_{m}^{{{\text{Iter,}}\gamma }} + \vec{H}_{m}^{{{\text{Iter,}}\gamma }} + \vec{S}_{m}^{{{\text{Iter,}}\gamma }} + \vec{I}_{m}^{{{\text{Iter,}}\gamma }} + \vec{D}_{m}^{{{\text{Iter,}}\gamma }} + \vec{R}_{m}^{{{\text{Iter,}}\gamma }} \\ \vec{V}_{m}^{{{\text{Iter}},\delta }} & = \vec{G}_{m}^{{{\text{Iter}},\delta }} + \vec{I}_{m}^{{{\text{Iter}},\delta }} + \vec{R}_{m}^{{{\text{Iter}},\delta }} \\ \end{aligned}$$
(2)
As stated earlier, HOA is inspired by horses and their six general and social behaviors in various ages. The six behaviors and their mathematical implementation are discussed as follows.
Grazing: Horses are grazing animals that graze at all stages of their lives for about 16–20 h per day [25]. Equations (3) and (4) mathematically implement this behavior in HOA [35].
$$\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{{{\text{Iter}}}} \left( {\check{u}} + {\check{\rho}} \right) + [X_{m}^{{({\text{Iter}} - 1)}} ],\quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$
(3)
$$g_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{g}$$
(4)
In the above equations, \(\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the ith horse's motion parameter indicating its tendency to graze. With \({\omega }_{g}\) in each iteration, this factor reduces linearity. \({\check{u}}\) is the upper bound of the grazing space, and its recommended value is 1.05. \({\check{l}}\) is the lower bound of the grazing space, and its recommended value is 0.95. \(\rho\) is a random number in between 0 and 1. The coefficient \(g\) for all age ranges is recommended to be set to 1.5.
Hierarchy: Horses are not self-sufficient, and they usually follow a leader, which could be a human, an adult stallion, or a mare. This occurs in the hierarchy law [7]. The most experienced and strongest horse tends to lead in a herd of horses, and others follow it. Horses between the ages of 5 and 15 (β and γ) were shown to follow the hierarchy law. The hierarchy is implemented according to Eqs. (5) and (6) below [35]:
$$\vec{H}_{m}^{{\text{Iter,AGE}}} = h_{m}^{{\text{Iter,AGE}}} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$
(5)
$$h_{m}^{{\text{Iter,AGE}}} = h_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{h}$$
(6)
where \(\vec{H}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the impact of the location of the leader horse on the velocity, and \(X_{*}^{{({\text{Iter}} - 1)}}\) indicates the location of that horse.
Sociability: Sociability is another behavior of horses that HOA inspired. Horses require social interaction and may coexist with other animals. This also increases their chances of survival. Some horses appear to enjoy being with even other animals such as cattle and sheep [25]. Horses between the ages of 5 and 15 years old show this behavior. Socialization in HOA was considered the movement towards the position of other horses in the herd, and it is implemented using the Eqs. (7) and (8) [35]:
$$\vec{S}_{m}^{{\text{Iter,AGE}}} = s_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \beta ,\gamma$$
(7)
$$S_{m}^{{\text{Iter,AGE}}} = s_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{s}$$
(8)
where \(\vec{S}_{m}^{{\text{Iter,AGE}}}\) is the ith horses social vector motion, and \(s_{m}^{{\text{Iter,AGE}}}\) is the same horse's orientation towards the herd in the Iterth iteration. With a \({ }\omega_{s}\) factor, \(s_{m}^{{\text{Iter,AGE}}}\) decrements in each cycle. The total number of horses is indicated by N, and AGE is each horse’s age range in the herd. The s coefficient of β and γ horses is calculated in the parameters' sensitivity analysis.
Imitation: Horses learn each other's excellent and undesirable habits and behaviors by imitating one another [7]. This imitation is the other horse behavior that is inspired by HOA. Young horses attempt to imitate others, and this behavior persists throughout their lives. The imitation is described by Eqs. (9) and (10) [35]:
$$\vec{I}_{m}^{{\text{Iter,AGE}}} = i_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right],\quad {\text{AGE}} = \gamma$$
(9)
$$i_{m}^{{\text{Iter,AGE}}} = i_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{i}$$
(10)
In the above equations, \(\vec{I}_{m}^{{\text{Iter,AGE}}}\) shows the ith horse's motion vector towards the best horses’ average with locations of \(\widehat{X}\). pN presents the total number of horses that have the best locations, and p is recommended to be set to 10% of total horses in the herd. \({\omega }_{i}\) is a reduction factor in each cycle for iiter.
Defense: Horses use fight-or-flight behavior to defend themselves. Their initial impulse is to flee. In addition, when trapped, they usually buck. To keep rivals, they fight for food and water. They also fight to avoid dangerous situations with enemies such as wolves [25, 55]. The horses’ defense mechanism is the other behavior used in HOA and defined by running away from horses that exhibit non-optimal responses. Equations (11) and (12) describe the defense mechanism [35]:
$$\vec{D}_{m}^{{\text{Iter,AGE}}} = - d_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} {\check{X}}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$
(11)
$$d_{m}^{{\text{Iter,AGE}}} = d_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{d}$$
(12)
In the above equations, \(\vec{D}_{m}^{{\text{Iter,AGE}}}\) indicates the “the escape vector of ith horse from the average of some horses with worst locations, which are shown by the \({\check{X}}\) vector”. The quantity of horses that have the worst locations is qN. The value of q is recommended to be set to 20% of the total number of horses. \(\omega_{d} { }\) is the reduction factor per cycle for diter.
Roaming: The last behavior of horses that HOA simulates is their roaming habit. In pursuit of food, horses in nature roam and graze from one pasture to another if they are not kept in stables. A horse may abruptly change its grazing site. Horses are incredibly curious, as they frequently visit different pastures and get to know their surroundings [55]. The Roaming behavior is considered as a random movement of a horse in the herd and can be described by Eqs. (13) and (14) [35]:
$$\vec{R}_{m}^{{\text{Iter,AGE}}} = r_{m}^{{\text{Iter,AGE}}} pX^{{({\text{Iter}} - 1)}} ,\quad {\text{AGE}} = \gamma \;{\text{and}}\;\delta$$
(13)
$$r_{m}^{{\text{Iter,AGE}}} = r_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{r}$$
(14)
\(\vec{R}_{m}^{{\text{Iter,AGE}}}\) is “the random velocity vector of ith horse for a local search and an escape from local minima”. The reduction factor of \(r_{ m}^{{\text{Iter,AGE}}}\) per cycle is represented by \(\omega_{r}\).
The horses’ general velocity can be calculated by substituting Eqs. (3)–(14) in Eq. (2). The velocity of horses at different ages (δ, γ, β, and α, respectively) are obtained according to Eqs. (15)–(18).
$$\vec{V}_{m}^{{{\text{Iter}},\delta }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right]$$
(15)
where \(\vec{V}_{m}^{{{\text{Iter}},\delta }}\) is the δ horses’ velocity (horses at the age of 0–5).
$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\gamma }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right] \\ \end{aligned}$$
(16)
where \(\vec{V}_{m}^{{{\text{Iter}},\gamma }}\) is the γ horses’s velocity (horses at the age of 5–10).
$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ \end{aligned}$$
(17)
where \(\vec{V}_{m}^{{{\text{Iter}},\beta }}\) is the β horses’ velocity (horses at the age between 10 and 15 years).
$$\vec{V}_{m}^{{{\text{Iter}},\alpha }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] - \left[ {d_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right]$$
(18)
where \(\vec{V}_{m}^{{{\text{Iter}},\alpha }}\) is the α horses’ velocity (horses older than 15).
The findings validated HOA's capacity to cope with difficult situations involving a large number of unknown variables in high-dimensional domains. Adult α horses start a local search around the global optimum with extremely high precision. The β horses look for other near situations around the adult α horses, intending to approach them; nevertheless, the γ horses have less interest in approaching the α horses. They show a strong drive to explore new regions and discover new global optimum spots. Because of their specific behavioral features, young δ horses are excellent candidates for the random search phase.

4 Proposed approach

In this study, the metaheuristic HOA is modified first, then the modified version of HOA is used in feature selection for detecting spam emails. First, the continuous HOA is changed to a binary algorithm to be used for feature selection since it is a discrete problem. The inputs of the resulting algorithm are then become opposition-based. Next, the binary opposition-based HOA is upgraded to multiobjective in order to solve multiobjective problems. Finally, the multiobjective opposition-based binary HOA (MOBHOA) is applied in spam detection.
Users usually receive spam from anonymous senders with strange email addresses. This certainly does not mean that every email sent by an anonymous sender is considered spam. Therefore, it is necessary to use appropriate methods to detect and separate spam emails from legitimate emails that contain important information. In the proposed method, every email that is entered from the server needs to be followed by a series of steps to be classified as spam email or genuine email. The first step after receiving an email from the server is the feature extraction step. A series of general or specific features are extracted from the email body in the feature extraction step. The next phase is feature selection, which identifies related features and removes irrelevant and duplicate features. The final step is the classification step which is used to classify emails as spam or genuine emails. The overall structure of this method is depicted in Fig. 1 which shows the flowchart of the new approach and how it operates for detecting spam emails. The next sections provide further details of each step in modifying the HOA.
Fig. 1
Overall structure and steps of the proposed approach
Full size image

4.1 Binary HOA

The optimization process in binary search space differs significantly from continuous search space. Horse search agents can update their positions in the continuous search space by adding a step length to their position vector. But in a binary search space, the search agents’ position can not be updated by adding a step length because the search agent position vector can only have a value of 0 or 1. Therefore, we needed to develop a binary version of the HOA for feature selection, which is a discrete problem.
Developing the binary version of the HOA algorithm is simple. We only need to set the variables’ minimum and maximum values between zero and one, then run the algorithm. Just before sending the values to the cost function, we process the values with the greatest integer function to round them to zero and one vector. The nature of the variables is and will be continuous, but they will become binary with the greatest integer function only before entering the cost function. In other words, the algorithm considers the problem to be continuous, and the cost function considers it to be discrete. In the meantime, a function establishes the communication language of the discrete cost function (binary) and the continuous algorithm. This is performed by applying the greatest integer function in Eq. (19). In Eq. (19), x represents a real value between m and n, which are two consecutive integers, and k is an integer resulting from the application of the greatest integer function on x. This strategy can solve the problem of continuity of a continuous algorithm to be used for discrete problems.
$$k = \left| \!{\underline {\,\,}} \right. x\left. {\underline {\, \,}}\! \right|$$
(19)

4.2 Opposition-based binary HOA

By exploring conflicting solutions, opposition-based learning increases the chances of the start with a better initial population [48]. Not only could this approach be used in the initial solutions, but could be applied continuously to any solution in the current population. Generally, the opposition-based learning method is employed in metaheuristic approaches to improve convergence. Because the temporal complexity of metaheuristic algorithms increases, the opposition-based learning method is used to avoid these limitations. This strategy causes the metaheuristic method to seek the optimal solutions in the current solution’s opposite direction. Then, it determines which one is the best solution to choose, the current or the opposite. This method converges the solution rapidly and brings it closer to the optimal solution [48]. A sample opposition-based learning application was discussed in the study by Ibrahim et al. [22].
Starting from a suitable initial population in evolutionary algorithms is an essential and challenging task as the starting point would be effective in the algorithm's convergence speed and the final solution’s quality [48]. In an opposition-based algorithm, to determine the members of the original population, first, a high and a low limit is defined for each of the genes that make up the population members. The genes are then randomly defined between the upper limit and the lower limit. To use opposite numbers during the starting of the population, we consider the value of each member, which is defined according to Eq. (20). Assuming that X is the position of the horse between a and b, the opposition-based \(\overline{X}\) is defined according to Eq. (20). If the opposition-based cost function becomes less than the initial cost function, then the point can be substituted; otherwise, it will continue. Therefore, the gene and the opposite gene are evaluated simultaneously to proceed with more appropriate ones.
$$\overline{X} = a + b {-}X$$
(20)

4.3 Multiobjective opposition-based binary HOA

Models used to optimize problems that only have one objective function are known as single-objective models. In a single-objective problem, we attempt to find the best solution among available solutions. In practice, there is more than one objective function in many designing and engineering problems. These problems are known as multiobjective optimization problems. In many cases, the objective functions defined in multiobjective optimization problems are in conflict with each other [9]. That means the objectives are not compatible [37].
Spam detection is a multiobjective problem. The objectives pursued in this problem are the number of features and the classification accuracy, in which the quantity of features should be minimum, whereas the classification accuracy should be maximum. Higher classification accuracy means that most emails are categorized into the correct category after the classification is completed, and the error rate of the classification is minimal. Furthermore, because the classification is reliant on the selected features by the modified HOA metaheuristic algorithm, the number of features should be kept as minimal as feasible to prevent complexity. Since more than one objective function must be investigated, it is necessary to use a multiobjective optimization method. The essential aspect of such approaches is that they provide engineers and system designers with more than one solution. These solutions demonstrate the balance between the various objective functions [24]. A multiobjective optimization problem can be expressed mathematically as a minimization problem using Eq. (21) [60]:
$$\begin{aligned} & {\text{Minimize:}}\;f_{m} (x), \quad m = 1,2, \ldots ,M \\ & {\text{Subject }}\;{\text{to:}}\; g_{i} (x) \ge 0, \quad j = 1,2, \ldots ,J \\ & h_{k} (x) = 0,\quad k = 1,2, \ldots ,K \\ & L_{i} \le x_{i} \le U_{i} , n\quad i = 1,2, \ldots ,n \\ \end{aligned}$$
(21)
In Eq. (21), M represents the number of objectives, J represents the number of inequality constraints, K represents the number of equality constraints, and [Li, Ui] are the ith variable's boundaries. The solutions of a multiobjective problem would not be compared by arithmetic relational operators. Rather, the Pareto optimal dominance concept compares two solutions in a multiobjective search space [60].
To date, several single-objective metaheuristic methods have been converted to multiobjective [58]. This section explains how we have converted the single-objective HOA to multiobjective HOA. The multiobjective HOA algorithm employs a general objective function with a weight vector based on Eq. (22) to find the relationship between horses in a multiobjective search space. In this Equation, M combines each horse's objectives into a single objective.
$$F(x_{i} ) = \frac{1}{M}\mathop \sum \limits_{j = 1}^{M} f_{j} (x_{i} )$$
(22)
The main difference between single-objective and multiobjective HOA is in their process of updating the objectives. By selecting the best solution obtained, the objective could be easily selected in a single-objective search space. However, the objective must be selected from a set of optimal solutions in multiobjective HOA. Optimal solutions are stored, and the ultimate objective would be one of them. The challenge here is to find an objective for improving the distribution of the stored solutions. To attain this goal, first, the number of neighboring solutions in the existing solution’s neighborhood is calculated [41]. This method is similar to MOPSO in the study by Zouache et al. [60]. Then, the number of neighboring solutions is considered a quantitative criterion for measuring the areas' congestion. Equation (23) determines the probability of choosing an objective from among the objectives.
$$p_{i} = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {N_{i} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${N_{i} }$}}$$
(23)
In Eq. (23), Ni indicates the total number of the neighborhood of the ith solution. With this probability, a roulette method is used to choose the objective. This improves the distribution of the search space's less distributed areas. The other benefit is that in the event of premature convergence, solutions with a crowded neighborhood may be chosen as the objective to solve the problem [59]. The used storage space is limited. To lower the computational cost of the multiobjective HOA, only a small number of solutions should be in the archive, and the archive must be updated frequently. But, when comparing out-of-archive and in-archive solutions, there are several cases. The multiobjective HOA must be able to manage these cases in order to enhance the archive. The simplest case is when at least one archive member dominates the external solution. In this case, it must be discarded immediately. The other case is when all of the solutions in the archive are dominated by new solutions. Since the archive stores the dominant solutions that have been achieved so far, a non-dominant solution must be added to the archive. On the other hand, if the solution dominates the archive, it must be replaced.
In spam detection, feature selection is considered a multiobjective optimization problem. Two opposite objectives are met in multiobjective problems: (1) a minimum selected features and (2) a higher classification accuracy. Therefore, to be able to define the feature selection's objective function, a classification algorithm is required [19, 20]. Because most studies in the literature have employed the KNN classification algorithm, this classification method is employed to define the feature selection problem’s objective function in the current study as well, and the opposition-based binary HOA was converted to multiobjective, then it is used for spam detection problem.
Equation (24) is applied as a multiobjective function for selecting features. This equation balances between two opposing objectives so that a near-optimal solution is chosen.
The smaller the number of features contributes to a more optimal solution, yet, a lower number of features might sometimes raise the classification error rate. Also, the smaller the classification error, the more optimal the solution, but the number of features may have to be increased to reduce the error rate. In other words, a fewer number of features does not always optimize the solution, and a lower number of features from a certain limit may reduce the accuracy of the classification. It might also happen the other way around; a lower classification error rate does not always optimize the solution and may cause more features to be selected. There is a threshold for each of these, and this threshold is different in different problems. Therefore, a balance must be achieved between these, and Eq. (24) establishes this balance.
$${\text{Fitness}} = \alpha \gamma_{R} (D) + \beta \frac{\left| R \right|}{{\left| N \right|}}$$
(24)
In Eq. (24), \(\alpha \gamma_{R} (D)\) indicates the classifier's error rate, \(\left| R \right|\) indicates the selected subset's multi-linearity, and the overall number of features within the data set is denoted by \(\left| N \right|.\) α and β are the significance of the classification's quality and the subset's length, respectively. The α and β values have been adapted from Emary, Zawbaa and Hassanien [14] where α ∈ [0, 1] and β = (1 − α). The initial value of α in this study is set to 0.99; thus, β will be calculated as 0.01. KNN helps to evaluate the selected feature by the suggested method and other similar methods accurately, and it serves as a benchmark for all algorithms [2, 3, 27, 3032, 46].

4.4 Spam detection using multiobjective opposition-based binary HOA

The current study employed a data set on which preliminary processing was performed, and a set of features was extracted. MOBHOA selects several extracted features that distinguish spam emails from genuine emails. This is accomplished through the use of HOA's natural processes, which are discussed as follows.
Feature selection is a four-step process that includes the generation of feature subsets, evaluation of subsets, termination of criteria checking, and the validation of the results [26]. Firstly, the feature subset is generated in the data set. In this subset, candidate features are searched based on the search strategy of MOBOHA. Then, candidate subsets are evaluated and compared with the best previous value of the evaluation feature used. If a better subset is produced, it is replaced with the previous best. This generation and evaluation of the subsets is iterated until the termination criterion of the MOBHOA is reached. MOBHOA is repeated several times before achieving the best global solution. After each cycle, the fitness function calculates the accuracy of the classifier for the candidate subset. The candidate generation, fitness calculation, and evaluation function continues until the final criteria are met. In general, termination criteria are defined on the basis of two factors: the rate of error and the total number of iterations. If the error rate is lower than a certain threshold, or if the algorithm exceeds the specified number of iterations, the algorithm stops [26].
As stated earlier, this study attempted to propose a new optimization approach for feature selection in detecting spam emails using MOBHOA. Figure 2 illustrates the flowchart related to the proposed approach.
Fig. 2
The proposed algorithm's flowchart
Full size image

5 Simulation and evaluation

The new algorithm was implemented and simulated in MATLAB R2014a environment installed on a PC with a 64-bit i5 CPU and 4GB memory. For simulation, a data set called 'Spam Base' was used. 20% of the data was allocated for training and 80% for testing. Experiments were conducted on the Spam Base data set from the UCI data repository for evaluating the algorithm's performance in detecting spam. The used data set includes 4601 emails, of which 1813 (39.4%) are spam emails and 2788 (60.6%) are non-spam emails. Every record in this data set contains fifty-eight features in which the latest feature shows whether the email is spam (1) or genuine (0). The first forty-eight features indicate the frequency of specific keywords. That is the percentage of words or phrases in the email that match a specific word or phrase. The next six features indicate the characters’ frequency, and the next three features contain information about the data set. In Liu et al. [28], this data set was recommended as one of the most valid and suitable data sets for spam.
The remainder of this section discusses the classification accuracy of the proposed method in detecting spam compared to GWO and KNN. The simulation results of GWO, KNN, and MOBHOA in terms of classification accuracy with a different number of iterations are represented in Table 2 and Fig. 3. In this simulation, the iteration number was set to 1–100, and the size of the population was considered 20.
Table 2
Comparison of GWO, KNN, and MOBHOA in terms of the accuracy of the classification
Number of iteration => 
10
20
30
40
50
60
70
80
90
100
GWO
84
86
87
87
88
89
91
92
92
93
KNN
77
78
78
79
81
82
87
89
88
90
MBOHOA
88
89
92
92
93
92
92
93
94
96
Fig. 3
Performance comparison of GWO, KNN, and MOBHOA in terms of the spam detection's accuracy
Full size image
According to Table 2 and Fig. 3, MOBHOA has obtained much better results than GWO and KNN algorithms in detecting spam by increasing the iteration number. The performance of MOBHOA was similar to the other two algorithms in the first iterations; but, by the increase in the number of iterations, it has shown its better performance over the GWO and KNN algorithms. This was due to the application of the opposition-based approach to develop solutions in the opposite search space.
In the next evaluation, MOBHOA was compared with K-Nearest Neighbours-Grey Wolf Optimisation (KNN-GWO), KNN, Multi Layer Perceptron (MLP), Naive Bayesian (NB), and Support Vector Machine (SVM) classifiers with regards to accuracy, sensitivity and precision in detecting spam emails. Table 3 and Fig. 4 show the evaluation results.
Table 3
Comparison of MOBHOA with other classifiers in terms of accuracy, precision, and sensitivity
Algorithm
KNN
NB
SVM
MLP
KNN-GWO
KNN-BOHOA
Accuracy
0.765720
0.62571
0.81870
0.72739
0.91942
0.94516
Precision
0.83493
0.94331
0.77585
0.61739
0.89732
0.94120
Sensitivity
0.84384
0.34687
0.97805
0.98658
0.93211
0.96324
Fig. 4
Comparison of MOBHOA with KNN and KNN-GWO classifiers in accuracy, sensitivity, and precision
Full size image
As mentioned earlier, detecting spam emails is carried out in two steps, the first step is selecting features, and the second step is classification. Table 3 shows the results of feature selection.
The results in Table 3 are presented according to the feature selection step of the MOBHOA-KNN method, which was carried out with MOBHOA, and the classification step which is done with the KNN method. Also, in the KNN-GWO method, the feature selection step is carried out with GWO, and the classification step is done with the KNN method. However, KNN, MLP, SVM, and NB methods classify data without performing the feature selection step.
The proposed approach in feature selection improves accuracy, precision, and sensitivity and also reduces runtime. Because in optimal feature selection, redundant or insignificant features are eliminated, and operations are performed on only significant features. In this case, the algorithm's execution time will be reduced, and the accuracy, precision, and sensitivity will be increased. In this experiment, when running the execution, in all feature selections, KNN was constant. In the second run, considering the optimal feature selection, KNN was combined with the proposed method, and the results are demonstrated in Table 3.
As shown in Fig. 4, the evaluation results indicate that MOBHOA has improved compared to KNN and KNN-GWO with respect to accuracy, sensitivity, and precision. Specifically, it has increased the algorithm's accuracy by around 50%.
The results show that the multiobjective opposition-based binary horse herd optimizer running on the UCI data set has been more successful in the average size of selection and the accuracy of classification compared with some other standard metaheuristic methods. According to the results, the proposed algorithm is substantially more accurate in detecting spam emails in the data set than other similar algorithms. This is due to the application of HOA, which is a highly efficient optimization algorithm and has an outstanding performance in solving high-dimensional problems. The other reason is implementing the feature selection phase besides the classification phase. Feature selection decreases the computational complexity and increases classification accuracy by removing unnecessary features.
Machine learning-based techniques are one of the most efficient ways to solve a variety of problems. However, most machine learning algorithms have the problem of computational complexity. There is a need to employ more advanced techniques and algorithms that can improve the accuracy and decrease the complexity and error rate of the spam detection problem; therefore, we used the horse herd optimization algorithm to further improve the computation speed and accuracy. New advances in deep learning demonstrate that they can still be utilized for solving spam detection problems. A limited number of studies in the literature have examined the performance of deep learning algorithms for spam detection. In addition, the majority of the used datasets are either small in size or artificially developed. Thus, future studies are expected to consider big data solutions, large datasets, and deep learning algorithms to develop more efficient techniques for detecting spam. Furthermore, the focus of this study was specifically on email spam detection, and spam detection in other platforms, such as social networking spam and so on was not examined in the current study. Future studies may focus on using this approach for spam detection on other platforms.

6 Conclusion

Unwanted emails or spam have become a problem for Internet users and data centers, as these types of emails waste a large amount of storage and other resources. Moreover, they provide a basis for intrusion, cyber-attacks as well as access to user information. There are several techniques and methods for detecting, filtering, classifying spam, and facilitating their removal. In the majority of the proposed approaches, there is a rate of error, and none of the spam detection techniques, despite the optimizations performed, have been effective on their own. The objective of this paper was to use a robust metaheuristic optimization algorithm to detect spam emails to be used in email services. For this purpose, the horse herd optimization algorithm was employed, which is a novel nature-inspired metaheuristic optimization algorithm developed for solving highly complex optimization problems. The problem of detecting spam is discrete and has multiple objectives. To be able to use HOA for this problem, first, the original HOA, which is a continuous algorithm, was binarised and then transformed into a multiobjective opposition-based algorithm to solve the feature selection problem in spam detection. The new algorithm, multiobjective opposition-based binary horse herd optimization algorithm (MOBHOA), was implemented and simulated in MATLAB, and in order to evaluate the performance of the proposed approach in detecting spams, experiments were conducted on the Spam Base data set from the UCI data repository. According to the simulation results, in comparison with other similar approaches such as KNN, GWO, MLP, SVM, and NB, the new approach performs better in classification, as well as accuracy, precision, and sensitivity. The findings demonstrate that the new approach outperforms similar metaheuristic solutions introduced in the literature; therefore, it could be used for feature selection in spam detection systems.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Title
A novel approach for spam detection using horse herd optimization algorithm
Authors
Ali Hosseinalipour
Reza Ghanbarzadeh
Publication date
29-03-2022
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 15/2022
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-022-07148-x

Appendix

The following are some characteristics of emails that are believed to be malicious:
Some of the characteristics of emails that are believed to be malicious:
The email is located in the spam list
The sender of the email is anonymous
The sender's email address is related to free email services
The sender's email address is entirely or even slightly different from the trusted email address
The recipient's exact name is not mentioned in the content of the email; rather, general names are used. For example, the recipient is addressed with phrases such as "Dear customer", "Dear expert"
The email expresses a sense of urgency. For example, the sender threatens to immediately close the recipient's account if the requested action is not taken
The email contains persuasive content, while the sender is not credible. For instance, they promise money, participate in a lottery, win the lottery, discount vouchers for famous stores or brands, request to help a charity or an accident survivor
The email asks for personal information such as username, password or bank account details
The email content has major spelling and grammatical errors
The email was sent from a trusted organization, while the organization is not expected to send an email at that specific time
The entire body of the email is an embedded photo of the content in the text format
Images in the email contain a link to a fake website
The email contains links or attachments that are not expected. In other words, the name or format of the attachments are different from the expected name or format
The attached files have two or more extensions for their format
In case of suspicious emails, or after detecting spam, users [16]:
should not click on the links in the email
should not open email attachments in any way
should not reply to the mail or contact the sender
should not enter any information on the opened website if they accidentally click on a link in a suspicious email
should report suspicious emails to the body responsible for handling these emails
1.
go back to reference Abdulhamid SM, Shuaib M, Alhassan JK, Adebayo OS, Ismaila I, Osho O, Rans N (2019) Whale optimization algorithm based email spam feature selection method using rotation forest for classification. SN Appl Sci 1:1–17
2.
go back to reference Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
3.
go back to reference Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
4.
go back to reference Awad W, ELseuofi S (2011) Machine learning methods for spam e-mail classification. Int J Comput Sci Inf Technol (IJCSIT) 3(1):173–184
5.
go back to reference Batra J, Jain R, Tikkiwal VA, Chakraborty A (2021) A comprehensive study of spam detection in e-mails using bio-inspired optimization techniques. Int J Inf Manag Data Insights 1(1):100006
6.
go back to reference Bibi A, Latif R, Khalid S, Ahmed W, Shabir RA, Shahryar T (2020) Spam mail scanning using machine learning algorithm. J Comput 15(2):73–84
7.
go back to reference Bogner F (2011) A comprehensive summary of the scientific literature on Horse Assisted Education in Germany. Van Hall Larenstein
8.
go back to reference Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. arXiv preprint cs/0109015
9.
go back to reference Chang K-H (2014) Design theory and methods using CAD/CAE: the computer aided engineering design series. Academic Press, Cambridge
10.
go back to reference Chen H, Jiao S, Heidari AA, Wang M, Chen X, Zhao X (2019) An opposition-based sine cosine approach with local search for parameter estimation of photovoltaic models. Energy Convers Manag 195:927–942
11.
go back to reference DeBarr D, Wechsler H (2009) Spam detection using clustering, random forests, and active learning. In: Sixth conference on email and anti-spam. Mountain View, California
12.
go back to reference Dedeturk BK, Akay B (2020) Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Appl Soft Comput 91:106229
13.
go back to reference Egozi G, Verma R (2018) Phishing email detection using robust nlp techniques. In: IEEE international conference on data mining workshops (ICDMW)
14.
go back to reference Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65
15.
go back to reference Faris H, Aljarah I, Al-Shboul B (2016) A hybrid approach based on particle swarm optimization and random forests for e-mail spam filtering. In: International conference on computational collective intelligence
16.
go back to reference Guo D, Chen C (2014) Detecting non-personal and spam users on geo-tagged Twitter network. Trans GIS 18(3):370–384MathSciNet
17.
go back to reference GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam detection approach for secure mobile message communication using machine learning algorithms. Secur Commun Netw 2020:8873639. https://doi.org/10.1155/2020/8873639CrossRef
18.
go back to reference Harisinghaney A, Dixit A, Gupta S, Arora A (2014) Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm. In: International conference on reliability optimization and information technology (ICROIT)
19.
go back to reference Hosseinalipour A, Gharehchopogh FS, Masdari M, Khademi A (2021) A novel binary farmland fertility algorithm for feature selection in analysis of the text psychology. Appl Intell 51:4824–4859
20.
go back to reference Hosseinalipour A, Gharehchopogh FS, Masdari M, Khademi A (2021) Toward text psychology analysis using social spider optimization algorithm. Concurr Comput Pract Exp 33:e6325
21.
go back to reference Hu H, Wang G (2018) Revisiting email spoofing attacks. arXiv preprint. arXiv:1801.00853
22.
go back to reference Ibrahim RA, Abd Elaziz M, Oliva D, Cuevas E, Lu S (2019) An opposition-based social spider optimization for feature selection. Soft Comput 23(24):13547–13567
23.
go back to reference Karim A, Azam S, Shanmugam B, Kannoorpatti K, Alazab M (2019) A comprehensive survey for intelligent spam email detection. IEEE Access 7:168261–168295
24.
go back to reference Khanmohammadi S, Kizilkan O, Musharavati F (2021) Multiobjective optimization of a geothermal power plant. In: Thermodynamic analysis and optimization of geothermal power plants. Elsevier, pp 279–291
25.
go back to reference Krueger K, Heinze J (2008) Horse sense: social status of horses (Equus caballus) affects their likelihood of copying other horses’ behavior. Anim Cognit 11(3):431–439
26.
go back to reference Kumar A, Khorwal R, Chaudhary S (2016) A survey on sentiment analysis using swarm intelligence. Indian J Sci Technol 9(39):1–7
27.
go back to reference Liao TW, Kuo R (2018) Five discrete symbiotic organisms search algorithms for simultaneous optimization of feature subset and neighborhood size of knn classification models. Appl Soft Comput 64:581–595
28.
go back to reference Liu J, Jing H, Tang YY (2002) Multi-agent oriented constraint satisfaction. Artif Intell 136(1):101–144MathSciNetMATH
29.
go back to reference Luo J, Chen H, Heidari AA, Xu Y, Zhang Q, Li C (2019) Multi-strategy boosted mutative whale-inspired optimization approaches. Appl Math Model 73:109–123MathSciNetMATH
30.
go back to reference Mafarja M, Aljarah I, Heidari AA, Hammouri AI, Faris H, Ala’M A-Z, Mirjalili S (2018) Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowl Based Syst 145:25–45
31.
go back to reference Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453
32.
go back to reference Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
33.
go back to reference Marinos L, Lourenço M (2019) ENISA threat landscape report 2018: 15 top cyberthreats and trends. European Union Agency For Network and Information Security (ENISA)
34.
go back to reference Mendez JR, Cotos-Yanez TR, Ruano-Ordas D (2019) A new semantic-based feature selection method for spam filtering. Appl Soft Comput 76:89–104
35.
go back to reference MiarNaeimi F, Azizyan G, Rashki M (2021) Horse herd optimization algorithm: a nature-inspired algorithm for high-dimensional optimization problems. Knowl Based Syst 213:106711
36.
go back to reference Mirjalili S (2015) Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl Based Syst 89:228–249
37.
go back to reference Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073MathSciNet
38.
go back to reference Mirjalili S (2016) SCA: a sine cosine algorithm for solving optimization problems. Knowl Based Syst 96:120–133
39.
go back to reference Mirjalili S, Mirjalili SM, Hatamlou A (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Comput Appl 27(2):495–513
40.
go back to reference Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
41.
go back to reference Mirjalili SZ, Mirjalili S, Saremi S, Faris H, Aljarah I (2018) Grasshopper optimization algorithm for multi-objective optimization problems. Appl Intell 48(4):805–820
42.
go back to reference Mohmmadzadeh H (2020) Case study email spam detection of two metaheuristic algorithm for optimal feature selection
43.
go back to reference Pandey AC, Rajpoot DS (2019) Spam review detection using spiral cuckoo search clustering method. Evolut Intell 12(2):147–164
44.
go back to reference Pashiri RT, Rostami Y, Mahrami M (2020) Spam detection through feature selection using artificial neural network and sine–cosine algorithm. Math Sci 14(3):193–199MathSciNetMATH
45.
go back to reference Raad M, Yeassen NM, Alam GM, Zaidan BB, Zaidan AA (2010) Impact of spam advertisement through e-mail: a study to assess the influence of the anti-spam on the e-mail marketing. Afr J Bus Manag 4(11):2362–2367
46.
go back to reference Rajamohana S, Umamaheswari K (2018) Hybrid approach of improved binary particle swarm optimization and shuffled frog leaping for feature selection. Comput Electr Eng 67:497–508
47.
go back to reference Saab SA, Mitri N, Awad M (2014) Ham or spam? A comparative study for some content-based classification algorithms for email filtering. In: MELECON 2014–2014 17th IEEE mediterranean electrotechnical conference
48.
go back to reference Saremi S, Mirjalili S, Lewis A (2017) Grasshopper optimisation algorithm: theory and application. Adv Eng Softw 105:30–47
49.
go back to reference Shadravan S, Naji H, Bardsiri VK (2019) The Sailfish Optimizer: a novel nature-inspired metaheuristic algorithm for solving constrained engineering optimization problems. Eng Appl Artif Intell 80:20–34
50.
go back to reference Shajideen NM, Bindu V (2018) Spam filtering: a comparison between different machine learning classifiers. In: Second international conference on electronics, communication and aerospace technology (ICECA)
51.
go back to reference Sharma P, Bhardwaj U (2018) Machine learning based spam e-mail detection. Int J Intell Eng Syst 11(3):1–10
52.
go back to reference Soni AN (2019) Spam-e-mail-detection-using-advanced-deep-convolution-neuralnetwork-algorithms. J Innov Dev Pharm Tech Sci 2(5):74–80
53.
go back to reference Srinivasan S, Ravi V, Alazab M, Ketha S, Ala’M A-Z, Padannayil SK (2021) Spam emails detection based on distributed word embedding with deep learning. In: Machine intelligence and big data analytics for cybersecurity applications. Springer, pp 161–189
54.
go back to reference Wang C, Li Q, Ren TY, Wang XH, Guo GX (2021) High efficiency spam filtering: a manifold learning-based approach. In: Mathematical problems in engineering
55.
go back to reference Waring G (1983) The behavioral traits and adaptations of domestic and wild horses, including ponies. Horse Behavor
56.
go back to reference Xu Y, Chen H, Heidari AA, Luo J, Zhang Q, Zhao X, Li C (2019) An efficient chaotic mutative moth-flame-inspired optimizer for global optimization tasks. Expert Syst Appl 129:135–155
57.
go back to reference Yaseen Q (2021) Spam email detection using deep learning techniques. Procedia Comput Sci 184:853–858
58.
go back to reference Zhang Y, Gong D-W, Gao X-Z, Tian T, Sun X-Y (2020) Binary differential evolution with self-learning for multi-objective feature selection. Inf Sci 507:67–85MathSciNetMATH
59.
go back to reference Zhang Y, Wang J, Lu H (2019) Research and application of a novel combined model based on multiobjective optimization for multistep-ahead electric load forecasting. Energies 12(10):1931
60.
go back to reference Zouache D, Arby YO, Nouioua F, Abdelaziz FB (2019) Multi-objective chicken swarm optimization: a novel algorithm for solving multi-objective optimization problems. Comput Ind Eng 129:377–391

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG