Hybrid email spam detection model with negative selection algorithm and differential evolution

https://doi.org/10.1016/j.engappai.2013.12.001Get rights and content

Abstract

Email spam is an increasing problem that not only affects normal users of internet but also causes a major problem for companies and organizations. Earlier techniques have been impaired by the adaptive nature of unsolicited email spam. Inspired by adaptive algorithm, this paper introduces a modified machine learning technique of the human immune system called negative selection algorithm (NSA). A local selection differential evolution (DE) generates detectors at the random detector generation phase of NSA; code named NSA–DE. Local outlier factor (LOF) is implemented as fitness function to maximize the distance of generated spam detectors from the non-spam space. The problem of overlapping detectors is also solved by calculating the minimum and maximum distance of two overlapped detectors in the spam space. From the experiments, the results show that the detection accuracy of NSA–DE is 83.06% while the standard negative selection algorithm is 68.86% at 7000 generated detectors.

Graphical abstract

Shows the overall average of the NSA and NSA-DE.

  1. Download : Download full-size image

Introduction

The cheapest and most important form of communication in the world today is email. It is effective, simple and available for all computer users. The simplicity of email makes it vulnerable to a lot of threats. One of the most important threat to email is spam; virtually all email users across the world suffer from email spam (Cormack et al., 2011). The word spam was used to describe unwanted, junk mails sent to an internet user's inbox. It is very convenient for spammers to send millions of email spam all over the world with no cost at all (Carpinter and Hunt, 2006). This makes it a common scenario for all internet users to receive junk mail hundred times daily. Different techniques have been adopted to stop the threat of spam or drastically reduce the amount of spam that attacks internet users across the world. An anti-spam law was enacted by legislating penalty for spammers that distribute email spam (Schryen, 2007). Also, two general approaches have been used in email spam detection; a knowledge engineering approach and a machine learning approach (Wamli et al., 2009). In the knowledge engineering approach, the use of network information and internet protocol address techniques to determine if a message is spam or non-spam is called origin-based filter. Sets of rules have to be specified in the knowledge engineering approach in order to determine which email is to be categorized as spam or non-spam. Such rules could be created by the use of filter or by some other authority. An example of this process is the software company that provides a particular rule based spam filtering tools. By the application of this method, there is promising result. However, the rules need to be maintained all the time and updated which is a waste of time and inconvenient for most users. Machine learning is more efficient than knowledge engineering approach (Guzella and Caminhas, 2009) and does not require specifying rules; a set of pre-classified email message (training sample) is applied. Specific algorithms are used to learn the classification rules from the email messages. The filtering techniques are the most commonly used methods; it identifies whether a message is spam or non-spam based solely on the message content and some other characteristics of the message. Despite different approaches and techniques adopted to fight the scourge called spam, the internet today still witnesses huge amount of spam (Zhang et al., 2004, Massey et al., 2003), and more attention is needed by adaptive techniques on how the menace can be drastically reduced if not totally eliminated.

Due to the wide knowledge of machine learning approach, several algorithms have been used for email spam detection (Guzella and Caminhas, 2009). They include artificial immune system (AIS), support vector machine (SVM), neural network (NN), Naïve Bayes (NB), k-nearest neighbour (KNN), etc. In this paper, we propose a new approach that is inspired by artificial immune system model; that is a negative selection algorithm (NSA) with the combined effort of differential evolution (DE) which modifies the standard negative selection algorithm in order to generate more accurate results. The engineering goals required in hybrid negative selection algorithm can be viewed in three ways; first, is to generate an efficient detector set; secondly, is to limit the number of detectors that will be generated and thirdly, is to maximize the detector set distance as much as possible. Problems that require attention in this research work are: (i) generating detectors in the spam space; (ii) maximizing distance between spam detectors and the non-spam space and (iii) solving the problem of overlapping detectors in the spam space. These problems are solved by the implementation of local differential evolution for generating detectors, application of local outlier factor as fitness function to maximize the distance between generated detector in the spam space and the non-spam space, calculating the minimum and maximum distance between two overlapped generated detectors as fitness function. The performance of NSA is determined by detector generation and how effective it is able to utilize the detector coverage space of spam and non-spam. This paper is organized into six sections, Section 1 is the introduction, Section 2 discusses the related work in negative selection algorithm, the proposed improved model and its constituent framework are presented in Section 3. Empirical studies, results and discussions are presented in Section 4, Section 5 discusses the experimental results while conclusions and recommendations are presented in Section 6.

Section snippets

Related work

Over the past years, rapid expansion of computer network systems has changed the world. The expansion is essential for an effective computer security system because attacks and criminal intend are increasingly popular in computer network (Golovko et al., 2010). Negative selection algorithm, while not reacting to the self cells uses the immune system capability to detect unknown antigens. Its mechanism protects body against—reactive lymphocytes. Receptors are made through a pseudo-random genetic

The proposed improved model and its constituent frameworks

Hybrid systems in recent times have extensive success in many real world complex problem solving. The importance of a hybrid system is not negotiable, based on the fact that an individual system has its weakness, and a hybrid system is meant to compliment the weaknesses of these individual intelligent systems. A smart hybridization of negative selection algorithm and differential evolution is investigated in order to compliment the parameters of each component of the system by using the

Empirical study, results and discussion

To carry out an empirical study, spam base dataset was acquired. The entire dataset was divided using stratified sampling approach into training and testing set in order to evaluate the performance of negative selection algorithm and the proposed hybrid model. 70% of the entire dataset was used for training and construction of the proposed implementation model while 30% of the remaining dataset was used for testing and validating the model. For effective comparative study of testing and

Experimental results and discussion

Performance comparison between negative selection algorithm model and proposed hybrid model using validation of an unseen data is summarized in Fig. 11, Fig. 12, Fig. 13. The result of hybridized NSA–DE model out performs the NSA model. The proposed hybrid model shows an improved accuracy when compared with the standard model which performs poorly by all measuring standards. It is clear that the hybrid model is better than the individual models due to the good forecasting scheme used in the

Conclusion and recommendations

A new hybrid model that combines negative selection algorithm (NSA) and differential evolution (DE) has been proposed and implemented. The uniqueness of this model is that the DE is implemented at the random generation phase of NSA; also the generated detector distance was maximized and overlapping of detectors was also minimized. The detector generation phase of NSA and detector coverage area determines how robust and effective an algorithm will perform. DE implementation improved detector

Acknowledgements

The Universiti Teknologi Malaysia (UTM) under IDF Scholarships, Ministry of Higher Education (MOHE) Malaysia under research Grant Vot: R.J130000.7828.4F087 and under Research University Funding Scheme Universiti Teknologi Malaysia (Q.J130000.2510.03H02) are hereby acknowledged for some of the facilities utilized during the course of this research work and for supporting the related research.

References (52)

  • A. Prakash et al.

    Modified immune algorithm for job selection and operation allocation problem in flexible manufacturing systems

    Adv. Eng. Softw.

    (2008)
  • I. Yevseyeva et al.

    Optimising anti-spam filters with evolutionary algorithms

    Expert Syst. Appl.

    (2013)
  • A.R. Yildiz

    A new hybrid differential evolution algorithm for the selection of optimal machining parameters in milling operations

    Appl. Soft Comput.

    (2013)
  • A.R. Yıldız

    An effective hybrid immune-hill climbing optimization approach for solving design and manufacturing optimization problems in industry

    J. Mater. Process. Technol.

    (2009)
  • A.R. Yildiz

    Comparison of evolutionary-based optimization algorithms for structural design optimization

    Eng. Appl. Artif. Intell.

    (2013)
  • A.R. Yildiz

    Hybrid Taguchi-differential evolution algorithm for optimization of multi-pass turning operations

    Appl. Soft Comput.

    (2013)
  • A.R. Yildiz

    A new hybrid artificial bee colony algorithm for robust optimal design and manufacturing

    Appl. Soft Comput.

    (2013)
  • A. Abi-Haidar et al.

    Adaptive spam detection inspired by a cross-regulation model of immune dynamics: a study of concept drift

  • Balthrop, J., Forrest, S., Glickman, M.R., 2002. Revisiting LISYS: parameters and normal behavior. In: Proceedings of...
  • G. Bezerra et al.

    An immunological filter for spam

  • Cormack, G., Lynam, T., 2007. TREC Public Spam Corpus. 〈http://plguwaterlooca/~gvcormac/treccorpus07/〉 (cited...
  • G Cormack et al.

    Efficient and effective spam filtering and re-ranking for large web datasets

    Inform. Retr.

    (2011)
  • Forrest, S., Perelson, A.S., 1994. Self nonself discrimination in...
  • Golovko, V., Bezobrazov, S., Kachurka, P., Vaitsekhovich, L., 2010. Neural network and artificial immune systems for...
  • Gonzalez, F., Gomez, J., Madhavi, K., Dipankar, D., 2003. An evolutionary approach to generate fuzzy anomaly (attack)...
  • Guangchen, R., Ying, T., 2007. Intelligent detection approaches for spam. In: Third International Conference on Natural...
  • Cited by (0)

    View full text