Skip to main content
Erschienen in: Knowledge and Information Systems 3/2019

10.09.2018 | Regular Paper

A modified content-based evolutionary approach to identify unsolicited emails

verfasst von: Shrawan Kumar Trivedi, Shubhamoy Dey

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This computational research seeks to classify unsolicited versus legitimate emails. A modified version of an existing genetic programming (GP) classifier—i.e., modified genetic programming (MGP)—is implemented to build an ensemble of classifiers to identify unsolicited emails. The proposed classifier is assessed using informative features extracted from two corpora (Enron and SpamAssassin) with the help of the greedy stepwise feature search method. Further, a comparative study is performed with other popular classifiers, such as Bayesian network, naïve Bayes, decision tree, random forest (RF), support vector machine (SVM), and GP. Further the results are validated with 20-fold cross-validation and paired T test. The results prove that the proposed classifier performs better in terms of accuracy and false-positive detection in comparison with the other machine learning classifiers tested in this study. Using different training and testing a set of email files from the Enron corpus, ensemble-based classifiers, such as boosted SVM, boosted Bayesian, boosted naïve Bayesian, RF, and the proposed MGP classifier, are tested and compared on all metrics, including training and testing time. The findings suggest that the MGP classifier with the greedy stepwise feature search method offers an improvement over alternative methods in detecting unsolicited emails.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Lai CC (2007) An empirical study of three machine learning methods for spam filtering. Knowl Based Syst 20(3):249–254CrossRef Lai CC (2007) An empirical study of three machine learning methods for spam filtering. Knowl Based Syst 20(3):249–254CrossRef
3.
Zurück zum Zitat Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136CrossRef Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136CrossRef
4.
Zurück zum Zitat Clark KP (2008) A survey of content-based spam classifiers Clark KP (2008) A survey of content-based spam classifiers
5.
Zurück zum Zitat Jakobsson M (2016) Traditional countermeasures to unwanted email. In: Understanding social engineering based scams, pp 51–62. Springer, New York Jakobsson M (2016) Traditional countermeasures to unwanted email. In: Understanding social engineering based scams, pp 51–62. Springer, New York
6.
Zurück zum Zitat Cole WK (2007) Blacklists, Blocklists, DNSBL’s, and survival. Retrieved on, 01-26 Cole WK (2007) Blacklists, Blocklists, DNSBL’s, and survival. Retrieved on, 01-26
7.
Zurück zum Zitat Leiba B, Ossher J, Rajan VT, Segal R, Wegman MN (2005). SMTP Path Analysis. In: CEAS Leiba B, Ossher J, Rajan VT, Segal R, Wegman MN (2005). SMTP Path Analysis. In: CEAS
8.
Zurück zum Zitat Levine JR (2005) Experiences with Greylisting. In: CEAS Levine JR (2005) Experiences with Greylisting. In: CEAS
9.
Zurück zum Zitat Sharaff A, Nagwani NK, Dhadse A (2016). Comparative study of classification algorithms for spam email detection. In: Emerging research in computing, information, communication and applications, pp 237–244. Springer India Sharaff A, Nagwani NK, Dhadse A (2016). Comparative study of classification algorithms for spam email detection. In: Emerging research in computing, information, communication and applications, pp 237–244. Springer India
10.
Zurück zum Zitat Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, Spyropoulos CD (2000) An evaluation of naive bayesian anti-spam filtering, pp 26–28 Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, Spyropoulos CD (2000) An evaluation of naive bayesian anti-spam filtering, pp 26–28
11.
Zurück zum Zitat Sharma AK, Sahni S (2011) A comparative study of classification algorithms for spam email data analysis. Int J Comput Sci Eng 3(5):1890–1895 Sharma AK, Sahni S (2011) A comparative study of classification algorithms for spam email data analysis. Int J Comput Sci Eng 3(5):1890–1895
12.
Zurück zum Zitat Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054CrossRef Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054CrossRef
13.
Zurück zum Zitat Trivedi SK, Dey S (2014) A study of Ensemble based evolutionary classifiers for detecting unsolicited emails. In: Proceedings of the 2014 conference on research in adaptive and convergent systems (pp 46–51). ACM Trivedi SK, Dey S (2014) A study of Ensemble based evolutionary classifiers for detecting unsolicited emails. In: Proceedings of the 2014 conference on research in adaptive and convergent systems (pp 46–51). ACM
14.
Zurück zum Zitat Li J, Ping W (2009) The e-mail filtering system based on improved genetic algorithm. In: Proceedings of the 2009 international workshop on information security and application (IWISA 2009), ISBN 978-952-5726-06-0. 16 Li J, Ping W (2009) The e-mail filtering system based on improved genetic algorithm. In: Proceedings of the 2009 international workshop on information security and application (IWISA 2009), ISBN 978-952-5726-06-0. 16
15.
Zurück zum Zitat Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144CrossRef Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C Appl Rev 40(2):121–144CrossRef
16.
Zurück zum Zitat Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Vol 62, pp 98–105 Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Vol 62, pp 98–105
17.
Zurück zum Zitat Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In: Proc. KDD 2000 workshop on text mining, Boston, MA Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In: Proc. KDD 2000 workshop on text mining, Boston, MA
18.
Zurück zum Zitat Pantel P, Lin D (1998) Spamcop: a spam classification and organization program. In: Proceedings of AAAI-98 workshop on learning for text categorization, pp 95–98 Pantel P, Lin D (1998) Spamcop: a spam classification and organization program. In: Proceedings of AAAI-98 workshop on learning for text categorization, pp 95–98
19.
Zurück zum Zitat Trivedi SK, Dey S, Shikhar P (2013) Effect of various kernels and feature selection methods on SVM performance for detecting email spams. Int J Comput Appl 66(21) Trivedi SK, Dey S, Shikhar P (2013) Effect of various kernels and feature selection methods on SVM performance for detecting email spams. Int J Comput Appl 66(21)
20.
Zurück zum Zitat Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040 Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040
21.
Zurück zum Zitat Carreras X, Màrquez L (2001) Boosting trees for clause splitting. In: Proceedings of the 2001 workshop on computational natural language learning-Volume 7 (p 26). Association for Computational Linguistics Carreras X, Màrquez L (2001) Boosting trees for clause splitting. In: Proceedings of the 2001 workshop on computational natural language learning-Volume 7 (p 26). Association for Computational Linguistics
22.
Zurück zum Zitat Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136CrossRef Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136CrossRef
23.
Zurück zum Zitat Rios G, Zha H (2004) Exploring Support vector machines and random forests for spam detection. In: CEAS Rios G, Zha H (2004) Exploring Support vector machines and random forests for spam detection. In: CEAS
24.
Zurück zum Zitat Trivedi S, Dey S (2013) Effect of feature selection methods on machine learning classifiers for detecting email spams. In: Proceedings of the 2013 ACM research in applied computation symposium, Montreal, QC, canada Trivedi S, Dey S (2013) Effect of feature selection methods on machine learning classifiers for detecting email spams. In: Proceedings of the 2013 ACM research in applied computation symposium, Montreal, QC, canada
25.
Zurück zum Zitat Colleoni E, Rozza A, Arvidsson A (2014) Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. J Commun 64(2):317–332CrossRef Colleoni E, Rozza A, Arvidsson A (2014) Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. J Commun 64(2):317–332CrossRef
26.
Zurück zum Zitat Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the second international conference on information and communication technology for competitive strategies (p 64). ACM Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the second international conference on information and communication technology for competitive strategies (p 64). ACM
27.
Zurück zum Zitat John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine Learning: Proceedings of the eleventh international conference, pp 121–129 John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine Learning: Proceedings of the eleventh international conference, pp 121–129
28.
Zurück zum Zitat Aas K, Eikvil L (1999) Text categorisation: a survey. Raport NR, 941 Aas K, Eikvil L (1999) Text categorisation: a survey. Raport NR, 941
29.
Zurück zum Zitat Shengen L, Xiaofei N, Peiqi L, Lin W (2011) Generating new features using genetic programming to detect link spam. In: 2011 International conference on intelligent computation technology and automation (ICICTA), vol 1, pp 135–138. IEEE Shengen L, Xiaofei N, Peiqi L, Lin W (2011) Generating new features using genetic programming to detect link spam. In: 2011 International conference on intelligent computation technology and automation (ICICTA), vol 1, pp 135–138. IEEE
30.
Zurück zum Zitat Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: 2013 IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160. IEEE Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: 2013 IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160. IEEE
31.
Zurück zum Zitat Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, CambridgeMATH Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, CambridgeMATH
32.
Zurück zum Zitat De Jong K (1975) The analysis of the behavior of class of genetic adaptive systems, doctorate these. Department of computer Science, University of Michigan, Ann Arbor De Jong K (1975) The analysis of the behavior of class of genetic adaptive systems, doctorate these. Department of computer Science, University of Michigan, Ann Arbor
33.
Zurück zum Zitat Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Trans Sys Man Cybern 16(1):122–128CrossRef Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Trans Sys Man Cybern 16(1):122–128CrossRef
34.
Zurück zum Zitat Kapoor V, Dey S, Khurana AP (2011) An empirical study of the role of control parameters of genetic algorithms in function optimization problems. Int J Comput Appl 31(6):20–26 Kapoor V, Dey S, Khurana AP (2011) An empirical study of the role of control parameters of genetic algorithms in function optimization problems. Int J Comput Appl 31(6):20–26
35.
Zurück zum Zitat Lewis DD Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proceedings of 10th European conference on machine learning (ECML-98), 1998, pp 4–15 Lewis DD Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proceedings of 10th European conference on machine learning (ECML-98), 1998, pp 4–15
36.
Zurück zum Zitat Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 2016 30th international conference on advanced information networking and applications workshops (WAINA), pp 355–360. IEEE Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 2016 30th international conference on advanced information networking and applications workshops (WAINA), pp 355–360. IEEE
37.
Zurück zum Zitat Tripathi A, Trivedi SK (2016) Sentiment analysis of Indian movie review with various feature selection techniques. In IEEE international conference on advances in computer applications (ICACA), pp 181–185. IEEE Tripathi A, Trivedi SK (2016) Sentiment analysis of Indian movie review with various feature selection techniques. In IEEE international conference on advances in computer applications (ICACA), pp 181–185. IEEE
38.
Zurück zum Zitat Trivedi SK (2016) A study of machine learning classifiers for spam detection. In 2016 4th international symposium on computational and business intelligence (ISCBI), pp 176–180. IEEE Trivedi SK (2016) A study of machine learning classifiers for spam detection. In 2016 4th international symposium on computational and business intelligence (ISCBI), pp 176–180. IEEE
39.
Zurück zum Zitat Vapnik VN (1999) An overview of statistical learning theory. In: IEEE transactions on neural network, Vol 10, No 5, pp 988–998. 6 Vapnik VN (1999) An overview of statistical learning theory. In: IEEE transactions on neural network, Vol 10, No 5, pp 988–998. 6
40.
Zurück zum Zitat Trivedi SK, Dey S (2014) Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails. ACM SIGAPP Appl Comput Rev 14(1):53–61CrossRef Trivedi SK, Dey S (2014) Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails. ACM SIGAPP Appl Comput Rev 14(1):53–61CrossRef
41.
Zurück zum Zitat Obimbo C, Nyakundi E Comparison of SVMs with radial-basis function & polynomial kernels in classification of categories in Intrusion Detection Obimbo C, Nyakundi E Comparison of SVMs with radial-basis function & polynomial kernels in classification of categories in Intrusion Detection
42.
Zurück zum Zitat Trivedi SK, Trivedi SK, Dey S, Dey S (2016) A novel committee selection mechanism for combining classifiers to detect unsolicited emails. VINE J Inf Knowl Manag Syst 46(4):524–548CrossRef Trivedi SK, Trivedi SK, Dey S, Dey S (2016) A novel committee selection mechanism for combining classifiers to detect unsolicited emails. VINE J Inf Knowl Manag Syst 46(4):524–548CrossRef
43.
Zurück zum Zitat Sharma A, Dey S (2013) A boosted SVM based sentiment analysis approach for online opinionated text. In: Proceedings of the 2013 research in adaptive and convergent systems, pp 28–34. ACM Sharma A, Dey S (2013) A boosted SVM based sentiment analysis approach for online opinionated text. In: Proceedings of the 2013 research in adaptive and convergent systems, pp 28–34. ACM
Metadaten
Titel
A modified content-based evolutionary approach to identify unsolicited emails
verfasst von
Shrawan Kumar Trivedi
Shubhamoy Dey
Publikationsdatum
10.09.2018
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2019
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-018-1271-1

Weitere Artikel der Ausgabe 3/2019

Knowledge and Information Systems 3/2019 Zur Ausgabe

Premium Partner