skip to main content
research-article

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Authors Info & Claims
Published:01 March 2014Publication History
Skip Abstract Section

Abstract

Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

References

  1. Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing email. Human--Computer Interaction, 20(1-2), 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. C. Lai, "An empirical study of three machine learning methods for spam filtering," Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249--254, April, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).Google ScholarGoogle Scholar
  4. J. Goodman, G. V. Cormack, and D. Heckerman, "Spam and the ongoing battle for the inbox," Communications of the ACM, vol.50, issue 2, pp. 24--33, February 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Woitaszek, M. Shaaban, and R. Czernikowski "Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine," conf. Proceedings, 2003 Symposium on Applications and the Internet, PP 166--169, 27--31 Jan. 2003. 5 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. N. Vapnik, "An Overview of Statistical Learning Theory", IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988--998, 1999. 6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Androutsopoulos I., J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9--17. 7Google ScholarGoogle Scholar
  8. Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes--Which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), pages 125--134. 8Google ScholarGoogle Scholar
  9. Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759--771. 9Google ScholarGoogle Scholar
  10. W. A. Awad, and S. M. ELseuofi, "Machine Learning Methods for Spam Classification," International Journal of Computer Science & Information Technology (IJCSIT), PP 173--184, Vol 3, No 1, Feb 2011. 10Google ScholarGoogle Scholar
  11. Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2).Google ScholarGoogle Scholar
  12. Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 35--40). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Trivedi, S. K., & Dey, S. (2013). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. in Proc. 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, Australia published by IEEE computer society. 978-0-7695-5096-1/13 DOI 10.1109/CSE.2013.171 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Sculley, G. M. Wachman, "Relaxed Online SVMs for Spam Filtering" SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415--422, ISBN: 978-1-59593-597-7, July 2007. 12 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4--15. 13 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048--1054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596--2602. IEEE, 2003. 15Google ScholarGoogle Scholar
  18. Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based on improved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-952-5726-06-0. 16Google ScholarGoogle Scholar
  19. Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17Google ScholarGoogle Scholar
  20. Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18Google ScholarGoogle Scholar
  21. Haleh, Vafaie and Ibrahim F. Imam, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19Google ScholarGoogle Scholar
  22. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader