Abstract
Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.
- Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing email. Human--Computer Interaction, 20(1-2), 1--9. Google ScholarDigital Library
- C. C. Lai, "An empirical study of three machine learning methods for spam filtering," Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249--254, April, 2007. Google ScholarDigital Library
- Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).Google Scholar
- J. Goodman, G. V. Cormack, and D. Heckerman, "Spam and the ongoing battle for the inbox," Communications of the ACM, vol.50, issue 2, pp. 24--33, February 2007 Google ScholarDigital Library
- M. Woitaszek, M. Shaaban, and R. Czernikowski "Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine," conf. Proceedings, 2003 Symposium on Applications and the Internet, PP 166--169, 27--31 Jan. 2003. 5 Google ScholarDigital Library
- V. N. Vapnik, "An Overview of Statistical Learning Theory", IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988--998, 1999. 6 Google ScholarDigital Library
- Androutsopoulos I., J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9--17. 7Google Scholar
- Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes--Which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), pages 125--134. 8Google Scholar
- Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759--771. 9Google Scholar
- W. A. Awad, and S. M. ELseuofi, "Machine Learning Methods for Spam Classification," International Journal of Computer Science & Information Technology (IJCSIT), PP 173--184, Vol 3, No 1, Feb 2011. 10Google Scholar
- Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2).Google Scholar
- Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 35--40). ACM. Google ScholarDigital Library
- Trivedi, S. K., & Dey, S. (2013). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. in Proc. 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, Australia published by IEEE computer society. 978-0-7695-5096-1/13 DOI 10.1109/CSE.2013.171 Google ScholarDigital Library
- D. Sculley, G. M. Wachman, "Relaxed Online SVMs for Spam Filtering" SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415--422, ISBN: 978-1-59593-597-7, July 2007. 12 Google ScholarDigital Library
- David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4--15. 13 Google ScholarDigital Library
- Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048--1054. Google ScholarDigital Library
- Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596--2602. IEEE, 2003. 15Google Scholar
- Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based on improved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-952-5726-06-0. 16Google Scholar
- Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17Google Scholar
- Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18Google Scholar
- Haleh, Vafaie and Ibrahim F. Imam, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19Google Scholar
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98 Google ScholarDigital Library
Index Terms
- Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails
Recommendations
Effect of feature selection methods on machine learning classifiers for detecting email spams
RACS '13: Proceedings of the 2013 Research in Adaptive and Convergent SystemsThis research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic ...
An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails
CSE '13: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and EngineeringIdentification of unsolicited emails (spams) is now a well-recognized research area within text classification. A good email classifier is not only evaluated by performance accuracy but also by the false positive rate. This research presents an Enhanced ...
A novel committee selection mechanism for combining classifiers to detect unsolicited emails
PurposeThe email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false ...
Comments