Abstract
Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.
- Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing email. Human--Computer Interaction, 20(1-2), 1--9. Google ScholarDigital Library
- C. C. Lai, "An empirical study of three machine learning methods for spam filtering," Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249--254, April, 2007. Google ScholarDigital Library
- Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).Google Scholar
- J. Goodman, G. V. Cormack, and D. Heckerman, "Spam and the ongoing battle for the inbox," Communications of the ACM, vol.50, issue 2, pp. 24--33, February 2007 Google ScholarDigital Library
- M. Woitaszek, M. Shaaban, and R. Czernikowski "Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine," conf. Proceedings, 2003 Symposium on Applications and the Internet, PP 166--169, 27--31 Jan. 2003. 5 Google ScholarDigital Library
- V. N. Vapnik, "An Overview of Statistical Learning Theory", IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988--998, 1999. 6 Google ScholarDigital Library
- Androutsopoulos I., J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9--17. 7Google Scholar
- Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes--Which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), pages 125--134. 8Google Scholar
- Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759--771. 9Google Scholar
- W. A. Awad, and S. M. ELseuofi, "Machine Learning Methods for Spam Classification," International Journal of Computer Science & Information Technology (IJCSIT), PP 173--184, Vol 3, No 1, Feb 2011. 10Google Scholar
- Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2).Google Scholar
- Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 35--40). ACM. Google ScholarDigital Library
- Trivedi, S. K., & Dey, S. (2013). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. in Proc. 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, Australia published by IEEE computer society. 978-0-7695-5096-1/13 DOI 10.1109/CSE.2013.171 Google ScholarDigital Library
- D. Sculley, G. M. Wachman, "Relaxed Online SVMs for Spam Filtering" SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415--422, ISBN: 978-1-59593-597-7, July 2007. 12 Google ScholarDigital Library
- David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4--15. 13 Google ScholarDigital Library
- Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048--1054. Google ScholarDigital Library
- Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596--2602. IEEE, 2003. 15Google Scholar
- Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based on improved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-952-5726-06-0. 16Google Scholar
- Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17Google Scholar
- Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18Google Scholar
- Haleh, Vafaie and Ibrahim F. Imam, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19Google Scholar
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98 Google ScholarDigital Library
Index Terms
- Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails
Recommendations
A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive StrategiesClassification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the ...
A study of ensemble based evolutionary classifiers for detecting unsolicited emails
RACS '14: Proceedings of the 2014 Conference on Research in Adaptive and Convergent SystemsIdentification of unsolicited emails or spam in a set of email files has become a challenging area of research. A robust classifier is not only appraised by performance accuracy but also false positive rate. Recently, Evolutionary algorithms and ...
Effect of feature selection methods on machine learning classifiers for detecting email spams
RACS '13: Proceedings of the 2013 Research in Adaptive and Convergent SystemsThis research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic ...
Comments