research-article

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Authors:
Shrawan Kumar Trivedi

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India
View Profile

,
Shubhamoy Dey

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India
View Profile

Authors Info & Claims

ACM SIGAPP Applied Computing Review Volume 14 Issue 1March 2014pp 53–61https://doi.org/10.1145/2600617.2600622

Published:01 March 2014Publication History

ACM SIGAPP Applied Computing Review

Abstract

Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.

References

Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing email. Human--Computer Interaction, 20(1-2), 1--9. Google ScholarDigital Library
C. C. Lai, "An empirical study of three machine learning methods for spam filtering," Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249--254, April, 2007. Google ScholarDigital Library
Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).Google Scholar
J. Goodman, G. V. Cormack, and D. Heckerman, "Spam and the ongoing battle for the inbox," Communications of the ACM, vol.50, issue 2, pp. 24--33, February 2007 Google ScholarDigital Library
M. Woitaszek, M. Shaaban, and R. Czernikowski "Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine," conf. Proceedings, 2003 Symposium on Applications and the Internet, PP 166--169, 27--31 Jan. 2003. 5 Google ScholarDigital Library
V. N. Vapnik, "An Overview of Statistical Learning Theory", IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988--998, 1999. 6 Google ScholarDigital Library
Androutsopoulos I., J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9--17. 7Google Scholar
Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes--Which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), pages 125--134. 8Google Scholar
Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759--771. 9Google Scholar
W. A. Awad, and S. M. ELseuofi, "Machine Learning Methods for Spam Classification," International Journal of Computer Science & Information Technology (IJCSIT), PP 173--184, Vol 3, No 1, Feb 2011. 10Google Scholar
Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2).Google Scholar
Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 35--40). ACM. Google ScholarDigital Library
Trivedi, S. K., & Dey, S. (2013). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. in Proc. 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, Australia published by IEEE computer society. 978-0-7695-5096-1/13 DOI 10.1109/CSE.2013.171 Google ScholarDigital Library
D. Sculley, G. M. Wachman, "Relaxed Online SVMs for Spam Filtering" SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415--422, ISBN: 978-1-59593-597-7, July 2007. 12 Google ScholarDigital Library
David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4--15. 13 Google ScholarDigital Library
Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048--1054. Google ScholarDigital Library
Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596--2602. IEEE, 2003. 15Google Scholar
Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based on improved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-952-5726-06-0. 16Google Scholar
Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17Google Scholar
Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18Google Scholar
Haleh, Vafaie and Ibrahim F. Imam, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98 Google ScholarDigital Library

Index Terms

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning

Recommendations

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Classification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the ...
Read More
A study of ensemble based evolutionary classifiers for detecting unsolicited emails
RACS '14: Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems

Identification of unsolicited emails or spam in a set of email files has become a challenging area of research. A robust classifier is not only appraised by performance accuracy but also false positive rate. Recently, Evolutionary algorithms and ...
Read More
Effect of feature selection methods on machine learning classifiers for detecting email spams
RACS '13: Proceedings of the 2013 Research in Adaptive and Convergent Systems

This research presents the effects of using features selected by two feature selection methods i.e. Genetic Search and Greedy Stepwise Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine and Genetic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGAPP Applied Computing Review Volume 14, Issue 1
March 2014
56 pages
ISSN:1559-6915
EISSN:1931-0161
DOI:10.1145/2600617
Editor:
Sung Y. Shin
South Dakota State University
Issue’s Table of Contents
Copyright © 2014 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2014
Check for updates
Author Tags
Bayesian
F-value
J48
classification accuracy
email spam classification
false positive rate
feature selection
genetic
naive bayes
probabilistic classifiers
random forest
spam filtering
support vector machine
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 236
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

ACM SIGAPP Applied Computing Review

Abstract

References

Cited By

Index Terms

Recommendations

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

A study of ensemble based evolutionary classifiers for detecting unsolicited emails

Effect of feature selection methods on machine learning classifiers for detecting email spams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

ACM SIGAPP Applied Computing Review

Abstract

References

Cited By

Index Terms

Recommendations

A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

A study of ensemble based evolutionary classifiers for detecting unsolicited emails

Effect of feature selection methods on machine learning classifiers for detecting email spams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media