ABSTRACT
Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by the preliminary detection, through the host level web graph, three types of features are extracted. Experiments on WEBSPAM-UK2006 benchmark show that with this strategy, the performance of web spam detection can be improved evidently.
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting Spam Web Pages through Content Analysis. In Proc. of the WWW'06, May,2006. Google ScholarDigital Library
- C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know Your Neighbors: Web Spam Detection Using the Web Topology. SIGIR'07, July, 2007. Google ScholarDigital Library
- G. G. Geng, C. H. Wang, Q. D. Li, L. Xu and X. B. Jin, Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification, FSKD'07, China, August, 2007. Google ScholarDigital Library
- Yahoo! Research: Web Collection UK-2006". http://research.yahoo.com/ Crawled by the Laboratory of Web Algorithmics, University of Milan, 2007.Google Scholar
- Q. Q. Gan and Torsten Suel. Improving Web Spam Classifiers Using Link Structure. AIRWeb'07, Banff, Canada, May,2007. Google ScholarDigital Library
Index Terms
- Improving web spam detection with re-extracted features
Recommendations
Web Spam: A Study of the Page Language Effect on the Spam Detection Features
ICMLA '12: Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02Although search engines have deployed various techniques to detect and filter out Web spam, Web stammers continue to develop new tactics to influence the result of search engines ranking algorithms, for the purpose of obtaining an undeservedly high ...
Google Penguin: Evasion in Non-English Languages and a New Classifier
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be ...
Link-based web spam detection using weight properties
Link spam is created with the intention of boosting one target's rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being ...
Comments