ABSTRACT
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
- S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarDigital Library
- R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- L. Breiman. Bagging Predictors. In Machine Learning, Vol. 24, No. 2, pages 123--140, 1996. Google ScholarDigital Library
- U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)Google Scholar
- B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google Scholar
- D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004. Google ScholarDigital Library
- D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 2005. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, 1995. Google ScholarDigital Library
- Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data Bases, Aug. 2005. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- GZIP. http://www.gzip.org/Google Scholar
- M. Henzinger, R. Motwani and C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarDigital Library
- J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. In 2002 ACM Symposium on Applied Computing, Mar. 2002. Google ScholarDigital Library
- B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.Google Scholar
- C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)Google Scholar
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999, Cambridge, Massachusetts. Google ScholarDigital Library
- P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- MSN Search. http://search.msn.com/Google Scholar
- J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)Google Scholar
- L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.Google Scholar
- A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)Google Scholar
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufman, 1993. Google ScholarDigital Library
- J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996. Google ScholarDigital Library
- M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.Google Scholar
- B. Wu and B. Davison. Identifying Link Farm Spam Pages. In 14th International World Wide Web Conference, May 2005. Google ScholarDigital Library
- B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
- H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarCross Ref
Index Terms
- Detecting spam web pages through content analysis
Recommendations
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to ...
SAAD, a content based Web Spam Analyzer and Detector
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible ...
Analysis and detection of web spam by means of web content
IRFC'12: Proceedings of the 5th conference on Multidisciplinary Information RetrievalWeb Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages ...
Comments