skip to main content
10.1145/1135777.1135794acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Detecting spam web pages through content analysis

Published:23 May 2006Publication History

ABSTRACT

In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

References

  1. S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  2. E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  4. A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  5. L. Breiman. Bagging Predictors. In Machine Learning, Vol. 24, No. 2, pages 123--140, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)Google ScholarGoogle Scholar
  7. B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google ScholarGoogle Scholar
  8. D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Gyöngyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data Bases, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  14. GZIP. http://www.gzip.org/Google ScholarGoogle Scholar
  15. M. Henzinger, R. Motwani and C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. In 2002 ACM Symposium on Applied Computing, Mar. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.Google ScholarGoogle Scholar
  18. C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)Google ScholarGoogle Scholar
  19. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999, Cambridge, Massachusetts. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  21. G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  22. MSN Search. http://search.msn.com/Google ScholarGoogle Scholar
  23. J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)Google ScholarGoogle Scholar
  24. L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.Google ScholarGoogle Scholar
  25. A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)Google ScholarGoogle Scholar
  26. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufman, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.Google ScholarGoogle Scholar
  29. B. Wu and B. Davison. Identifying Link Farm Spam Pages. In 14th International World Wide Web Conference, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google ScholarGoogle Scholar
  31. H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Detecting spam web pages through content analysis

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WWW '06: Proceedings of the 15th international conference on World Wide Web
            May 2006
            1102 pages
            ISBN:1595933239
            DOI:10.1145/1135777

            Copyright © 2006 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 May 2006

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate1,899of8,196submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader