Article

Detecting spam web pages through content analysis

Authors:
Alexandros Ntoulas

UCLA Computer Science Dept., Los Angeles, CA

UCLA Computer Science Dept., Los Angeles, CA
View Profile

,
Marc Najork

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Mark Manasse

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Dennis Fetterly

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 83–92https://doi.org/10.1145/1135777.1135794

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 83–92

ABSTRACT

In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

References

S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarDigital Library
R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
L. Breiman. Bagging Predictors. In Machine Learning, Vol. 24, No. 2, pages 123--140, 1996. Google ScholarDigital Library
U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)Google Scholar
B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google Scholar
D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004. Google ScholarDigital Library
D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 2005. Google ScholarDigital Library
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, 1995. Google ScholarDigital Library
Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data Bases, Aug. 2005. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
GZIP. http://www.gzip.org/Google Scholar
M. Henzinger, R. Motwani and C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarDigital Library
J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. In 2002 ACM Symposium on Applied Computing, Mar. 2002. Google ScholarDigital Library
B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.Google Scholar
C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)Google Scholar
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999, Cambridge, Massachusetts. Google ScholarDigital Library
P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
MSN Search. http://search.msn.com/Google Scholar
J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)Google Scholar
L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.Google Scholar
A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufman, 1993. Google ScholarDigital Library
J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996. Google ScholarDigital Library
M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.Google Scholar
B. Wu and B. Davison. Identifying Link Farm Spam Pages. In 14th International World Wide Web Conference, May 2005. Google ScholarDigital Library
B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.Google Scholar
H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarCross Ref

Index Terms

Detecting spam web pages through content analysis

Recommendations

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to ...
Read More
SAAD, a content based Web Spam Analyzer and Detector

Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible ...
Read More
Analysis and detection of web spam by means of web content
IRFC'12: Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Web Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
web characterization
web pages
web spam
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 373
  Total Citations
  View Citations
- 3,735
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting spam web pages through content analysis

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

SAAD, a content based Web Spam Analyzer and Detector

Analysis and detection of web spam by means of web content