Article

Identifying link farm spam pages

Authors:
Baoning Wu

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

,
Brian D. Davison

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebMay 2005Pages 820–829https://doi.org/10.1145/1062745.1062762

Published:10 May 2005Publication History

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Pages 820–829

ABSTRACT

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.

References

Pr0 - Google's PageRank 0, 2002. http://pr.efactory.de/e-pr0.shtml.Google Scholar
Lycos 50, 2005. http://50.lycos.com/.Google Scholar
Open directory project, 2005. http://dmoz.org/.Google Scholar
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pages 38--47, Aug. 2003. Google ScholarDigital Library
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarDigital Library
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, Aug. 1998. Google ScholarDigital Library
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International World Wide Web Conference, pages 415--429, May 2001. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 1157--1166, 1997. Google ScholarDigital Library
S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, pages 211--220, 2001. Google ScholarDigital Library
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2003. Google ScholarDigital Library
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, Jul 2000. Technical Report WS-00-01.Google Scholar
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarDigital Library
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. Technical report, Stanford Digital Library Technologies Project, Mar. 2004.Google Scholar
Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th VLDB Conference, Sept. 2004. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarDigital Library
D. Lemin. Google Zeitgeist, Dec. 2004. http://www.google.com/press/zeitgeist.html.Google Scholar
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarDigital Library
L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In The eleventh International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarDigital Library
A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.Google Scholar
G. O. Roberts and J. S. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics, 3(3):199--216, Dec. 2003.Google Scholar
D. Sullivan. Search engine optimization, Apr. 2000. Online at http://searchenginewatch.com/resources/article.php/2156511.Google Scholar
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarCross Ref

Index Terms

Identifying link farm spam pages
1. Information systems
  1. Information retrieval

Recommendations

Topical link analysis for web search
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Traditional web link-based ranking schemes use a single score to measure a page's authority without concern of the community from which that authority is derived. As a result, a resource that is highly popular for one topic may dominate the results of ...
Read More
Topical TrustRank: using topicality to combat web spam
WWW '06: Proceedings of the 15th international conference on World Wide Web

Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently ...
Read More
Content and link-structure perspective of ranking webpages: A review
Abstract
The delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
May 2005
454 pages
ISBN:1595930515
DOI:10.1145/1062745
Conference Chairs:
Allan Ellis
Southern Cross University, Australia
,
Tatsuya Hagino
Keio University, Japan
,
Program Chairs:
Fred Douglis
IBM Research
,
Prabhakar Raghavan
Verity, Inc.
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HITS
PageRank
link analysis
spam
web search engine
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 136
  Total Citations
  View Citations
- 1,377
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Identifying link farm spam pages

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topical link analysis for web search

Topical TrustRank: using topicality to combat web spam

Content and link-structure perspective of ranking webpages: A review