skip to main content
10.1145/1062745.1062762acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Identifying link farm spam pages

Published:10 May 2005Publication History

ABSTRACT

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.

References

  1. Pr0 - Google's PageRank 0, 2002. http://pr.efactory.de/e-pr0.shtml.Google ScholarGoogle Scholar
  2. Lycos 50, 2005. http://50.lycos.com/.Google ScholarGoogle Scholar
  3. Open directory project, 2005. http://dmoz.org/.Google ScholarGoogle Scholar
  4. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pages 38--47, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International World Wide Web Conference, pages 415--429, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, pages 211--220, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, Jul 2000. Technical Report WS-00-01.Google ScholarGoogle Scholar
  14. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. Technical report, Stanford Digital Library Technologies Project, Mar. 2004.Google ScholarGoogle Scholar
  16. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th VLDB Conference, Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Lemin. Google Zeitgeist, Dec. 2004. http://www.google.com/press/zeitgeist.html.Google ScholarGoogle Scholar
  20. R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In The eleventh International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.Google ScholarGoogle Scholar
  23. G. O. Roberts and J. S. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics, 3(3):199--216, Dec. 2003.Google ScholarGoogle Scholar
  24. D. Sullivan. Search engine optimization, Apr. 2000. Online at http://searchenginewatch.com/resources/article.php/2156511.Google ScholarGoogle Scholar
  25. H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Identifying link farm spam pages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
      May 2005
      454 pages
      ISBN:1595930515
      DOI:10.1145/1062745

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 May 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader