skip to main content
10.1145/1244408.1244417acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
Article

A large-scale study of link spam detection by graph algorithms

Authors Info & Claims
Published:08 May 2007Publication History

ABSTRACT

Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.

References

  1. L. Becchetti, C. Castillo, and D. Donato. Link-based characterization and detection of web spam. In Proc. of AIRWEB 2006, Seattle, 2006.Google ScholarGoogle Scholar
  2. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detetection. In Proc. of KDD 2006, Philadelphia, Pennsylvania, 2006.Google ScholarGoogle Scholar
  3. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proc. of AIRWEB 2006, Seattle, 2006.Google ScholarGoogle Scholar
  4. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank -- fully automatic link spam detection. In Proc. of AIRWEB 2005, Chiba, 2005.Google ScholarGoogle Scholar
  5. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309--320, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proc. of KDD 2000, Boston, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of VLDB 2005, Trondheim, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. of AIRWEB 2005, Chiba, 2005.Google ScholarGoogle Scholar
  9. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. of VLDB 2004, Toronto, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46:119--130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In Proc. of AIRWEB 2006, Seattle, 2006.Google ScholarGoogle Scholar
  12. K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT 2004, Humlebaek, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In Proc. of AIRWEB 2005, Chiba, 2005.Google ScholarGoogle Scholar
  14. M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. of WWW 2001, Hong Kong, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Ono, M. Toyoda, and M. Kitsuregawa. An examination of techniques for identifying web spam by link analysis. In Proc. of DEWS 2006, Tokyo, 2006.Google ScholarGoogle Scholar
  16. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google ScholarGoogle Scholar
  17. B. Wu and B. Davidson. Identifying link farm spam pages. In Proc. of WWW 2005, Tokyo, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A large-scale study of link spam detection by graph algorithms

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Other conferences
                AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
                May 2007
                98 pages
                ISBN:9781595937322
                DOI:10.1145/1244408

                Copyright © 2007 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 8 May 2007

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader