ABSTRACT
Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.
- L. Becchetti, C. Castillo, and D. Donato. Link-based characterization and detection of web spam. In Proc. of AIRWEB 2006, Seattle, 2006.Google Scholar
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detetection. In Proc. of KDD 2006, Philadelphia, Pennsylvania, 2006.Google Scholar
- A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proc. of AIRWEB 2006, Seattle, 2006.Google Scholar
- A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank -- fully automatic link spam detection. In Proc. of AIRWEB 2005, Chiba, 2005.Google Scholar
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309--320, 2000. Google ScholarDigital Library
- G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proc. of KDD 2000, Boston, 2000. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of VLDB 2005, Trondheim, 2005. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. of AIRWEB 2005, Chiba, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. of VLDB 2004, Toronto, 2004. Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46:119--130, 1997. Google ScholarDigital Library
- V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In Proc. of AIRWEB 2006, Seattle, 2006.Google Scholar
- K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT 2004, Humlebaek, 2004.Google ScholarCross Ref
- P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In Proc. of AIRWEB 2005, Chiba, 2005.Google Scholar
- M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. of WWW 2001, Hong Kong, 2001. Google ScholarDigital Library
- T. Ono, M. Toyoda, and M. Kitsuregawa. An examination of techniques for identifying web spam by link analysis. In Proc. of DEWS 2006, Tokyo, 2006.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google Scholar
- B. Wu and B. Davidson. Identifying link farm spam pages. In Proc. of WWW 2005, Tokyo, 2005. Google ScholarDigital Library
Index Terms
- A large-scale study of link spam detection by graph algorithms
Recommendations
Link analysis for Web spam detection
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to ...
Link-based web spam detection using weight properties
Link spam is created with the intention of boosting one target's rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being ...
A link graph-based approach to identify forum spam
Web spammers have taken note of the popularity of public forums such as blogs, wikis, webboards, and guestbooks. They are now exploiting them with the purpose of driving traffic to their malicious or fraudulent websites, such as those used for phishing, ...
Comments