Elsevier

Pattern Recognition Letters

Volume 34, Issue 13, 1 October 2013, Pages 1462-1469
Pattern Recognition Letters

Combating Web spam through trust–distrust propagation with confidence

https://doi.org/10.1016/j.patrec.2013.05.017Get rights and content

Highlights

  • We propose the GBR algorithm that propagates both trust and distrust.

  • The propagation of a page’s trust/distrust is decided by its probability of being trust/distrust.

  • GBR takes advantages from both trust and distrust propagation.

  • GBR outperforms other typical algorithms for both spam demotion and spam detection.

Abstract

Semi-automatic anti-spam algorithms propagate either trust through links from a set of good seed pages or distrust through inverse-links from a set of bad seed pages to the entire Web. It has been mentioned that a combined usage of both trust and distrust propagations can lead to better results. However, little work has been known to realize this insight successfully. In this paper, we view that each Web page has both a trustworthy side and an untrustworthy side, and propose to assign two scores for each Web page to denote its trustworthy side and untrustworthy side, respectively. We then propose the Good-Bad Rank (GBR) algorithm for propagating trust and distrust simultaneously from both directions. In GBR, the propagation of a page’s trust/distrust is decided by its probability of being trust/distrust. GBR takes advantages from both trust and distrust propagations, thus is more powerful than propagating only trust or distrust. Experimental results show that GBR outperforms other typical link-based anti-spam algorithms that propagates only trust or distrust. GBR achieves comparable performance than another algorithm that propagates both trust and distrust, TDR, but is much more efficient than TDR.

Introduction

Asking a search engine is the dominant way for people to find useful information on the Web, and usually they are only interested in some top ranked results in the first several pages. Since higher ranking in search results brings more traffic, and more traffic means more profit to the owners of Web sites, there are economic incentives to manipulate ranking results of a search engine through unethical methods. This kind of unethical manipulation is termed as spam (Gyöngyi and Garcia-Molina, 2005). Spam put negative economic and social impact on the whole Web community. It is identified as one of the major challenges faced by search engines (Henzinger et al., 2002). Commercial search engines have to take measures to eliminate the negative effect of spam.

Two kinds of techniques are usually used to combat spam: (1) spam demotion, to punish spam pages by demoting them in the search result ranking; (2) spam detection, to identify spam pages so that search engines can further take actions to filter them out. Many anti-spam algorithms have been proposed so far, and among them link-based semi-automatic algorithms that propagate human experts’ judgments over a set of seed pages are the most promising, considering both effectiveness and efficiency. These algorithms fall into two categories: trust propagation and distrust propagation. Trust propagation algorithms are based on the intuition that a good page seldom points to spam pages or low quality pages. They propagate trust from a seed set of good pages recursively through links to the entire Web. TrustRank (Gyöngyi et al., 2004) is the first trust propagation algorithm.Topical TrustRank (Wu et al., 2006), CredibleRank (Caverlee and Liu, 2007) and etc. are alternative algorithms or improvements over TrustRank. Trust propagation based algorithms are used to demote the spam sites in the ranking results. Distrust propagation algorithms hold the philosophy that a page should be penalized for pointing to bad pages. They propagate distrust from a seed set of bad pages recursively through inverse-links to the entire Web. Anti-Trust Rank (Krishnan and Raj, 2006) is a distrust propagation algorithm, and it focuses on the task of spam detection.

Trust propagation algorithms have inherent limitations that come from their philosophy. Note that “a good page seldom points to spam pages” does not mean that there are no good-to-spam links and trust propagation based algorithms cannot handle this problem effectively. Consequently, the trust propagation is transigent and can be manipulated easily by spammers. For example, spammers can put their links on the comments section of a good page to gain high rank. Distrust propagation algorithms’ philosophy is not convictive either, because in many cases it is not the fault of a good page who has some links to bad pages. For such a good page, it is the propagation process of distrust, not the page itself, that should be penalized. To the best of our knowledge, these issues have not been well addressed. It is mentioned that a combined usage of both trust and distrust propagations can lead to better results (Zhao et al., 2008, Wu et al., 2006, Zhang et al., 2006). Wu et al. simply made a linear combination of the TrustRank and Anti-Trust Rank scores for each page (Wu et al., 2006). However, this linear combination achieves only a little improvement over TrustRank or Anti-Trust Rank only when one of them is dominant in the combination. It performs worse than either of them when they are weighted nearly equal in the combination. This shows that linear combination gains limited benefit since the trust and distrust are propagated separately in this framework. In (Zhang et al., 2011) proposed the TDR algorithm which propagates both trust and distrust, the propagation of trust/distrust is penalized by the target’s current value of distrust/trust.

In this paper, we propose another framework in which trust and distrust have impact on the propagation of each other. We view that each page has both a good side and a bad side, and assign a GoodRank score and a BadRank score to each page to denote the good side and bad side, respectively. GoodRank represents the page’s value and trustworthiness, while BadRank means the page’s spamicity or possibility of being manipulated by spam pages. We then propose the Good-Bad Rank (GBR) algorithm that propagates both GoodRank and BadRank from good and bad seeds respectively. In GBR algorithm, the propagation of a page’s GoodRank is penalized by its BadRank, and the propagation of a page’s BadRank is weakened by its GoodRank. Precisely speaking, the propagation of a page’s trust/distrust is decided by its probability of being trust/distrust. Thus the disadvantages of both trust propagation and distrust propagation algorithms are overcome. For example, spam pages benefit little by manipulating some good pages with high GoodRank scores, because the propagation of GoodRank is penalized by the pages’ BadRank. GBR could be seen as a combinatorial generalization of both TrustRank and Anti-Trust Rank since it propagates both trust and distrust and take advantages from both good and bad seeds. We use the GoodRank scores for the spam demotion task and BadRank scores of GBR for the spam detection task. GBR differs from TDR since GBR penalizes good-to-bad trust propagation and bad-to-good distrust propagation at the source end, while TDR penalizes at the target end. GBR is better than TDR since it puts more strict penalty on good-to-bad trust propagation and bad-to-good distrust propagation than TDR. Experimental results show that GBR outperforms other typical link-based anti-spam algorithms that propagates only trust or distrust. GBR achieves comparable performance than TDR, but is much more efficient that TDR.

The remainder of this paper is organized as follows: Section 2 gives an overview of related work. Section 3 provides some basic concepts and preliminaries. Section 4 introduces the GBR algorithm. Section 5 demonstrates our evaluation process and experimental results. The last section concludes the discussion and points out future work.

Section snippets

Related work

Link-based spam, according to Davison (2000), can be defined as nepotistic links between pages that are present for reasons other than merit. Link-based spam consists of the creation of a link structure, e.g. a link farm, to take advantage of link-based ranking algorithms, such as PageRank (Page et al., 1999) and HITS (Kleinberg, 1999). Gyöngyi and Garcia-Molina (2005) identified link-based spam as a very important issue not only because it can render significant gains in the rankings of target

PageRank, TrustRank and Anti-Trust Rank

The Web can be modeled as a directed graph G=(V,E) consisting of a set V={p1,p2,,pN} of N pages (vertices) and a set E of directed links (edges). In this paper, for a page p, we use IN(p) to represent the set of pages that link to it, and OUT(p) to denote the set of pages linked by page p.

PageRank, as a well known algorithm using link information to assign global importance scores to all pages on the Web, is the basis of most of anti-spam algorithms. The PageRank score r(p) of a page p is

Motivation

TrustRank, AntiTrust Rank and their variants propagate only trust or distrust to combat Web spam. Propagating only trust or distrust is not effective enough, and in the following we use an example to narrate the ineffectiveness of TrustRank and Anti-TrustRank.

Fig. 1 represents a small nine-page Web graph where good pages are shown as white, and bad pages as black. Bad page 6, 7, 8 and 9 make up a small link farm. Since there are two good pages 4 and 5 pointing to bad ones, trust can be

Data set

We chose the WEBSPAM-UK2007 dataset (Yahoo, 2007) and TREC Category B of ClueWeb09 dataset (Callan et al., 2009) to evaluate the GBR algorithm. WEBSPAM-UK2007 dataset contains 105,896,555 pages from 114,529 hosts in the .UK domain and ClueWeb09 dataset contains 428,136,613 English Web pages. In WEBSPAM-UK2007, a portion of 6479 hosts was labeled as spam/reputable/undecided by a group of volunteers and for ClueWeb09, there are 86,823,693 spam pages together with 61,323,911 non-spam pages (the

Conclusion and future work

In this paper we have proposed a novel algorithm GBR to combat spam by taking advantages of both good seeds and bad seeds. The algorithm assigns each page two metrics (GoodRank and BadRank) and propagates trust and distrust simultaneously through links and inverse-links. GBR is a combinatorial generalization of both Trust-Rank and Anti-Trust Rank. By estimating each page’s possibility of being reputable (spam), the process of propagating its trustworthiness (spamicity) to others is restricted

References (35)

  • Abernethy, J., Chapelle, O., Castillo, C., 2008. Web spam identification through content and hyperlinks. In:...
  • Becchetti, L., Leonardi, S., 2006. Using rank propagation and probabilistic counting for link-based spam detection. In:...
  • Benczúr, A.A., Csalogány, K., Sarlós, T., Uher, M., 2005. Spamrank – fully automatic link spam detection. In: AIRWeb...
  • Benczúr, A.A., Bíró, I., Csalogány, K., Uher, M., 2006. Detecting nepotistic links by language model disagreement. In:...
  • Callan, J., Hoy, M., Yoo, C., Zhao, L., 2009. The clueweb09 data set. http://boston.lti.cs.cmu.edu/Data/clueweb09/...
  • Caverlee, J., Liu, L., 2007. Countering web spam with credibility-based link analysis. In: PODC ’07: Proceedings of the...
  • Chen, S.-N. Yu, S. Cheng, Link variable trustrank for fighting web spam, in: CSSE ’08: Proceedings of the 2008...
  • Chung, Y.-j., Toyoda, M., Kitsuregawa, M., 2009. A study of link farm distribution and evolution using a time series of...
  • Davison, B.D., 2000. Recognizing nepotistic links on the web. In: Artificial Intelligence for Web Search, AAAI Press,...
  • Gyöngyi, Z., Garcia-Molina, H., 2005. Web spam taxonomy. In: AIRWeb ’05: Proceedings of the First International...
  • Gyöngyi, Z., Garcia-Molina, H., 2005. Link spam alliances. In: VLDB ’05: Proceedings of the 31st international...
  • Gyöngyi, Z., Garcia-Molina, H., Pedersen, J., 2004. Combating web spam with trustrank. In: VLDB ’04: Proceedings of the...
  • Gyongyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J., 2006. Link spam detection based on mass estimation. In: VLDB...
  • M.R. Henzinger et al.

    Challenges in web search engines

    SIGIR Forum

    (2002)
  • P. Heymann et al.

    Fighting spam on social web sites: a survey of approaches and future challenges

    IEEE Internet Comput.

    (2007)
  • Jiang, Q., Zhang, L., Zhu, Y., Zhang, Y., 2008. Larger is better: seed selection in link-based anti-spamming...
  • J.M. Kleinberg

    Authoritative sources in a hyperlinked environment

    J. ACM

    (1999)
  • Cited by (0)

    Supported by National Science Foundation of China (No. 61272374), Program for New Century Excellent Talents in University (NCET) of China (No. NCET-11-0056), Fundamental Research Funds for the Central Universities of China (No. DUT11ZD107), Specialized Research Fund for the Doctoral Program of Higher Education (No. 20120041110046) and Key Project of Chinese Ministry of Education (No. 313011).

    View full text