skip to main content
10.1145/1277741.1277814acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Know your neighbors: web spam detection using the web topology

Published:23 July 2007Publication History

ABSTRACT

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

References

  1. R. Angelova and G. Weikum. Graph-based text classification: learn from your neighbors. In ACM SIGIR, pages 485--492, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006.Google ScholarGoogle Scholar
  5. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006.Google ScholarGoogle Scholar
  6. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWeb, 2006.Google ScholarGoogle Scholar
  7. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. ACM SIGIR Forum, 40(2):11--24, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report, 2006.Google ScholarGoogle Scholar
  9. A. L. da Costa-Carvalho, P.-A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In WWW, pages 73--82, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. D. Davison. Topical locality in the web. In ACM SIGIR, pages 272--279, Athens, Greece, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In WWW, pages 309--318, New York, NY, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb, 2005.Google ScholarGoogle Scholar
  15. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Gyöngyi and H. G. Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. ACM SIGIR Forum, 36(2):11--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96--129, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003.Google ScholarGoogle Scholar
  20. S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.Google ScholarGoogle Scholar
  21. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.Google ScholarGoogle Scholar
  22. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW, pages 83--92, Edinburgh, Scotland, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.Google ScholarGoogle Scholar
  24. X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In CIKM, pages 228--237, Arlington, VA, USA, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM, Hong Kong, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005.Google ScholarGoogle Scholar
  28. B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW, pages 820--829, New York, NY, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006.Google ScholarGoogle Scholar
  30. H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92--104, Rome, Italy, 2004. Springer.Google ScholarGoogle Scholar
  31. T. Zhang, A. Popescul, and B. Dom. Linear prediction models with graph regularization for web-page categorization. In ACM KDD, pages 821--826, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321--328, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Know your neighbors: web spam detection using the web topology

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2007
      946 pages
      ISBN:9781595935977
      DOI:10.1145/1277741

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader