ABSTRACT
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.
- R. Angelova and G. Weikum. Graph-based text classification: learn from your neighbors. In ACM SIGIR, pages 485--492, 2006. Google ScholarDigital Library
- A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999. Google ScholarDigital Library
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006.Google Scholar
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006.Google Scholar
- A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWeb, 2006.Google Scholar
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. ACM SIGIR Forum, 40(2):11--24, December 2006. Google ScholarDigital Library
- W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report, 2006.Google Scholar
- A. L. da Costa-Carvalho, P.-A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In WWW, pages 73--82, New York, NY, USA, 2006. Google ScholarDigital Library
- B. D. Davison. Topical locality in the web. In ACM SIGIR, pages 272--279, Athens, Greece, 2000. Google ScholarDigital Library
- I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005. Google ScholarDigital Library
- N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In WWW, pages 309--318, New York, NY, USA, 2004. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In VLDB, 2005. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004. Google ScholarDigital Library
- Z. Gyöngyi and H. G. Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, 2005. Google ScholarDigital Library
- M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. ACM SIGIR Forum, 36(2):11--22, 2002. Google ScholarDigital Library
- G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96--129, 1998. Google ScholarDigital Library
- Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003.Google Scholar
- S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.Google Scholar
- G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.Google Scholar
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW, pages 83--92, Edinburgh, Scotland, May 2006. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
- X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In CIKM, pages 228--237, Arlington, VA, USA, November 2006. Google ScholarDigital Library
- G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM, Hong Kong, December 2006. Google ScholarDigital Library
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. Google ScholarDigital Library
- B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005.Google Scholar
- B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW, pages 820--829, New York, NY, USA, 2005. Google ScholarDigital Library
- B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006.Google Scholar
- H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92--104, Rome, Italy, 2004. Springer.Google Scholar
- T. Zhang, A. Popescul, and B. Dom. Linear prediction models with graph regularization for web-page categorization. In ACM KDD, pages 821--826, New York, NY, USA, 2006. Google ScholarDigital Library
- D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321--328, 2004.Google ScholarDigital Library
Index Terms
- Know your neighbors: web spam detection using the web topology
Recommendations
Survey on web spam detection: principles and algorithms
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even ...
Google Penguin: Evasion in Non-English Languages and a New Classifier
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be ...
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide WebWeb spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...
Comments