Article

Know your neighbors: web spam detection using the web topology

Authors:
Carlos Castillo

Yahoo! Research Barcelona, Catalunya, Spain

Yahoo! Research Barcelona, Catalunya, Spain
View Profile

,
Debora Donato

Yahoo! Research Barcelona, Catalunya, Spain

Yahoo! Research Barcelona, Catalunya, Spain
View Profile

,
Aristides Gionis

Yahoo! Research Barcelona, Catalunya, Spain

Yahoo! Research Barcelona, Catalunya, Spain
View Profile

,
Vanessa Murdock

Yahoo! Research Barcelona, Catalunya, Spain

Yahoo! Research Barcelona, Catalunya, Spain
View Profile

,
Fabrizio Silvestri

ISTI-CNR, Pisa, Italy

ISTI-CNR, Pisa, Italy
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 423–430https://doi.org/10.1145/1277741.1277814

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 423–430

ABSTRACT

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

References

R. Angelova and G. Weikum. Graph-based text classification: learn from your neighbors. In ACM SIGIR, pages 485--492, 2006. Google ScholarDigital Library
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999. Google ScholarDigital Library
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006.Google Scholar
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006.Google Scholar
A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWeb, 2006.Google Scholar
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. ACM SIGIR Forum, 40(2):11--24, December 2006. Google ScholarDigital Library
W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report, 2006.Google Scholar
A. L. da Costa-Carvalho, P.-A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In WWW, pages 73--82, New York, NY, USA, 2006. Google ScholarDigital Library
B. D. Davison. Topical locality in the web. In ACM SIGIR, pages 272--279, Athens, Greece, 2000. Google ScholarDigital Library
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005. Google ScholarDigital Library
N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In WWW, pages 309--318, New York, NY, USA, 2004. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In VLDB, 2005. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb, 2005.Google Scholar
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004. Google ScholarDigital Library
Z. Gyöngyi and H. G. Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, 2005. Google ScholarDigital Library
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. ACM SIGIR Forum, 36(2):11--22, 2002. Google ScholarDigital Library
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96--129, 1998. Google ScholarDigital Library
Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003.Google Scholar
S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.Google Scholar
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.Google Scholar
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW, pages 83--92, Edinburgh, Scotland, May 2006. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In CIKM, pages 228--237, Arlington, VA, USA, November 2006. Google ScholarDigital Library
G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM, Hong Kong, December 2006. Google ScholarDigital Library
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. Google ScholarDigital Library
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005.Google Scholar
B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW, pages 820--829, New York, NY, USA, 2005. Google ScholarDigital Library
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006.Google Scholar
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92--104, Rome, Italy, 2004. Springer.Google Scholar
T. Zhang, A. Popescul, and B. Dom. Linear prediction models with graph regularization for web-page categorization. In ACM KDD, pages 821--826, New York, NY, USA, 2006. Google ScholarDigital Library
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321--328, 2004.Google ScholarDigital Library

Index Terms

Know your neighbors: web spam detection using the web topology
1. Information systems
  1. Information systems applications

Recommendations

Survey on web spam detection: principles and algorithms

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even ...
Read More
Google Penguin: Evasion in Non-English Languages and a New Classifier
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02

Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be ...
Read More
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide Web

Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content spam
link spam
web spam
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 213
  Total Citations
  View Citations
- 2,478
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Know your neighbors: web spam detection using the web topology

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Survey on web spam detection: principles and algorithms

Google Penguin: Evasion in Non-English Languages and a New Classifier

Improving web spam detection with re-extracted features