Article

Measuring similarity to detect qualified links

Authors:
Xiaoguang Qi

Lehigh University

Lehigh University
View Profile

,
Lan Nie

Lehigh University

Lehigh University
View Profile

,
Brian D. Davison

Lehigh University

Lehigh University
View Profile

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the webMay 2007Pages 49–56https://doi.org/10.1145/1244408.1244418

Published:08 May 2007Publication History

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Pages 49–56

ABSTRACT

The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show our approach almost doubles the performance of Kleinberg's HITS and boosts Bharat and Henzinger's imp algorithm by close to 9% in terms of precision. It also outperforms a previous approach focusing on link farm detection.

References

A. A. Benczur, I. Biro, K. Csalogany, and M. Uher. Detecting nepotistic links by language model disagreement. In Proceedings of the 15th WWW conference, pages 939--940, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Aug. 1998. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International World Wide Web Conference, pages 107--117, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
D. Cai, X. He, J.-R. Wen, and W.-Y. Ma. Block-level link analysis. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 2004. Google ScholarDigital Library
S. Chakrabarti, B. E. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. M. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International World Wide Web Conference, pages 65--74, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6(2): 153--186, 2006. Google ScholarDigital Library
A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In Proceedings of the 15th WWW conference, pages 73--82, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.Google Scholar
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceeding of the ECML, 2005. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarDigital Library
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000. Google ScholarDigital Library
T. Joachims. Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998. Google ScholarDigital Library
M.-Y. Kan and H. O. N. Thi. Fast webpage classification using url features. In Proceeding of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pages 325--326, 2005. Poster abstract. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarDigital Library
L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarDigital Library
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.Google Scholar
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on the World Wide Web, Edinburgh, Scotland, May 2006. Google ScholarDigital Library
Open Directory Project (ODP), 2007. http://www.dmoz.com/.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished draft, 1998.Google Scholar
S. E. Robertson. Overview of the OKAPI projects. Journal of Documentation, 53:3--7, 1997.Google ScholarCross Ref
G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Retrieval. Prentice Hall, Englewood Cliffs, NJ, 1971. Google ScholarDigital Library
C. J. Van Rijsbergen. Information Retrieval. Butterworths, London, 1979. 2nd edition. Google ScholarDigital Library
B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820--829, Chiba, Japan, May 2005. Google ScholarDigital Library

Index Terms

Measuring similarity to detect qualified links
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval

Recommendations

Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Read More
A study of results overlap and uniqueness among major web search engines

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Read More
Detecting Link Hijacking by Web Spammers
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Since current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
May 2007
98 pages
ISBN:9781595937322
DOI:10.1145/1244408
Conference Chairs:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Brian D. Davison
Lehigh University
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
link analysis
link classification
web search engine
web spam
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 311
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Measuring similarity to detect qualified links

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying link farm spam pages

A study of results overlap and uniqueness among major web search engines

Detecting Link Hijacking by Web Spammers