ABSTRACT
The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show our approach almost doubles the performance of Kleinberg's HITS and boosts Bharat and Henzinger's imp algorithm by close to 9% in terms of precision. It also outperforms a previous approach focusing on link farm detection.
- A. A. Benczur, I. Biro, K. Csalogany, and M. Uher. Detecting nepotistic links by language model disagreement. In Proceedings of the 15th WWW conference, pages 939--940, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Aug. 1998. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International World Wide Web Conference, pages 107--117, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
- D. Cai, X. He, J.-R. Wen, and W.-Y. Ma. Block-level link analysis. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 2004. Google ScholarDigital Library
- S. Chakrabarti, B. E. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. M. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International World Wide Web Conference, pages 65--74, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
- J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6(2): 153--186, 2006. Google ScholarDigital Library
- A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In Proceedings of the 15th WWW conference, pages 73--82, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.Google Scholar
- I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceeding of the ECML, 2005. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarDigital Library
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000. Google ScholarDigital Library
- T. Joachims. Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998. Google ScholarDigital Library
- M.-Y. Kan and H. O. N. Thi. Fast webpage classification using url features. In Proceeding of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pages 325--326, 2005. Poster abstract. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarDigital Library
- L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarDigital Library
- A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.Google Scholar
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on the World Wide Web, Edinburgh, Scotland, May 2006. Google ScholarDigital Library
- Open Directory Project (ODP), 2007. http://www.dmoz.com/.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished draft, 1998.Google Scholar
- S. E. Robertson. Overview of the OKAPI projects. Journal of Documentation, 53:3--7, 1997.Google ScholarCross Ref
- G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Retrieval. Prentice Hall, Englewood Cliffs, NJ, 1971. Google ScholarDigital Library
- C. J. Van Rijsbergen. Information Retrieval. Butterworths, London, 1979. 2nd edition. Google ScholarDigital Library
- B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820--829, Chiba, Japan, May 2005. Google ScholarDigital Library
Index Terms
- Measuring similarity to detect qualified links
Recommendations
Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebWith the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
A study of results overlap and uniqueness among major web search engines
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Detecting Link Hijacking by Web Spammers
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data MiningSince current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an ...
Comments