skip to main content
10.1145/1244408.1244418acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
Article

Measuring similarity to detect qualified links

Published:08 May 2007Publication History

ABSTRACT

The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show our approach almost doubles the performance of Kleinberg's HITS and boosts Bharat and Henzinger's imp algorithm by close to 9% in terms of precision. It also outperforms a previous approach focusing on link farm detection.

References

  1. A. A. Benczur, I. Biro, K. Csalogany, and M. Uher. Detecting nepotistic links by language model disagreement. In Proceedings of the 15th WWW conference, pages 939--940, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International World Wide Web Conference, pages 107--117, Brisbane, Australia, Apr. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Cai, X. He, J.-R. Wen, and W.-Y. Ma. Block-level link analysis. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chakrabarti, B. E. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. M. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International World Wide Web Conference, pages 65--74, Brisbane, Australia, Apr. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6(2): 153--186, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In Proceedings of the 15th WWW conference, pages 73--82, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.Google ScholarGoogle Scholar
  9. I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceeding of the ECML, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.-Y. Kan and H. O. N. Thi. Fast webpage classification using url features. In Proceeding of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pages 325--326, 2005. Poster abstract. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.Google ScholarGoogle Scholar
  19. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on the World Wide Web, Edinburgh, Scotland, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Open Directory Project (ODP), 2007. http://www.dmoz.com/.Google ScholarGoogle Scholar
  21. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished draft, 1998.Google ScholarGoogle Scholar
  22. S. E. Robertson. Overview of the OKAPI projects. Journal of Documentation, 53:3--7, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  23. G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Retrieval. Prentice Hall, Englewood Cliffs, NJ, 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. J. Van Rijsbergen. Information Retrieval. Butterworths, London, 1979. 2nd edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820--829, Chiba, Japan, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Measuring similarity to detect qualified links

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
          May 2007
          98 pages
          ISBN:9781595937322
          DOI:10.1145/1244408

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 8 May 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader