skip to main content
10.1145/1060745.1060839acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Scaling link-based similarity search

Authors Info & Claims
Published:10 May 2005Publication History

ABSTRACT

To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.

References

  1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. Proc. of VLDB, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. Proc. of WWW13, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Proc. of WWW9, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder. On the resemblance and containment of documents. Proc. of Compression and Complexity of Sequences (SEQUENCES'97), pages 21--29. IEEE Computer Society, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig. Syntactic clustering of the Web. Proc. of WWW6, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proc. of SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y.-Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing PageRank. Proc. of CIKM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1467--1479, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Fogaras and B. Rácz. A scalable randomized method to compute link-based similarity rank on the web graph. Proc. of Clustering Information over the Web (ClustWeb), in conjunction with EDBT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Fogaras and B. Rácz. Scaling link-based similarity search. Technical report, MTA SZTAKI, 2004. http://www.ilab.sztaki.hu/websearch/Publications/.Google ScholarGoogle Scholar
  14. D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc. of Third Workshop on Algorithms and Models for the Web-Graph (WAW) in conjunction with FOCS, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. T. H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford University, 1999.Google ScholarGoogle Scholar
  16. T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. Proc. of WWW11, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. H. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.Google ScholarGoogle Scholar
  18. M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform url sampling. Proc. of WWW9, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: a repository of web pages. Proc. of WWW9, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. Proc. of SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Proc. of CIKM, pages 556--559. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Lu, J. Janssen, E. Milios, and N. Japkowicz. Node similarity in networked information spaces. Proc. of 2001 conference of the Centre for Advanced Studies on Collaborative research, page~11. IBM Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. U. Meyer, P. Sanders, and J. Sibeyn. Algorithms for Memory Hierarchies, Advanced Lectures. LNCS, Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Open Directory Project (ODP). http://www.dmoz.org.Google ScholarGoogle Scholar
  27. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google ScholarGoogle Scholar
  28. C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. Proc. of SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. Proc. of AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.Google ScholarGoogle Scholar

Index Terms

  1. Scaling link-based similarity search

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              WWW '05: Proceedings of the 14th international conference on World Wide Web
              May 2005
              781 pages
              ISBN:1595930469
              DOI:10.1145/1060745

              Copyright © 2005 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 10 May 2005

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate1,899of8,196submissions,23%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader