ABSTRACT
To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
- Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. Proc. of VLDB, 2000. Google ScholarDigital Library
- Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. Proc. of WWW13, 2004. Google ScholarDigital Library
- A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Proc. of WWW9, 2000. Google ScholarDigital Library
- A. Z. Broder. On the resemblance and containment of documents. Proc. of Compression and Complexity of Sequences (SEQUENCES'97), pages 21--29. IEEE Computer Society, 1997. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig. Syntactic clustering of the Web. Proc. of WWW6, 1997. Google ScholarDigital Library
- S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proc. of SIGMOD, 1998. Google ScholarDigital Library
- Y.-Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing PageRank. Proc. of CIKM, 2002. Google ScholarDigital Library
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarDigital Library
- J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1467--1479, 1999. Google ScholarDigital Library
- D. Fogaras and B. Rácz. A scalable randomized method to compute link-based similarity rank on the web graph. Proc. of Clustering Information over the Web (ClustWeb), in conjunction with EDBT, 2004. Google ScholarDigital Library
- D. Fogaras and B. Rácz. Scaling link-based similarity search. Technical report, MTA SZTAKI, 2004. http://www.ilab.sztaki.hu/websearch/Publications/.Google Scholar
- D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc. of Third Workshop on Algorithms and Models for the Web-Graph (WAW) in conjunction with FOCS, 2004.Google ScholarCross Ref
- T. H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford University, 1999.Google Scholar
- T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. Proc. of WWW11, 2002. Google ScholarDigital Library
- T. H. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.Google Scholar
- M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform url sampling. Proc. of WWW9, 2000. Google ScholarDigital Library
- J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: a repository of web pages. Proc. of WWW9, 2000. Google ScholarDigital Library
- G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. Proc. of SIGKDD, 2002. Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Proc. of CIKM, pages 556--559. ACM Press, 2003. Google ScholarDigital Library
- W. Lu, J. Janssen, E. Milios, and N. Japkowicz. Node similarity in networked information spaces. Proc. of 2001 conference of the Centre for Advanced Studies on Collaborative research, page~11. IBM Press, 2001. Google ScholarDigital Library
- U. Meyer, P. Sanders, and J. Sibeyn. Algorithms for Memory Hierarchies, Advanced Lectures. LNCS, Springer, 2003. Google ScholarDigital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- Open Directory Project (ODP). http://www.dmoz.org.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
- C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. Proc. of SIGKDD, 2002. Google ScholarDigital Library
- P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. Proc. of AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.Google Scholar
Index Terms
- Scaling link-based similarity search
Recommendations
Efficient link-based similarity search in web networks
The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of ...
Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs
SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [24] for retrieving SimRank does not always produce high-quality similarity results, ...
Effective Similarity Search on Indoor Moving-Object Trajectories
DASFAA 2016: Proceedings, Part II, of the 21st International Conference on Database Systems for Advanced Applications - Volume 9643In this paper, we propose a new approach to measuring the similarity among indoor moving-object trajectories. Particularly, we propose to measure indoor trajectory similarity based on spatial similarity and semantic pattern similarity. For spatial ...
Comments