Article

Scaling link-based similarity search

Authors:
Dániel Fogaras

Budapest University of Technology and Economics, Budapest, Hungary

Budapest University of Technology and Economics, Budapest, Hungary
View Profile

,
Balázs Rácz

Computer and Automation Research Institute of the Hungarian Academy of Sciences, Budapest, Hungary

Computer and Automation Research Institute of the Hungarian Academy of Sciences, Budapest, Hungary
View Profile

WWW '05: Proceedings of the 14th international conference on World Wide WebMay 2005Pages 641–650https://doi.org/10.1145/1060745.1060839

Published:10 May 2005Publication History

WWW '05: Proceedings of the 14th international conference on World Wide Web

Pages 641–650

ABSTRACT

To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.

References

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. Proc. of VLDB, 2000. Google ScholarDigital Library
Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. Proc. of WWW13, 2004. Google ScholarDigital Library
A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Proc. of WWW9, 2000. Google ScholarDigital Library
A. Z. Broder. On the resemblance and containment of documents. Proc. of Compression and Complexity of Sequences (SEQUENCES'97), pages 21--29. IEEE Computer Society, 1997. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig. Syntactic clustering of the Web. Proc. of WWW6, 1997. Google ScholarDigital Library
S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proc. of SIGMOD, 1998. Google ScholarDigital Library
Y.-Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing PageRank. Proc. of CIKM, 2002. Google ScholarDigital Library
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarDigital Library
J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1467--1479, 1999. Google ScholarDigital Library
D. Fogaras and B. Rácz. A scalable randomized method to compute link-based similarity rank on the web graph. Proc. of Clustering Information over the Web (ClustWeb), in conjunction with EDBT, 2004. Google ScholarDigital Library
D. Fogaras and B. Rácz. Scaling link-based similarity search. Technical report, MTA SZTAKI, 2004. http://www.ilab.sztaki.hu/websearch/Publications/.Google Scholar
D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc. of Third Workshop on Algorithms and Models for the Web-Graph (WAW) in conjunction with FOCS, 2004.Google ScholarCross Ref
T. H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford University, 1999.Google Scholar
T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. Proc. of WWW11, 2002. Google ScholarDigital Library
T. H. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.Google Scholar
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform url sampling. Proc. of WWW9, 2000. Google ScholarDigital Library
J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: a repository of web pages. Proc. of WWW9, 2000. Google ScholarDigital Library
G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. Proc. of SIGKDD, 2002. Google ScholarDigital Library
J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Proc. of CIKM, pages 556--559. ACM Press, 2003. Google ScholarDigital Library
W. Lu, J. Janssen, E. Milios, and N. Japkowicz. Node similarity in networked information spaces. Proc. of 2001 conference of the Centre for Advanced Studies on Collaborative research, page~11. IBM Press, 2001. Google ScholarDigital Library
U. Meyer, P. Sanders, and J. Sibeyn. Algorithms for Memory Hierarchies, Advanced Lectures. LNCS, Springer, 2003. Google ScholarDigital Library
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
Open Directory Project (ODP). http://www.dmoz.org.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. Proc. of SIGKDD, 2002. Google ScholarDigital Library
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. Proc. of AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.Google Scholar

Index Terms

Scaling link-based similarity search
1. Information systems
  1. Information retrieval
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms
  2. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

Efficient link-based similarity search in web networks

The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of ...
Read More
Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs
SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [24] for retrieving SimRank does not always produce high-quality similarity results, ...
Read More
Effective Similarity Search on Indoor Moving-Object Trajectories
DASFAA 2016: Proceedings, Part II, of the 21st International Conference on Database Systems for Advanced Applications - Volume 9643

In this paper, we propose a new approach to measuring the similarity among indoor moving-object trajectories. Particularly, we propose to measure indoor trajectory similarity based on spatial similarity and semantic pattern similarity. For spatial ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Conference Chairs:
Allan Ellis
Southern Cross University
,
Tatsuya Hagino
Keio University
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fingerprint
link-analysis
scalability
similarity search
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 135
  Total Citations
  View Citations
- 1,197
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling link-based similarity search

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient link-based similarity search in web networks

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Effective Similarity Search on Indoor Moving-Object Trajectories

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scaling link-based similarity search

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient link-based similarity search in web networks

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Effective Similarity Search on Indoor Moving-Object Trajectories

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media