Article

To randomize or not to randomize: space optimal summaries for hyperlink analysis

Authors:
Tamás Sarlós

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest
View Profile

,
Adrás A. Benczúr

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest
View Profile

,
Károly Csalogány

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest

Hungarian Academy of Sciences (MTA SZTAKI) and Eötvös University, Budapest
View Profile

,
Dániel Fogaras

Hungarian Academy of Sciences (MTA SZTAKI) and Budapest University of Technology and Economics

Hungarian Academy of Sciences (MTA SZTAKI) and Budapest University of Technology and Economics
View Profile

,
Balázs Rácz

Hungarian Academy of Sciences (MTA SZTAKI) and Budapest University of Technology and Economics

Hungarian Academy of Sciences (MTA SZTAKI) and Budapest University of Technology and Economics
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 297–306https://doi.org/10.1145/1135777.1135823

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 297–306

ABSTRACT

Personalized PageRank expresses link-based page quality around user selected pages. The only previous personalized PageRank algorithm that can serve on-line queries for an unrestricted choice of pages on large graphs is our Monte Carlo algorithm [WAW 2004]. In this paper we achieve unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom [WWW 2003]. We evaluate the precision of approximation experimentally on large scale real-world data and find significant improvement over previous results. As a key theoretical contribution we show that our algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds. Our lower bounds and algorithms apply to the SimRank as well; of independent interest is the reduction of the SimRank computation to personalized PageRank.

References

Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: Lower bounds and applications. Proc of 33rd STOC, 2001. Google ScholarDigital Library
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarDigital Library
P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. Proc of 13th WWW, pp. 595--602, 2004. Google ScholarDigital Library
A. Z. Broder and M. Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.Google ScholarCross Ref
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Proc of 29th ICALP, pp. 693--703, 2002. Google ScholarDigital Library
Y.-Y. Chen, Q. Gan, and T. Suel. Local methods for estimating PageRank values. Proc of 12th CIKM, pp. 381--389, 2004. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. Proc of 5th SIAM Intl. Conf. on Data Mining, 2005.Google ScholarCross Ref
R. Fagin, R. Kumar, M. Mahdian, D. Sivakumar, and E. Vee. Comparing and aggregating rankings with ties. Proc of 23rd PODS, 2004. Google ScholarDigital Library
D. Fogaras. Where to start browsing the web? Proc of 3rd I2CS, Springer LNCS vol. 2877, pp. 65--79, 2003.Google Scholar
D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc of 3rd WAW, pp. 105--117, 2004. Full version to appear in Internet Mathematics.Google ScholarCross Ref
D. Fogaras and B. Rácz. Scaling link-based similarity search. Proc of 14th WWW, pp. 641--650, 2005. Full version available at www.ilab.sztaki.hu/websearch/Publications/. Google ScholarDigital Library
T. H. Haveliwala. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 15(4):784--796, 2003. Google ScholarDigital Library
M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. In External Memory Algorithms, DIMACS Book Series vol. 50., pp. 107--118. American Mathematical Society, 1999. Google ScholarDigital Library
J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A repository of web pages. Proc of 9th WWW, pp. 277--293, 2000. Google ScholarDigital Library
G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. Pro of 8th SIGKDD, pp. 538--543, 2002. Google ScholarDigital Library
G. Jeh and J. Widom. Scaling personalized web search. Proc of 12th WWW, pp. 271--279, 2003. Google ScholarDigital Library
S. Kamvar, T. H. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the web for computing PageRank. Technical Report 2003-17, Stanford University, 2003.Google Scholar
M. G. Kendall. Rank Correlation Methods. Hafner Publishing Co., New York, 1955.Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. Google ScholarDigital Library
F. McSherry. A uniform approach to accelerated PageRank computation. Proc of 14th WWW, pp. 575--582, 2005. Google ScholarDigital Library
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Comp. Sci., 1(2), 2005. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford University, 1998.Google Scholar
C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: A fast and scalable tool for data mining in massive graphs. Proc of 8th SIGKDD, pp. 81--90, 2002. Google ScholarDigital Library
P. K. C. Singitham, M. S. Mahabhashyam, and P. Raghavan. Efficiency-quality tradeoffs for vector score aggregation. Proc of 30th VLDB, pp. 624--635, 2004. Google ScholarDigital Library
J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 33(2):209--271, 2001. Google ScholarDigital Library

Index Terms

To randomize or not to randomize: space optimal summaries for hyperlink analysis
1. Information systems
  1. Information retrieval
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms
  2. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

A near-optimal algorithm for estimating the entropy of a stream

We describe a simple algorithm for approximating the empirical entropy of a stream of m values up to a multiplicative factor of (1+ϵ) using a single pass, O(ϵ⁻² log (δ⁻¹) log m) words of space, and O(log ϵ⁻¹ + log log δ⁻¹ + log log m) processing time ...
Read More
A recommender system based on collaborative filtering using ontology and dimensionality reduction techniques

A new method is developed for recommender systems.The recommender system is developed based on collaborative filtering.Scalability and sparsity issues in recommender systems are solved.MovieLens and Yahoo! Webscope R4 datasets are used for method ...
Read More
PageRank revisited

PageRank, one part of the search engine Google, is one of the most prominent link-based rankings of documents in the World Wide Web. Usually it is described as a Markov chain modeling a specific random surfer. In this article, an alternative ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data streams
link-analysis
scalability
similarity search
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 605
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

To randomize or not to randomize: space optimal summaries for hyperlink analysis

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A near-optimal algorithm for estimating the entropy of a stream

A recommender system based on collaborative filtering using ontology and dimensionality reduction techniques

PageRank revisited