research-article

Fast nearest-neighbor search in disk-resident graphs

Authors:
Purnamrita Sarkar

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

,
Andrew W. Moore

Google inc, Pittsburgh, USA

Google inc, Pittsburgh, USA
View Profile

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2010Pages 513–522https://doi.org/10.1145/1835804.1835871

Published:25 July 2010Publication History

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 513–522

ABSTRACT

Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.

Supplemental Material

kdd2010_sarkar_fnnsdrg_01.mov

mov

144.8 MB

Download

References

D. Aldous and J. A. Fill. Reversible Markov Chains. 2001.Google Scholar
R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS, 2006. Google ScholarDigital Library
A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarDigital Library
P. Berkhin. Bookmark-Coloring Algorithm for Personalized PageRank Computing. Internet Mathematics, 2006.Google ScholarCross Ref
M. Brand. A Random Walks Perspective on Maximizing Satisfaction and Profit. In SIAM '05, 2005.Google ScholarCross Ref
S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW '07, New York, NY, USA. Google ScholarDigital Library
S. Chakrabarti, J. Mirchandani, and A. Nandi. Spin: searching personal information networks. In SIGIR '05. Google ScholarDigital Library
F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD '09. Google ScholarDigital Library
B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. Proceedings of VLDB Endowment, 1(1):1189--1204, 2008. Google ScholarDigital Library
D. Fogaras, B. Rcz, K. Csalogany, and T. Sarlos. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2004.Google Scholar
F. C. Graham and W. Zhao. Pagerank and random walks on graphs. In Fete of Combinatorics".Google Scholar
Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB, pages 576{587, 2004. Google ScholarDigital Library
J. Hopcroft and D. Sheldon. Manipulation-resistant reputations using hitting time. Technical report, Cornell University, 2007.Google Scholar
G. Jeh and J. Widom. Scaling personalized web search. In Stanford University Technical Report, 2002.Google Scholar
G. Jeh and J. Widom. Simrank: A measure if structural-context similarity. In ACM SIGKDD, 2002. Google ScholarDigital Library
A. Joshi, R. Kumar, B. Reed, and A. Tomkins. Anchor-based proximity measures. In WWW '07. Google ScholarDigital Library
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359--392, 1998. Google ScholarDigital Library
D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM '03, 2003. Google ScholarDigital Library
Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM '08. Google ScholarDigital Library
H. Qiu and E. R. Hancock. Image segmentation using commute times. In Proc. BMVC, 2005.Google ScholarCross Ref
H. Qiu and E. R. Hancock. Robust multi-body motion tracking using commute time clustering. In ECCV'06, 2006. Google ScholarDigital Library
P. Sarkar and A. Moore. Fast nearest-neighbor search in disk-resident graphs. Technical report:10-005, Machine Learning Department, Carnegie Mellon University, 2010.Google Scholar
T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras, and B. Racz. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In WWW, 2006. Google ScholarDigital Library
A. D. Sarma, S. Gollapudi, and R. Panigrahy. Estimating pagerank on graph streams. In PODS, 2008. Google ScholarDigital Library
D. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the STOC'04. Google ScholarDigital Library
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 20, 2003.Google Scholar

Index Terms

Fast nearest-neighbor search in disk-resident graphs
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Random Walks on Regular and Irregular Graphs

For an undirected graph and an optimal cyclic list of all its vertices, the cyclic cover time is the expected time it takes a simple random walk to travel from vertex to vertex along the list until it completes a full cycle. The main result of this ...
Read More
Fast edge searching and fast searching on graphs

Given a graph G=(V,E) in which a fugitive hides on vertices or along edges, graph searching problems are usually to find the minimum number of searchers required to capture the fugitive. In this paper, we consider the problem of finding the minimum ...
Read More
A Characterization of $K_{2,4}$-Minor-Free Graphs

We provide a complete structural characterization of $K_{2,4}$-minor-free graphs. The 3-connected $K_{2,4}$-minor-free graphs consist of nine small graphs on at most eight vertices, together with a family of planar graphs that contains $2n-8$ ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
July 2010
1240 pages
ISBN:9781450300551
DOI:10.1145/1835804
General Chairs:
Bharat Rao
Siemens
,
Balaji Krishnapuram
Siemens
,
Program Chairs:
Andrew Tomkins
Google Inc.
,
Qiang Yang
Hong Kong University of Science and Technology
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
external memory
graphs
link prediction
random walks
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 776
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast nearest-neighbor search in disk-resident graphs

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Random Walks on Regular and Irregular Graphs

Fast edge searching and fast searching on graphs

A Characterization of $K_{2,4}$-Minor-Free Graphs