ABSTRACT
Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.
Supplemental Material
- D. Aldous and J. A. Fill. Reversible Markov Chains. 2001.Google Scholar
- R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS, 2006. Google ScholarDigital Library
- A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarDigital Library
- P. Berkhin. Bookmark-Coloring Algorithm for Personalized PageRank Computing. Internet Mathematics, 2006.Google ScholarCross Ref
- M. Brand. A Random Walks Perspective on Maximizing Satisfaction and Profit. In SIAM '05, 2005.Google ScholarCross Ref
- S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW '07, New York, NY, USA. Google ScholarDigital Library
- S. Chakrabarti, J. Mirchandani, and A. Nandi. Spin: searching personal information networks. In SIGIR '05. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD '09. Google ScholarDigital Library
- B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. Proceedings of VLDB Endowment, 1(1):1189--1204, 2008. Google ScholarDigital Library
- D. Fogaras, B. Rcz, K. Csalogany, and T. Sarlos. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2004.Google Scholar
- F. C. Graham and W. Zhao. Pagerank and random walks on graphs. In Fete of Combinatorics".Google Scholar
- Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB, pages 576{587, 2004. Google ScholarDigital Library
- J. Hopcroft and D. Sheldon. Manipulation-resistant reputations using hitting time. Technical report, Cornell University, 2007.Google Scholar
- G. Jeh and J. Widom. Scaling personalized web search. In Stanford University Technical Report, 2002.Google Scholar
- G. Jeh and J. Widom. Simrank: A measure if structural-context similarity. In ACM SIGKDD, 2002. Google ScholarDigital Library
- A. Joshi, R. Kumar, B. Reed, and A. Tomkins. Anchor-based proximity measures. In WWW '07. Google ScholarDigital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359--392, 1998. Google ScholarDigital Library
- D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM '03, 2003. Google ScholarDigital Library
- Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM '08. Google ScholarDigital Library
- H. Qiu and E. R. Hancock. Image segmentation using commute times. In Proc. BMVC, 2005.Google ScholarCross Ref
- H. Qiu and E. R. Hancock. Robust multi-body motion tracking using commute time clustering. In ECCV'06, 2006. Google ScholarDigital Library
- P. Sarkar and A. Moore. Fast nearest-neighbor search in disk-resident graphs. Technical report:10-005, Machine Learning Department, Carnegie Mellon University, 2010.Google Scholar
- T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras, and B. Racz. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In WWW, 2006. Google ScholarDigital Library
- A. D. Sarma, S. Gollapudi, and R. Panigrahy. Estimating pagerank on graph streams. In PODS, 2008. Google ScholarDigital Library
- D. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the STOC'04. Google ScholarDigital Library
- X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 20, 2003.Google Scholar
Index Terms
- Fast nearest-neighbor search in disk-resident graphs
Recommendations
Random Walks on Regular and Irregular Graphs
For an undirected graph and an optimal cyclic list of all its vertices, the cyclic cover time is the expected time it takes a simple random walk to travel from vertex to vertex along the list until it completes a full cycle. The main result of this ...
Fast edge searching and fast searching on graphs
Given a graph G=(V,E) in which a fugitive hides on vertices or along edges, graph searching problems are usually to find the minimum number of searchers required to capture the fugitive. In this paper, we consider the problem of finding the minimum ...
A Characterization of $K_{2,4}$-Minor-Free Graphs
We provide a complete structural characterization of $K_{2,4}$-minor-free graphs. The 3-connected $K_{2,4}$-minor-free graphs consist of nine small graphs on at most eight vertices, together with a family of planar graphs that contains $2n-8$ ...
Comments