skip to main content
10.1145/1835804.1835871acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Fast nearest-neighbor search in disk-resident graphs

Published:25 July 2010Publication History

ABSTRACT

Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.

Skip Supplemental Material Section

Supplemental Material

kdd2010_sarkar_fnnsdrg_01.mov

mov

144.8 MB

References

  1. D. Aldous and J. A. Fill. Reversible Markov Chains. 2001.Google ScholarGoogle Scholar
  2. R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Berkhin. Bookmark-Coloring Algorithm for Personalized PageRank Computing. Internet Mathematics, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Brand. A Random Walks Perspective on Maximizing Satisfaction and Profit. In SIAM '05, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW '07, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chakrabarti, J. Mirchandani, and A. Nandi. Spin: searching personal information networks. In SIGIR '05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. Proceedings of VLDB Endowment, 1(1):1189--1204, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Fogaras, B. Rcz, K. Csalogany, and T. Sarlos. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2004.Google ScholarGoogle Scholar
  11. F. C. Graham and W. Zhao. Pagerank and random walks on graphs. In Fete of Combinatorics".Google ScholarGoogle Scholar
  12. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB, pages 576{587, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Hopcroft and D. Sheldon. Manipulation-resistant reputations using hitting time. Technical report, Cornell University, 2007.Google ScholarGoogle Scholar
  14. G. Jeh and J. Widom. Scaling personalized web search. In Stanford University Technical Report, 2002.Google ScholarGoogle Scholar
  15. G. Jeh and J. Widom. Simrank: A measure if structural-context similarity. In ACM SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Joshi, R. Kumar, B. Reed, and A. Tomkins. Anchor-based proximity measures. In WWW '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359--392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM '03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Qiu and E. R. Hancock. Image segmentation using commute times. In Proc. BMVC, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. H. Qiu and E. R. Hancock. Robust multi-body motion tracking using commute time clustering. In ECCV'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Sarkar and A. Moore. Fast nearest-neighbor search in disk-resident graphs. Technical report:10-005, Machine Learning Department, Carnegie Mellon University, 2010.Google ScholarGoogle Scholar
  23. T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras, and B. Racz. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. D. Sarma, S. Gollapudi, and R. Panigrahy. Estimating pagerank on graph streams. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the STOC'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 20, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Fast nearest-neighbor search in disk-resident graphs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
          July 2010
          1240 pages
          ISBN:9781450300551
          DOI:10.1145/1835804

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 July 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader