skip to main content
research-article

Walking in the cloud: parallel SimRank at scale

Published:01 September 2015Publication History
Skip Abstract Section

Abstract

Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-n indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation.

References

  1. Apache Spark - Lightning-Fast Cluster Computing. http://spark.apache.org/.Google ScholarGoogle Scholar
  2. P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In WWW, pages 1--13, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Boldi and S. Vigna. The WebGraph Framework I: Compression Techniques. In WWW, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-SimRank computing on MapReduce. In BigMine, pages 28--35, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Cheng, Q. Liu, Z. Li, W. Fan, J. C. Lui, and C. He. VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer networks, 31(11):1467--1479, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. G. Doyle and J. L. Snell. Random walks and electric networks. AMC, 10:12, 1984.Google ScholarGoogle Scholar
  10. D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW, pages 641--650, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. TKDE, 19(3):355--369, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.Google ScholarGoogle Scholar
  13. G. He, H. Feng, C. Li, and H. Chen. Parallel SimRank computation on large graphs with iterative aggregation. In KDD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Jamali and M. Ester. Trustwalker: a random walk model for combining trust-based and item-based recommendation. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kusumoto, T. Maehara, and K. Kawarabayashi. Scalable similarity search for SimRank. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.Google ScholarGoogle Scholar
  19. C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of SimRank for static and dynamic information networks. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Li, C. Li, H. Chen, and X. Du. Mapreduce-based SimRank computation and its application in social recommender system. In BigData Congress, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair SimRank computation. In SDM, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for SimRank computation. VLDBJ, 19(1):45--66, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Maehara, M. Kusumoto, and K. Kawarabayashi. Efficient SimRank computation via linearization. CoRR, abs/1411.7228, 2014.Google ScholarGoogle Scholar
  24. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, 1999.Google ScholarGoogle Scholar
  25. L. Sun, R. Cheng, X. Li, D. Cheung, and J. Han. On link-based similarity join. VLDB, 4(11):714--725, 2011.Google ScholarGoogle Scholar
  26. X.-M. Wu, Z. Li, and S.-F. Chang. New insights into laplacian similarity search. In CVPR, 2012.Google ScholarGoogle Scholar
  27. X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang. Learning with partially absorbing random walks. In NIPS, 2012.Google ScholarGoogle Scholar
  28. W. Yu and J. A. McCann. Efficient partial-pairs SimRank search on large networks. VLDB, 8(5), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for SimRank computation. WWW, 15(3):327--353, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 1
    September 2015
    35 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2015
    Published in pvldb Volume 9, Issue 1

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader