Abstract
Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-n indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation.
- Apache Spark - Lightning-Fast Cluster Computing. http://spark.apache.org/.Google Scholar
- P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In WWW, pages 1--13, 2011. Google ScholarDigital Library
- P. Boldi and S. Vigna. The WebGraph Framework I: Compression Techniques. In WWW, 2004. Google ScholarDigital Library
- L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-SimRank computing on MapReduce. In BigMine, pages 28--35, 2012. Google ScholarDigital Library
- J. Cheng, Q. Liu, Z. Li, W. Fan, J. C. Lui, and C. He. VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC. In ICDE, 2015.Google ScholarCross Ref
- D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC, 1987. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009. Google ScholarDigital Library
- J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer networks, 31(11):1467--1479, 1999. Google ScholarDigital Library
- P. G. Doyle and J. L. Snell. Random walks and electric networks. AMC, 10:12, 1984.Google Scholar
- D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW, pages 641--650, 2005. Google ScholarDigital Library
- F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. TKDE, 19(3):355--369, 2007. Google ScholarDigital Library
- G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.Google Scholar
- G. He, H. Feng, C. Li, and H. Chen. Parallel SimRank computation on large graphs with iterative aggregation. In KDD, 2010. Google ScholarDigital Library
- M. Jamali and M. Ester. Trustwalker: a random walk model for combining trust-based and item-based recommendation. In KDD, 2009. Google ScholarDigital Library
- G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD, 2002. Google ScholarDigital Library
- M. Kusumoto, T. Maehara, and K. Kawarabayashi. Scalable similarity search for SimRank. In SIGMOD, 2014. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In WWW, 2010. Google ScholarDigital Library
- J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.Google Scholar
- C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of SimRank for static and dynamic information networks. In EDBT, 2010. Google ScholarDigital Library
- L. Li, C. Li, H. Chen, and X. Du. Mapreduce-based SimRank computation and its application in social recommender system. In BigData Congress, 2013. Google ScholarDigital Library
- P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair SimRank computation. In SDM, 2010.Google ScholarCross Ref
- D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for SimRank computation. VLDBJ, 19(1):45--66, 2010. Google ScholarDigital Library
- T. Maehara, M. Kusumoto, and K. Kawarabayashi. Efficient SimRank computation via linearization. CoRR, abs/1411.7228, 2014.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, 1999.Google Scholar
- L. Sun, R. Cheng, X. Li, D. Cheung, and J. Han. On link-based similarity join. VLDB, 4(11):714--725, 2011.Google Scholar
- X.-M. Wu, Z. Li, and S.-F. Chang. New insights into laplacian similarity search. In CVPR, 2012.Google Scholar
- X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang. Learning with partially absorbing random walks. In NIPS, 2012.Google Scholar
- W. Yu and J. A. McCann. Efficient partial-pairs SimRank search on large networks. VLDB, 8(5), 2015. Google ScholarDigital Library
- W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for SimRank computation. WWW, 15(3):327--353, 2012. Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
Recommendations
Walking on minimax paths for k-NN search
AAAI'13: Proceedings of the Twenty-Seventh AAAI Conference on Artificial IntelligenceLink-based dissimilarity measures, such as shortest path or Euclidean commute time distance, base their distance on paths between nodes of a weighted graph. These measures are known to be better suited to data manifold with nonconvex-shaped clusters, ...
Scalable real-time OLAP on cloud architectures
In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we ...
Querying Massive Trajectories by Path on the Cloud
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsA path query aims to find the trajectories that pass a given sequence of connected road segments within a time period. It is very useful in many urban applications, e.g., 1) traffic modeling, 2) frequent path mining, and 3) traffic anomaly detection. ...
Comments