research-article

Walking in the cloud: parallel SimRank at scale

Authors:
Zhenguo Li

Huawei Noah's Ark Lab, Hong Kong

Huawei Noah's Ark Lab, Hong Kong
View Profile

,
Yixiang Fang

The University of Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong
View Profile

,
Qin Liu

The Chinese University of Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong
View Profile

,
Jiefeng Cheng

Huawei Noah's Ark Lab, Hong Kong

Huawei Noah's Ark Lab, Hong Kong
View Profile

,
Reynold Cheng

The University of Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong
View Profile

,
John C. S. Lui

The Chinese University of Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 1pp 24–35https://doi.org/10.14778/2850469.2850472

Published:01 September 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-n indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation.

References

Apache Spark - Lightning-Fast Cluster Computing. http://spark.apache.org/.Google Scholar
P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In WWW, pages 1--13, 2011. Google ScholarDigital Library
P. Boldi and S. Vigna. The WebGraph Framework I: Compression Techniques. In WWW, 2004. Google ScholarDigital Library
L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-SimRank computing on MapReduce. In BigMine, pages 28--35, 2012. Google ScholarDigital Library
J. Cheng, Q. Liu, Z. Li, W. Fan, J. C. Lui, and C. He. VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC. In ICDE, 2015.Google ScholarCross Ref
D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC, 1987. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009. Google ScholarDigital Library
J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer networks, 31(11):1467--1479, 1999. Google ScholarDigital Library
P. G. Doyle and J. L. Snell. Random walks and electric networks. AMC, 10:12, 1984.Google Scholar
D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW, pages 641--650, 2005. Google ScholarDigital Library
F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. TKDE, 19(3):355--369, 2007. Google ScholarDigital Library
G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.Google Scholar
G. He, H. Feng, C. Li, and H. Chen. Parallel SimRank computation on large graphs with iterative aggregation. In KDD, 2010. Google ScholarDigital Library
M. Jamali and M. Ester. Trustwalker: a random walk model for combining trust-based and item-based recommendation. In KDD, 2009. Google ScholarDigital Library
G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD, 2002. Google ScholarDigital Library
M. Kusumoto, T. Maehara, and K. Kawarabayashi. Scalable similarity search for SimRank. In SIGMOD, 2014. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In WWW, 2010. Google ScholarDigital Library
J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.Google Scholar
C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of SimRank for static and dynamic information networks. In EDBT, 2010. Google ScholarDigital Library
L. Li, C. Li, H. Chen, and X. Du. Mapreduce-based SimRank computation and its application in social recommender system. In BigData Congress, 2013. Google ScholarDigital Library
P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair SimRank computation. In SDM, 2010.Google ScholarCross Ref
D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for SimRank computation. VLDBJ, 19(1):45--66, 2010. Google ScholarDigital Library
T. Maehara, M. Kusumoto, and K. Kawarabayashi. Efficient SimRank computation via linearization. CoRR, abs/1411.7228, 2014.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, 1999.Google Scholar
L. Sun, R. Cheng, X. Li, D. Cheung, and J. Han. On link-based similarity join. VLDB, 4(11):714--725, 2011.Google Scholar
X.-M. Wu, Z. Li, and S.-F. Chang. New insights into laplacian similarity search. In CVPR, 2012.Google Scholar
X.-M. Wu, Z. Li, A. M. So, J. Wright, and S.-F. Chang. Learning with partially absorbing random walks. In NIPS, 2012.Google Scholar
W. Yu and J. A. McCann. Efficient partial-pairs SimRank search on large networks. VLDB, 8(5), 2015. Google ScholarDigital Library
W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and time efficient algorithm for SimRank computation. WWW, 15(3):327--353, 2012. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library

Recommendations

Walking on minimax paths for k-NN search
AAAI'13: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence

Link-based dissimilarity measures, such as shortest path or Euclidean commute time distance, base their distance on paths between nodes of a weighted graph. These measures are known to be better suited to data manifold with nonconvex-shaped clusters, ...
Read More
Scalable real-time OLAP on cloud architectures

In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we ...
Read More
Querying Massive Trajectories by Path on the Cloud
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

A path query aims to find the trajectories that pass a given sequence of connected road segments within a time period. It is very useful in many urban applications, e.g., 1) traffic modeling, 2) frequent path mining, and 3) traffic anomaly detection. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 9, Issue 1
September 2015
35 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2015
Published in pvldb Volume 9, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 124
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Walking in the cloud: parallel SimRank at scale

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Walking on minimax paths for k-NN search

Scalable real-time OLAP on cloud architectures

Querying Massive Trajectories by Path on the Cloud

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Walking in the cloud: parallel SimRank at scale

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Walking on minimax paths for k-NN search

Scalable real-time OLAP on cloud architectures

Querying Massive Trajectories by Path on the Cloud

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media