ABSTRACT
Given a directed graph G, a source node s, and a target node t, the personalized PageRank (PPR of t with respect to s is the probability that a random walk starting from s terminates at t. The average of the personalized PageRank score of t with respect to each source node υ∈ V is exactly the PageRank score π( t ) of node t , which denotes the overall importance of node t in the graph. A heavy hitter of node t is a node whose contribution to π( t ) is above a φ fraction, where φ is a value between 0 and 1. Finding heavy hitters has important applications in link spam detection, classification of web pages, and friend recommendations.
In this paper, we propose BLOG, an efficient framework for three types of heavy hitter queries: the pairwise approximate heavy hitter (AHH), the reverse AHH, and the multi-source reverse AHH queries. For pairwise AHH queries, our algorithm combines the Monte-Carlo approach and the backward propagation approach to reduce the cost of both methods, and incorporates new techniques to deal with high in-degree nodes. For reverse AHH and multi-source reverse AHH queries, our algorithm extends the ideas behind the pairwise AHH algorithm with a new "logarithmic bucketing'' technique to improve the query efficiency. Extensive experiments demonstrate that our BLOG is far more efficient than alternative solutions on the three queries.
- Reid Andersen, Christian Borgs, Jennifer T. Chayes, John E. Hopcroft, Vahab S. Mirrokni, and Shang-Hua Teng. 2007. Local Computation of PageRank Contributions. In WAW. 150--165. Google ScholarDigital Library
- Reid Andersen, Fan R. K. Chung, and Kevin J. Lang. 2006. Local Graph Partitioning using PageRank Vectors. In FOCS. 475--486. Google ScholarDigital Library
- Konstantin Avrachenkov, Nelly Litvak, Danil Nemirovsky, and Natalia Osipova. 2007. Monte Carlo methods in PageRank computation: When one iteration is sufficient. SIAM J. Numerical Analysis Vol. 45, 2 (2007), 890--904. Google ScholarDigital Library
- Lars Backstrom and Jure Leskovec. 2011. Supervised random walks: predicting and recommending links in social networks. In WSDM. 635--644. Google ScholarDigital Library
- Bahman Bahmani, Kaushik Chakrabarti, and Dong Xin. 2011. Fast personalized PageRank on MapReduce. In SIGMOD. 973--984. Google ScholarDigital Library
- András A. Benczúr, Károly Csalogány, Tamás Sarlós, and Máté Uher. 2005. SpamRank -- Fully Automatic Link Spam Detection. In AIRWeb. 25--38.Google Scholar
- Pavel Berkhin. 2005. Survey: A Survey on PageRank Computing. Internet Mathematics Vol. 2, 1 (2005), 73--120.Google ScholarCross Ref
- Pavel Berkhin. 2006. Bookmark-coloring algorithm for personalized pagerank computing. Internet Mathematics Vol. 3, 1 (2006), 41--62.Google ScholarCross Ref
- Soumen Chakrabarti. 2007. Dynamic personalized pagerank in entity-relation graphs WWW. 571--580. Google ScholarDigital Library
- Fan R. K. Chung and Lincoln Lu. 2006. Survey: Concentration Inequalities and Martingale Inequalities: A Survey. Internet Mathematics Vol. 3, 1 (2006), 79--127.Google ScholarCross Ref
- Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. 2005. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics Vol. 2, 3 (2005), 333--358.Google ScholarCross Ref
- Yasuhiro Fujiwara, Makoto Nakatsuji, Takeshi Yamamuro, Hiroaki Shiokawa, and Makoto Onizuka. 2012. Efficient personalized pagerank with accuracy assurance KDD. 15--23. Google ScholarDigital Library
- Tao Guo, Xin Cao, Gao Cong, Jiaheng Lu, and Xuemin Lin. 2017 a. Distributed Algorithms on Exact Personalized PageRank SIGMOD. 479--494. Google ScholarDigital Library
- Wentian Guo, Yuchen Li, Mo Sha, and Kian-Lee Tan. 2017 b. Parallel Personalized Pagerank on Dynamic Graphs. PVLDB Vol. 11, 1 (2017), 93--106. Google ScholarDigital Library
- Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. 2013. Wtf: The who to follow service at twitter. In WWW. 505--514. Google ScholarDigital Library
- Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan O. Pedersen. 2006. Link Spam Detection Based on Mass Estimation. In VLDB. 439--450. Google ScholarDigital Library
- Zoltan Gyongyi, Hector Garcia-Molina, and Jan Pedersen. 2006. Web content categorization using link information. Technical Report. Stanford.Google Scholar
- Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In WWW. 271--279. Google ScholarDigital Library
- Jinhong Jung, Namyong Park, Lee Sael, and U. Kang. 2017. BePI: Fast and Memory-Efficient Method for Billion-Scale Random Walk with Restart. In SIGMOD. 789--804. Google ScholarDigital Library
- Jinhong Jung, Kijung Shin, Lee Sael, and U. Kang. 2016. Random Walk with Restart on Large Graphs Using Block Elimination. ACM Trans. Database Syst. Vol. 41, 2 (2016), 12:1--12:43. Google ScholarDigital Library
- David Liben-Nowell and Jon M. Kleinberg. 2007. The link-prediction problem for social networks. JASIST Vol. 58, 7 (2007), 1019--1031. Google ScholarDigital Library
- Peter Lofgren, Siddhartha Banerjee, and Ashish Goel. 2016. Personalized pagerank estimation and search: A bidirectional approach WSDM. 163--172. Google ScholarDigital Library
- Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and C Seshadhri. 2014. Fast-ppr: Scaling personalized pagerank estimation for large graphs KDD. 1436--1445. Google ScholarDigital Library
- Takanori Maehara, Takuya Akiba, Yoichi Iwata, and Ken-ichi Kawarabayashi. 2014. Computing personalized PageRank quickly by exploiting graph structures. PVLDB Vol. 7, 12 (2014), 1023--1034. Google ScholarDigital Library
- Naoto Ohsaka, Takanori Maehara, and Ken-ichi Kawarabayashi. 2015. Efficient PageRank Tracking in Evolving Networks. In SIGKDD 2015. 875--884. Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: bringing order to the web. (1999).Google Scholar
- Atish Das Sarma, Anisur Rahaman Molla, Gopal Pandurangan, and Eli Upfal. 2013. Fast Distributed PageRank Computation. In ICDCN. 11--26.Google Scholar
- Kijung Shin, Jinhong Jung, Lee Sael, and U. Kang. 2015. BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs SIGMOD. 1571--1585. Google ScholarDigital Library
- Sibo Wang, Youze Tang, Xiaokui Xiao, Yin Yang, and Zengxiang Li. 2016. HubPPR: Effective Indexing for Approximate Personalized PageRank. PVLDB Vol. 10, 3 (2016), 205--216. Google ScholarDigital Library
- Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, and Yin Yang. 2017. FORA: Simple and Effective Approximate Single-Source Personalized PageRank. In SIGKDD. 505--514. Google ScholarDigital Library
- Hongyang Zhang, Peter Lofgren, and Ashish Goel. 2016. Approximate Personalized PageRank on Dynamic Graphs KDD. 1315--1324. Google ScholarDigital Library
- Fanwei Zhu, Yuan Fang, Kevin Chen-Chuan Chang, and Jing Ying. 2013. Incremental and Accuracy-Aware Personalized PageRank through Scheduled Approximation. PVLDB Vol. 6, 6 (2013), 481--492. Google ScholarDigital Library
Index Terms
- Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks
Recommendations
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
Finding Subcube Heavy Hitters in Analytics Data Streams
WWW '18: Proceedings of the 2018 World Wide Web ConferenceModern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem ...
Conditional heavy hitters: detecting interesting correlations in data streams
The notion of heavy hitters--items that make up a large fraction of the population--has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to ...
Comments