ABSTRACT
Graphs appear in numerous applications including cyber-security, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MapReduce/Hadoop. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.
- Hadoop information. http://hadoop.apache.org/.Google Scholar
- Daniel J. Abadi, Peter A. Boncz, and Stavros Harizopoulos. Column oriented database systems. PVLDB, 2009. Google ScholarDigital Library
- Daniel J. Abadi, Samuel Madden, and Nabil Hachem. Column-stores vs. row-stores: how different are they really? In SIGMOD Conference, pages 967--980, 2008. Google ScholarDigital Library
- Louigi Addario-Berry, W. Sean Kennedy, Andrew D. King, Zhentao Li, and Bruce A. Reed. Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph. Discrete Applied Mathematics, 158(7):765--770, 2010. Google ScholarDigital Library
- Charu C. Aggarwal, Yan Xie, and Philip S. Yu. Gconnect: A connectivity index for massive disk-resident graphs. PVLDB, 2(1):862--873, 2009. Google ScholarDigital Library
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. oddball: Spotting anomalies in weighted graphs. In PAKDD (2), pages 410--421, 2010. Google ScholarDigital Library
- I. Alvarez-Hamelin, L. Dall'Asta, A. Barrat, and A. Vespignani. k-core decompositions: A tool for the visualization of large scale networks. http://arxiv.org/abs/cs.NI/0504107.Google Scholar
- Periklis Andritsos, Renée J. Miller, and Panayiotis Tsaparas. Information-theoretic tools for mining database structure from large data sets. In SIGMOD, 2004. Google ScholarDigital Library
- David A. Bader, Shiva Kintali, Kamesh Madduri, and Milena Mihail. Approximating betweenness centrality. WAW, 2007. Google ScholarDigital Library
- Paolo Boldi and Sebastiano Vigna. The webgraph framework i: compression techniques. In WWW, 2004. Google ScholarDigital Library
- Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008. Google ScholarDigital Library
- Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In KDD, 2009. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI'04, December 2004. Google ScholarDigital Library
- P. Erdos and A. Rényi. On random graphs. Publicationes Mathematicae, 6:290--297, 1959.Google ScholarCross Ref
- Robert L. Grossman and Yunhong Gu. Data mining using high performance data clouds: experimental studies using sector and sphere. KDD, 2008. Google ScholarDigital Library
- Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, and Peter A. Boncz. Positional update handling in column stores. In SIGMOD, 2010. Google ScholarDigital Library
- Michael Isard and Yuan Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD Conference, pages 987--994, 2009. Google ScholarDigital Library
- Milena Ivanova, Martin L. Kersten, Niels J. Nes, and Romulo Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarDigital Library
- U Kang, C.E Tsourakakis, Ana Paula Appel, C Faloutsos, and Jure Leskovec. Radius plots for mining tera-byte scale graphs: Algorithms, patterns, and observations. SIAM International Conference on Data Mining, 2010.Google ScholarCross Ref
- U Kang, C.E Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. ICDM, 2009. Google ScholarDigital Library
- G. Karypis and V. Kumar. Parallel multilevel k-way partitioning for irregular graphs. SIAM Review, 41(2):278--300, 1999. Google ScholarDigital Library
- George Karypis and Vipin Kumar. Multilevel k-way hypergraph partitioning. In DAC, pages 343--348, 1999. Google ScholarDigital Library
- Jure Leskovec, Deepayan Chakrabarti, Jon M. Kleinberg, and Christos Faloutsos. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. PKDD, pages 133--145, 2005. Google ScholarDigital Library
- Ching-Yung Lin, Nan Cao, Shixia Liu, Spiros Papadimitriou, Jimeng Sun, and Xifeng Yan. Smallblue: Social network analysis for expertise search and collective intelligence. In ICDE, pages 1483--1486, 2009. Google ScholarDigital Library
- Chao Liu, Fan Guo, and Christos Faloutsos. Bbm: bayesian browsing model from petabyte-scale data. In KDD, 2009. Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
- Hossein Maserrat and Jian Pei. Neighbor query friendly compression of social networks. In KDD, 2010. Google ScholarDigital Library
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08, pages 1099--1110, 2008. Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
- Spiros Papadimitriou and Jimeng Sun. Disco: Distributed co-clustering with map-reduce. ICDM, 2008. Google ScholarDigital Library
- Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, 2005. Google ScholarDigital Library
- Purnamrita Sarkar and Andrew W. Moore. Fast nearest-neighbor search in disk-resident graphs. In KDD, pages 513--522, 2010. Google ScholarDigital Library
- Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. C-store: A column-oriented dbms. In VLDB, 2005. Google ScholarDigital Library
- Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In ICDM, pages 418--425, 2005. Google ScholarDigital Library
- Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In ICDM, pages 613--622, 2006. Google ScholarDigital Library
- Silke Trißl and Ulf Leser. Fast and practical indexing and querying of very large graphs. In SIGMOD, 2007. Google ScholarDigital Library
- D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. In VLDB, pages 709--720, 2005. Google ScholarDigital Library
- Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu. Mining significant graph patterns by leap search. In SIGMOD Conference, pages 433--444, 2008. Google ScholarDigital Library
- Peixiang Zhao, Jeffrey Xu Yu, and Philip S. Yu. Graph indexing: Tree + delta >= graph. In VLDB, 2007. Google ScholarDigital Library
Index Terms
- GBASE: a scalable and general graph management system
Recommendations
gbase: an efficient analysis platform for large graphs
Graphs appear in numerous applications including cyber security, the Internet, social networks, protein networks, recommendation systems, citation networks, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to ...
Equimatchable Graphs on Surfaces
A graph G is equimatchable if each matching in G is a subset of a maximum-size matching and it is factor critical if G-v has a perfect matching for each vertex v of G. It is known that any 2-connected equimatchable graph is either bipartite or factor ...
Acyclic 3-choosability of sparse graphs with girth at least 7
Every planar graph is known to be acyclically 7-choosable and is conjectured to be acyclically 5-choosable (Borodin et al. 2002 [4]). This conjecture if proved would imply both Borodin's acyclic 5-color theorem (1979) and Thomassen's 5-choosability ...
Comments