ABSTRACT
There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.
- B. Amann and M. Scholl. Gram: a graph data model and query languages. In ACM Hypertext, 1992. Google ScholarDigital Library
- L. A. N. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley. Classes of small-world networks. Proceedings of The National Academy of Sciences, 2000.Google ScholarCross Ref
- A. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 1999.Google Scholar
- F. Benevenuto, T. Rodrigues, M. Cha, and V. A. F. Almeida. Characterizing user behavior in online social networks. In Internet Measurement Conference, 2009. Google ScholarDigital Library
- G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using banks. In ICDE, 2002. Google ScholarDigital Library
- V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics-theory and Experiment, 2008.Google Scholar
- M. Brocheler, A. Pugliese, and V. S. Subrahmanian. COSI: Cloud oriented subgraph identification in massive social networks. In ASONAM, 2010. Google ScholarDigital Library
- A. Capocci, V. D. P. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia. Phys. Rev. E, 2006.Google Scholar
- U. V. Catalyurek and C. Aykanat. Patoh: Partitioning tool for hypergraphs. Bilkent University, Tech. Rep, 1999.Google Scholar
- FactNexus Pty Ltd. Graphbase. http://www.graphbase.net/, 2011.Google Scholar
- W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching: from intractable to polynomial time. In VLDB, 2010. Google ScholarDigital Library
- G. F. Georgakopoulos and K. Politopoulos. Max-density revisited: a generalization and a more efficient algorithm. The Computer Journal, 2007.Google Scholar
- S. A. Golder, D. M. Wilkinson, and B. A. Huberman. Rhythms of social interaction: messaging within a massive online network. CoRR, abs/cs/0611137, 2006.Google Scholar
- K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword proximity search in complex data graphs. In SIGMOD, 2008. Google ScholarDigital Library
- R. Gonzalez, R. C. Rumin, A. Cuevas, and C. Guerrero. Where are my followers? understanding the locality effect in twitter. CoRR, abs/1105.3682, 2011.Google Scholar
- R. Guting. GraphDB: Modeling and querying graphs in databases. In VLDB, 1994. Google ScholarDigital Library
- M. Gyssens, J. Paredaens, and D. van Gucht. A graph-oriented object database model. In PODS, 1990. Google ScholarDigital Library
- J. Hamilton. Scaling linkedin. http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx, 2008.Google Scholar
- H. He and A. K. Singh. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD, 2008. Google ScholarDigital Library
- B. Hendrickson and R. Leland. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal on Scientific Computing, 1995. Google ScholarDigital Library
- High Scalability Blog. Why are facebook, digg, and twitter so hard to scale? http://highscalability.com/blog/2009/10/13/why-are-facebook-digg-and-twitter-so-hard-to-scale.html, 2009.Google Scholar
- High Scalibility Blog. Friendster lost lead because of a failure to scale. http://highscalability.com/blog/ 2007/11/13/friendster-lost-lead-because-of-a-failure-to-scale.html, 2007.Google Scholar
- J. Huang, D. Abadi, and K. Ren. Scalable SPARQL querying of large RDF graphs. In VLDB, 2011.Google ScholarDigital Library
- InfiniteGraph. http://www.infinitegraph.com/, 2011.Google Scholar
- R. Jin, Y. Xiang, N. Ruan, and H. Wang. Efficiently answering reachability queries on very large directed graphs. In SIGMOD, 2008. Google ScholarDigital Library
- V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, 2005. Google ScholarDigital Library
- S. Kadambi, J. Chen, B. Cooper, D. Lomax, R. Ramakrishnan, A. Silberstein, E. Tam, and H. G. Molina. Where in the world is my data? In VLDB, 2011.Google Scholar
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, 2009. Google ScholarDigital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. In SIAM, 1999. Google ScholarDigital Library
- S. Khuller and B. Saha. On finding dense subgraphs. In ICALP, 2009. Google ScholarDigital Library
- Kobrix Software. A general purpose distributed data store, 2011. http://www.kobrix.com/hgdb.jsp.Google Scholar
- R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In KDD, 2006. Google ScholarDigital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW, 2008. Google ScholarDigital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Journal of Internet Mathematics, 2009.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In PODC, 2009. Google ScholarDigital Library
- Microsoft Research. Trinity. http://research.microsoft.com/en-us/projects/trinity/, 2011.Google Scholar
- Neo4j. Neo4j open source nosql graph database. http://neo4j.org/, 2011.Google Scholar
- M. Newman. Why social networks are different from other types of networks. Physical Review E, 2003.Google ScholarCross Ref
- M. Newman. Modularity and community structure in networks. In Proc. of The Natl. Academy of Sciences, 2006.Google Scholar
- J. M. Pujol, V. Erramilli, and P. Rodriguez. Divide and conquer: Partitioning online social networks. CoRR, abs/0905.4918, 2009.Google Scholar
- J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, P. Chhabra, and P. Rodriguez. The little engine(s) that could: scaling online social networks. In SIGCOMM, 2010. Google ScholarDigital Library
- L. Sheng and Z. Ozsoyoglu. A graph query language and its query processing. In ICDE, 1999. Google ScholarDigital Library
- A. Silberstein, J. Terrace, B. F. Cooper, and R. Ramakrishnan. Feeding Frenzy: Selectively materializing users' event feeds. In SIGMOD, 2010. Google ScholarDigital Library
- T. Tran, H. Wang, S. Rudolph, and P. Cimiano. Top-k exploration of query candidates for efficient keyword search on graph-shaped (RDF) data. In ICDE, 2009. Google ScholarDigital Library
- FlockDB. https://github.com/twitter/flockdb.Google Scholar
- O. Wolfson, S. Jajodia, and Y. Huang. An adaptive data replication algorithm. In TODS, 1997. Google ScholarDigital Library
- O. Wolfson and A. Milo. The multicast policy and its relationship to replicated data placement. In TODS, 1991. Google ScholarDigital Library
- H. Yildirim, V. Chaoji, and M. J. Zaki. GRAIL: Scalable reachability index for large graphs. In VLDB, 2010. Google ScholarDigital Library
Index Terms
- Managing large dynamic graphs efficiently
Recommendations
Critical ideals of signed graphs with twin vertices
This paper studies critical ideals of graphs with twin vertices, which are vertices with the same neighbors. A pair of such vertices are called replicated if they are adjacent, and duplicated, otherwise. Critical ideals of graphs having twin vertices ...
Large Induced Forests in Graphs
In this article, we prove three theorems. The first is that every connected graph of order n and size m has an induced forest of order at least 8n-2m-2/9 with equality if and only if such a graph is obtained from a tree by expanding every vertex to a ...
Scalable Pattern Matching over Compressed Graphs via Dedensification
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningOne of the most common operations on graph databases is graph pattern matching (e.g., graph isomorphism and more general types of "subgraph pattern matching"). In fact, in some graph query languages every single query is expressed as a graph matching ...
Comments