research-article

TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

Authors:
Himchan Park

Daegu Gyeongbuk Institute of Science and Technology, Daegu, Republic of Korea

Daegu Gyeongbuk Institute of Science and Technology, Daegu, Republic of Korea
View Profile

,
Min-Soo Kim

Daegu Gyeongbuk Institute of Science and Technology, Daegu, Republic of Korea

Daegu Gyeongbuk Institute of Science and Technology, Daegu, Republic of Korea
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 913–928https://doi.org/10.1145/3035918.3064014

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 913–928

ABSTRACT

As many applications encounter exponential growth in graph sizes, a fast and scalable graph generator has become more important than ever before due to lack of large-scale realistic graphs for evaluating the performance of graph processing methods. Although there have been proposed a number of methods to generate synthetic graphs, they are not very efficient in terms of space and time complexities, and so, cannot generate even trillion-scale graphs using a moderate size cluster of commodity machines. Here, we propose an efficient and scalable disk-based graph generator, TrillionG that can generate massive graphs in a short time only using a small amount of memory. It can generate a graph of a trillion edges following the RMAT or Kronecker models within two hours only using 10 PCs. We first generalize existing graph generation models to the scope-based generation model, where RMAT and Kronecker correspond to two extremes. Then, we propose a new graph generation model called the recursive vector model, which compromises two extremes, and so, solves the space and time complexity problems existing in RMAT and Kronecker. We also extend the recursive vector model so as to generate a semantically richer graph database. Through extensive experiments, we have demonstrated that TrillionG outperforms the state-of-the-art graph generators by up to orders of magnitude.

References

Amazon S3. https://aws.amazon.com/s3.Google Scholar
Apache HBase. http://hbase.apache.org/.Google Scholar
Apache Spark. http://spark.apache.org/.Google Scholar
Graph500 benchmark. http://www.graph500.org/.Google Scholar
Hadoop implementation of RMAT-WES/P. https://github.com/paulmw/rmat.Google Scholar
Human Connectome Project. http://www.humanconnectomeproject.org/.Google Scholar
Rich Data and the Increasing Value of the Internet of Things. http://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm.Google Scholar
Stanford Network Analysis Project. http://snap.stanford.edu/.Google Scholar
G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified stress testing of RDF data management systems. In International Semantic Web Conference, pages 197--212. Springer, 2014. Google ScholarDigital Library
G. Bagan, A. Bonifati, R. Ciucanu, G. H. Fletcher, A. Lemay, and N. Advokaat. gMark: schema-driven generation of graphs and queries. IEEE Transactions on Knowledge and Data Engineering, 2016. Google ScholarDigital Library
A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google Scholar
D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 11(2007):21, 2007.Google Scholar
U. K. ByungSoo Jeon, Inah jeon. TeGViz: Distributed Tera-Scale Graph Generation and Visualization. In ICDM, 2015. Google ScholarDigital Library
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A Recursive Model for Graph Mining. In SDM, volume 4, pages 442--446. SIAM, 2004.Google ScholarCross Ref
F. Checconi and F. Petrini. Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 425--434. IEEE, 2014. Google ScholarDigital Library
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at Facebook-scale. VLDB, 8(12):1804--1815, 2015. Google ScholarDigital Library
C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1--2):48--57, 2010. Google ScholarDigital Library
C. Dobre and F. Xhafa. Intelligent services for big data science. Future Generation Computer Systems, 37:267--281, 2014.Google ScholarCross Ref
P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae Debrecen, 6:290--297, 1959.Google ScholarCross Ref
O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz. The LDBC social network benchmark: Interactive workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 619--630. ACM, 2015. Google ScholarDigital Library
E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141--1144, 1959.Google ScholarCross Ref
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014. Google ScholarDigital Library
A. Hadian, S. Nobari, B. Minaei-Bidgoli, and Q. Qu. ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks. In Proceedings of the 2016 International Conference on Management of Data, pages 1829--1842. ACM, 2016. Google ScholarDigital Library
A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T Manhardto, H. Chafio, M. Capotă, N. Sundaram, M. Anderson, et al. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. volume 9, pages 1317--1328. VLDB Endowment, 2016. Google ScholarDigital Library
M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, pages 447--461. ACM, 2016. Google ScholarDigital Library
P. Kumar and H. H. Huang. G-store: high-performance graph store for trillion-edge processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 71. IEEE Press, 2016. Google ScholarDigital Library
J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In PKDD, pages 133--145. Springer, 2005.Google ScholarCross Ref
J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. JMLR, 11:985--1042, 2010. Google ScholarDigital Library
J. Lothian, S. Powers, B. D. Sullivan, M. Baker, J. Schrock, and S. W. Poole. Synthetic Graph Generation for Data-Intensive HPC Benchmarking: Background and Framework. 2013.Google Scholar
A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 61--72. ACM, 2012. Google ScholarDigital Library
M.-D. Pham, P. Boncz, and O. Erling. S3G2: A scalable structure-correlated social graph generator. In Technology Conference on Performance Evaluation and Benchmarking, pages 156--172. Springer, 2012.Google Scholar
A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: scalable workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology, pages 430--441. ACM, 2013. Google ScholarDigital Library
A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410--424. ACM, 2015. Google ScholarDigital Library
M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP 2 Bench: a SPARQL performance benchmark. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 222--233. IEEE, 2009. Google ScholarDigital Library
C. Seshadhri, A. Pinar, and T. G. Kolda. An in-depth study of stochastic Kronecker graphs. In ICDM, pages 587--596. IEEE, 2011. Google ScholarDigital Library
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In MSST, pages 1--10. IEEE, 2010. Google ScholarDigital Library
C. L. Staudt, A. Sazonovs, and H. Meyerhenke. Networkit: An interactive tool suite for high-performance network analysis. arXiv preprint arXiv:1403.3005, 2014.Google Scholar
M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. G ra M: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 408--421. ACM, 2015. Google ScholarDigital Library
J. Zhang and Y. Tay. GSCALER: Synthetically scaling a given graph. In EDBT, pages 53--64, 2016.Google Scholar
G. K. Zipf. Human behavior and the principle of least effort: An introduction to human ecology. 2016Google Scholar

Index Terms

TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Random graphs

Recommendations

EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Nowadays, many researchers and industry groups often suffer from the lack of a variety of large-scale real graphs. Although a lot of synthetic graph generation methods,(or models) such as RMAT and BA have been developed, their output graphs tend to be ...
Read More
Exhaustive generation of k‐critical H‐free graphs

AbstractWe describe an algorithm for generating all k‐critical H‐free graphs, based on a method of Hoàng et al. A graph G is k‐critical H‐free if G is H‐free, k‐chromatic, and every H‐free proper subgraph of G is k−1‐colorable. Using this algorithm, we ...
Read More
Exhaustive Generation of k-Critical $${\mathcal H}$$-Free Graphs
WG 2016: Revised Selected Papers of the 42nd International Workshop on Graph-Theoretic Concepts in Computer Science - Volume 9941

We describe an algorithm for generating all k-critical $${\mathcal H}$$-free graphs, based on a method of Hoíng et al. Using this algorithm, we prove that there are only finitely many 4-critical $$P_7,C_k$$-free graphs, for both $$k=4$$ and $$k=5$$. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
benchmark
distributed
graph generation
ldbc
rdf
realistic
recursive vector model
rmat
scalability
synthetic
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 1,044
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties

Exhaustive generation of k‐critical H‐free graphs

Exhaustive Generation of k-Critical $${\mathcal H}$$-Free Graphs