ABSTRACT
As many applications encounter exponential growth in graph sizes, a fast and scalable graph generator has become more important than ever before due to lack of large-scale realistic graphs for evaluating the performance of graph processing methods. Although there have been proposed a number of methods to generate synthetic graphs, they are not very efficient in terms of space and time complexities, and so, cannot generate even trillion-scale graphs using a moderate size cluster of commodity machines. Here, we propose an efficient and scalable disk-based graph generator, TrillionG that can generate massive graphs in a short time only using a small amount of memory. It can generate a graph of a trillion edges following the RMAT or Kronecker models within two hours only using 10 PCs. We first generalize existing graph generation models to the scope-based generation model, where RMAT and Kronecker correspond to two extremes. Then, we propose a new graph generation model called the recursive vector model, which compromises two extremes, and so, solves the space and time complexity problems existing in RMAT and Kronecker. We also extend the recursive vector model so as to generate a semantically richer graph database. Through extensive experiments, we have demonstrated that TrillionG outperforms the state-of-the-art graph generators by up to orders of magnitude.
- Amazon S3. https://aws.amazon.com/s3.Google Scholar
- Apache HBase. http://hbase.apache.org/.Google Scholar
- Apache Spark. http://spark.apache.org/.Google Scholar
- Graph500 benchmark. http://www.graph500.org/.Google Scholar
- Hadoop implementation of RMAT-WES/P. https://github.com/paulmw/rmat.Google Scholar
- Human Connectome Project. http://www.humanconnectomeproject.org/.Google Scholar
- Rich Data and the Increasing Value of the Internet of Things. http://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm.Google Scholar
- Stanford Network Analysis Project. http://snap.stanford.edu/.Google Scholar
- G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified stress testing of RDF data management systems. In International Semantic Web Conference, pages 197--212. Springer, 2014. Google ScholarDigital Library
- G. Bagan, A. Bonifati, R. Ciucanu, G. H. Fletcher, A. Lemay, and N. Advokaat. gMark: schema-driven generation of graphs and queries. IEEE Transactions on Knowledge and Data Engineering, 2016. Google ScholarDigital Library
- A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google Scholar
- D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 11(2007):21, 2007.Google Scholar
- U. K. ByungSoo Jeon, Inah jeon. TeGViz: Distributed Tera-Scale Graph Generation and Visualization. In ICDM, 2015. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A Recursive Model for Graph Mining. In SDM, volume 4, pages 442--446. SIAM, 2004.Google ScholarCross Ref
- F. Checconi and F. Petrini. Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 425--434. IEEE, 2014. Google ScholarDigital Library
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at Facebook-scale. VLDB, 8(12):1804--1815, 2015. Google ScholarDigital Library
- C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1--2):48--57, 2010. Google ScholarDigital Library
- C. Dobre and F. Xhafa. Intelligent services for big data science. Future Generation Computer Systems, 37:267--281, 2014.Google ScholarCross Ref
- P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae Debrecen, 6:290--297, 1959.Google ScholarCross Ref
- O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz. The LDBC social network benchmark: Interactive workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 619--630. ACM, 2015. Google ScholarDigital Library
- E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141--1144, 1959.Google ScholarCross Ref
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014. Google ScholarDigital Library
- A. Hadian, S. Nobari, B. Minaei-Bidgoli, and Q. Qu. ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks. In Proceedings of the 2016 International Conference on Management of Data, pages 1829--1842. ACM, 2016. Google ScholarDigital Library
- A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T Manhardto, H. Chafio, M. Capotă, N. Sundaram, M. Anderson, et al. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. volume 9, pages 1317--1328. VLDB Endowment, 2016. Google ScholarDigital Library
- M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, pages 447--461. ACM, 2016. Google ScholarDigital Library
- P. Kumar and H. H. Huang. G-store: high-performance graph store for trillion-edge processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 71. IEEE Press, 2016. Google ScholarDigital Library
- J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In PKDD, pages 133--145. Springer, 2005.Google ScholarCross Ref
- J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. JMLR, 11:985--1042, 2010. Google ScholarDigital Library
- J. Lothian, S. Powers, B. D. Sullivan, M. Baker, J. Schrock, and S. W. Poole. Synthetic Graph Generation for Data-Intensive HPC Benchmarking: Background and Framework. 2013.Google Scholar
- A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 61--72. ACM, 2012. Google ScholarDigital Library
- M.-D. Pham, P. Boncz, and O. Erling. S3G2: A scalable structure-correlated social graph generator. In Technology Conference on Performance Evaluation and Benchmarking, pages 156--172. Springer, 2012.Google Scholar
- A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: scalable workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology, pages 430--441. ACM, 2013. Google ScholarDigital Library
- A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410--424. ACM, 2015. Google ScholarDigital Library
- M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP 2 Bench: a SPARQL performance benchmark. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 222--233. IEEE, 2009. Google ScholarDigital Library
- C. Seshadhri, A. Pinar, and T. G. Kolda. An in-depth study of stochastic Kronecker graphs. In ICDM, pages 587--596. IEEE, 2011. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In MSST, pages 1--10. IEEE, 2010. Google ScholarDigital Library
- C. L. Staudt, A. Sazonovs, and H. Meyerhenke. Networkit: An interactive tool suite for high-performance network analysis. arXiv preprint arXiv:1403.3005, 2014.Google Scholar
- M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. G ra M: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 408--421. ACM, 2015. Google ScholarDigital Library
- J. Zhang and Y. Tay. GSCALER: Synthetically scaling a given graph. In EDBT, pages 53--64, 2016.Google Scholar
- G. K. Zipf. Human behavior and the principle of least effort: An introduction to human ecology. 2016Google Scholar
Index Terms
- TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model
Recommendations
EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningNowadays, many researchers and industry groups often suffer from the lack of a variety of large-scale real graphs. Although a lot of synthetic graph generation methods,(or models) such as RMAT and BA have been developed, their output graphs tend to be ...
Exhaustive generation of k‐critical H‐free graphs
AbstractWe describe an algorithm for generating all k‐critical H‐free graphs, based on a method of Hoàng et al. A graph G is k‐critical H‐free if G is H‐free, k‐chromatic, and every H‐free proper subgraph of G is k−1‐colorable. Using this algorithm, we ...
Exhaustive Generation of k-Critical $${\mathcal H}$$-Free Graphs
WG 2016: Revised Selected Papers of the 42nd International Workshop on Graph-Theoretic Concepts in Computer Science - Volume 9941We describe an algorithm for generating all k-critical $${\mathcal H}$$-free graphs, based on a method of Hoíng et al. Using this algorithm, we prove that there are only finitely many 4-critical $$P_7,C_k$$-free graphs, for both $$k=4$$ and $$k=5$$. We ...
Comments