skip to main content
10.1145/3035918.3064014acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

Authors Info & Claims
Published:09 May 2017Publication History

ABSTRACT

As many applications encounter exponential growth in graph sizes, a fast and scalable graph generator has become more important than ever before due to lack of large-scale realistic graphs for evaluating the performance of graph processing methods. Although there have been proposed a number of methods to generate synthetic graphs, they are not very efficient in terms of space and time complexities, and so, cannot generate even trillion-scale graphs using a moderate size cluster of commodity machines. Here, we propose an efficient and scalable disk-based graph generator, TrillionG that can generate massive graphs in a short time only using a small amount of memory. It can generate a graph of a trillion edges following the RMAT or Kronecker models within two hours only using 10 PCs. We first generalize existing graph generation models to the scope-based generation model, where RMAT and Kronecker correspond to two extremes. Then, we propose a new graph generation model called the recursive vector model, which compromises two extremes, and so, solves the space and time complexity problems existing in RMAT and Kronecker. We also extend the recursive vector model so as to generate a semantically richer graph database. Through extensive experiments, we have demonstrated that TrillionG outperforms the state-of-the-art graph generators by up to orders of magnitude.

References

  1. Amazon S3. https://aws.amazon.com/s3.Google ScholarGoogle Scholar
  2. Apache HBase. http://hbase.apache.org/.Google ScholarGoogle Scholar
  3. Apache Spark. http://spark.apache.org/.Google ScholarGoogle Scholar
  4. Graph500 benchmark. http://www.graph500.org/.Google ScholarGoogle Scholar
  5. Hadoop implementation of RMAT-WES/P. https://github.com/paulmw/rmat.Google ScholarGoogle Scholar
  6. Human Connectome Project. http://www.humanconnectomeproject.org/.Google ScholarGoogle Scholar
  7. Rich Data and the Increasing Value of the Internet of Things. http://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm.Google ScholarGoogle Scholar
  8. Stanford Network Analysis Project. http://snap.stanford.edu/.Google ScholarGoogle Scholar
  9. G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified stress testing of RDF data management systems. In International Semantic Web Conference, pages 197--212. Springer, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Bagan, A. Bonifati, R. Ciucanu, G. H. Fletcher, A. Lemay, and N. Advokaat. gMark: schema-driven generation of graphs and queries. IEEE Transactions on Knowledge and Data Engineering, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google ScholarGoogle Scholar
  12. D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 11(2007):21, 2007.Google ScholarGoogle Scholar
  13. U. K. ByungSoo Jeon, Inah jeon. TeGViz: Distributed Tera-Scale Graph Generation and Visualization. In ICDM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A Recursive Model for Graph Mining. In SDM, volume 4, pages 442--446. SIAM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. F. Checconi and F. Petrini. Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 425--434. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at Facebook-scale. VLDB, 8(12):1804--1815, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1--2):48--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Dobre and F. Xhafa. Intelligent services for big data science. Future Generation Computer Systems, 37:267--281, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae Debrecen, 6:290--297, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  20. O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz. The LDBC social network benchmark: Interactive workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 619--630. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141--1144, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Hadian, S. Nobari, B. Minaei-Bidgoli, and Q. Qu. ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks. In Proceedings of the 2016 International Conference on Management of Data, pages 1829--1842. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T Manhardto, H. Chafio, M. Capotă, N. Sundaram, M. Anderson, et al. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. volume 9, pages 1317--1328. VLDB Endowment, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, pages 447--461. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Kumar and H. H. Huang. G-store: high-performance graph store for trillion-edge processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 71. IEEE Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In PKDD, pages 133--145. Springer, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. JMLR, 11:985--1042, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Lothian, S. Powers, B. D. Sullivan, M. Baker, J. Schrock, and S. W. Poole. Synthetic Graph Generation for Data-Intensive HPC Benchmarking: Background and Framework. 2013.Google ScholarGoogle Scholar
  30. A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 61--72. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M.-D. Pham, P. Boncz, and O. Erling. S3G2: A scalable structure-correlated social graph generator. In Technology Conference on Performance Evaluation and Benchmarking, pages 156--172. Springer, 2012.Google ScholarGoogle Scholar
  32. A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: scalable workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology, pages 430--441. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410--424. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP 2 Bench: a SPARQL performance benchmark. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 222--233. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Seshadhri, A. Pinar, and T. G. Kolda. An in-depth study of stochastic Kronecker graphs. In ICDM, pages 587--596. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In MSST, pages 1--10. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. L. Staudt, A. Sazonovs, and H. Meyerhenke. Networkit: An interactive tool suite for high-performance network analysis. arXiv preprint arXiv:1403.3005, 2014.Google ScholarGoogle Scholar
  38. M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. G ra M: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 408--421. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Zhang and Y. Tay. GSCALER: Synthetically scaling a given graph. In EDBT, pages 53--64, 2016.Google ScholarGoogle Scholar
  40. G. K. Zipf. Human behavior and the principle of least effort: An introduction to human ecology. 2016Google ScholarGoogle Scholar

Index Terms

  1. TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader