skip to main content
research-article

Columnar storage and list-based processing for graph database management systems

Published:01 July 2021Publication History
Skip Abstract Section

Abstract

We revisit column-oriented storage and query processing techniques in the context of contemporary graph database management systems (GDBMSs). Similar to column-oriented RDBMSs, GDBMSs support read-heavy analytical workloads that however have fundamentally different data access patterns than traditional analytical workloads. We first derive a set of desiderata for optimizing storage and query processors of GDBMS based on their access patterns. We then present the design of columnar storage, compression, and query processing techniques based on these desiderata. In addition to showing direct integration of existing techniques from columnar RDBMSs, we also propose novel ones that are optimized for GDBMSs. These include a novel list-based query processor, which avoids expensive data copies of traditional block-based processors under many-to-many joins, a new data structure we call single-indexed edge property pages and an accompanying edge ID scheme, and a new application of Jacobson's bit vector index for compressing NULL values and empty lists. We integrated our techniques into the GraphflowDB in-memory GDBMS. Through extensive experiments, we demonstrate the scalability and query performance benefits of our techniques.

References

  1. Daniel J. Abadi. 2007. Column Stores for Wide and Sparse Data. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007. 292--297. http://cidrdb.org/cidr2007/papers/cidr07p33.pdfGoogle ScholarGoogle Scholar
  2. Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compression and Execution in Column-Oriented Database Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2006. 671--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Daniel J. Abadi, Samuel Madden, and Nabil Hachem. 2008. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008. 967--980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Daniel J. Abadi, Adam Marcus, Samuel Madden, and Katherine J. Hollenbach. 2007. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proceedings of the 33rd International Conference on Very Large Data. 411--422. http://www.vldb.org/conf/2007/papers/research/p411-abadi.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amazon. 2020. Amazon Neptune. https://aws.amazon.com/neptune/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  6. JanusGraph Authors. 2020. JanusGraph. https://janusgraph.org. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  7. Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. 2013. Aggregation and Ordering in Factorised Databases. Proceedings of the VLDB Endowment 6, 14 (2013), 1990--2001. http://www.vldb.org/pvldb/vol6/p1990-zavodny.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny. 2012. FDB: A Query Engine for Factorised Relational Databases. Proceedings of the VLDB Endowment 5, 11 (2012), 1232--1243. http://vldb.org/pvldb/vol5/p1232_nurzhanbakibayev_vldb2012.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jennifer L. Beckmann, Alan Halverson, Rajasekar Krishnamurthy, and Jeffrey F. Naughton. 2006. Extending RDBMSs to Support Sparse Datasets using an Interpreted Attribute Storage Format. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. 58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter Boncz. 2002. Monet: A Next-Generation Database Kernel for Query-Intensive Applications. Ph.D. Dissertation. Universiteit van Amsterdam. https://ir.cwi.nl/pub/14832/14832A.pdfGoogle ScholarGoogle Scholar
  11. Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005. 225--237. http://cidrdb.org/cidr2005/papers/P19.pdfGoogle ScholarGoogle Scholar
  12. Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018. Querying Graphs. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Angela Bonifati, Peter Furniss, Alastair Green, Russ Harmer, Eugenia Oshurko, and Hannes Voigt. 2019. Schema Validation and Evolution for Graph Databases. In Conceptual Modeling - 38th International Conference, ER 2019 (Lecture Notes in Computer Science), Vol. 11788. 448--456. Google ScholarGoogle ScholarCross RefCross Ref
  14. Prosenjit Bose, Meng He, Anil Maheshwari, and Pat Morin. 2009. Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In Algorithms and Data Structures, 11th International Symposium, WADS 2009 (Lecture Notes in Computer Science), Vol. 5664. 98--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. DGraph. 2020. DGraph Github Repository. https://github.com/dgraph-io/dgraph. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  16. Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency Graph Streaming using Compressed Purely-functional Trees. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019. 918--934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. DuckDB. 2020. DuckDB. https://duckdb.org/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  18. Orri Erling, Alex Averbuch, Josep Lluís Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat-Pérez, Minh-Duc Pham, and Peter A. Boncz. 2015. The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015. 619--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, SIGMOD 2018. 1433--1445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing Relations and Indexes. In Proceedings of the Fourteenth International Conference on Data Engineering, ICDE 1998. 370--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. 1992. New Indices for Text: Pat Trees and Pat Arrays. In Information Retrieval: Data Structures & Algorithms. 66--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering, TKDE 6, 1 (1994), 120--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Graefe and L.D. Shapiro. 1991. Data Compression and Database Performance. In Proceedings of the 1991 Symposium on Applied Computing. 22--27. Google ScholarGoogle ScholarCross RefCross Ref
  24. Graphflow. 2020. GraphflowDB Source Code. https://github.com/queryproc/optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  25. Graphflow. 2021. GraphflowDB Columnar Techniques. https://github.com/graphflow/graphflow-columnar-techniques. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  26. Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Technical Report. https://github.com/graphflow/graphflow-columnar-techniques/blob/master/paper.pdfGoogle ScholarGoogle Scholar
  27. Olaf Hartig and Jan Hidders. 2019. Defining Schemas for Property Graphs by using the GraphQL Schema Definition Language. In Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA 2019. 6:1--6:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, and Peter A. Boncz. 2010. Positional Update Handling in Column Stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. 543--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Engineering Bulletin 35, 1 (2012), 40--45. http://sites.computer.org/debull/A12mar/monetdb.pdfGoogle ScholarGoogle Scholar
  30. Guy Jacobson. 1989. Space-efficient Static Trees and Graphs. In 30th Annual Symposium on Foundations of Computer Science, FOCS 1989. 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Guy Jacobson. 1989. Succinct Static Data Structures. Ph.D. Dissertation. Carnegie Mellon University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chathura Kankanamge, Siddhartha Sahu, Amine Mhedhbi, Jeremy Chen, and Semih Salihoglu. 2017. Graphflow: An Active Graph Database. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017. 1695--1698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jérôme Kunegis. 2013. KONECT: The Koblenz Network Collection. In 22nd International World Wide Web Conference, WWW 2013. 1343--1350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jérôme Kunegis. 2021. Wikipedia Dynamic (de), (Konect). http://konect.cc/networks/link-dynamic-dewiki/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  35. Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215. http://www.vldb.org/pvldb/vol9/p204-leis.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Daniel Lemire and Leonid Boytsov. 2015. Decoding Billions of Integers per Second through Vectorization. Software: Practice and Experience 45, 1 (2015), 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2005. Graphs Over Time: Densification Laws, Shrinking Diameters and Possible Explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD 2005. 177--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Chunbin Lin, Benjamin Mandel, Yannis Papakonstantinou, and Matthias Springer. 2016. Fast In-Memory SQL Analytics on Typed Graphs. Proceedings of the VLDB Endowment 10, 3 (2016), 265--276. http://www.vldb.org/pvldb/vol10/p265-lin.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Peter Macko, Virendra J. Marathe, Daniel W. Margo, and Margo I. Seltzer. 2015. LLAMA: Efficient graph analytics using Large Multiversioned Arrays. In 31st IEEE International Conference on Data Engineering, ICDE 2015. 363--374. Google ScholarGoogle ScholarCross RefCross Ref
  40. Memgraph. 2020. Memgraph. https://memgraph.com/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  41. Amine Mhedhbi and Semih Salihoglu. 2019. Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins. Proceedings of the VLDB Endowment 12, 11 (2019), 1692--1704. http://www.vldb.org/pvldb/vol12/p1692-mhedhbi.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Alan Mislove, Hema Swetha Koppula, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2008. Growth of the Flickr Social Network. In Proceedings of the first Workshop on Online Social Networks, WOSN 2008. 25--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. MonetDB. 2020. MonetDB source code, (Jun2020-SP1). https://github.com/MonetDB/MonetDB/releases/tag/Jun2020_SP1_release. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  44. Gonzalo Navarro and Veli Mäkinen. 2007. Compressed Full-Text Indexes. Comput. Surveys 39, 1 (2007), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Gonzalo Navarro, Yakov Nekrich, and Luís M. S. Russo. 2013. Space-efficient Data-Analysis Queries on Grids. Theoretical Computer Science 482 (2013), 60--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Neo4j. 2020. Neo4j Blog on Deletions. https://neo4j.com/developer/kb/how-deletes-workin-neo4j/. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  47. Neo4j. 2020. Neo4j Community Edition. https://neo4j.com/download-center/#community. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  48. Neo4j. 2020. Neo4j Property Graph Model. https://neo4j.com/developer/graph-database. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  49. Thomas Neumann and Guido Moerkotte. 2011. Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011. 984--994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X Engine for Scalable Management of RDF Data. VLDB Journal 19, 1 (2010), 91--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases. SIGMOD Record 45, 2 (2016), 5--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Dan Olteanu and Jakub Závodný. 2015. Size Bounds for Factorised Representations of Query Results. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 2:1--2:44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Oracle. 2020. Oracle In-Memory Column Store Architecture. https://tinyurl.com/vkvb6p6. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  54. Oracle. 2020. Oracle Spatial and Graph. https://www.oracle.com/database/technologies/spatialandgraph.html. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  55. Michael Rudolf, Marcus Paradies, Christof Bornhövd, and Wolfgang Lehner. 2013. The Graph Story of the SAP HANA Database. In Datenbanksysteme für Business, Technologie und Web (BTW), 15. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS) (LNI), Vol. P-214. 403--420. https://dl.gi.de/20.500.12116/17334Google ScholarGoogle Scholar
  56. Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2020. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey. VLDB Journal 29, 2--3 (2020), 595--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2019. C-store: A Column-Oriented DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker. 491--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yuanyuan Tian, En Liang Xu, Wei Zhao, Mir Hamid Pirahesh, Suijun Tong, Wen Sun, Thomas Kolanko, Md. Shahidul Haque Apu, and Huijuan Peng. 2020. IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020. 345--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. TigerGraph. 2020. TigerGraphDB. https://www.tigergraph.com. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  60. Syunsuke Uemura, Toshitsugu Yuba, Akio Kokubu, Ryoichi Ooomote, and Yasuo Sugawara. 1980. The Design and Implementaion of a Magnetic-Bubble Database Machine. In Information Processing, Proceedings of the 8th IFIP Congress 1980. 433--438.Google ScholarGoogle Scholar
  61. Vertica. 2020. Vertica 10.0.x Documentation. https://www.vertica.com/docs/10.0.x/HTML/Content/Home.html. Last Accessed July 25, 2021.Google ScholarGoogle Scholar
  62. Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore: Sextuple Indexing for Semantic Web Data Management. Proceedings of the VLDB Endowment 1, 1 (2008), 1008--1019. http://www.vldb.org/pvldb/vol1/1453965.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. The Implementation and Performance of Compressed Databases. SIGMOD Record 29, 3 (2000), 55--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Succinct Range Filters. ACM Transactions on Database Systems (TODS) 45, 2 (2020), 5:1--5:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Xiaowei Zhu, Marco Serafini, Xiaosong Ma, Ashraf Aboulnaga, Wenguang Chen, and Guanyu Feng. 2020. LiveGraph: A Transactional Graph Storage System with Purely Sequential Adjacency List Scans. Proceedings of the VLDB Endowment 13, 7 (2020), 1020--1034. http://www.vldb.org/pvldb/vol13/p1020-zhu.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Marcin Zukowski and Peter A. Boncz. 2012. From X100 to Vectorwise: Opportunities, Challenges and Things Most Researchers Do Not Think About. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012. 861--862. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Marcin Zukowski and Peter A. Boncz. 2012. Vectorwise: Beyond Column Stores. IEEE Data Engineering Bulletin 35, 1 (2012), 21--27. http://sites.computer.org/debull/A12mar/vectorwise.pdfGoogle ScholarGoogle Scholar
  68. Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. 59. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Columnar storage and list-based processing for graph database management systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 14, Issue 11
            July 2021
            732 pages
            ISSN:2150-8097
            Issue’s Table of Contents

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 July 2021
            Published in pvldb Volume 14, Issue 11

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader