Abstract
We revisit column-oriented storage and query processing techniques in the context of contemporary graph database management systems (GDBMSs). Similar to column-oriented RDBMSs, GDBMSs support read-heavy analytical workloads that however have fundamentally different data access patterns than traditional analytical workloads. We first derive a set of desiderata for optimizing storage and query processors of GDBMS based on their access patterns. We then present the design of columnar storage, compression, and query processing techniques based on these desiderata. In addition to showing direct integration of existing techniques from columnar RDBMSs, we also propose novel ones that are optimized for GDBMSs. These include a novel list-based query processor, which avoids expensive data copies of traditional block-based processors under many-to-many joins, a new data structure we call single-indexed edge property pages and an accompanying edge ID scheme, and a new application of Jacobson's bit vector index for compressing NULL values and empty lists. We integrated our techniques into the GraphflowDB in-memory GDBMS. Through extensive experiments, we demonstrate the scalability and query performance benefits of our techniques.
- Daniel J. Abadi. 2007. Column Stores for Wide and Sparse Data. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007. 292--297. http://cidrdb.org/cidr2007/papers/cidr07p33.pdfGoogle Scholar
- Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compression and Execution in Column-Oriented Database Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2006. 671--682. Google ScholarDigital Library
- Daniel J. Abadi, Samuel Madden, and Nabil Hachem. 2008. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008. 967--980. Google ScholarDigital Library
- Daniel J. Abadi, Adam Marcus, Samuel Madden, and Katherine J. Hollenbach. 2007. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proceedings of the 33rd International Conference on Very Large Data. 411--422. http://www.vldb.org/conf/2007/papers/research/p411-abadi.pdf Google ScholarDigital Library
- Amazon. 2020. Amazon Neptune. https://aws.amazon.com/neptune/. Last Accessed July 25, 2021.Google Scholar
- JanusGraph Authors. 2020. JanusGraph. https://janusgraph.org. Last Accessed July 25, 2021.Google Scholar
- Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. 2013. Aggregation and Ordering in Factorised Databases. Proceedings of the VLDB Endowment 6, 14 (2013), 1990--2001. http://www.vldb.org/pvldb/vol6/p1990-zavodny.pdf Google ScholarDigital Library
- Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny. 2012. FDB: A Query Engine for Factorised Relational Databases. Proceedings of the VLDB Endowment 5, 11 (2012), 1232--1243. http://vldb.org/pvldb/vol5/p1232_nurzhanbakibayev_vldb2012.pdf Google ScholarDigital Library
- Jennifer L. Beckmann, Alan Halverson, Rajasekar Krishnamurthy, and Jeffrey F. Naughton. 2006. Extending RDBMSs to Support Sparse Datasets using an Interpreted Attribute Storage Format. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. 58. Google ScholarDigital Library
- Peter Boncz. 2002. Monet: A Next-Generation Database Kernel for Query-Intensive Applications. Ph.D. Dissertation. Universiteit van Amsterdam. https://ir.cwi.nl/pub/14832/14832A.pdfGoogle Scholar
- Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005. 225--237. http://cidrdb.org/cidr2005/papers/P19.pdfGoogle Scholar
- Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018. Querying Graphs. Google ScholarDigital Library
- Angela Bonifati, Peter Furniss, Alastair Green, Russ Harmer, Eugenia Oshurko, and Hannes Voigt. 2019. Schema Validation and Evolution for Graph Databases. In Conceptual Modeling - 38th International Conference, ER 2019 (Lecture Notes in Computer Science), Vol. 11788. 448--456. Google ScholarCross Ref
- Prosenjit Bose, Meng He, Anil Maheshwari, and Pat Morin. 2009. Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In Algorithms and Data Structures, 11th International Symposium, WADS 2009 (Lecture Notes in Computer Science), Vol. 5664. 98--109. Google ScholarDigital Library
- DGraph. 2020. DGraph Github Repository. https://github.com/dgraph-io/dgraph. Last Accessed July 25, 2021.Google Scholar
- Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency Graph Streaming using Compressed Purely-functional Trees. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019. 918--934. Google ScholarDigital Library
- DuckDB. 2020. DuckDB. https://duckdb.org/. Last Accessed July 25, 2021.Google Scholar
- Orri Erling, Alex Averbuch, Josep Lluís Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat-Pérez, Minh-Duc Pham, and Peter A. Boncz. 2015. The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015. 619--630. Google ScholarDigital Library
- Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, SIGMOD 2018. 1433--1445. Google ScholarDigital Library
- Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing Relations and Indexes. In Proceedings of the Fourteenth International Conference on Data Engineering, ICDE 1998. 370--379. Google ScholarDigital Library
- Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. 1992. New Indices for Text: Pat Trees and Pat Arrays. In Information Retrieval: Data Structures & Algorithms. 66--82. Google ScholarDigital Library
- Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering, TKDE 6, 1 (1994), 120--135. Google ScholarDigital Library
- G. Graefe and L.D. Shapiro. 1991. Data Compression and Database Performance. In Proceedings of the 1991 Symposium on Applied Computing. 22--27. Google ScholarCross Ref
- Graphflow. 2020. GraphflowDB Source Code. https://github.com/queryproc/optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins/. Last Accessed July 25, 2021.Google Scholar
- Graphflow. 2021. GraphflowDB Columnar Techniques. https://github.com/graphflow/graphflow-columnar-techniques. Last Accessed July 25, 2021.Google Scholar
- Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Technical Report. https://github.com/graphflow/graphflow-columnar-techniques/blob/master/paper.pdfGoogle Scholar
- Olaf Hartig and Jan Hidders. 2019. Defining Schemas for Property Graphs by using the GraphQL Schema Definition Language. In Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA 2019. 6:1--6:11. Google ScholarDigital Library
- Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, and Peter A. Boncz. 2010. Positional Update Handling in Column Stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. 543--554. Google ScholarDigital Library
- Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Engineering Bulletin 35, 1 (2012), 40--45. http://sites.computer.org/debull/A12mar/monetdb.pdfGoogle Scholar
- Guy Jacobson. 1989. Space-efficient Static Trees and Graphs. In 30th Annual Symposium on Foundations of Computer Science, FOCS 1989. 549--554. Google ScholarDigital Library
- Guy Jacobson. 1989. Succinct Static Data Structures. Ph.D. Dissertation. Carnegie Mellon University. Google ScholarDigital Library
- Chathura Kankanamge, Siddhartha Sahu, Amine Mhedhbi, Jeremy Chen, and Semih Salihoglu. 2017. Graphflow: An Active Graph Database. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017. 1695--1698. Google ScholarDigital Library
- Jérôme Kunegis. 2013. KONECT: The Koblenz Network Collection. In 22nd International World Wide Web Conference, WWW 2013. 1343--1350. Google ScholarDigital Library
- Jérôme Kunegis. 2021. Wikipedia Dynamic (de), (Konect). http://konect.cc/networks/link-dynamic-dewiki/. Last Accessed July 25, 2021.Google Scholar
- Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215. http://www.vldb.org/pvldb/vol9/p204-leis.pdf Google ScholarDigital Library
- Daniel Lemire and Leonid Boytsov. 2015. Decoding Billions of Integers per Second through Vectorization. Software: Practice and Experience 45, 1 (2015), 1--29. Google ScholarDigital Library
- Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2005. Graphs Over Time: Densification Laws, Shrinking Diameters and Possible Explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD 2005. 177--187. Google ScholarDigital Library
- Chunbin Lin, Benjamin Mandel, Yannis Papakonstantinou, and Matthias Springer. 2016. Fast In-Memory SQL Analytics on Typed Graphs. Proceedings of the VLDB Endowment 10, 3 (2016), 265--276. http://www.vldb.org/pvldb/vol10/p265-lin.pdf Google ScholarDigital Library
- Peter Macko, Virendra J. Marathe, Daniel W. Margo, and Margo I. Seltzer. 2015. LLAMA: Efficient graph analytics using Large Multiversioned Arrays. In 31st IEEE International Conference on Data Engineering, ICDE 2015. 363--374. Google ScholarCross Ref
- Memgraph. 2020. Memgraph. https://memgraph.com/. Last Accessed July 25, 2021.Google Scholar
- Amine Mhedhbi and Semih Salihoglu. 2019. Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins. Proceedings of the VLDB Endowment 12, 11 (2019), 1692--1704. http://www.vldb.org/pvldb/vol12/p1692-mhedhbi.pdf Google ScholarDigital Library
- Alan Mislove, Hema Swetha Koppula, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2008. Growth of the Flickr Social Network. In Proceedings of the first Workshop on Online Social Networks, WOSN 2008. 25--30. Google ScholarDigital Library
- MonetDB. 2020. MonetDB source code, (Jun2020-SP1). https://github.com/MonetDB/MonetDB/releases/tag/Jun2020_SP1_release. Last Accessed July 25, 2021.Google Scholar
- Gonzalo Navarro and Veli Mäkinen. 2007. Compressed Full-Text Indexes. Comput. Surveys 39, 1 (2007), 2. Google ScholarDigital Library
- Gonzalo Navarro, Yakov Nekrich, and Luís M. S. Russo. 2013. Space-efficient Data-Analysis Queries on Grids. Theoretical Computer Science 482 (2013), 60--72. Google ScholarDigital Library
- Neo4j. 2020. Neo4j Blog on Deletions. https://neo4j.com/developer/kb/how-deletes-workin-neo4j/. Last Accessed July 25, 2021.Google Scholar
- Neo4j. 2020. Neo4j Community Edition. https://neo4j.com/download-center/#community. Last Accessed July 25, 2021.Google Scholar
- Neo4j. 2020. Neo4j Property Graph Model. https://neo4j.com/developer/graph-database. Last Accessed July 25, 2021.Google Scholar
- Thomas Neumann and Guido Moerkotte. 2011. Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011. 984--994. Google ScholarDigital Library
- Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X Engine for Scalable Management of RDF Data. VLDB Journal 19, 1 (2010), 91--113. Google ScholarDigital Library
- Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases. SIGMOD Record 45, 2 (2016), 5--16. Google ScholarDigital Library
- Dan Olteanu and Jakub Závodný. 2015. Size Bounds for Factorised Representations of Query Results. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 2:1--2:44. Google ScholarDigital Library
- Oracle. 2020. Oracle In-Memory Column Store Architecture. https://tinyurl.com/vkvb6p6. Last Accessed July 25, 2021.Google Scholar
- Oracle. 2020. Oracle Spatial and Graph. https://www.oracle.com/database/technologies/spatialandgraph.html. Last Accessed July 25, 2021.Google Scholar
- Michael Rudolf, Marcus Paradies, Christof Bornhövd, and Wolfgang Lehner. 2013. The Graph Story of the SAP HANA Database. In Datenbanksysteme für Business, Technologie und Web (BTW), 15. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS) (LNI), Vol. P-214. 403--420. https://dl.gi.de/20.500.12116/17334Google Scholar
- Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2020. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey. VLDB Journal 29, 2--3 (2020), 595--618. Google ScholarDigital Library
- Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2019. C-store: A Column-Oriented DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker. 491--518. Google ScholarDigital Library
- Yuanyuan Tian, En Liang Xu, Wei Zhao, Mir Hamid Pirahesh, Suijun Tong, Wen Sun, Thomas Kolanko, Md. Shahidul Haque Apu, and Huijuan Peng. 2020. IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020. 345--359. Google ScholarDigital Library
- TigerGraph. 2020. TigerGraphDB. https://www.tigergraph.com. Last Accessed July 25, 2021.Google Scholar
- Syunsuke Uemura, Toshitsugu Yuba, Akio Kokubu, Ryoichi Ooomote, and Yasuo Sugawara. 1980. The Design and Implementaion of a Magnetic-Bubble Database Machine. In Information Processing, Proceedings of the 8th IFIP Congress 1980. 433--438.Google Scholar
- Vertica. 2020. Vertica 10.0.x Documentation. https://www.vertica.com/docs/10.0.x/HTML/Content/Home.html. Last Accessed July 25, 2021.Google Scholar
- Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore: Sextuple Indexing for Semantic Web Data Management. Proceedings of the VLDB Endowment 1, 1 (2008), 1008--1019. http://www.vldb.org/pvldb/vol1/1453965.pdf Google ScholarDigital Library
- Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. The Implementation and Performance of Compressed Databases. SIGMOD Record 29, 3 (2000), 55--67. Google ScholarDigital Library
- Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Succinct Range Filters. ACM Transactions on Database Systems (TODS) 45, 2 (2020), 5:1--5:31. Google ScholarDigital Library
- Xiaowei Zhu, Marco Serafini, Xiaosong Ma, Ashraf Aboulnaga, Wenguang Chen, and Guanyu Feng. 2020. LiveGraph: A Transactional Graph Storage System with Purely Sequential Adjacency List Scans. Proceedings of the VLDB Endowment 13, 7 (2020), 1020--1034. http://www.vldb.org/pvldb/vol13/p1020-zhu.pdf Google ScholarDigital Library
- Marcin Zukowski and Peter A. Boncz. 2012. From X100 to Vectorwise: Opportunities, Challenges and Things Most Researchers Do Not Think About. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012. 861--862. Google ScholarDigital Library
- Marcin Zukowski and Peter A. Boncz. 2012. Vectorwise: Beyond Column Stores. IEEE Data Engineering Bulletin 35, 1 (2012), 21--27. http://sites.computer.org/debull/A12mar/vectorwise.pdfGoogle Scholar
- Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. 59. Google ScholarDigital Library
Index Terms
- Columnar storage and list-based processing for graph database management systems
Recommendations
Relational Database Management Systems: The Business Explosion
This special issue (part 2 of a series began with the special issue in October-December 2012) tells the history of how IBM and several new, independent software companies built companies that supplanted the database management system companies and their ...
Comments