1 Introduction
- We extend a unified benchmark for both RDBMSs and GDBMSs to evaluate them under the same datasets as well as the same metrics.
- We propose a graph-to-relation inter-mapping scheme under which graph data and relational data are inter-transformed. We rewrite all SQL queries in TPC-H to graph queries and rewrite all graph queries in LDBC to SQL queries.
- We evaluate the performance of the GDBMSs using different back-end storage engines and report our findings.
- We conduct extensive experimental evaluations for existing popular RDBMSs and GDBMSs over both standard TPC-H and LDBC, and report our findings in detail.
2 Related Work
3 A Unified Benchmark for RDBMSs and GDBMSs
- We utilize the standard RDBMSs benchmark, TPC-H, and extend it to evaluate the performance for GDBMSs.
- Similarly, we utilize a widely used GDBMSs benchmark, LDBC, and extend it to evaluate the performance for RDBMSs.
3.1 Data Generation
3.1.1 Data Generation Schemes
RDBMSs table | GDBMSs vertex |
---|---|
PART | Part |
SUPPLIER | Supplier |
PARTSUPP | Partsupp |
LINEITEM | Lineitem |
ORDERS | Orders |
CUSTOMER | Customer |
NATION | Nation |
REGION | Region |
From key | To key | Relationship |
---|---|---|
PART.PARTKEY | PARTSUPP.PARTKEY | Part2partsupp |
SUPPLIER.SUPPKEY | PARTSUPP.SUPPKEY | Supplier2partsupp |
PARTSUPP.PARTSUPPKEY | LINEITEM.PARTSUPPKEY | Partsupp2lineitem |
ORDERS.ORDERKEY | LINEITEM.ORDERKEY | orders2lineitem |
CUSTOMER.CUSTKEY | ORDERS.CUSTKEY | Customer2orders |
NATION.NATIONKEY | SUPPLIER.NATIONKEY | Nation2supplier |
NATION.NATIONKEY | CUSTOMER.NATIONKEY | Nation2customer |
REGION.REGIONKEY | NATION.REGIONKEY | Region2nation |
3.1.2 Datasets to be Used
ID | Size | Vertices | Edges |
---|---|---|---|
tpch-0.05 | 50 MB | 432,844 | 2,261,723 |
tpch-0.1 | 100 MB | 866,602 | 4,530,029 |
tpch-0.5 | 500 MB | 4,330,622 | 22,634,256 |
tpch-1 | 1 GB | 8,661,245 | 45,268,530 |
Graphs | Vertices | Edges | Size | Domain |
---|---|---|---|---|
Wiki-Vote | 7115 | 103,689 | S | Social |
Cit-HepTh | 27,770 | 352,807 | M | Citation |
Web-Stanford | 281,903 | 2,312,497 | L | Web graphs |
Wiki-Talk | 2,394,385 | 5,021,410 | XL | Communication |
3.2 Query Workload
Category | Operations | # Of queries |
---|---|---|
Atomic relational queries | Project, Aggregation, Join, Order by | 4 |
TPC-H workloads | All the TPC-H query workloads | 22 |
Graph query workloads | BFS, CDLP, PR, LCC, WCC | 5 |
3.2.1 Atomic Relational Queries
3.2.2 TPC-H Query Workloads
DB/Operations | Projection | Aggregation | Join | Order by |
---|---|---|---|---|
RDBMS | Select | Group by/sum/average ... | Join | Order by |
GDBMS | Match | Group by/sum/average ... | Edges between vertexes | Order by |
3.2.3 Graph Query workload
3.3 Representatives of RDBMSs and GDBMSs to be Compared
DBMS | VERSION | CATEGORY |
---|---|---|
PostgreSQL | 9.5 | RDBMS |
Oracle | 11 g | RDBMS |
MSSQL | 2017 | RDBMS |
Neo4j | 3.4.6 | GDBMS |
ArangoDB | 3.3.19 | GDBMS |
3.4 Metrics
- Query processing time: the execution time of a graph query or an SQL query, which is returned and collected by RDBMSs and GDBMSs;
- Memory usage ratio: the peak usage ratio of memory during the execution of the whole workload;
- CPU usage ratio: the peak usage ratio of CPU during the execution of the whole workload;
4 Experiments
4.1 Experimental Setup
Component | Parameter |
---|---|
CPU | Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz |
Cores | 40 (80 threads) |
Memory | 256 G |
Disk | 60 T |
GPU | Titan V |
Back-end storage engine | Version | Category |
---|---|---|
RocksDB | 5.8 | Key-value storage engine |
Hbase | 2.2 | Key-value storage engine |
Cassandra | 3.11 | Key-value storage engine |
Neo4j | 3.4.6 | Native graph storage engine |
MySQL | 5.7 | Row storage engine |
- Back-end storage engines of GDBMSs The GDBMSs use a variety of back-end storage engines, which will bring different query performance for the same data model. Therefore, we first investigate the impact of different back-end storage engines on read and write performance under the same graph model.
- Relational operations We evaluate the performance of each database when processing general TPC-H queries and some extra evaluation queries. For TPC-H queries, we execute 22 queries on three databases: PostgreSQL as the representative for RDBMSs, Neo4j as the representative GDBMSs, and ArangoDB as a typical system for multi-model NoSQL database which includes graph data models. Twenty-two queries are executed on each of them, and meanwhile, the processing time, CPU usage, and main memory usage are recorded. Mean values are calculated for the processing results of the 22 queries to measure the general capacity of handling business-oriented ad hoc queries and concurrent data modifications for 3 databases. The evaluation for TPC-H queries only measures general capacity for business-oriented scenarios, yet every query in TPC-H consists of many atomic operations such as Projection, Aggregation, Join and so on. Four extra queries have been designed to measure the performance of processing 4 typical atomic operations: Projection, Join, Aggregation, and Order by.
- Graph algorithms For graph algorithms, we execute 5 graph algorithms we have introduced in Sect. 3 on four databases: Neo4j as the representative for GDBMSs; Oracle, PostgreSQL, and Microsoft SQL Server as 3 typical RDBMSs. Every graph algorithm is executed on 4 databases, and the processing time is recorded for each database.
4.2 Experimental Results and Analysis
4.2.1 Back-end Storage Engines Evaluation
4.2.2 Relational Operation Evaluation
4.2.3 Graph Algorithm Evaluation
5 Conclusion
Perspectives | Categories | Databases | Performance description |
---|---|---|---|
Data model | Relation model | RDBMSs | Outperformed in handling aggregate, group and sort large amounts of data, and algorithms with fewer iterations and less depth |
Graph model | GDBMSs | Outperformed in handling multi-table join and multilayer iterative calculation | |
Back-end storage Engine | Row storage | RDBMSs | Row-based storage engine used by RDBMSs did not stand out in either read or write operations |
Key-value storage | GDBMSs | Key-value storage engines with LSM -Tree outperforms in write operations | |
Native graph storage | GDBMSs | Native storage of graphs performs well in read operations |
Database category | Performance summary |
---|---|
RDBMSs | RDBMSs outperform GDMBSs by a substantial margin under the workloads which mainly consist of group by, sort, and aggregation operations, and their combinations |
GDBMSs | GDMBSs show their superiority under the workloads that mainly consist of multi-table join, pattern match, path identification, and their combinations |