Skip to main content
Erschienen in: The VLDB Journal 1/2020

20.12.2019 | Special Issue Paper

\(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version)

verfasst von: Silu Huang, Liqi Xu, Jialin Liu, Aaron J. Elmore, Aditya Parameswaran

Erschienen in: The VLDB Journal | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average \(10^3\times \) faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to \(20\times \) relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by \(10\times \) on average.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Orpheus is a musician and poet from ancient Greek mythology with the ability to raise the dead with his music, much like OrpheusDB has the ability to retrieve old (“dead”) dataset versions on demand.
 
2
We also tried alternative join methods—the findings were unchanged; we will discuss this further in Sect. 4.1. We also tried using an additional secondary index for vlist for split-by-vlist which reduced the time for checkout but increased the time for commit even further.
 
3
Table 3 shows the commit and checkout time for split-by-rid-vid without building an index on vid for the versioning table. When built an index on vid for the versioning table, the checkout time for split-by-rid-vid is reduced to 69.382 s, while the commit time is increased to 21.235 s.
 
4
A lyre was the musical instrument of choice for Orpheus.
 
5
In each iteration r, topological sorting algorithm finds vertices \(V'\) with in-degree equals 0, removes \(V'\), and updates in-degree of other vertices. \(l(v_i) = r, \forall v_i\in V'\).
 
6
If the version graph is a DAG instead, we first transform it into a version tree as discussed in Sect. 5.1.
 
7
If each attribute is of different size, we can simply replace “the number of attributes” with “the number of bytes” in the whole algorithm.
 
8
PostgreSQL ’s version 9.5 added the feature of dynamically adjusting the number of buckets for hash-join.
 
9
Shingles are calculated as signatures of each partition based on a min-hashing based technique.
 
Literatur
2.
Zurück zum Zitat Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A.J., Madden, S., Parameswaran, A.G.: Datahub: collaborative data science & dataset version management at scale. In: CIDR (2015) Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A.J., Madden, S., Parameswaran, A.G.: Datahub: collaborative data science & dataset version management at scale. In: CIDR (2015)
3.
Zurück zum Zitat Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic Acids Res. 43(D1), D1049–D1056 (2015)CrossRef Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic Acids Res. 43(D1), D1049–D1056 (2015)CrossRef
5.
Zurück zum Zitat Maddox, M., Goehring, D., Elmore, A.J., Madden, S., Parameswaran, A., Deshpande, A.: Decibel: the relational dataset branching system. Proc. VLDB Endow. 9(9), 624–635 (2016)CrossRef Maddox, M., Goehring, D., Elmore, A.J., Madden, S., Parameswaran, A., Deshpande, A.: Decibel: the relational dataset branching system. Proc. VLDB Endow. 9(9), 624–635 (2016)CrossRef
6.
Zurück zum Zitat Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc. VLDB Endow. 8(12), 1346–1357 (2015)CrossRef Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc. VLDB Endow. 8(12), 1346–1357 (2015)CrossRef
7.
Zurück zum Zitat Tansel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snodgrass, R.: Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc (1993) Tansel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snodgrass, R.: Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc (1993)
8.
Zurück zum Zitat Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)CrossRef Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)CrossRef
9.
Zurück zum Zitat Ozsoyoglu, G., Snodgrass, R.T.: Temporal and real-time databases: a survey. IEEE Trans. Knowl. Data Eng. 7(4), 513–532 (1995)CrossRef Ozsoyoglu, G., Snodgrass, R.T.: Temporal and real-time databases: a survey. IEEE Trans. Knowl. Data Eng. 7(4), 513–532 (1995)CrossRef
10.
Zurück zum Zitat Kulkarni, K., Michels, J.-E.: Temporal features in sql: 2011. ACM Sigmod Record 41(3), 34–43 (2012)CrossRef Kulkarni, K., Michels, J.-E.: Temporal features in sql: 2011. ACM Sigmod Record 41(3), 34–43 (2012)CrossRef
11.
Zurück zum Zitat Huang, S., Xu, L., Liu, J., Elmore, A.J., Parameswaran, A.: O rpheus db: bolt-on versioning for relational databases. Proc. VLDB Endow. 10(10), 1130–1141 (2017)CrossRef Huang, S., Xu, L., Liu, J., Elmore, A.J., Parameswaran, A.: O rpheus db: bolt-on versioning for relational databases. Proc. VLDB Endow. 10(10), 1130–1141 (2017)CrossRef
12.
Zurück zum Zitat Xu, L., Huang, S., Hui, S., Elmore, A., Parameswaran, A.: (2017) OrpheusDB : A lightweight approach to relational dataset versioning. Technical Report Xu, L., Huang, S., Hui, S., Elmore, A., Parameswaran, A.: (2017) OrpheusDB : A lightweight approach to relational dataset versioning. Technical Report
13.
Zurück zum Zitat Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Ho, C.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Record 30(1), 78–83 (2001)CrossRef Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Ho, C.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Record 30(1), 78–83 (2001)CrossRef
14.
Zurück zum Zitat Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: fully automatic tool generation from ad hoc data. In: ACM SIGPLAN Notices, vol. 43, pp. 421–434. ACM (2008) Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: fully automatic tool generation from ad hoc data. In: ACM SIGPLAN Notices, vol. 43, pp. 421–434. ACM (2008)
18.
Zurück zum Zitat Buneman, P., Khanna, S., Tajima, K., Tan, W.-C.: Archiving scientific data. ACM Trans. Database Syst. (TODS) 29(1), 2–42 (2004)CrossRef Buneman, P., Khanna, S., Tajima, K., Tan, W.-C.: Archiving scientific data. ACM Trans. Database Syst. (TODS) 29(1), 2–42 (2004)CrossRef
19.
Zurück zum Zitat Jain, S., Moritz, D., Halperin, D., Howe, B., Lazowska, E.: Sqlshare: Results from a multi-year sql-as-a-service experiment. In: Proceedings of the 2016 International Conference on Management of Data, pp. 281–293. ACM (2016) Jain, S., Moritz, D., Halperin, D., Howe, B., Lazowska, E.: Sqlshare: Results from a multi-year sql-as-a-service experiment. In: Proceedings of the 2016 International Conference on Management of Data, pp. 281–293. ACM (2016)
20.
Zurück zum Zitat De Castro, C., Grandi, F., Scalas, M.R.: (1995) On schema versioning in temporal databases. In: Recent Advances in Temporal Databases, pp. 272–291. Springer De Castro, C., Grandi, F., Scalas, M.R.: (1995) On schema versioning in temporal databases. In: Recent Advances in Temporal Databases, pp. 272–291. Springer
21.
Zurück zum Zitat De Castro, C., Grandi, F., Scalas, M.R.: Schema versioning for multitemporal relational databases. Inf. Syst. 22(5), 249–290 (1997)CrossRef De Castro, C., Grandi, F., Scalas, M.R.: Schema versioning for multitemporal relational databases. Inf. Syst. 22(5), 249–290 (1997)CrossRef
22.
Zurück zum Zitat Moon, H.J., Curino, C.A., Deutsch, A., Hou, C.-Y., Zaniolo, C.: Managing and querying transaction-time databases under schema evolution. Proc. VLDB Endow. 1(1), 882–895 (2008)CrossRef Moon, H.J., Curino, C.A., Deutsch, A., Hou, C.-Y., Zaniolo, C.: Managing and querying transaction-time databases under schema evolution. Proc. VLDB Endow. 1(1), 882–895 (2008)CrossRef
23.
Zurück zum Zitat Moon, H.J., Curino, C.A., Zaniolo, C.: Scalable architecture and query optimization fortransaction-time dbs with evolving schemas. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 207–218. ACM (2010) Moon, H.J., Curino, C.A., Zaniolo, C.: Scalable architecture and query optimization fortransaction-time dbs with evolving schemas. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 207–218. ACM (2010)
24.
Zurück zum Zitat Quamar, A., Deshpande, A., Lin, J.: (2014) Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26 Quamar, A., Deshpande, A., Lin, J.: (2014) Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26
25.
Zurück zum Zitat Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, G., Fu, W., Ooi, B.C., Ruan, P.: (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. arXiv preprint arXiv:1802.04949 Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, G., Fu, W., Ooi, B.C., Ruan, P.: (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. arXiv preprint arXiv:​1802.​04949
26.
Zurück zum Zitat Chavan, A., Huang, S., Deshpande, A., Elmore, A., Madden, S., Parameswaran, A.: Towards a unified query language for provenance and versioning. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (2015) Chavan, A., Huang, S., Deshpande, A., Elmore, A., Madden, S., Parameswaran, A.: Towards a unified query language for provenance and versioning. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (2015)
27.
Zurück zum Zitat Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for queries, updates, and transactions. In: 6th \(\{\)USENIX\(\}\) Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014) Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for queries, updates, and transactions. In: 6th \(\{\)USENIX\(\}\) Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)
28.
Zurück zum Zitat Curino, C.A., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. Proc. VLDB Endow. 1(1), 761–772 (2008)CrossRef Curino, C.A., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. Proc. VLDB Endow. 1(1), 761–772 (2008)CrossRef
29.
Zurück zum Zitat Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5(10), 968–979 (2012)CrossRef Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5(10), 968–979 (2012)CrossRef
30.
Zurück zum Zitat Ahn, I., Snodgrass, R.: Performance evaluation of a temporal database management system. In: ACM SIGMOD Record, vol. 15, pp. 96–107. ACM (1986) Ahn, I., Snodgrass, R.: Performance evaluation of a temporal database management system. In: ACM SIGMOD Record, vol. 15, pp. 96–107. ACM (1986)
31.
Zurück zum Zitat Snodgrass, R., Ahn, I.: A taxonomy of time databases. ACM Sigmod Record 14(4), 236–246 (1985)CrossRef Snodgrass, R., Ahn, I.: A taxonomy of time databases. ACM Sigmod Record 14(4), 236–246 (1985)CrossRef
32.
Zurück zum Zitat Snodgrass, R.T., Ahn, I., Ariav, G., Batory, D.S., Clifford, J., Dyreson, C.E., Elmasri, R., Grandi, F., Jensen, C.S., Käfer, W., et al.: Tsql2 language specification. Sigmod Record 23(1), 65–86 (1994)CrossRef Snodgrass, R.T., Ahn, I., Ariav, G., Batory, D.S., Clifford, J., Dyreson, C.E., Elmasri, R., Grandi, F., Jensen, C.S., Käfer, W., et al.: Tsql2 language specification. Sigmod Record 23(1), 65–86 (1994)CrossRef
33.
Zurück zum Zitat Torp, K., Jensen, C.S., Snodgrass, R.T.: Stratum approaches to temporal DBMS implementation. In: Database Engineering and Applications Symposium, 1998. Proceedings. IDEAS’98. International, pp. 4–13. IEEE (1998) Torp, K., Jensen, C.S., Snodgrass, R.T.: Stratum approaches to temporal DBMS implementation. In: Database Engineering and Applications Symposium, 1998. Proceedings. IDEAS’98. International, pp. 4–13. IEEE (1998)
34.
Zurück zum Zitat Chen, C.X., Kong, J., Zaniolo, C.: Design and implementation of a temporal extension of sql. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 689–691. IEEE (2003) Chen, C.X., Kong, J., Zaniolo, C.: Design and implementation of a temporal extension of sql. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 689–691. IEEE (2003)
35.
Zurück zum Zitat Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in db2 for z. IBM Corporation, New York (2010) Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in db2 for z. IBM Corporation, New York (2010)
36.
Zurück zum Zitat Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J.,. Pakala, S.P.: Temporal query processing in teradata. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 573–578. ACM (2013) Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J.,. Pakala, S.P.: Temporal query processing in teradata. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 573–578. ACM (2013)
37.
Zurück zum Zitat Kaufmann, M., Manjili, A.A., Vagenas, P., Fischer, P.M., Kossmann, D., Färber, F., May, N.: Timeline index: a unified data structure for processing queries on temporal data in sap hana. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1173–1184. ACM (2013) Kaufmann, M., Manjili, A.A., Vagenas, P., Fischer, P.M., Kossmann, D., Färber, F., May, N.: Timeline index: a unified data structure for processing queries on temporal data in sap hana. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1173–1184. ACM (2013)
38.
Zurück zum Zitat Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The sap hana database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012) Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The sap hana database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
39.
Zurück zum Zitat Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking bitemporal database systems: ready for the future or stuck in the past? In: EDBT, pp. 738–749 (2014) Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking bitemporal database systems: ready for the future or stuck in the past? In: EDBT, pp. 738–749 (2014)
40.
Zurück zum Zitat Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. (CSUR) 31(2), 158–221 (1999)CrossRef Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. (CSUR) 31(2), 158–221 (1999)CrossRef
41.
Zurück zum Zitat Lee, J.W., Loaiza, J., Stewart, M.J., Hu, W.-M., Bridge Jr, W.H.: Flashback database, Feb. 20 2007. US Patent 7,181,476 Lee, J.W., Loaiza, J., Stewart, M.J., Hu, W.-M., Bridge Jr, W.H.: Flashback database, Feb. 20 2007. US Patent 7,181,476
42.
Zurück zum Zitat Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. VLDB J. Int. J. Very Large Data Bases 14(1), 2–29 (2005)CrossRef Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. VLDB J. Int. J. Very Large Data Bases 14(1), 2–29 (2005)CrossRef
43.
Zurück zum Zitat Landau, G.M., Schmidt, J.P., Tsotras, V.J.: Historical queries along multiple lines of time evolution. VLDB J. Int. J. Very Large Data Bases 4(4), 703–726 (1995)CrossRef Landau, G.M., Schmidt, J.P., Tsotras, V.J.: Historical queries along multiple lines of time evolution. VLDB J. Int. J. Very Large Data Bases 4(4), 703–726 (1995)CrossRef
44.
Zurück zum Zitat Salzberg, B.J., Lomet, D.B.: Branched and Temporal Index Structures. Northeastern University, College of Computer Science (1995) Salzberg, B.J., Lomet, D.B.: Branched and Temporal Index Structures. Northeastern University, College of Computer Science (1995)
45.
Zurück zum Zitat Lanka, S., Mays, E.: Fully Persistent B+-trees, vol. 20. ACM, New York (1991) Lanka, S., Mays, E.: Fully Persistent B+-trees, vol. 20. ACM, New York (1991)
46.
Zurück zum Zitat Jiang, L., Salzberg, B., Lomet, D.B., García, M.B.: (2000) The bt-tree: a branched and temporal access method. In: VLDB, pp. 451–460 Jiang, L., Salzberg, B., Lomet, D.B., García, M.B.: (2000) The bt-tree: a branched and temporal access method. In: VLDB, pp. 451–460
51.
Zurück zum Zitat Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRef Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRef
52.
Zurück zum Zitat Liu, D.-R., Shekhar, S.: Partitioning similarity graphs: a framework for declustering problems. Inf. Syst. 21(6), 475–496 (1996)CrossRef Liu, D.-R., Shekhar, S.: Partitioning similarity graphs: a framework for declustering problems. Inf. Syst. 21(6), 475–496 (1996)CrossRef
53.
54.
Zurück zum Zitat Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Des. 11(3), 285–300 (2000)CrossRef Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Des. 11(3), 285–300 (2000)CrossRef
55.
Zurück zum Zitat Kumar, K.A., Quamar, A., Deshpande, A., Khuller, S.: Sword: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 23(6), 845–870 (2014)CrossRef Kumar, K.A., Quamar, A., Deshpande, A., Khuller, S.: Sword: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 23(6), 845–870 (2014)CrossRef
Metadaten
Titel
DB: bolt-on versioning for relational databases (extended version)
verfasst von
Silu Huang
Liqi Xu
Jialin Liu
Aaron J. Elmore
Aditya Parameswaran
Publikationsdatum
20.12.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 1/2020
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-019-00594-5

Weitere Artikel der Ausgabe 1/2020

The VLDB Journal 1/2020 Zur Ausgabe