research-article

Public Access

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

Authors:
Stratos Idreos

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Kostas Zoumpatianos

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Brian Hentschel

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Michael S. Kester

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Demi Guo

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 535–550https://doi.org/10.1145/3183713.3199671

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 535–550

ABSTRACT

Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.

References

Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends in Databases Vol. 5, 3 (2013), 197--280. Google ScholarDigital Library
Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning Proceedings of the ACM SIGMOD International Conference on Management of Data. 1009--1024. Google ScholarDigital Library
Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-free Adaptive Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1103--1114. Google ScholarDigital Library
Victor Alvarez, Felix Martin Schuhknecht, Jens Dittrich, and Stefan Richter. 2014. Main Memory Adaptive Indexing for Multi-Core Systems Proceedings of the International Workshop on Data Management on New Hardware (DAMON). 3:1---3:10. Google ScholarDigital Library
Michael R. Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael J. Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://web.eecs.umich.edu/Google Scholar
Paul M Aoki. 1998. Generalizing "Search" in Generalized Search Trees (Extended Abstract) Proceedings of the IEEE International Conference on Data Engineering (ICDE). 380--389. Google ScholarDigital Library
Paul M Aoki. 1999. How to Avoid Building DataBlades That Know the Value of Everything and the Cost of Nothing Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). 122--133. Google ScholarDigital Library
Joy Arulraj, Andrew Pavlo, and Prashanth Menon. 2016. Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. Designing Access Methods: The RUM Conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT). 461--466.Google Scholar
Shivnath Babu, Nedyalko Borisov, Songyun Duan, Herodotos Herodotou, and Vamsidhar Thummala. 2009. Automated Experiment-Driven Management of (Database) Systems Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS). http://www.usenix.org/events/hotos09/tech/full Google ScholarDigital Library
Don S Batory, J R Barnett, J F Garza, K P Smith, K Tsukuda, B C Twichell, and T E Wise. 1988. GENESIS: An Extensible Database Management System. IEEE Transactions on Software Engineering (TSE), Vol. 14, 11 (1988), 1711--1730. Google ScholarDigital Library
Philip A. Bernstein and David B. Lomet. 1987. CASE Requirements for Extensible Database Systems. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 2--9. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
Alfonso F. Cardenas. 1973. Evaluation and Selection of File Organization - A Model and System. Commun. ACM, Vol. 16, 9 (1973), 540--548. Google ScholarDigital Library
Michael J Carey and David J DeWitt. 1987. An Overview of the EXODUS Project. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 47--54. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
ch M.J. Steindorfer, and J.J. Vinju. ch2016. chTowards a software product line of trie-based collections chProceedings of the ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. Google ScholarDigital Library
Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server Proceedings of the International Conference on Very Large Data Bases (VLDB). 146--155. http://dl.acm.org/citation.cfm?id=645923.673646 Google ScholarDigital Library
Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System Proceedings of the International Conference on Very Large Data Bases (VLDB). 1--10. http://www.vldb.org/conf/2000/P001.pdf Google ScholarDigital Library
chC. Loncaric, E.Torlak, and M.D Ernst. ch2016. chFast synthesis of fast collections. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
chD. Cohen and N. Campbell. ch1993. chAutomating Relational Operations on Data Structures. chIEEE Software Vol. ch10, ch3 (. ch1993), ch53--60. Google ScholarDigital Library
chE. Schonberg, J.T. Schwartz and Sharir. ch1979. chAutomatic Data Structure Selection in SETL. chProceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarDigital Library
chE. Schonberg, J.T. Schwartz and Sharir. ch1981. chAn Automatic Technique for Selection of Data Representations in SETL Programs. chACM Transactions on Programming Languages and Systems, Vol. ch3, ch2 (. ch1981), ch126--143. Google ScholarDigital Library
Alvin Cheung. 2015. Towards Generating Application-Specific Data Management Systems Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://www.cidrdb.org/cidr2015/Papers/17Google Scholar
chmStratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Mike Kester, Demi Guo. chm2018. chmThe Internals of the Data Calculator. chmHarvard Data Systems Laboratory, Technical Report (. chm2018).Google Scholar
chO. Shacham, M. Vechev, and E. Yahav. ch2009. chChameleon: Adaptive Selection of Collections. chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard and M. Sagiv. ch2011. chData representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard, and M. Sagiv. ch2012. chConcurrent data representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
chY. Smaragdakis, and D. Batory. ch1997. chDiSTiL: A Transformation Library for Data Structures chProceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL). Google ScholarDigital Library
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarDigital Library
Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. 1997. A Reliable Randomized Algorithm for the Closest-Pair Problem. J. Algorithms (1997). Google ScholarDigital Library
Jens Dittrich and Alekh Jindal. 2011. Towards a One Size Fits All Database Architecture Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). 195--198.Google Scholar
David Goldhirsch and Jack A Orenstein. 1987. Extensibility in the PROBE Database System. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 24--31. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 6, 1 (feb. 1994), 120--135. http://dl.acm.org/citation.cfm?id=627558 Google ScholarDigital Library
Goetz Graefe. 2011. Modern B-Tree Techniques. Foundations and Trends in Databases Vol. 3, 4 (2011), 203--402. Google ScholarDigital Library
Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold. 2012. Concurrency control for adaptive indexing. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 656--667. http://dl.acm.org/citation.cfm?id=2180918 Google ScholarDigital Library
Jim Gray. 2000. What Next? A Few Remaining Problems in Information Technlogy. ACM SIGMOD Digital Symposium Collection Vol. 2, 2 (2000).Google Scholar
Richard A Hankins and Jignesh M Patel. 2003. Data Morphing: An Adaptive, Cache-Conscious Storage Technique Proceedings of the International Conference on Very Large Data Bases (VLDB). 417--428. http://www.vldb.org/conf/2003/papers/S13P03.pdf Google ScholarDigital Library
Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation Proceedings of the ACM SIGMOD International Conference on Management of Data. 1477--1492. Google ScholarDigital Library
Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. 1995. Generalized Search Trees for Database Systems. Proceedings of the International Conference on Very Large Data Bases (VLDB). 562--573. http://dl.acm.org/citation.cfm?id=645921.673145 Google ScholarDigital Library
Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2009. Self-organizing Tuple Reconstruction in Column-Stores Proceedings of the ACM SIGMOD International Conference on Management of Data. 297--308. Google ScholarDigital Library
Yannis E Ioannidis and Eugene Wong. 1987. Query Optimization by Simulated Annealing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 9--22. Google ScholarDigital Library
Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? Proceedings of the ACM SIGMOD International Conference on Management of Data. 715--730. Google ScholarDigital Library
Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2013. The Uncracked Pieces in Database Cracking. Proceedings of the VLDB Endowment Vol. 7, 2 (2013), 97--108. http://www.vldb.org/pvldb/vol7/p97-schuhknecht.pdf Google ScholarDigital Library
Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-Adjusting Binary Search Trees. J. ACM Vol. 32, 3 (1985), 652--686. Google ScholarDigital Library
Michael Stonebraker, Jeff Anton, and Michael Hirohama. 1987. Extendability in POSTGRES. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 16--23. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
R.E. Tarjan. 1978. Complexity of combinatorial algorithms. SIAM Rev (1978).Google Scholar
Toby J. Teorey and K. Sundar Das. 1976. Application of an Analytical Model to Evaluate Storage Structures Proceedings of the ACM SIGMOD International Conference on Management of Data. Google Scholar
Peng Wang, Di Wang, and Adam Chlipala. 2017. TiML: a functional language for practical complexity analysis with invariants. PACMPL Vol. 1 (2017), 79:1--79:26. Google ScholarDigital Library
Leland Wilkinson. 2005. The Grammar of Graphics. Springer (2005). Google ScholarDigital Library
S Bing Yao. 1977. An Attribute Based Model for Database Access Cost Analysis. ACM Transactions on Database Systems (TODS), Vol. 2, 1 (1977), 45--67. Google ScholarDigital Library
S. Bing Yao and D. DeJong. 1978. Evaluation of Database Access Paths. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
S. Bing Yao and Alan G. Merten. 1975. Selection of File Organization Using an Analytic Model Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
Ming Zhou. 1999. Generalizing Database Access Methods. Ph.D. Thesis. University of Waterloo (1999).Google Scholar
Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2014. Indexing for interactive exploration of big data series Proceedings of the ACM SIGMOD International Conference on Management of Data. 1555--1566. Google ScholarDigital Library

Index Terms

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
  2. Information storage systems
    1. Record storage systems

Recommendations

Fast synthesis of fast collections
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Many applications require specialized data structures not found in the standard libraries, but implementing new data structures by hand is tedious and error-prone. This paper presents a novel approach for synthesizing efficient implementations of ...
Read More
Cozy: synthesizing collection data structures
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

Many applications require specialized data structures not found in standard libraries. Implementing new data structures by hand is tedious and error-prone. To alleviate this difficulty, we built a tool called Cozy that synthesizes data structures using ...
Read More
Fast synthesis of fast collections
PLDI '16

Many applications require specialized data structures not found in the standard libraries, but implementing new data structures by hand is tedious and error-prone. This paper presents a novel approach for synthesizing efficient implementations of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data structure synthesis
learned cost models
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 1,468
  Total Downloads
- Downloads (Last 12 months)270
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast synthesis of fast collections

Cozy: synthesizing collection data structures

Fast synthesis of fast collections

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast synthesis of fast collections

Cozy: synthesizing collection data structures

Fast synthesis of fast collections

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media