ABSTRACT
Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.
- Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends in Databases Vol. 5, 3 (2013), 197--280. Google ScholarDigital Library
- Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning Proceedings of the ACM SIGMOD International Conference on Management of Data. 1009--1024. Google ScholarDigital Library
- Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-free Adaptive Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1103--1114. Google ScholarDigital Library
- Victor Alvarez, Felix Martin Schuhknecht, Jens Dittrich, and Stefan Richter. 2014. Main Memory Adaptive Indexing for Multi-Core Systems Proceedings of the International Workshop on Data Management on New Hardware (DAMON). 3:1---3:10. Google ScholarDigital Library
- Michael R. Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael J. Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://web.eecs.umich.edu/Google Scholar
- Paul M Aoki. 1998. Generalizing "Search" in Generalized Search Trees (Extended Abstract) Proceedings of the IEEE International Conference on Data Engineering (ICDE). 380--389. Google ScholarDigital Library
- Paul M Aoki. 1999. How to Avoid Building DataBlades That Know the Value of Everything and the Cost of Nothing Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). 122--133. Google ScholarDigital Library
- Joy Arulraj, Andrew Pavlo, and Prashanth Menon. 2016. Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. Designing Access Methods: The RUM Conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT). 461--466.Google Scholar
- Shivnath Babu, Nedyalko Borisov, Songyun Duan, Herodotos Herodotou, and Vamsidhar Thummala. 2009. Automated Experiment-Driven Management of (Database) Systems Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS). http://www.usenix.org/events/hotos09/tech/full Google ScholarDigital Library
- Don S Batory, J R Barnett, J F Garza, K P Smith, K Tsukuda, B C Twichell, and T E Wise. 1988. GENESIS: An Extensible Database Management System. IEEE Transactions on Software Engineering (TSE), Vol. 14, 11 (1988), 1711--1730. Google ScholarDigital Library
- Philip A. Bernstein and David B. Lomet. 1987. CASE Requirements for Extensible Database Systems. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 2--9. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
- Alfonso F. Cardenas. 1973. Evaluation and Selection of File Organization - A Model and System. Commun. ACM, Vol. 16, 9 (1973), 540--548. Google ScholarDigital Library
- Michael J Carey and David J DeWitt. 1987. An Overview of the EXODUS Project. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 47--54. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
- ch M.J. Steindorfer, and J.J. Vinju. ch2016. chTowards a software product line of trie-based collections chProceedings of the ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. Google ScholarDigital Library
- Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server Proceedings of the International Conference on Very Large Data Bases (VLDB). 146--155. http://dl.acm.org/citation.cfm?id=645923.673646 Google ScholarDigital Library
- Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System Proceedings of the International Conference on Very Large Data Bases (VLDB). 1--10. http://www.vldb.org/conf/2000/P001.pdf Google ScholarDigital Library
- chC. Loncaric, E.Torlak, and M.D Ernst. ch2016. chFast synthesis of fast collections. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- chD. Cohen and N. Campbell. ch1993. chAutomating Relational Operations on Data Structures. chIEEE Software Vol. ch10, ch3 (. ch1993), ch53--60. Google ScholarDigital Library
- chE. Schonberg, J.T. Schwartz and Sharir. ch1979. chAutomatic Data Structure Selection in SETL. chProceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarDigital Library
- chE. Schonberg, J.T. Schwartz and Sharir. ch1981. chAn Automatic Technique for Selection of Data Representations in SETL Programs. chACM Transactions on Programming Languages and Systems, Vol. ch3, ch2 (. ch1981), ch126--143. Google ScholarDigital Library
- Alvin Cheung. 2015. Towards Generating Application-Specific Data Management Systems Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://www.cidrdb.org/cidr2015/Papers/17Google Scholar
- chmStratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Mike Kester, Demi Guo. chm2018. chmThe Internals of the Data Calculator. chmHarvard Data Systems Laboratory, Technical Report (. chm2018).Google Scholar
- chO. Shacham, M. Vechev, and E. Yahav. ch2009. chChameleon: Adaptive Selection of Collections. chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard and M. Sagiv. ch2011. chData representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard, and M. Sagiv. ch2012. chConcurrent data representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- chY. Smaragdakis, and D. Batory. ch1997. chDiSTiL: A Transformation Library for Data Structures chProceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL). Google ScholarDigital Library
- Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarDigital Library
- Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. 1997. A Reliable Randomized Algorithm for the Closest-Pair Problem. J. Algorithms (1997). Google ScholarDigital Library
- Jens Dittrich and Alekh Jindal. 2011. Towards a One Size Fits All Database Architecture Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). 195--198.Google Scholar
- David Goldhirsch and Jack A Orenstein. 1987. Extensibility in the PROBE Database System. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 24--31. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
- Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 6, 1 (feb. 1994), 120--135. http://dl.acm.org/citation.cfm?id=627558 Google ScholarDigital Library
- Goetz Graefe. 2011. Modern B-Tree Techniques. Foundations and Trends in Databases Vol. 3, 4 (2011), 203--402. Google ScholarDigital Library
- Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold. 2012. Concurrency control for adaptive indexing. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 656--667. http://dl.acm.org/citation.cfm?id=2180918 Google ScholarDigital Library
- Jim Gray. 2000. What Next? A Few Remaining Problems in Information Technlogy. ACM SIGMOD Digital Symposium Collection Vol. 2, 2 (2000).Google Scholar
- Richard A Hankins and Jignesh M Patel. 2003. Data Morphing: An Adaptive, Cache-Conscious Storage Technique Proceedings of the International Conference on Very Large Data Bases (VLDB). 417--428. http://www.vldb.org/conf/2003/papers/S13P03.pdf Google ScholarDigital Library
- Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation Proceedings of the ACM SIGMOD International Conference on Management of Data. 1477--1492. Google ScholarDigital Library
- Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. 1995. Generalized Search Trees for Database Systems. Proceedings of the International Conference on Very Large Data Bases (VLDB). 562--573. http://dl.acm.org/citation.cfm?id=645921.673145 Google ScholarDigital Library
- Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2009. Self-organizing Tuple Reconstruction in Column-Stores Proceedings of the ACM SIGMOD International Conference on Management of Data. 297--308. Google ScholarDigital Library
- Yannis E Ioannidis and Eugene Wong. 1987. Query Optimization by Simulated Annealing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 9--22. Google ScholarDigital Library
- Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? Proceedings of the ACM SIGMOD International Conference on Management of Data. 715--730. Google ScholarDigital Library
- Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2013. The Uncracked Pieces in Database Cracking. Proceedings of the VLDB Endowment Vol. 7, 2 (2013), 97--108. http://www.vldb.org/pvldb/vol7/p97-schuhknecht.pdf Google ScholarDigital Library
- Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-Adjusting Binary Search Trees. J. ACM Vol. 32, 3 (1985), 652--686. Google ScholarDigital Library
- Michael Stonebraker, Jeff Anton, and Michael Hirohama. 1987. Extendability in POSTGRES. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 16--23. http://sites.computer.org/debull/87JUN-CD.pdfGoogle Scholar
- R.E. Tarjan. 1978. Complexity of combinatorial algorithms. SIAM Rev (1978).Google Scholar
- Toby J. Teorey and K. Sundar Das. 1976. Application of an Analytical Model to Evaluate Storage Structures Proceedings of the ACM SIGMOD International Conference on Management of Data. Google Scholar
- Peng Wang, Di Wang, and Adam Chlipala. 2017. TiML: a functional language for practical complexity analysis with invariants. PACMPL Vol. 1 (2017), 79:1--79:26. Google ScholarDigital Library
- Leland Wilkinson. 2005. The Grammar of Graphics. Springer (2005). Google ScholarDigital Library
- S Bing Yao. 1977. An Attribute Based Model for Database Access Cost Analysis. ACM Transactions on Database Systems (TODS), Vol. 2, 1 (1977), 45--67. Google ScholarDigital Library
- S. Bing Yao and D. DeJong. 1978. Evaluation of Database Access Paths. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- S. Bing Yao and Alan G. Merten. 1975. Selection of File Organization Using an Analytic Model Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
- Ming Zhou. 1999. Generalizing Database Access Methods. Ph.D. Thesis. University of Waterloo (1999).Google Scholar
- Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2014. Indexing for interactive exploration of big data series Proceedings of the ACM SIGMOD International Conference on Management of Data. 1555--1566. Google ScholarDigital Library
Index Terms
- The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models
Recommendations
Fast synthesis of fast collections
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationMany applications require specialized data structures not found in the standard libraries, but implementing new data structures by hand is tedious and error-prone. This paper presents a novel approach for synthesizing efficient implementations of ...
Cozy: synthesizing collection data structures
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software EngineeringMany applications require specialized data structures not found in standard libraries. Implementing new data structures by hand is tedious and error-prone. To alleviate this difficulty, we built a tool called Cozy that synthesizes data structures using ...
Fast synthesis of fast collections
PLDI '16Many applications require specialized data structures not found in the standard libraries, but implementing new data structures by hand is tedious and error-prone. This paper presents a novel approach for synthesizing efficient implementations of ...
Comments