skip to main content
10.1145/3183713.3199671acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

Published:27 May 2018Publication History

ABSTRACT

Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.

References

  1. Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends in Databases Vol. 5, 3 (2013), 197--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning Proceedings of the ACM SIGMOD International Conference on Management of Data. 1009--1024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-free Adaptive Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1103--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Victor Alvarez, Felix Martin Schuhknecht, Jens Dittrich, and Stefan Richter. 2014. Main Memory Adaptive Indexing for Multi-Core Systems Proceedings of the International Workshop on Data Management on New Hardware (DAMON). 3:1---3:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael R. Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael J. Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://web.eecs.umich.edu/Google ScholarGoogle Scholar
  6. Paul M Aoki. 1998. Generalizing "Search" in Generalized Search Trees (Extended Abstract) Proceedings of the IEEE International Conference on Data Engineering (ICDE). 380--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paul M Aoki. 1999. How to Avoid Building DataBlades That Know the Value of Everything and the Cost of Nothing Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM). 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Joy Arulraj, Andrew Pavlo, and Prashanth Menon. 2016. Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. Designing Access Methods: The RUM Conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT). 461--466.Google ScholarGoogle Scholar
  10. Shivnath Babu, Nedyalko Borisov, Songyun Duan, Herodotos Herodotou, and Vamsidhar Thummala. 2009. Automated Experiment-Driven Management of (Database) Systems Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS). http://www.usenix.org/events/hotos09/tech/full Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Don S Batory, J R Barnett, J F Garza, K P Smith, K Tsukuda, B C Twichell, and T E Wise. 1988. GENESIS: An Extensible Database Management System. IEEE Transactions on Software Engineering (TSE), Vol. 14, 11 (1988), 1711--1730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Philip A. Bernstein and David B. Lomet. 1987. CASE Requirements for Extensible Database Systems. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 2--9. http://sites.computer.org/debull/87JUN-CD.pdfGoogle ScholarGoogle Scholar
  13. Alfonso F. Cardenas. 1973. Evaluation and Selection of File Organization - A Model and System. Commun. ACM, Vol. 16, 9 (1973), 540--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael J Carey and David J DeWitt. 1987. An Overview of the EXODUS Project. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 47--54. http://sites.computer.org/debull/87JUN-CD.pdfGoogle ScholarGoogle Scholar
  15. ch M.J. Steindorfer, and J.J. Vinju. ch2016. chTowards a software product line of trie-based collections chProceedings of the ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server Proceedings of the International Conference on Very Large Data Bases (VLDB). 146--155. http://dl.acm.org/citation.cfm?id=645923.673646 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System Proceedings of the International Conference on Very Large Data Bases (VLDB). 1--10. http://www.vldb.org/conf/2000/P001.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. chC. Loncaric, E.Torlak, and M.D Ernst. ch2016. chFast synthesis of fast collections. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. chD. Cohen and N. Campbell. ch1993. chAutomating Relational Operations on Data Structures. chIEEE Software Vol. ch10, ch3 (. ch1993), ch53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. chE. Schonberg, J.T. Schwartz and Sharir. ch1979. chAutomatic Data Structure Selection in SETL. chProceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. chE. Schonberg, J.T. Schwartz and Sharir. ch1981. chAn Automatic Technique for Selection of Data Representations in SETL Programs. chACM Transactions on Programming Languages and Systems, Vol. ch3, ch2 (. ch1981), ch126--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alvin Cheung. 2015. Towards Generating Application-Specific Data Management Systems Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). http://www.cidrdb.org/cidr2015/Papers/17Google ScholarGoogle Scholar
  23. chmStratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Mike Kester, Demi Guo. chm2018. chmThe Internals of the Data Calculator. chmHarvard Data Systems Laboratory, Technical Report (. chm2018).Google ScholarGoogle Scholar
  24. chO. Shacham, M. Vechev, and E. Yahav. ch2009. chChameleon: Adaptive Selection of Collections. chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard and M. Sagiv. ch2011. chData representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. chP. Hawkins, A. Aiken, K. Fisher, M.C. Rinard, and M. Sagiv. ch2012. chConcurrent data representation synthesis. In chProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. chY. Smaragdakis, and D. Batory. ch1997. chDiSTiL: A Transformation Library for Data Structures chProceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. 1997. A Reliable Randomized Algorithm for the Closest-Pair Problem. J. Algorithms (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jens Dittrich and Alekh Jindal. 2011. Towards a One Size Fits All Database Architecture Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR). 195--198.Google ScholarGoogle Scholar
  31. David Goldhirsch and Jack A Orenstein. 1987. Extensibility in the PROBE Database System. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 24--31. http://sites.computer.org/debull/87JUN-CD.pdfGoogle ScholarGoogle Scholar
  32. Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 6, 1 (feb. 1994), 120--135. http://dl.acm.org/citation.cfm?id=627558 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Goetz Graefe. 2011. Modern B-Tree Techniques. Foundations and Trends in Databases Vol. 3, 4 (2011), 203--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold. 2012. Concurrency control for adaptive indexing. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 656--667. http://dl.acm.org/citation.cfm?id=2180918 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jim Gray. 2000. What Next? A Few Remaining Problems in Information Technlogy. ACM SIGMOD Digital Symposium Collection Vol. 2, 2 (2000).Google ScholarGoogle Scholar
  36. Richard A Hankins and Jignesh M Patel. 2003. Data Morphing: An Adaptive, Cache-Conscious Storage Technique Proceedings of the International Conference on Very Large Data Bases (VLDB). 417--428. http://www.vldb.org/conf/2003/papers/S13P03.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation Proceedings of the ACM SIGMOD International Conference on Management of Data. 1477--1492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. 1995. Generalized Search Trees for Database Systems. Proceedings of the International Conference on Very Large Data Bases (VLDB). 562--573. http://dl.acm.org/citation.cfm?id=645921.673145 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  40. Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2009. Self-organizing Tuple Reconstruction in Column-Stores Proceedings of the ACM SIGMOD International Conference on Management of Data. 297--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yannis E Ioannidis and Eugene Wong. 1987. Query Optimization by Simulated Annealing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 9--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In Biennial Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  43. Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? Proceedings of the ACM SIGMOD International Conference on Management of Data. 715--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2013. The Uncracked Pieces in Database Cracking. Proceedings of the VLDB Endowment Vol. 7, 2 (2013), 97--108. http://www.vldb.org/pvldb/vol7/p97-schuhknecht.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-Adjusting Binary Search Trees. J. ACM Vol. 32, 3 (1985), 652--686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Michael Stonebraker, Jeff Anton, and Michael Hirohama. 1987. Extendability in POSTGRES. IEEE Data Engineering Bulletin Vol. 10, 2 (1987), 16--23. http://sites.computer.org/debull/87JUN-CD.pdfGoogle ScholarGoogle Scholar
  47. R.E. Tarjan. 1978. Complexity of combinatorial algorithms. SIAM Rev (1978).Google ScholarGoogle Scholar
  48. Toby J. Teorey and K. Sundar Das. 1976. Application of an Analytical Model to Evaluate Storage Structures Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle Scholar
  49. Peng Wang, Di Wang, and Adam Chlipala. 2017. TiML: a functional language for practical complexity analysis with invariants. PACMPL Vol. 1 (2017), 79:1--79:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Leland Wilkinson. 2005. The Grammar of Graphics. Springer (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S Bing Yao. 1977. An Attribute Based Model for Database Access Cost Analysis. ACM Transactions on Database Systems (TODS), Vol. 2, 1 (1977), 45--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Bing Yao and D. DeJong. 1978. Evaluation of Database Access Paths. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Bing Yao and Alan G. Merten. 1975. Selection of File Organization Using an Analytic Model Proceedings of the International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ming Zhou. 1999. Generalizing Database Access Methods. Ph.D. Thesis. University of Waterloo (1999).Google ScholarGoogle Scholar
  55. Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2014. Indexing for interactive exploration of big data series Proceedings of the ACM SIGMOD International Conference on Management of Data. 1555--1566. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
        May 2018
        1874 pages
        ISBN:9781450347037
        DOI:10.1145/3183713

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader