skip to main content
article

Instant loading for main memory databases

Published:01 September 2013Publication History
Skip Abstract Section

Abstract

eScience and big data analytics applications are facing the challenge of efficiently evaluating complex queries over vast amounts of structured text data archived in network storage solutions. To analyze such data in traditional disk-based database systems, it needs to be bulk loaded, an operation whose performance largely depends on the wire speed of the data source and the speed of the data sink, i.e., the disk. As the speed of network adapters and disks has stagnated in the past, loading has become a major bottleneck. The delays it is causing are now ubiquitous as text formats are a preferred storage format for reasons of portability.

But the game has changed: Ever increasing main memory capacities have fostered the development of in-memory database systems and very fast network infrastructures are on the verge of becoming economical. While hardware limitations for fast loading have disappeared, current approaches for main memory databases fail to saturate the now available wire speeds of tens of Gbit/s. With Instant Loading, we contribute a novel CSV loading approach that allows scalable bulk loading at wire speed. This is achieved by optimizing all phases of loading for modern super-scalar multi-core CPUs. Large main memory capacities and Instant Loading thereby facilitate a very efficient data staging processing model consisting of instantaneous load-work-unload cycles across data archives on a single node. Once data is loaded, updates and queries are efficiently processed with the flexibility, security, and high performance of relational main memory databases.

References

  1. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922-933, 2009. Google ScholarGoogle Scholar
  2. A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? VLDB, pages 266-277, 1999. Google ScholarGoogle Scholar
  3. I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, pages 241-252, 2012. Google ScholarGoogle Scholar
  4. P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225-237, 2005.Google ScholarGoogle Scholar
  5. J. Dean. MapReduce: simplified data processing on large clusters. CACM, 51(1):107-113, 2008. Google ScholarGoogle Scholar
  6. D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, et al. Split Query Processing in Polybase. In SIGMOD, pages 1255-1266, 2013. Google ScholarGoogle Scholar
  7. J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB, 5(11):1591-1602, 2012. Google ScholarGoogle Scholar
  8. G. Graefe. B-tree indexes for high update rates. SIGMOD Rec., 35(1):39-44, 2006. Google ScholarGoogle Scholar
  9. G. Graefe and H. Kuno. Fast Loads and Queries. In TLDKS II, number 6380 in LNCS, pages 31-72, 2010. Google ScholarGoogle Scholar
  10. J. Gray, D. Liu, M. Nieto-Santisteban, A. Szalay, D. DeWitt, and G. Heber. Scientific Data Management in the Coming Decade. SIGMOD Rec., 34(4):34-41, 2005. Google ScholarGoogle Scholar
  11. Hive user group presentation from Netflix. http://slideshare.net/slideshow/embed_code/3483386.Google ScholarGoogle Scholar
  12. S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, pages 57-68, 2011.Google ScholarGoogle Scholar
  13. S. Idreos, M. L. Kersten, and S. Manegold. Database Cracking. In CIDR, pages 68-78, 2007.Google ScholarGoogle Scholar
  14. S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores. PVLDB, 4(9):586-597, 2011. Google ScholarGoogle Scholar
  15. Extending the worlds most popular processor architecture. Intel Whitepaper, 2006.Google ScholarGoogle Scholar
  16. M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDM, volume 7338 of LNCS, pages 485-494, 2012. Google ScholarGoogle Scholar
  17. R. Johnson and I. Pandis. The bionic DBMS is coming, but what will it look like? In CIDR, 2013.Google ScholarGoogle Scholar
  18. R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, et al. H-store: a high-performance, distributed main memory transaction processing system. PVLDB, 1(2):1496-1499, 2008. Google ScholarGoogle Scholar
  19. A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, pages 195-206, 2011. Google ScholarGoogle Scholar
  20. V. Leis, A. Kemper, and T. Neumann. The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. In ICDE, pages 38-49, 2013. Google ScholarGoogle Scholar
  21. G. Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. VLDB, pages 476-487, 1998. Google ScholarGoogle Scholar
  22. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539-550, 2011. Google ScholarGoogle Scholar
  23. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165-178, 2009. Google ScholarGoogle Scholar
  24. J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. 2007. Google ScholarGoogle Scholar
  25. E. Sedlar. Oracle Labs. Personal comm. May 29, 2013.Google ScholarGoogle Scholar
  26. A. Szalay. JHU. Personal comm. May 16, 2013.Google ScholarGoogle Scholar
  27. A. Szalay, A. R. Thakar, and J. Gray. The sqlLoader Data-Loading Pipeline. JCSE, 10:38-48, 2008. Google ScholarGoogle Scholar
  28. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, et al. Hive: A warehousing solution over a map-reduce framework. PVLDB, 2(2):1626-1629, 2009. Google ScholarGoogle Scholar
  29. T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, et al. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. PVLDB, 2(1):385-394, 2009. Google ScholarGoogle Scholar
  30. Y. Shafranovich. IETF RFC 4180, 2005.Google ScholarGoogle Scholar
  31. J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In SIGMOD, pages 145-156, 2002. Google ScholarGoogle Scholar

Index Terms

  1. Instant loading for main memory databases
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 14
      September 2013
      384 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 September 2013
      Published in pvldb Volume 6, Issue 14

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader