skip to main content
10.1145/2618243.2618265acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Efficient data management and statistics with zero-copy integration

Published:30 June 2014Publication History

ABSTRACT

Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.

References

  1. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1--70, 2008.Google ScholarGoogle Scholar
  2. Intel 64 and IA-32 Architectures Software Developer's Manual, 06 2013.Google ScholarGoogle Scholar
  3. D. Adler, C. Gläser, O. Nenadic, J. Oehlschlägel, and W. Zucchini. ff: memory-efficient storage of large data on disk and fast access functions, 2013. R package version 2.2-11.Google ScholarGoogle Scholar
  4. F. Chen and B. D. Ripley. Statistical computing and databases: Distributed computing near the data. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), 2003.Google ScholarGoogle Scholar
  5. J. Conway, D. Eddelbuettel, T. Nishiyama, S. K. Prayaga, and N. Tiffin. RPostgreSQL: R interface to the PostgreSQL database system, 2013. R package version 0.4.Google ScholarGoogle Scholar
  6. M. Dowle, T. Short, S. L. with contributions from A Srinivasan, and R. Saporta. data.table: Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns., 2013. R package version 1.8.10.Google ScholarGoogle Scholar
  7. P. Große, W. Lehner, T. Weichert, F. Färber, and W.-S. Li. Bridging two worlds with RICE integrating R into the SAP in-memory computing engine. PVLDB, 4(12):1307--1317, 2011.Google ScholarGoogle Scholar
  8. M. Hornick and T. Plunkett. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise and Oracle R Connector for Hadoop. McGraw-Hill Osborne Media, 2013.Google ScholarGoogle Scholar
  9. S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Engineering Bulletin, 35(1):40--45, 2012.Google ScholarGoogle Scholar
  10. D. T. Lang. Scenarios for using R within a relational database management system server. Technical report, Bell Labs, 2001.Google ScholarGoogle Scholar
  11. H. Mühleisen and T. Lumley. Best of both worlds: relational databases and statistics. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM, pages 32:1--32:4, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Mukhin, D. A. James, and J. Luciani. ROracle: OCI based Oracle database interface for R, 2013. R package version 1.1-10.Google ScholarGoogle Scholar
  13. R. S. I. G. on Databases. DBI: R Database Interface, 2013. R package version 0.2-7.Google ScholarGoogle Scholar
  14. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2013.Google ScholarGoogle Scholar
  15. R Core Team. R Internals. R Foundation for Statistical Computing, 3.1.0 edition, 2014.Google ScholarGoogle Scholar
  16. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented DBMS. In K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-Å. Larson, and B. C. Ooi, editors, VLDB, pages 553--564. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Z. Hanzálek, H. Härtig, M. Castro, and M. F. Kaashoek, editors, EuroSys, pages 197--210. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Zhang, H. Herodotou, and J. Yang. RIOT: I/O-efficient numerical computing without SQL. In Proceedings of the 2009 Conference on Innovative Data Systems Research, 2009.Google ScholarGoogle Scholar
  19. Y. Zhang and J. Yang. Optimizing I/O for big array analytics. CoRR, abs/1204.6081, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient data management and statistics with zero-copy integration

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management
          June 2014
          417 pages
          ISBN:9781450327220
          DOI:10.1145/2618243

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 June 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SSDBM '14 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate56of146submissions,38%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader