ABSTRACT
Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.
- IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1--70, 2008.Google Scholar
- Intel 64 and IA-32 Architectures Software Developer's Manual, 06 2013.Google Scholar
- D. Adler, C. Gläser, O. Nenadic, J. Oehlschlägel, and W. Zucchini. ff: memory-efficient storage of large data on disk and fast access functions, 2013. R package version 2.2-11.Google Scholar
- F. Chen and B. D. Ripley. Statistical computing and databases: Distributed computing near the data. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), 2003.Google Scholar
- J. Conway, D. Eddelbuettel, T. Nishiyama, S. K. Prayaga, and N. Tiffin. RPostgreSQL: R interface to the PostgreSQL database system, 2013. R package version 0.4.Google Scholar
- M. Dowle, T. Short, S. L. with contributions from A Srinivasan, and R. Saporta. data.table: Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns., 2013. R package version 1.8.10.Google Scholar
- P. Große, W. Lehner, T. Weichert, F. Färber, and W.-S. Li. Bridging two worlds with RICE integrating R into the SAP in-memory computing engine. PVLDB, 4(12):1307--1317, 2011.Google Scholar
- M. Hornick and T. Plunkett. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise and Oracle R Connector for Hadoop. McGraw-Hill Osborne Media, 2013.Google Scholar
- S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Engineering Bulletin, 35(1):40--45, 2012.Google Scholar
- D. T. Lang. Scenarios for using R within a relational database management system server. Technical report, Bell Labs, 2001.Google Scholar
- H. Mühleisen and T. Lumley. Best of both worlds: relational databases and statistics. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM, pages 32:1--32:4, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- D. Mukhin, D. A. James, and J. Luciani. ROracle: OCI based Oracle database interface for R, 2013. R package version 1.1-10.Google Scholar
- R. S. I. G. on Databases. DBI: R Database Interface, 2013. R package version 0.2-7.Google Scholar
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2013.Google Scholar
- R Core Team. R Internals. R Foundation for Statistical Computing, 3.1.0 edition, 2014.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented DBMS. In K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-Å. Larson, and B. C. Ooi, editors, VLDB, pages 553--564. ACM, 2005. Google ScholarDigital Library
- S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Z. Hanzálek, H. Härtig, M. Castro, and M. F. Kaashoek, editors, EuroSys, pages 197--210. ACM, 2013. Google ScholarDigital Library
- Y. Zhang, H. Herodotou, and J. Yang. RIOT: I/O-efficient numerical computing without SQL. In Proceedings of the 2009 Conference on Innovative Data Systems Research, 2009.Google Scholar
- Y. Zhang and J. Yang. Optimizing I/O for big array analytics. CoRR, abs/1204.6081, 2012. Google ScholarDigital Library
Index Terms
- Efficient data management and statistics with zero-copy integration
Recommendations
Optimizing I/O performance in ViMo-S hypervisor with zero-copy method
ICSCA '17: Proceedings of the 6th International Conference on Software and Computer ApplicationsARM CPU is expanding into server market with the introduction of virtualization extensions. Virtualization is one of the key technologies that is commonly employed in servers. Virtualization is provided by hypervisor and ViMo-S is a prototype hypervisor ...
Zero-copy Migration for Lightweight Software Rejuvenation of Virtualized Systems
APSys '15: Proceedings of the 6th Asia-Pacific Workshop on SystemsVirtualized systems tend to suffer from software aging, which is the phenomenon that the state of a running system degrades with time. Software aging is restored by a technique called software rejuvenation, e.g., a system reboot. To reduce the downtime ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Comments