Abstract
Enterprise databases use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD-based high-performance tier when it is "hot" (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when "cold" (rarely accessed). To address the unprecedented growth in the amount of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as $0.01/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (min to h), low-cost, tape-based, archival tier.
Driven by the price/performance aspect of CSD, this paper makes a case for using CSD as a replacement for both capacity and archival tiers of enterprise databases. Although CSD offer major cost savings, we show that current database systems can suffer from severe performance drop when CSD are used as a replacement for HDD due to the mismatch between design assumptions made by the query execution engine and actual storage characteristics of the CSD. We then build a CSD-driven query execution framework, called Skipper, that modifies both the database execution engine and CSD scheduling algorithms to be aware of each other. Using results from our implementation of the architecture based on PostgreSQL and OpenStack Swift, we show that Skipper is capable of completely masking the high latency overhead of CSD, thereby opening up CSD for wider adoption as a storage tier for cheap data analytics over cold data.
- Amazon. Amazon glacier. http://aws.amazon.com/glacier/.Google Scholar
- L. Amsaleg, M. J. Franklin, A. Tomasic, and T. Urhan. Scrambling query plans to cope with unexpected delays. In DIS, 1996. Google ScholarDigital Library
- S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: A data-centric analytic processing engine for large data warehouses. In SIGMOD, 2010. Google ScholarDigital Library
- R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, 2000. Google ScholarDigital Library
- S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, 2005. Google ScholarDigital Library
- N. Bansal and K. Pruhs. Server Scheduling in the Lp Norm: A Rising Tide Lifts All Boat. In STOC, 2003. Google ScholarDigital Library
- D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In Conference on Supercomputing, 2002. Google ScholarDigital Library
- J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. PVLDB, 6(14):1942--1953, 2013. Google ScholarDigital Library
- A. Deshpande, I. Zachary, and V. Raman. Adaptive Query Processing. In Foundations and Trends in Databases, 2007. Google ScholarDigital Library
- A. Eldawy, J. Levandoski, and P. Larson. Trekking through siberia: Managing cold data in a memory-optimized database. In VLDB, 2014. Google ScholarDigital Library
- Google. Google cloud storage nearline. White paper, 2015.Google Scholar
- A. Gulati, A. Merchant, M. Uysal, P. Padala, and P. Varman. Workload dependent io scheduling for fairness and efficiency in shared storage systems. In HiPC, 2012.Google ScholarCross Ref
- A. Gulati, A. Merchant, and P. J. Varman. mclock: Handling throughput variability for hypervisor io scheduling. In OSDI, 2010. Google ScholarDigital Library
- C. Gupta, A. Mehta, S. Wang, and U. Dayal. Fair, effective, efficient and differentiated scheduling in an enterprise data warehouse. In EDBT, 2009. Google ScholarDigital Library
- P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD, 1999. Google ScholarDigital Library
- S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational query engine. In SIGMOD, 2005. Google ScholarDigital Library
- IDC. Technology Assessment: Cold Storage Is Hot Again Finding the Frost Point, 2013.Google Scholar
- IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014.Google Scholar
- Intel. Cold Storage in the Cloud: Trends, Challenges, and Solutions. White Paper.Google Scholar
- Z. G. Ives, A. Y. Halevy, and D. S. Weld. Adapting to source properties in processing data integration queries. In SIGMOD, 2004. Google ScholarDigital Library
- W. Jin, J. S. Chase, and J. Kaur. Interposed proportional sharing for a storage service utility. In SIGMETRICS, 2004. Google ScholarDigital Library
- N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998. Google ScholarDigital Library
- B. Laliberte. Automate and optimize a tiered storage environment fast! ESG White Paper.Google Scholar
- J. J. Levandoski, P.-A. Larson, and R. Stoica. Identifying hot and cold data in main-memory databases. In ICDE, 2013.Google ScholarDigital Library
- S. Logic. Arctic blue pricing calculator. https://www.spectralogic.com/arcticblue-pricing-calculator/.Google Scholar
- V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust Query Processing through Progressive Optimization. In SIGMOD, 2004. Google ScholarDigital Library
- T. H. Merrett, Y. Kambayashi, and H. Yasuura. Scheduling of page-fetches in join operations. In VLDB, 1981. Google ScholarDigital Library
- P. O. Neil, B. O. Neil, and X. Chen. Star Schema Benchmark. 2009.Google Scholar
- S. Newsletter. Costs as barrier to realizing value big data can deliver. http://www.storagenewsletter.com/rubriques/marketreportsresearch/37-of-cios-storing-between-500tb-and-1pb-storiantresearch-now/.Google Scholar
- Oracle. Openstack swift interface for oracle hierarchical storage manager. White Paper, 2015.Google Scholar
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
- PCWorld. Facebook uses 10,000 blu-ray discs to store 'cold' data. http://www.pcworld.com/article/2092420/facebook-puts-10000-bluray-discs-in-lowpower-storage-system.html.Google Scholar
- M. L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company, Incorporated, 3rd edition, 2008. Google ScholarDigital Library
- A. Povzner, T. Kaldewey, S. Brandt, R. Golding, T. M. Wong, and C. Maltzahn. Efficient guaranteed disk request scheduling with fahrrad. SIGOPS Oper. Syst. Rev., 42(4):13--25, 2008. Google ScholarDigital Library
- S. Prabhakar, D. Agrawal, and A. El Abbadi. Optimal scheduling algorithms for tertiary storage. Distributed and Parallel Databases, 14(3):255--282, 2003. Google ScholarDigital Library
- O. C. Project. Cold storage hardware v0.5, 2013.Google Scholar
- S. Sarawagi. Query Processing in Tertiary Memory Databases. In VLDB, 1995. Google ScholarDigital Library
- S. Sarawagi and M. Stonebraker. Reordering Query Execution in Tertiary Memory Databases. In VLDB, 1996. Google ScholarDigital Library
- P. Shenoy and H. M. Vin. Cello: A disk scheduling framework for next generation operating systems*. Real-Time Systems, 22(1):9--48, 2002. Google ScholarDigital Library
- R. B. Shobana Balakrishnan, A. Donnelly, P. England, A. Glass, D. Harper, S. Legtchenko, A. Ogus, E. Peterson, and A. Rowstron. Pelican: A building block for exascale cold data storage. In OSDI, 2014. Google ScholarDigital Library
- Spectra. Arcticblue deep storage disk. Product, https://www.spectralogic.com/products/arcticblue/.Google Scholar
- H. I. Strategies. Tiered storage takes center stage. Report, 2015.Google Scholar
- TPC. Tpc-h benchmark. http://www.tpc.org/tpch/.Google Scholar
- T. Urhan and M. J. Franklin. XJoin: A Reactively-Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2):27--33, 2000.Google Scholar
- S. Viglas, J. F. Naughton, and J. Burger. Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. In VLDB, 2003. Google ScholarDigital Library
- M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: Performance insulation for shared storage servers. In FAST, 2007. Google ScholarDigital Library
- A. Wilschut and P. Apers. Dataflow Query Execution in a Parallel Main-memory Environment. In PDIS, 1991. Google ScholarDigital Library
- C. H. Wu and et al. The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Research, 2002.Google ScholarCross Ref
Index Terms
- Cheap data analytics using cold storage devices
Recommendations
Coupling Right-Provisioned Cold Storage Data Centers with Deduplication
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingModern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce ...
A data management method for databases using hybrid storage systems
When applications require high I/O performance, solid-state drives (SSDs) are often preferable because they perform better than traditional hard-disk drives (HDDs). Therefore, database system response time can be improved by moving frequently used data ...
Comments