research-article

Cheap data analytics using cold storage devices

Authors:
Renata Borovica-Gajić

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

,
Raja Appuswamy

École Polytechnique Fédérale de Lausanne

École Polytechnique Fédérale de Lausanne
View Profile

,
Anastasia Ailamaki

École Polytechnique Fédérale de Lausanne and RAW Labs SA

École Polytechnique Fédérale de Lausanne and RAW Labs SA
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 12pp 1029–1040https://doi.org/10.14778/2994509.2994521

Published:01 August 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Enterprise databases use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD-based high-performance tier when it is "hot" (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when "cold" (rarely accessed). To address the unprecedented growth in the amount of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as $0.01/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (min to h), low-cost, tape-based, archival tier.

Driven by the price/performance aspect of CSD, this paper makes a case for using CSD as a replacement for both capacity and archival tiers of enterprise databases. Although CSD offer major cost savings, we show that current database systems can suffer from severe performance drop when CSD are used as a replacement for HDD due to the mismatch between design assumptions made by the query execution engine and actual storage characteristics of the CSD. We then build a CSD-driven query execution framework, called Skipper, that modifies both the database execution engine and CSD scheduling algorithms to be aware of each other. Using results from our implementation of the architecture based on PostgreSQL and OpenStack Swift, we show that Skipper is capable of completely masking the high latency overhead of CSD, thereby opening up CSD for wider adoption as a storage tier for cheap data analytics over cold data.

References

Amazon. Amazon glacier. http://aws.amazon.com/glacier/.Google Scholar
L. Amsaleg, M. J. Franklin, A. Tomasic, and T. Urhan. Scrambling query plans to cope with unexpected delays. In DIS, 1996. Google ScholarDigital Library
S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: A data-centric analytic processing engine for large data warehouses. In SIGMOD, 2010. Google ScholarDigital Library
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, 2000. Google ScholarDigital Library
S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, 2005. Google ScholarDigital Library
N. Bansal and K. Pruhs. Server Scheduling in the Lp Norm: A Rising Tide Lifts All Boat. In STOC, 2003. Google ScholarDigital Library
D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In Conference on Supercomputing, 2002. Google ScholarDigital Library
J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. PVLDB, 6(14):1942--1953, 2013. Google ScholarDigital Library
A. Deshpande, I. Zachary, and V. Raman. Adaptive Query Processing. In Foundations and Trends in Databases, 2007. Google ScholarDigital Library
A. Eldawy, J. Levandoski, and P. Larson. Trekking through siberia: Managing cold data in a memory-optimized database. In VLDB, 2014. Google ScholarDigital Library
Google. Google cloud storage nearline. White paper, 2015.Google Scholar
A. Gulati, A. Merchant, M. Uysal, P. Padala, and P. Varman. Workload dependent io scheduling for fairness and efficiency in shared storage systems. In HiPC, 2012.Google ScholarCross Ref
A. Gulati, A. Merchant, and P. J. Varman. mclock: Handling throughput variability for hypervisor io scheduling. In OSDI, 2010. Google ScholarDigital Library
C. Gupta, A. Mehta, S. Wang, and U. Dayal. Fair, effective, efficient and differentiated scheduling in an enterprise data warehouse. In EDBT, 2009. Google ScholarDigital Library
P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD, 1999. Google ScholarDigital Library
S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational query engine. In SIGMOD, 2005. Google ScholarDigital Library
IDC. Technology Assessment: Cold Storage Is Hot Again Finding the Frost Point, 2013.Google Scholar
IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014.Google Scholar
Intel. Cold Storage in the Cloud: Trends, Challenges, and Solutions. White Paper.Google Scholar
Z. G. Ives, A. Y. Halevy, and D. S. Weld. Adapting to source properties in processing data integration queries. In SIGMOD, 2004. Google ScholarDigital Library
W. Jin, J. S. Chase, and J. Kaur. Interposed proportional sharing for a storage service utility. In SIGMETRICS, 2004. Google ScholarDigital Library
N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998. Google ScholarDigital Library
B. Laliberte. Automate and optimize a tiered storage environment fast! ESG White Paper.Google Scholar
J. J. Levandoski, P.-A. Larson, and R. Stoica. Identifying hot and cold data in main-memory databases. In ICDE, 2013.Google ScholarDigital Library
S. Logic. Arctic blue pricing calculator. https://www.spectralogic.com/arcticblue-pricing-calculator/.Google Scholar
V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust Query Processing through Progressive Optimization. In SIGMOD, 2004. Google ScholarDigital Library
T. H. Merrett, Y. Kambayashi, and H. Yasuura. Scheduling of page-fetches in join operations. In VLDB, 1981. Google ScholarDigital Library
P. O. Neil, B. O. Neil, and X. Chen. Star Schema Benchmark. 2009.Google Scholar
S. Newsletter. Costs as barrier to realizing value big data can deliver. http://www.storagenewsletter.com/rubriques/marketreportsresearch/37-of-cios-storing-between-500tb-and-1pb-storiantresearch-now/.Google Scholar
Oracle. Openstack swift interface for oracle hierarchical storage manager. White Paper, 2015.Google Scholar
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
PCWorld. Facebook uses 10,000 blu-ray discs to store 'cold' data. http://www.pcworld.com/article/2092420/facebook-puts-10000-bluray-discs-in-lowpower-storage-system.html.Google Scholar
M. L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company, Incorporated, 3rd edition, 2008. Google ScholarDigital Library
A. Povzner, T. Kaldewey, S. Brandt, R. Golding, T. M. Wong, and C. Maltzahn. Efficient guaranteed disk request scheduling with fahrrad. SIGOPS Oper. Syst. Rev., 42(4):13--25, 2008. Google ScholarDigital Library
S. Prabhakar, D. Agrawal, and A. El Abbadi. Optimal scheduling algorithms for tertiary storage. Distributed and Parallel Databases, 14(3):255--282, 2003. Google ScholarDigital Library
O. C. Project. Cold storage hardware v0.5, 2013.Google Scholar
S. Sarawagi. Query Processing in Tertiary Memory Databases. In VLDB, 1995. Google ScholarDigital Library
S. Sarawagi and M. Stonebraker. Reordering Query Execution in Tertiary Memory Databases. In VLDB, 1996. Google ScholarDigital Library
P. Shenoy and H. M. Vin. Cello: A disk scheduling framework for next generation operating systems*. Real-Time Systems, 22(1):9--48, 2002. Google ScholarDigital Library
R. B. Shobana Balakrishnan, A. Donnelly, P. England, A. Glass, D. Harper, S. Legtchenko, A. Ogus, E. Peterson, and A. Rowstron. Pelican: A building block for exascale cold data storage. In OSDI, 2014. Google ScholarDigital Library
Spectra. Arcticblue deep storage disk. Product, https://www.spectralogic.com/products/arcticblue/.Google Scholar
H. I. Strategies. Tiered storage takes center stage. Report, 2015.Google Scholar
TPC. Tpc-h benchmark. http://www.tpc.org/tpch/.Google Scholar
T. Urhan and M. J. Franklin. XJoin: A Reactively-Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2):27--33, 2000.Google Scholar
S. Viglas, J. F. Naughton, and J. Burger. Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. In VLDB, 2003. Google ScholarDigital Library
M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: Performance insulation for shared storage servers. In FAST, 2007. Google ScholarDigital Library
A. Wilschut and P. Apers. Dataflow Query Execution in a Parallel Main-memory Environment. In PDIS, 1991. Google ScholarDigital Library
C. H. Wu and et al. The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Research, 2002.Google ScholarCross Ref

Index Terms

Cheap data analytics using cold storage devices
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing

Index terms have been assigned to the content through auto-classification.

Recommendations

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce ...
Read More
A data management method for databases using hybrid storage systems

When applications require high I/O performance, solid-state drives (SSDs) are often preferable because they perform better than traditional hard-disk drives (HDDs). Therefore, database system response time can be improved by moving frequently used data ...
Read More
Efficient archival data storage
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 12
August 2016
345 pages
ISSN:2150-8097
Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2016
Published in pvldb Volume 9, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 187
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cheap data analytics using cold storage devices

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

A data management method for databases using hybrid storage systems

Efficient archival data storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cheap data analytics using cold storage devices

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

A data management method for databases using hybrid storage systems

Efficient archival data storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media