skip to main content
research-article

Cheap data analytics using cold storage devices

Published:01 August 2016Publication History
Skip Abstract Section

Abstract

Enterprise databases use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD-based high-performance tier when it is "hot" (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when "cold" (rarely accessed). To address the unprecedented growth in the amount of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as $0.01/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (min to h), low-cost, tape-based, archival tier.

Driven by the price/performance aspect of CSD, this paper makes a case for using CSD as a replacement for both capacity and archival tiers of enterprise databases. Although CSD offer major cost savings, we show that current database systems can suffer from severe performance drop when CSD are used as a replacement for HDD due to the mismatch between design assumptions made by the query execution engine and actual storage characteristics of the CSD. We then build a CSD-driven query execution framework, called Skipper, that modifies both the database execution engine and CSD scheduling algorithms to be aware of each other. Using results from our implementation of the architecture based on PostgreSQL and OpenStack Swift, we show that Skipper is capable of completely masking the high latency overhead of CSD, thereby opening up CSD for wider adoption as a storage tier for cheap data analytics over cold data.

References

  1. Amazon. Amazon glacier. http://aws.amazon.com/glacier/.Google ScholarGoogle Scholar
  2. L. Amsaleg, M. J. Franklin, A. Tomasic, and T. Urhan. Scrambling query plans to cope with unexpected delays. In DIS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: A data-centric analytic processing engine for large data warehouses. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Bansal and K. Pruhs. Server Scheduling in the Lp Norm: A Rising Tide Lifts All Boat. In STOC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In Conference on Supercomputing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. PVLDB, 6(14):1942--1953, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Deshpande, I. Zachary, and V. Raman. Adaptive Query Processing. In Foundations and Trends in Databases, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Eldawy, J. Levandoski, and P. Larson. Trekking through siberia: Managing cold data in a memory-optimized database. In VLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Google. Google cloud storage nearline. White paper, 2015.Google ScholarGoogle Scholar
  12. A. Gulati, A. Merchant, M. Uysal, P. Padala, and P. Varman. Workload dependent io scheduling for fairness and efficiency in shared storage systems. In HiPC, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  13. A. Gulati, A. Merchant, and P. J. Varman. mclock: Handling throughput variability for hypervisor io scheduling. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Gupta, A. Mehta, S. Wang, and U. Dayal. Fair, effective, efficient and differentiated scheduling in an enterprise data warehouse. In EDBT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational query engine. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. IDC. Technology Assessment: Cold Storage Is Hot Again Finding the Frost Point, 2013.Google ScholarGoogle Scholar
  18. IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014.Google ScholarGoogle Scholar
  19. Intel. Cold Storage in the Cloud: Trends, Challenges, and Solutions. White Paper.Google ScholarGoogle Scholar
  20. Z. G. Ives, A. Y. Halevy, and D. S. Weld. Adapting to source properties in processing data integration queries. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Jin, J. S. Chase, and J. Kaur. Interposed proportional sharing for a storage service utility. In SIGMETRICS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Laliberte. Automate and optimize a tiered storage environment fast! ESG White Paper.Google ScholarGoogle Scholar
  24. J. J. Levandoski, P.-A. Larson, and R. Stoica. Identifying hot and cold data in main-memory databases. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Logic. Arctic blue pricing calculator. https://www.spectralogic.com/arcticblue-pricing-calculator/.Google ScholarGoogle Scholar
  26. V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust Query Processing through Progressive Optimization. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. H. Merrett, Y. Kambayashi, and H. Yasuura. Scheduling of page-fetches in join operations. In VLDB, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. O. Neil, B. O. Neil, and X. Chen. Star Schema Benchmark. 2009.Google ScholarGoogle Scholar
  29. S. Newsletter. Costs as barrier to realizing value big data can deliver. http://www.storagenewsletter.com/rubriques/marketreportsresearch/37-of-cios-storing-between-500tb-and-1pb-storiantresearch-now/.Google ScholarGoogle Scholar
  30. Oracle. Openstack swift interface for oracle hierarchical storage manager. White Paper, 2015.Google ScholarGoogle Scholar
  31. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. PCWorld. Facebook uses 10,000 blu-ray discs to store 'cold' data. http://www.pcworld.com/article/2092420/facebook-puts-10000-bluray-discs-in-lowpower-storage-system.html.Google ScholarGoogle Scholar
  33. M. L. Pinedo. Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company, Incorporated, 3rd edition, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Povzner, T. Kaldewey, S. Brandt, R. Golding, T. M. Wong, and C. Maltzahn. Efficient guaranteed disk request scheduling with fahrrad. SIGOPS Oper. Syst. Rev., 42(4):13--25, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Prabhakar, D. Agrawal, and A. El Abbadi. Optimal scheduling algorithms for tertiary storage. Distributed and Parallel Databases, 14(3):255--282, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. O. C. Project. Cold storage hardware v0.5, 2013.Google ScholarGoogle Scholar
  37. S. Sarawagi. Query Processing in Tertiary Memory Databases. In VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Sarawagi and M. Stonebraker. Reordering Query Execution in Tertiary Memory Databases. In VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Shenoy and H. M. Vin. Cello: A disk scheduling framework for next generation operating systems*. Real-Time Systems, 22(1):9--48, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. B. Shobana Balakrishnan, A. Donnelly, P. England, A. Glass, D. Harper, S. Legtchenko, A. Ogus, E. Peterson, and A. Rowstron. Pelican: A building block for exascale cold data storage. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Spectra. Arcticblue deep storage disk. Product, https://www.spectralogic.com/products/arcticblue/.Google ScholarGoogle Scholar
  42. H. I. Strategies. Tiered storage takes center stage. Report, 2015.Google ScholarGoogle Scholar
  43. TPC. Tpc-h benchmark. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  44. T. Urhan and M. J. Franklin. XJoin: A Reactively-Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2):27--33, 2000.Google ScholarGoogle Scholar
  45. S. Viglas, J. F. Naughton, and J. Burger. Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. In VLDB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: Performance insulation for shared storage servers. In FAST, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Wilschut and P. Apers. Dataflow Query Execution in a Parallel Main-memory Environment. In PDIS, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. H. Wu and et al. The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Research, 2002.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cheap data analytics using cold storage devices
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 9, Issue 12
      August 2016
      345 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2016
      Published in pvldb Volume 9, Issue 12

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader