skip to main content
research-article

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.

References

  1. D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented Database Systems. PVLDB, 2(2): 1664--1665, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1): 922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data Page Layouts for Relational Databases on Deep Memory Hierarchies. VLDB J., 11(3): 198--215, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache Drill. http://www.mapr.com/resources/community-resources/apache-drill.Google ScholarGoogle Scholar
  5. Apache Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  6. Apache Hive. http://hive.apache.org/.Google ScholarGoogle Scholar
  7. Apache Shark. http://shark.cs.berkeley.edu/.Google ScholarGoogle Scholar
  8. Apache Spark. https://spark.incubator.apache.org/.Google ScholarGoogle Scholar
  9. L. Chang et al. HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In ACM SIGMOD, pages 1223--1234, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cloudera Impala.http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.Google ScholarGoogle Scholar
  11. D. J. DeWitt, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In ACM SIGMOD, pages 1255--1266, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented Storage Techniques for MapReduce. PVLDB, 4(7): 419--429, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Floratou, N. Teletia, D. J. DeWitt, J. M. Patel, and D. Zhang. Can the Elephants Handle the NoSQL Onslaught? PVLDB, 5(12): 1712--1723, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, pages 1199--1208, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, pages 25--36, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2): 330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Presto. http://prestodb.io/.Google ScholarGoogle Scholar
  18. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In PVLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: Friends or Foes? CACM, 53(1): 64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tajo. http://tajo.incubator.apache.org/.Google ScholarGoogle Scholar
  21. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a Petabyte Scale Data Warehouse using Hadoop. In ICDE, pages 996--1005, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. TPC-DS like Workload on Impala. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/.Google ScholarGoogle Scholar
  23. Trevni Columnar Format. http://avro.apache.org/docs/1.7.6/trevni/spec.html.Google ScholarGoogle Scholar
  24. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In ACM SIGMOD, pages 13--24, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SQL-on-Hadoop: full circle back to shared-nothing database architectures
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 7, Issue 12
          August 2014
          296 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2014
          Published in pvldb Volume 7, Issue 12

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader