research-article

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Authors:
Avrilia Floratou

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Umar Farooq Minhas

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Fatma Özcan

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 12pp 1295–1306https://doi.org/10.14778/2732977.2733002

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.

References

D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented Database Systems. PVLDB, 2(2): 1664--1665, 2009. Google ScholarDigital Library
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1): 922--933, 2009. Google ScholarDigital Library
A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data Page Layouts for Relational Databases on Deep Memory Hierarchies. VLDB J., 11(3): 198--215, 2002. Google ScholarDigital Library
Apache Drill. http://www.mapr.com/resources/community-resources/apache-drill.Google Scholar
Apache Hadoop. http://hadoop.apache.org/.Google Scholar
Apache Hive. http://hive.apache.org/.Google Scholar
Apache Shark. http://shark.cs.berkeley.edu/.Google Scholar
Apache Spark. https://spark.incubator.apache.org/.Google Scholar
L. Chang et al. HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In ACM SIGMOD, pages 1223--1234, 2014. Google ScholarDigital Library
Cloudera Impala.http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.Google Scholar
D. J. DeWitt, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In ACM SIGMOD, pages 1255--1266, 2013. Google ScholarDigital Library
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented Storage Techniques for MapReduce. PVLDB, 4(7): 419--429, 2011. Google ScholarDigital Library
A. Floratou, N. Teletia, D. J. DeWitt, J. M. Patel, and D. Zhang. Can the Elephants Handle the NoSQL Onslaught? PVLDB, 5(12): 1712--1723, 2012. Google ScholarDigital Library
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, pages 25--36, 2011. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2): 330--339, 2010. Google ScholarDigital Library
Presto. http://prestodb.io/.Google Scholar
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In PVLDB, pages 553--564, 2005. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: Friends or Foes? CACM, 53(1): 64--71, 2010. Google ScholarDigital Library
Tajo. http://tajo.incubator.apache.org/.Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a Petabyte Scale Data Warehouse using Hadoop. In ICDE, pages 996--1005, 2010.Google ScholarCross Ref
TPC-DS like Workload on Impala. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/.Google Scholar
Trevni Columnar Format. http://avro.apache.org/docs/1.7.6/trevni/spec.html.Google Scholar
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In ACM SIGMOD, pages 13--24, 2013. Google ScholarDigital Library

Index Terms

SQL-on-Hadoop: full circle back to shared-nothing database architectures
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
SQL-on-hadoop systems: tutorial
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

Enterprises are increasingly using Apache Hadoop, more specifically HDFS, as a central repository for all their data; data coming from various sources, including operational systems, social media and the web, sensors and smart devices, as well as their ...
Read More
Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 12
August 2014
296 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 705
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

SQL-on-hadoop systems: tutorial

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

SQL-on-hadoop systems: tutorial

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media