Abstract
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.
- D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented Database Systems. PVLDB, 2(2): 1664--1665, 2009. Google ScholarDigital Library
- A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1): 922--933, 2009. Google ScholarDigital Library
- A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data Page Layouts for Relational Databases on Deep Memory Hierarchies. VLDB J., 11(3): 198--215, 2002. Google ScholarDigital Library
- Apache Drill. http://www.mapr.com/resources/community-resources/apache-drill.Google Scholar
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- Apache Hive. http://hive.apache.org/.Google Scholar
- Apache Shark. http://shark.cs.berkeley.edu/.Google Scholar
- Apache Spark. https://spark.incubator.apache.org/.Google Scholar
- L. Chang et al. HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In ACM SIGMOD, pages 1223--1234, 2014. Google ScholarDigital Library
- Cloudera Impala.http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.Google Scholar
- D. J. DeWitt, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In ACM SIGMOD, pages 1255--1266, 2013. Google ScholarDigital Library
- A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented Storage Techniques for MapReduce. PVLDB, 4(7): 419--429, 2011. Google ScholarDigital Library
- A. Floratou, N. Teletia, D. J. DeWitt, J. M. Patel, and D. Zhang. Can the Elephants Handle the NoSQL Onslaught? PVLDB, 5(12): 1712--1723, 2012. Google ScholarDigital Library
- Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
- R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-MapReduce Translator. In ICDCS, pages 25--36, 2011. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2): 330--339, 2010. Google ScholarDigital Library
- Presto. http://prestodb.io/.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column-oriented DBMS. In PVLDB, pages 553--564, 2005. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: Friends or Foes? CACM, 53(1): 64--71, 2010. Google ScholarDigital Library
- Tajo. http://tajo.incubator.apache.org/.Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a Petabyte Scale Data Warehouse using Hadoop. In ICDE, pages 996--1005, 2010.Google ScholarCross Ref
- TPC-DS like Workload on Impala. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/.Google Scholar
- Trevni Columnar Format. http://avro.apache.org/docs/1.7.6/trevni/spec.html.Google Scholar
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In ACM SIGMOD, pages 13--24, 2013. Google ScholarDigital Library
Index Terms
- SQL-on-Hadoop: full circle back to shared-nothing database architectures
Recommendations
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
SQL-on-hadoop systems: tutorial
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, HawaiiEnterprises are increasingly using Apache Hadoop, more specifically HDFS, as a central repository for all their data; data coming from various sources, including operational systems, social media and the web, sensors and smart devices, as well as their ...
Comments