Practical Hive | springerprofessional.de

Springer Professional

nach oben

2016 | Buch

Kapitel lesen Erstes Kapitel lesen

Practical Hive

A Guide to Hadoop's Data Warehouse System

verfasst von: Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Verlag: Apress

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Dive into the world of SQL on Hadoop and get the most out of your Hive data warehouses. This book is your go-to resource for using Hive: authors Scott Shaw, Ankur Gupta, David Kjerrumgaard, and Andreas Francois Vermeulen take you through learning HiveQL, the SQL-like language specific to Hive, to analyze, export, and massage the data stored across your Hadoop environment. From deploying Hive on your hardware or virtual machine and setting up its initial configuration to learning how Hive interacts with Hadoop, MapReduce, Tez and other big data technologies, Practical Hive gives you a detailed treatment of the software.

In addition, this book discusses the value of open source software, Hive performance tuning, and how to leverage semi-structured and unstructured data.

What You Will Learn

Install and configure Hive for new and existing datasets

Perform DDL operations

Execute efficient DML operations

Use tables, partitions, buckets, and user-defined functions

Discover performance tuning tips and Hive best practices

Who This Book Is For

Developers, companies, and professionals who deal with large amounts of data and could use software that can efficiently manage large volumes of input. It is assumed that readers have the ability to work with SQL.

Anzeige

Inhaltsverzeichnis

Frontmatter

Chapter 1. Setting the Stage for Hive: Hadoop

Abstract

By now, any technical specialist with even a sliver of curiosity has heard the term Hadoop tossed around at the water cooler. The discussion likely ranges from, “Hadoop is a waste-of-time,” to “This is big. This will solve all our current problems.” You may also have heard your company director, manager, or even CIO ask the team to begin implementing this new Big Data thing and to somehow identify a problem it is meant to solve.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 2. Introducing Hive

Abstract

As much as the Hadoop ecosystem evolves and provides exceptional means to access new types of data and structures, we cannot deny the influence and purpose of traditional relational systems. Relational systems and especially the data access methods employed by these systems have served as a valuable tool for over 30 years.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 3. Hive Architecture

Abstract

This chapter digs deeper into the core Hive components and architecture and will set the stage for even deeper discussions in later chapters. Here you will see what makes Hive tick, and what value its architecture provides over traditional relational systems. Make no mistake about it, Hive is complicated but its complexity is surmountable and will be familiar to those who make a living accessing data. Keep in mind too that, like any software development project, Hive is constantly changing and changing fast. Competition in the SQL-on-Hadoop space is driving community innovation at a phenomenal rate. This chapter helps you navigate the core of Hive and aids you in the ride.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 4. Hive Tables DDL

Abstract

By now, you know that Hive was created as a means to query the unstructured world of Hadoop without writing complex MapReduce programs. It gives users the ability to write simple queries using the expressiveness of SQL, the language that so many are already familiar with. Hive query language (HiveQL or HQL) is based on ANSI standard SQL, and hence is very easy to understand for anyone familiar with SQL. A user can log in to the Hive command-line interface and start querying the data on HDFS.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 5. Data Manipulation Language (DML)

Abstract

The Hive data manipulation language is the base for all data processing in the Hive ecosystem.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 6. Loading Data into Hive

Abstract

Let’s say you have built a data lake in your organization and one of the lines of business has requested for a new use case to be implemented, for example, a 360 view of the customer. When you consider the details of the use case, you find that analytics needs to occur on all the customer data residing in the existing operational systems, data warehouse, and on all new data getting generated from social media, customer service, and call centers, to get a complete picture of the customer. Hadoop, being a general-purpose, large-scale distributed processing platform, is quite suitable for this.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 7. Querying Semi-Structured Data

Abstract

Hive would not be much of a useful data warehouse tool without the ability to query data. Luckily, querying and providing schema-on-read capabilities at scale is the core foundation for Hive use cases.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 8. Hive Analytics

Abstract

Analytics is the scientific procedure of transforming data into understanding by implementing value-added decisions.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 9. Performance Tuning: Hive

Abstract

One of the biggest challenges Hive users face is the slow response time experienced by end users who are running ad hoc queries. When compared to the performance achieved by traditional relation database queries, Hive’s response times are often unacceptably slow and often leave you wondering how you can achieve the type of performance your end users are accustomed to.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 10. Hive Security

Abstract

Data is one of the most valuable assets of any organization. Loss of information is probably one of the worst nightmares in any organization.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Chapter 11. The Future of Hive

Abstract

The future of Hive is a roadmap of enhancements and improvements.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Appendix A. Building a Big Data Team

Abstract

Building a Big Data team is a fundamental requirement to ensure the success of business responsibilities for maintenance of production jobs and active projects.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Appendix B. Hive Functions

Abstract

Hive offers a comprehensive set of functions.

Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Backmatter

Titel: Practical Hive
verfasst von: Scott Shaw
Andreas François Vermeulen
Ankur Gupta
David Kjerrumgaard
Copyright-Jahr: 2016
Verlag: Apress
Electronic ISBN: 978-1-4842-0271-5
Print ISBN: 978-1-4842-0272-2
DOI: https://doi.org/10.1007/978-1-4842-0271-5

Premium Partner