Skip to main content

2016 | Buch

Practical Hive

A Guide to Hadoop's Data Warehouse System

verfasst von: Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard

Verlag: Apress

insite
SUCHEN

Über dieses Buch

Dive into the world of SQL on Hadoop and get the most out of your Hive data warehouses. This book is your go-to resource for using Hive: authors Scott Shaw, Ankur Gupta, David Kjerrumgaard, and Andreas Francois Vermeulen take you through learning HiveQL, the SQL-like language specific to Hive, to analyze, export, and massage the data stored across your Hadoop environment. From deploying Hive on your hardware or virtual machine and setting up its initial configuration to learning how Hive interacts with Hadoop, MapReduce, Tez and other big data technologies, Practical Hive gives you a detailed treatment of the software.

In addition, this book discusses the value of open source software, Hive performance tuning, and how to leverage semi-structured and unstructured data.

What You Will Learn

Install and configure Hive for new and existing datasets

Perform DDL operations

Execute efficient DML operations

Use tables, partitions, buckets, and user-defined functions

Discover performance tuning tips and Hive best practices

Who This Book Is For

Developers, companies, and professionals who deal with large amounts of data and could use software that can efficiently manage large volumes of input. It is assumed that readers have the ability to work with SQL.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Setting the Stage for Hive: Hadoop
Abstract
By now, any technical specialist with even a sliver of curiosity has heard the term Hadoop tossed around at the water cooler. The discussion likely ranges from, “Hadoop is a waste-of-time,” to “This is big. This will solve all our current problems.” You may also have heard your company director, manager, or even CIO ask the team to begin implementing this new Big Data thing and to somehow identify a problem it is meant to solve.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 2. Introducing Hive
Abstract
As much as the Hadoop ecosystem evolves and provides exceptional means to access new types of data and structures, we cannot deny the influence and purpose of traditional relational systems. Relational systems and especially the data access methods employed by these systems have served as a valuable tool for over 30 years.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 3. Hive Architecture
Abstract
This chapter digs deeper into the core Hive components and architecture and will set the stage for even deeper discussions in later chapters. Here you will see what makes Hive tick, and what value its architecture provides over traditional relational systems. Make no mistake about it, Hive is complicated but its complexity is surmountable and will be familiar to those who make a living accessing data. Keep in mind too that, like any software development project, Hive is constantly changing and changing fast. Competition in the SQL-on-Hadoop space is driving community innovation at a phenomenal rate. This chapter helps you navigate the core of Hive and aids you in the ride.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 4. Hive Tables DDL
Abstract
By now, you know that Hive was created as a means to query the unstructured world of Hadoop without writing complex MapReduce programs. It gives users the ability to write simple queries using the expressiveness of SQL, the language that so many are already familiar with. Hive query language (HiveQL or HQL) is based on ANSI standard SQL, and hence is very easy to understand for anyone familiar with SQL. A user can log in to the Hive command-line interface and start querying the data on HDFS.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 5. Data Manipulation Language (DML)
Abstract
The Hive data manipulation language is the base for all data processing in the Hive ecosystem.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 6. Loading Data into Hive
Abstract
Let’s say you have built a data lake in your organization and one of the lines of business has requested for a new use case to be implemented, for example, a 360 view of the customer. When you consider the details of the use case, you find that analytics needs to occur on all the customer data residing in the existing operational systems, data warehouse, and on all new data getting generated from social media, customer service, and call centers, to get a complete picture of the customer. Hadoop, being a general-purpose, large-scale distributed processing platform, is quite suitable for this.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 7. Querying Semi-Structured Data
Abstract
Hive would not be much of a useful data warehouse tool without the ability to query data. Luckily, querying and providing schema-on-read capabilities at scale is the core foundation for Hive use cases.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 8. Hive Analytics
Abstract
Analytics is the scientific procedure of transforming data into understanding by implementing value-added decisions.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 9. Performance Tuning: Hive
Abstract
One of the biggest challenges Hive users face is the slow response time experienced by end users who are running ad hoc queries. When compared to the performance achieved by traditional relation database queries, Hive’s response times are often unacceptably slow and often leave you wondering how you can achieve the type of performance your end users are accustomed to.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 10. Hive Security
Abstract
Data is one of the most valuable assets of any organization. Loss of information is probably one of the worst nightmares in any organization.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Chapter 11. The Future of Hive
Abstract
The future of Hive is a roadmap of enhancements and improvements.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Appendix A. Building a Big Data Team
Abstract
Building a Big Data team is a fundamental requirement to ensure the success of business responsibilities for maintenance of production jobs and active projects.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Appendix B. Hive Functions
Abstract
Hive offers a comprehensive set of functions.
Scott Shaw, Andreas François Vermeulen, Ankur Gupta, David Kjerrumgaard
Backmatter
Metadaten
Titel
Practical Hive
verfasst von
Scott Shaw
Andreas François Vermeulen
Ankur Gupta
David Kjerrumgaard
Copyright-Jahr
2016
Verlag
Apress
Electronic ISBN
978-1-4842-0271-5
Print ISBN
978-1-4842-0272-2
DOI
https://doi.org/10.1007/978-1-4842-0271-5

Premium Partner