Skip to main content

Über dieses Buch

This book is a practical guide on using the Apache Hadoop projects including MapReduce, HDFS, Apache Hive, Apache HBase, Apache Kafka, Apache Mahout and Apache Solr. From setting up the environment to running sample applications each chapter is a practical tutorial on using a Apache Hadoop ecosystem project. While several books on Apache Hadoop are available, most are based on the main projects MapReduce and HDFS and none discusses the other Apache Hadoop ecosystem projects and how these all work together as a cohesive big data development platform.

What you'll learnHow to set up environment in Linux for Hadoop projects using Cloudera Hadoop Distribution CDH 5.
How to run a MapReduce job
How to store data with Apache Hive, Apache HBase
How to index data in HDFS with Apache Solr
How to develop a Kafka messaging system
How to develop a Mahout User Recommender System
How to stream Logs to HDFS with Apache Flume
How to transfer data from MySQL database to Hive, HDFS and HBase with Sqoop
How create a Hive table over Apache Solr

Who this book is for:
The primary audience is Apache Hadoop developers. Pre-requisite knowledge of Linux and some knowledge of Hadoop is required.





Chapter 1. Introduction

Apache Hadoop is the de facto framework for processing and storing large quantities of data, what is often referred to as “big data”. The Apache Hadoop ecosystem consists of dozens of projects providing functionality ranging from storing, querying, indexing, transferring, streaming, and messaging, to list a few.

Deepak Vohra

Chapter 2. HDFS and MapReduce

Apache Hadoop is a distributed framework for storing and processing large quantities of data. Going over each of the terms in the previous statement, "distributed" implies that Hadoop is distributed across several (tens, hundreds, or even thousands) of nodes in a cluster. For "storing and processing" means that Hadoop uses two different frameworks: Hadoop Distributed Filesystem (HDFS) for storage and MapReduce for processing. This is illustrated in Figure 2-1

Deepak Vohra

Storing & Querying

Chapter 3. Apache Hive

Apache Hive is a data warehouse framework for querying and managing large datasets stored in Hadoop distributed filesystems (HDFS). Hive also provides a SQL-like query language called HiveQL. The HiveQL queries may be run in the Hive CLI shell. By default, Hive stores data in the HDFS, but also supports the Amazon S3 filesystem.

Deepak Vohra

Chapter 4. Apache HBase

Apache HBase is a distributed, scalable database designed for Apache Hadoop. HBase is a flexible format NoSQL database. HBase has three main components: HMaster, ZooKeeper, and RegionServers. The HMaster handles the DDL (create and delete) operations. The ZooKeeper is a distributed coordination service for an HBase cluster. RegionServers manage HBase table data and serve client requests. An HBase table is mapped to one or more regions using row key ranges to split the HBase table. More regions are used as a table grows. HMaster manages region assignment. Regions are stored in RegionServers, which serve PUT/GET requests from a client. Each RegionServer is collocated with a DataNode on HDFS. HBase table data is stored in the HDFS. The metadata for the Region->RegionServer mapping is kept in a metatable, which is stored on the ZooKeeper. A client request is first sent to the ZooKeeper, which provides the RegionServer the locations for the requested data. Subsequently, the client GETs/PUTs data directly on a RegionServer. The HBase architecture is illustrated in Figure 4-1.

Deepak Vohra

Bulk Transferring & Streaming

Chapter 5. Apache Sqoop

Apache Sqoop is a tool for transferring large quantities of data between a relational database, such as MySQL and Oracle database, and the Hadoop ecosystem, which includes the Hadoop Distributed File System (HDFS), Apache Hive, and Apache HBase. While Sqoop supports transfer between a relational database and HDFS bi-directionally, Sqoop only supports transfer from a relational database to Apache Hive and Apache HBase uni-directionally. The data transfer paths supported by Apache Sqoop are illustrated in Figure 5-1.

Deepak Vohra

Chapter 6. Apache Flume

Apache Flume is a framework based on streaming data flows for collecting, aggregating, and transferring large quantities of data. Flume is an efficient and reliable distributed service. A unit of data flow in Flume is called an event. The main components in Flume architecture are Flume source, Flume channel, and Flume sink, all of which are hosted by a Flume agent. A Flume source consumes events from an external source such as a log file or a web server. A Flume source stores the events it receives in a passive data store called a Flume channel. Examples of Flume channel types are a JDBC channel, a file channel, and a memory channel. The Flume sink component removes the events from the Flume channel and puts them in an external storage such as HDFS. A Flume sink can also forward events to another Flume source to be processed by another Flume agent. The Flume architecture for a single-hop data flow is shown in Figure 6-1.

Deepak Vohra



Chapter 7. Apache Avro

Apache Avro is a compact binary data serialization format providing varied data structures. Avro uses JSON notation schemas to serialize/deserialize data. Avro data is stored in a container file (an .avro file) and its schema (the .avsc file) is stored with the data file. Unlike some other similar systems such as Protocol buffers, Avro does not require code generation and uses dynamic typing. Data is untagged because the schema is accompanied with the data, resulting in a compact data file. Avro supports versioning; different versions (having different columns) of Avro data files may coexist along with their schemas. Another benefit of Avro is interoperability with other languages because of its efficient binary format. The Apache Hadoop ecosystem supports Apache Avro in several of its projects. Apache Hive provides support to store a table as Avro. The Apache sqoop import command supports importing relational data to an Avro data file. Apache Flume supports Avro as a source and sink type.

Deepak Vohra

Chapter 8. Apache Parquet

Apache Parquet is an efficient, structured, column-oriented (also called columnar storage), compressed, binary file format. Parquet supports several compression codecs, including Snappy, GZIP, deflate, and BZIP2. Snappy is the default. Structured file formats such as RCFile, Avro, SequenceFile, and Parquet offer better performance with compression support, which reduces the size of the data on the disk and consequently the I/O and CPU resources required to deserialize data.

Deepak Vohra

Messaging & Indexing

Chapter 9. Apache Kafka

Apache Kafka is publish-subscribe, high-throughput, distributed messaging system. Kafka is fast with a single broker handling hundreds of MB (terabytes)/sec of reads and writes from several clients.

Deepak Vohra

Chapter 10. Apache Solr

Apache Solr is a Apache Lucene-based enterprise search platform providing features such as full-text search, near real-time indexing, and database integration. The Apache Hadoop ecosystem provides support for Solr in several of its projects. Apache Hive Storage Handler for Solr can be used to index Hive table data in Solr. Apache HBase-Solr supports indexing of HBase table data. Apache Flume provides a MorphlineSolrSink for streaming data to Apache Solr for indexing. This chapter introduces Apache Solr and creates a Hive table stored by Solr.

Deepak Vohra

Chapter 11. Apache Mahout

Apache Mahout is a scalable machine learning library with support for several classification, clustering, and collaborative filtering algorithms. Mahout runs on top of Hadoop using the MapReduce model. Mahout also provides a Java API. This chapter explains how to get started with Mahout; you’ll install Mahout and run some sample Mahout applications. You will also see how to develop a user recommender system using the Mahout Java API.

Deepak Vohra


Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!