Skip to main content

Über dieses Buch

Learn advanced analytical techniques and leverage existing toolkits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems which go beyond the basics of classification, clustering, and recommendation.

In Pro Hadoop Data Analytics best practices are emphasized to ensure coherent, efficient development. A complete example system will be developed using standard third-party components which will consist of the toolkits, libraries, visualization and reporting code, as well as support glue to provide a working and extensible end-to-end system.

The book emphasizes four important topics:

The importance of end-to-end, flexible, configurable, high-performance data pipeline systems with analytical components as well as appropriate visualization results. Best practices and structured design principles. This will include strategic topics as well as the how to example portions.The importance of mix-and-match or hybrid systems, using different analytical components in one application to accomplish application goals. The hybrid approach will be prominent in the examples.Use of existing third-party libraries is key to effective development. Deep dive examples of the functionality of some of these toolkits will be showcased as you develop the example system.

What You'll Learn

The what, why, and how of building big data analytic systems with the Hadoop ecosystemLibraries, toolkits, and algorithms to make development easier and more effectiveBest practices to use when building analytic systems with Hadoop, and metrics to measure performance and efficiency of components and systemsHow to connect to standard relational databases, noSQL data sources, and moreUseful case studies and example components which assist you in creating your own systemsWho This Book Is For

Software engineers, architects, and data scientists with an interest in the design and implementation of big data analytical systems using Hadoop, the Hadoop ecosystem, and other associated technologies.





Chapter 1. Overview: Building Data Analytic Systems with Hadoop

This book is about designing and implementing software systems that ingest, analyze, and visualize big data sets. Throughout the book, we’ll use the acronym BDA or BDAs (big data analytics system) to describe this kind of software. Big data itself deserves a word of explanation. As computer programmers and architects, we know that what we now call “big data” has been with us for a very long time—decades, in fact, because “big data” has always been a relative, multi-dimensional term, a space which is not defined by the mere size of the data alone. Complexity, speed, veracity—and of course, size and volume of data—are all dimensions of any modern “big data set”.

Kerry Koitzsch

Chapter 2. A Scala and Python Refresher

This chapter contains a quick review of the Scala and Python programming languages used throughout the book. The material discussed here is primarily aimed at Java/C++ programmers who need a quick review of Scala and Python.

Kerry Koitzsch

Chapter 3. Standard Toolkits for Hadoop and Analytics

In this chapter, we take a look at the necessary ingredients for a BDA system: the standard libraries and toolkits most useful for building BDAs. We describe an example system (which we develop throughout the remainder of the book) using standard toolkits from the Hadoop and Spark ecosystems. We also use other analytical toolkits, such as R and Weka, with mainstream development components such as Ant, Maven, npm, pip, Bower, and other system building tools. "Glueware components" such as Apache Camel, Spring Framework, Spring Data, Apache Kafka, Apache Tika, and others can be used to create a Hadoop-based system appropriate for a variety of applications.

Kerry Koitzsch

Chapter 4. Relational, NoSQL, and Graph Databases

In this chapter, we describe the role of databases in distributed big data analysis. Database types include relational databases, document databases, graph databases, and others, which may be used as data sources or sinks in our analytical pipelines. Most of these database types integrate well with Hadoop ecosystem components, as well as with Apache Spark. Connectivity between different kinds of database and Hadoop/Apache Spark-distributed processing may be provided by “glueware” such as Spring Data or Apache Camel. We describe relational databases, such as MySQL, NoSQL databases such as Cassandra, and graph databases such as Neo4j, and how to integrate them with the Hadoop ecosystem.

Kerry Koitzsch

Chapter 5. Data Pipelines and How to Construct Them

In this chapter, we will discuss how to construct basic data pipelines using standard data sources and the Hadoop ecosystem. We provide an end-to-end example of how data sources may be linked and processed using Hadoop and other analytical components, and how this is similar to a standard ETL process.

Kerry Koitzsch

Chapter 6. Advanced Search Techniques with Hadoop, Lucene, and Solr

In this chapter, we describe the structure and use of the Apache Lucene and Solr third-party search engine components, how to use them with Hadoop, and how to develop advanced search capability customized for an analytical application. We will also investigate some newer Lucene-based search frameworks, primarily Elasticsearch, a premier search tool particularly well-suited towards building distributed analytic data pipelines. We will also discuss the extended Lucene/Solr ecosystem and some real-world programming examples of how to use Lucene and Solr in distributed big data analytics applications.

Kerry Koitzsch

Architectures and Algorithms


Chapter 7. An Overview of Analytical Techniques and Algorithms

In this chapter, we provide an overview of four categories of algorithm: statistical, Bayesian, ontology-driven, and hybrid algorithms which leverage the more basic algorithms found in standard libraries to perform more in-depth and accurate analyses using Hadoop.

Kerry Koitzsch

Chapter 8. Rule Engines, System Control, and System Orchestration

In this chapter, we describe the JBoss Drools rule engine and how it may be used to control and orchestrate Hadoop analysis pipelines. We describe an example rule-based controller which can be used for a variety of data types and applications in combination with the Hadoop ecosystem.

Kerry Koitzsch

Chapter 9. Putting It All Together: Designing a Complete Analytical System

In this chapter, we describe an end-to-end design example, using many of the components discussed so far. We also discuss “best practices” to use during the requirements acquisition, planning, architecting, development, testing, and deployment phases of the system development project.

Kerry Koitzsch

Components and Systems


Chapter 10. Data Visualizers: Seeing and Interacting with the Analysis

In this chapter, we will talk about how to look at—to visualize—our analytical results. This is actually quite a complex process, or it can be. It’s all a matter of choosing an appropriate technology stack for the kind of visualizing you need to do for your application. The visualization task in an analytics application can range from creating simple reports to full-fledged interactive systems. In this chapter we will primarily be discussing Angular JS and its ecosystem, including the ElasticUI visualization tool Kibana, as well as other visualization components for graphs, charts, and tables, including some JavaScript-based tools like D3.js and sigma.js.

Kerry Koitzsch

Case Studies and Applications


Chapter 11. A Case Study in Bioinformatics: Analyzing Microscope Slide Data

In this chapter, we describe an application to analyze microscopic slide data, such as might be found in medical examinations of patient samples or forensic evidence from a crime scene. We illustrate how a Hadoop system might be used to organize, analyze, and correlate bioinformatics data.

Kerry Koitzsch

Chapter 12. A Bayesian Analysis Component: Identifying Credit Card Fraud

In this chapter, we describe a Bayesian analysis software component plug-in which may be used to analyze streams of credit card transactions in order to identify fraudulent use of the credit card by illicit users.

Kerry Koitzsch

Chapter 13. Searching for Oil: Geographical Data Analysis with Apache Mahout

In this chapter, we discuss a particularly interesting application for distributed big data analytics: using a domain model to look for likely geographic locations for valuable minerals, such as petroleum, bauxite (aluminum ore), or natural gas. We touch on a number of convenient technology packages to ingest, analyze, and visualize the resulting data, especially those well-suited for processing geolocations and other geography-related data types.

Kerry Koitzsch

Chapter 14. “Image As Big Data” Systems: Some Case Studies

In this chapter, we will provide a brief introduction to an example toolkit, the Image as Big Data Toolkit (IABDT), a Java-based open source framework for performing a wide variety of distributed image processing and analysis tasks in a scalable, highly available, and reliable manner. IABDT is an image processing framework developed over the last several years in response to the rapid evolution of big data technologies in general, but in particular distributed image processing technologies. IABDT is designed to accept many formats of imagery, signals, sensor data, metadata, and video as data input.

Kerry Koitzsch

Chapter 15. Building a General Purpose Data Pipeline

In this chapter, we detail an end-to-end analytical system using many of the techniques we discussed throughout the book to provide an evaluation system the user may extend and edit to create their own Hadoop data analysis system. Five basic strategies to use when developing data pipelines are discussed. Then, we see how these strategies may be applied to build a general purpose data pipeline component.

Kerry Koitzsch

Chapter 16. Conclusions and the Future of Big Data Analysis

In this final chapter, we sum up what we have learned in the previous chapters and discuss some of the developing trends in big data analytics, including “incubator” projects and “young” projects for data analysis. We also speculate on what the future holds for big data analysis and the Hadoop ecosystem—“Future Hadoop” (which may also include Apache Spark and others).

Kerry Koitzsch


Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!