nach oben

2018 | Buch

Kapitel lesen Erstes Kapitel lesen

Next-Generation Big Data

A Practical Guide to Apache Kudu, Impala, and Spark

verfasst von: Butch Quinto

Verlag: Apress

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies.

Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard.

What You’ll Learn

Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice

Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark

Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing

Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing

Turbocharge Spark with Alluxio, a distributed in-memory storage platform

Deploy big data in the cloud using Cloudera Director

Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark

Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks

Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling

Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and MastercardWho This Book Is For

BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

Inhaltsverzeichnis

Frontmatter

Chapter 1. Next-Generation Big Data

Abstract

Despite all the excitement around big data, the large majority of mission-critical data is still stored in relational database management systems. This fact is supported by recent studies online and confirmed by my own professional experience working on numerous big data and business intelligence projects. Despite widespread interest in unstructured and semi-structured data, structured data still represents a significant percentage of data under management for most organizations, from the largest corporations and government agencies to small businesses and technology start-ups. Use cases that deals with unstructured and semi-structured data, while valuable and interesting, are few and far between. Unless you work for a company that does a lot of unstructured data processing such as Google, Facebook, or Apple, you are most likely working with structured data.

Butch Quinto

Chapter 2. Introduction to Kudu

Abstract

Kudu is an Apache-licensed open source columnar storage engine built for the Apache Hadoop platform. It supports fast sequential and random reads and writes, enabling real-time stream processing and analytic workloads.ⁱ It integrates with Impala, allowing you to insert, delete, update, upsert, and retrieve data using SQL. Kudu also integrates with Spark (and MapReduce) for fast and scalable data processing and analytics. Like other projects in the Apache Hadoop ecosystem, Kudu runs on commodity hardware and was designed to be highly scalable and highly available.

Butch Quinto

Chapter 3. Introduction to Impala

Abstract

Impala is a massively parallel processing (MPP) SQL engine designed and built from the ground up to run on Hadoop platforms.ⁱ Impala provides fast, low-latency response times appropriate for business intelligence applications and ad hoc data analysis. Impala’s performance matches, and in most cases, surpasses commercial MPP engines.

Butch Quinto

Chapter 4. High Performance Data Analysis with Impala and Kudu

Abstract

Impala is the default MPP SQL engine for Kudu. Impala allows you to interact with Kudu using SQL. If you have experience with traditional relational databases where the SQL and storage engines are tightly integrated, you might find it unusual that Kudu and Impala are decoupled from each other. Impala was designed to work with other storage engines such as HDFS, HBase, and S3, not just Kudu. There’s also work underway to integrate other SQL engines such as Apache Drill (DRILL4241) and Hive (HIVE12971) with Kudu. Decoupling storage, SQL, and processing engines is common in the open source community.

Butch Quinto

Chapter 5. Introduction to Spark

Abstract

Spark is the next-generation big data processing framework for processing and analyzing large data sets. Spark features a unified processing framework that provides high-level APIs in Scala, Python, Java, and R and powerful libraries including Spark SQL for SQL support, MLlib for machine learning, Spark Streaming for real-time streaming, and GraphX for graph processing.ⁱ Spark was founded by Matei Zaharia at the University of California, Berkeley’s AMPLab and was later donated to the Apache Software Foundation, becoming a top-level project in February 24, 2014.ⁱⁱ The first version was released on May 30, 2017.ⁱⁱⁱ

Butch Quinto

Chapter 6. High Performance Data Processing with Spark and Kudu

Abstract

Kudu is just a storage engine. You need a way to get data into it and out. As Cloudera’s default big data processing framework, Spark is the ideal data processing and ingestion tool for Kudu. Not only does Spark provide excellent scalability and performance, Spark SQL and the DataFrame API make it easy to interact with Kudu.

Butch Quinto

Chapter 7. Batch and Real-Time Data Ingestion and Processing

Abstract

Data ingestion is the process of transferring, loading, and processing data into a data management or storage platform. This chapter discusses various tools and methods on how to ingest data into Kudu in batch and real time. I’ll cover native tools that come with popular Hadoop distributions. I’ll show examples on how to use Spark to ingest data to Kudu using the Data Source API, as well as the Kudu client APIs in Java, Python, and C++. There is a group of next-generation commercial data ingestion tools that provide native Kudu support. Internet of Things (IoT) is also a hot topic. I’ll discuss all of them in detail in this chapter starting with StreamSets.

Butch Quinto

Chapter 8. Big Data Warehousing

Abstract

In this chapter, I’ll discuss how big data (and more specifically Cloudera Enterprise) is disrupting data warehousing. I assume the readers are familiar with data warehousing so I won’t cover the basics and theory of data warehousing.

Butch Quinto

Chapter 9. Big Data Visualization and Data Wrangling

Abstract

It’s easy to understand why self-service data analysis and visualization have become popular these past few years. It made users more productive by giving them the ability to perform their own analysis and allowing them to interactively explore and manipulate data based on their own needs without relying on traditional business intelligence developers to develop reports and dashboards, a task that can take days, weeks, or longer. Users can perform ad hoc analysis and run follow-up queries to answer their own questions. They’re also not limited by static reports and dashboards. Output from self-service data analysis can take various forms depending on the type of analysis. The output can take the form of interactive charts and dashboards, pivot tables, OLAP cubes, predictions from machine learning models, or query results returned by a SQL query.

Butch Quinto

Chapter 10. Distributed In-Memory Big Data Computing

Abstract

Alluxio is a distributed memory-centric storage system originally developed as a research project by Haoyuan Li in 2012, then a PhD student and a founding Apache Spark committer at AMPLab.

Butch Quinto

Chapter 11. Big Data Governance and Management

Abstract

Data governance is perhaps one of the most important parts of an organization’s information management strategy. Data governance issues such as lack of data quality or compromised data security have the ability to sink data-driven projects or cause massive revenue lost. For organizations such as banks and government agencies, data governance is a must. There are several data governance and management frameworks to choose from such as the DAMA framework.

Butch Quinto

Chapter 12. Big Data in the Cloud

Abstract

Big data deployments in the cloud have increasingly become popular these past few years. The flexibility and agility of the cloud is ideal for running Hadoop clusters. The cloud significantly reduces IT cost while providing applications the ability to scale. Expanding and shrinking clusters take minutes and systems administrators are not needed for most tasks. While some organizations still prefer on-premise big data deployments, most big data environments these days are deployed on one of the three main public cloud providers.

Butch Quinto

Chapter 13. Big Data Case Studies

Abstract

Big data has disrupted entire industries. Innovative use case in the fields of financial services, telecommunications, transportation, health care, retail, insurance, utilities, energy, and technology (to mention a few) have revolutionized the way organizations manage, process, and analyze data.

Butch Quinto

Backmatter

Titel: Next-Generation Big Data
verfasst von: Butch Quinto
Verlag: Apress
Electronic ISBN: 978-1-4842-3147-0
Print ISBN: 978-1-4842-3146-3
DOI: https://doi.org/10.1007/978-1-4842-3147-0