Skip to main content

2015 | Buch

Big Data Analytics with Spark

A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

insite
SUCHEN

Über dieses Buch

Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. In addition, this book will help you become a much sought-after Spark expert.

Spark is one of the hottest Big Data technologies. The amount of data generated today by devices, applications and users is exploding. Therefore, there is a critical need for tools that can analyze large-scale data and unlock value from it. Spark is a powerful technology that meets that need. You can, for example, use Spark to perform low latency computations through the use of efficient caching and iterative algorithms; leverage the features of its shell for easy and interactive Data analysis; employ its fast batch processing and low latency features to process your real time data streams and so on. As a result, adoption of Spark is rapidly growing and is replacing Hadoop MapReduce as the technology of choice for big data analytics.

This book provides an introduction to Spark and related big-data technologies. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, and MLlib. Big Data Analytics with Spark is therefore written for busy professionals who prefer learning a new technology from a consolidated source instead of spending countless hours on the Internet trying to pick bits and pieces from different sources.

The book also provides a chapter on Scala, the hottest functional programming language, and the program that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it.

What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to know is programming in any language.

There is a critical shortage of people with big data expertise, so companies are willing to pay top dollar for people with skills in areas like Spark and Scala. So reading this book and absorbing its principles will provide a boost—possibly a big boost—to your career.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Big Data Technology Landscape
Abstract
We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing exponentially. Data generated today is several magnitudes larger than what was generated just a few years ago. The challenge is how to get business value out of this data. This is the problem that big data–related technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last few years. Some of the most active open source projects are related to big data, and the number of these projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large established companies are making significant investments in big data technologies.
Mohammed Guller
Chapter 2. Programming in Scala
Abstract
Scala is one of the hottest modern programming languages. It is the Cadillac of programming languages. It is not only powerful but also a beautiful language. Learning Scala will provide a boost to your professional career.
Mohammed Guller
Chapter 3. Spark Core
Abstract
Spark is the most active open source project in the big data world. It has become hotter than Hadoop. It is considered the successor to Hadoop MapReduce, which we discussed in Chapter 1. Spark adoption is growing rapidly. Many organizations are replacing MapReduce with Spark.
Mohammed Guller
Chapter 4. Interactive Data Analysis with Spark Shell
Abstract
One of the reasons for Spark’s hockey-stick growth is its usability. It not only provides a rich expressive API in multiple languages, but it also makes it easy to get started. It comes with a command-line tool called Spark shell, which allows you to interactively write Spark applications in Scala. The Spark shell is similar to the Scala shell, discussed in Chapter 2. In fact, it is based on the Scala shell.
Mohammed Guller
Chapter 5. Writing a Spark Application
Abstract
This chapter discusses how to write a data processing application in Scala using Spark. Chapter 2 covered the basics of Scala, which you need to know to get started.
Mohammed Guller
Chapter 6. Spark Streaming
Abstract
Batch processing of historical data was one of the first use cases for big data technologies such as Hadoop and Spark. In batch processing, data is collected for a period of time and processed in batches. A batch processing system processes data spanning from hours to years, depending on the requirements. For example, some organizations run nightly batch processing jobs, which process data collected throughout the day by various systems.
Mohammed Guller
Chapter 7. Spark SQL
Abstract
Ease of use is one of the reasons Spark became popular. It provides a simpler programming model than Hadoop MapReduce for processing big data. However, the number of people who are fluent in the languages supported by the Spark core API is a lot smaller than the number of people who know the venerable SQL.
Mohammed Guller
Chapter 8. Machine Learning with Spark
Abstract
Interest in machine learning is growing by leaps and bounds. It has gained a lot of momentum in recent years for a few reasons. The first reason is performance improvements in hardware and algorithms. Machine learning is compute-intensive. With proliferation of multi-CPU and multi-core machines and efficient algorithms, it has become feasible to do machine learning computations in reasonable time. The second reason is that machine learning software has become freely available. Many good quality open source machine learning software are available now for anyone to download. The third reason is that MOOCs (massive open online courses) have created tremendous awareness about machine learning. These courses have democratized the knowledge required to use machine learning. Machine learning skills are no longer limited to a few people with Ph.D. in Statistics. Anyone can now learn and apply machine learning techniques.
Mohammed Guller
Chapter 9. Graph Processing with Spark
Abstract
Data is generally stored and processed as a collection of records or rows. It is represented as a two-dimensional table with data divided into rows and columns. However, collections or tables are not the only way to represent data. Sometimes, a graph provides a better representation of data than a collection.
Mohammed Guller
Chapter 10. Cluster Managers
Abstract
A cluster manager manages a cluster of computers. To be more specific, it manages resources such as CPU, memory, storage, ports, and other resources available on a cluster of nodes. It pools together the resources available on each cluster node and enables different applications to share these resources. Thus, it turns a cluster of commodity computers into a virtual super-computer that can be shared by multiple applications.
Mohammed Guller
Chapter 11. Monitoring
Abstract
Monitoring is a critical part of application management. It plays an even more important role in distributed computing, which involves a lot more moving parts. The possibility of something failing or application not performing at optimal level is high.
Mohammed Guller
Chapter 12. Bibliography
Abstract
Armbrust, Michael, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia. Spark SQL: Relational Data Processing in Spark. https://amplab.cs.berkeley.edu/publication/spark-sql-relational-data-processing-in-spark .
Mohammed Guller
Backmatter
Metadaten
Titel
Big Data Analytics with Spark
verfasst von
Mohammed Guller
Copyright-Jahr
2015
Verlag
Apress
Electronic ISBN
978-1-4842-0964-6
Print ISBN
978-1-4842-0965-3
DOI
https://doi.org/10.1007/978-1-4842-0964-6

Premium Partner