nach oben

2015 | Buch

Kapitel lesen Erstes Kapitel lesen

Big Data Analytics with Spark

A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

verfasst von: Mohammed Guller

Verlag: Apress

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. In addition, this book will help you become a much sought-after Spark expert.

Spark is one of the hottest Big Data technologies. The amount of data generated today by devices, applications and users is exploding. Therefore, there is a critical need for tools that can analyze large-scale data and unlock value from it. Spark is a powerful technology that meets that need. You can, for example, use Spark to perform low latency computations through the use of efficient caching and iterative algorithms; leverage the features of its shell for easy and interactive Data analysis; employ its fast batch processing and low latency features to process your real time data streams and so on. As a result, adoption of Spark is rapidly growing and is replacing Hadoop MapReduce as the technology of choice for big data analytics.

This book provides an introduction to Spark and related big-data technologies. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, and MLlib. Big Data Analytics with Spark is therefore written for busy professionals who prefer learning a new technology from a consolidated source instead of spending countless hours on the Internet trying to pick bits and pieces from different sources.

The book also provides a chapter on Scala, the hottest functional programming language, and the program that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it.

What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to know is programming in any language.

There is a critical shortage of people with big data expertise, so companies are willing to pay top dollar for people with skills in areas like Spark and Scala. So reading this book and absorbing its principles will provide a boost—possibly a big boost—to your career.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Big Data Technology Landscape

Abstract

We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing exponentially. Data generated today is several magnitudes larger than what was generated just a few years ago. The challenge is how to get business value out of this data. This is the problem that big data–related technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last few years. Some of the most active open source projects are related to big data, and the number of these projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large established companies are making significant investments in big data technologies.

Mohammed Guller

Chapter 2. Programming in Scala

Abstract

Scala is one of the hottest modern programming languages. It is the Cadillac of programming languages. It is not only powerful but also a beautiful language. Learning Scala will provide a boost to your professional career.

Mohammed Guller

Chapter 3. Spark Core

Abstract

Spark is the most active open source project in the big data world. It has become hotter than Hadoop. It is considered the successor to Hadoop MapReduce, which we discussed in Chapter 1. Spark adoption is growing rapidly. Many organizations are replacing MapReduce with Spark.

Mohammed Guller

Chapter 4. Interactive Data Analysis with Spark Shell

Abstract

One of the reasons for Spark’s hockey-stick growth is its usability. It not only provides a rich expressive API in multiple languages, but it also makes it easy to get started. It comes with a command-line tool called Spark shell, which allows you to interactively write Spark applications in Scala. The Spark shell is similar to the Scala shell, discussed in Chapter 2. In fact, it is based on the Scala shell.

Mohammed Guller

Chapter 5. Writing a Spark Application

Abstract

This chapter discusses how to write a data processing application in Scala using Spark. Chapter 2 covered the basics of Scala, which you need to know to get started.

Mohammed Guller

Chapter 6. Spark Streaming

Abstract

Batch processing of historical data was one of the first use cases for big data technologies such as Hadoop and Spark. In batch processing, data is collected for a period of time and processed in batches. A batch processing system processes data spanning from hours to years, depending on the requirements. For example, some organizations run nightly batch processing jobs, which process data collected throughout the day by various systems.

Mohammed Guller

Chapter 7. Spark SQL

Abstract

Ease of use is one of the reasons Spark became popular. It provides a simpler programming model than Hadoop MapReduce for processing big data. However, the number of people who are fluent in the languages supported by the Spark core API is a lot smaller than the number of people who know the venerable SQL.

Mohammed Guller

Chapter 8. Machine Learning with Spark

Abstract

Interest in machine learning is growing by leaps and bounds. It has gained a lot of momentum in recent years for a few reasons. The first reason is performance improvements in hardware and algorithms. Machine learning is compute-intensive. With proliferation of multi-CPU and multi-core machines and efficient algorithms, it has become feasible to do machine learning computations in reasonable time. The second reason is that machine learning software has become freely available. Many good quality open source machine learning software are available now for anyone to download. The third reason is that MOOCs (massive open online courses) have created tremendous awareness about machine learning. These courses have democratized the knowledge required to use machine learning. Machine learning skills are no longer limited to a few people with Ph.D. in Statistics. Anyone can now learn and apply machine learning techniques.

Mohammed Guller

Chapter 9. Graph Processing with Spark

Abstract

Data is generally stored and processed as a collection of records or rows. It is represented as a two-dimensional table with data divided into rows and columns. However, collections or tables are not the only way to represent data. Sometimes, a graph provides a better representation of data than a collection.

Mohammed Guller

Chapter 10. Cluster Managers

Abstract

A cluster manager manages a cluster of computers. To be more specific, it manages resources such as CPU, memory, storage, ports, and other resources available on a cluster of nodes. It pools together the resources available on each cluster node and enables different applications to share these resources. Thus, it turns a cluster of commodity computers into a virtual super-computer that can be shared by multiple applications.

Mohammed Guller

Chapter 11. Monitoring

Abstract

Monitoring is a critical part of application management. It plays an even more important role in distributed computing, which involves a lot more moving parts. The possibility of something failing or application not performing at optimal level is high.

Mohammed Guller

Chapter 12. Bibliography

Abstract

Armbrust, Michael, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia. Spark SQL: Relational Data Processing in Spark. https://amplab.cs.berkeley.edu/publication/spark-sql-relational-data-processing-in-spark .

Mohammed Guller

Backmatter

Titel: Big Data Analytics with Spark
verfasst von: Mohammed Guller
Verlag: Apress
Electronic ISBN: 978-1-4842-0964-6
Print ISBN: 978-1-4842-0965-3
DOI: https://doi.org/10.1007/978-1-4842-0964-6

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Chapter 1. Big Data Technology Landscape

Chapter 2. Programming in Scala

Chapter 3. Spark Core

Chapter 4. Interactive Data Analysis with Spark Shell

Chapter 5. Writing a Spark Application

Chapter 6. Spark Streaming

Chapter 7. Spark SQL

Chapter 8. Machine Learning with Spark

Chapter 9. Graph Processing with Spark

Chapter 10. Cluster Managers

Chapter 11. Monitoring

Chapter 12. Bibliography

Backmatter

Premium Partner