Skip to main content
main-content

Über dieses Buch

Learn to use Apache Pig to develop lightweight big data applications easily and quickly. This book shows you many optimization techniques and covers every context where Pig is used in big data analytics. Beginning Apache Pig shows you how Pig is easy to learn and requires relatively little time to develop big data applications.The book is divided into four parts: the complete features of Apache Pig; integration with other tools; how to solve complex business problems; and optimization of tools.You'll discover topics such as MapReduce and why it cannot meet every business need; the features of Pig Latin such as data types for each load, store, joins, groups, and ordering; how Pig workflows can be created; submitting Pig jobs using Hue; and working with Oozie. You'll also see how to extend the framework by writing UDFs and custom load, store, and filter functions. Finally you'll cover different optimization techniques such as gathering statistics about a Pig script, joining strategies, parallelism, and the role of data formats in good performance.What You Will Learn• Use all the features of Apache Pig• Integrate Apache Pig with other tools• Extend Apache Pig• Optimize Pig Latin code• Solve different use cases for Pig LatinWho This Book Is ForAll levels of IT professionals: architects, big data enthusiasts, engineers, developers, and big data administrators

Inhaltsverzeichnis

Frontmatter

Chapter 1. MapReduce and Its Abstractions

In this chapter, you will learn about the technologies that existed before Apache Hadoop, about how Hadoop has addressed the limitations of those technologies, and about the new developments since Hadoop was released.
Balaswamy Vaddeman

Chapter 2. Data Types

In this chapter, you will start learning how to code using Pig Latin. This chapter covers data types, type casting among data types, identifiers, and finally some operators.
Balaswamy Vaddeman

Chapter 3. Grunt

In the previous chapter, you learned the Pig Latin fundamentals such as data types, type casting among data types, and operators. In this chapter, you will learn about the command-line interface (CLI) of Pig, called Grunt.
Balaswamy Vaddeman

Chapter 4. Pig Latin Fundamentals

In this chapter, you will learn the basics of Pig Latin. You will learn how to run Pig Latin code, and you will come to understand Pig Latin basic relational operators and parameter substitution.
Balaswamy Vaddeman

Chapter 5. Joins and Functions

Many times you need to retrieve data from more than one relation to generate more meaningful and readable reports. You can use the joins feature of Apache Pig to retrieve data from more than one relation.
Balaswamy Vaddeman

Chapter 6. Creating and Scheduling Workflows Using Apache Oozie

Big data processing in Hadoop usually involves multiple technologies that have to be implemented in a certain order and manner. Often, these technologies also interact with one another. For instance, a certain step n in the workflow can be executed if and only if step n-1 has been successfully executed. Manually executing each of these multiple steps is time-consuming. Apache Oozie addresses this problem by providing dependency management among different steps and technologies.
Balaswamy Vaddeman

Chapter 7. HCatalog

As we discussed in Chapter 1, Apache Hive is a scalable data warehousing technology built on Apache Hadoop. Hive comes with a metastore service that maintains metadata so that users can run any number of queries on already created tables. However, data processing technologies such as MapReduce and Pig do not have a built-in metadata service, so users must define a schema each time they want to run a query.
Balaswamy Vaddeman

Chapter 8. Pig Latin in Hue

Every technology in the Hadoop ecosystem comes with a command-line interface that enhances the user experience with the technology. The Hadoop ecosystem is replete with technologies, and it is impossible to remember all of the commands, which are also case sensitive. Hue (which stands for Hadoop User Experience) alleviates this problem by providing web interfaces for most of the technologies in the Hadoop ecosystem.
Balaswamy Vaddeman

Chapter 9. Pig Latin Scripts in Apache Falcon

In this chapter, you will learn all about Apache Falcon and how to use Pig Latin scripts in Falcon. Apache Falcon is a Hadoop framework used for data lifecycle management. Its applications include data feed management, data replication from one cluster to another, and a lineage of data applications. Although developed by InMobi, it is now an Apache project.
Balaswamy Vaddeman

Chapter 10. Macros

In this chapter, you will learn how to write macros in Pig Latin.
Balaswamy Vaddeman

Chapter 11. User-Defined Functions

In this chapter, you will learn how to write user-defined functions (UDFs) in Pig Latin.
Balaswamy Vaddeman

Chapter 12. Writing Eval Functions

In the previous chapter, you learned how to write user-defined functions. In this chapter, you will learn in detail how to write Eval functions using Java and how to access MapReduce features and Pig features inside Eval functions.
Balaswamy Vaddeman

Chapter 13. Writing Load and Store Functions

You have many load/store functions such as PigStorage, HBaseStorage, and TextLoader available in Pig, and many functions are available in PiggyBank. However, you may get other requirements to write your own load/store functions. You will learn how to write load and store functions in this chapter.
Balaswamy Vaddeman

Chapter 14. Troubleshooting

Many times you might get stuck both while developing applications and while running applications. So, it is important to know how to troubleshoot Pig scripts. Pig provides features and operators for troubleshooting. You will learn about some of them in this chapter.
Balaswamy Vaddeman

Chapter 15. Data Formats

Storing and maintaining a huge amount data is one of the problems created by big data. In this chapter, you will learn how to store data efficiently using a few different data formats and compression algorithms.
Balaswamy Vaddeman

Chapter 16. Optimization

In big data processing, performance is important so that people can make quicker decisions based on the available reports. You should not be happy with just having output; you should also check how much time you have taken for that output and should try decreasing the running time.
Balaswamy Vaddeman

Chapter 17. Hadoop Ecosystem Tools

In this chapter, you will learn the basics of some other Hadoop ecosystem tools such as Zookeeper, Cascading, Presto, Tez, and Spark.
Balaswamy Vaddeman

Backmatter

Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!

Bildnachweise