Skip to main content
main-content

Über dieses Buch

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.
Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You'll LearnGet to know data lake architecture and design principlesImplement data capture and streaming strategies
Implement data processing strategies in HadoopUnderstand the data lake security framework and availability model
Who This Book Is For

Big data architects and solution architects

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction to Enterprise Data Lakes

Abstract
It was in 1861 when Charles Joseph Minard, an 80 years old French civil engineer, attempted to develop a visual that can narrate Napoleon’s disastrous Russian campaign of 1812. The image (Figure 1-1) not just depicted people movement but also exhibited details on geography, time, temperature, troop count, course and direction.
Saurabh Gupta, Venkata Giri

Chapter 2. Data lake ingestion strategies

Abstract
Big data strategy, as we learnt, is a cost effective and analytics driven package of flexible, pluggable, and customized technology stack. Organizations who started footing into Big Data world, realized that it’s not just a trend to follow but a journey to live. Big data trend offers an open ground of unprecedented challenges that demands logical and analytical exploitation of data-driven technologies. Early embracers who picked up their journeys with trivial solutions of data extraction and ingestion, accept the fact that conventional techniques were rather pro-relational and don’t offer a cake walk in big data world. Traditional approaches of data storage, processing, and ingestion fall well short of their bandwidth to handle variety, disparity, and volume of data.
Saurabh Gupta, Venkata Giri

Chapter 3. Capture Streaming Data with Change-Data-Capture

Abstract
It would be fruitless to design a data lake without factoring in the velocity of data flowing from data sources. Streaming data sources are becoming increasingly critical, more than ever, from a real-time or “freshness” perspective. The era where the internet of things and mobile trends carries as much prevalence as human rights, demands a system that not only matches the pace of data flow, but also puts it into action. Data lake beneficiaries and analytics consumers face the tough ask of ingesting the continuous motions of data.
Saurabh Gupta, Venkata Giri

Chapter 4. Data Processing Strategies in Data Lakes

Abstract
Data analytics trends have been disruptive. It would be an understatement to say that within the data analytics practitioner community, there exists a lean school of thoughts for data processing and drawing insights that are meaningful for business. With the steep increase in data appetite, data management practices have folded to multi times; which in-turn has reinforced advanced analytics expertise and data management policies in the industry. The thought process behind crafting a data strategy is driven by use-cases and adjunct to technical capacity, learning momentum, and most importantly, the ability to cherry pick key discoveries that can be magnified into actionable insights to engage customers and drive business. The success mantra for a data analytics practice to excel is to maintain a “preamble” that envisions end goals aligned with the business use cases; both in the short run as well as the longer run. In our earlier chapters, we discussed the pillars of data analytics i.e. data engineering, data discovery, data science, and data visualization. Data engineering offers relatively a bigger playground encapsulating ingestion principles, processing techniques, and development.
Saurabh Gupta, Venkata Giri

Chapter 5. Data Archiving Strategies in Data Lakes

Abstract
The linearly growing data lake sophistication trend has empowered the rise of data analytics from descriptive to predictive, and further to prescriptive. The strategies that drive meaty business outcomes, rely heavily on data initiatives that offer quality and relevance. An enterprise data lake, being the mainstay of modern cognitive data analytics, banks upon a body that guards its lifecycle through the stages of transformation and consumption. How often do you see an analyst questioning data sufficiency for a data model? How often does security analysts mark risk zone for data lake applications to measure their vulnerabilities? Here comes the role of data governance – a key pillar to overall data strategy in an organization.
Saurabh Gupta, Venkata Giri

Chapter 6. Data Security in Data Lakes

Abstract
With enormous volume and high value of data, comes the responsibility to secure data from external intrusions and mitigate the chances of unwanted attacks. Every year, the world sees through ample cases of cyber thefts, security breaches, and digital attacks. As per Gartner’s report in Q1 2017, worldwide expenditure on security in 2017 was estimated to be $90 billion, which was 7.6% more than 2016 numbers.
Saurabh Gupta, Venkata Giri

Chapter 7. Ensure High Availability of Data Lake

Abstract
Such is the power of data analytics, that enterprises are almost resting on to the daily nuggets of information that can unlock and drive new business opportunities. The art and exercise of data accumulation, real-time processing, and data crunching help businesses with the most distilled format of information. It keeps them at pace with the market information, understand industry trend and act fast. With such a dependency on day to day life with data, organizations pay utmost attention towards support functions of enterprise data lake. Data lake support functions include data quality, governance, architecture, and administration. One of the administrative aspects of data lake is availability and disaster recovery.
Saurabh Gupta, Venkata Giri

Chapter 8. Managing Data Lake Operations

Abstract
By now, the readers would have got a fair understanding of data analytics in real world and how data lake caters to the needs of data analytics. All organizational data assets converge under one hood and conceptualize complex data sets into a full-blown data lake. It is essential to understand how to strive for a healthy, stable, and secure data lake. How an organization tackles with security, stability and availability challenges to ensure data lake remains live and adheres to compliance guidelines?
Saurabh Gupta, Venkata Giri

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise