Skip to main content
main-content
Top

About this book

Gain insight into essential data science skills in a holistic manner using data engineering and associated scalable computational methods. This book covers the most popular Python 3 frameworks for both local and distributed (in premise and cloud based) processing. Along the way, you will be introduced to many popular open-source frameworks, like, SciPy, scikitlearn, Numba, Apache Spark, etc. The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code.
As data science projects gets continuously larger and more complex, software engineering knowledge and experience is crucial to produce evolvable solutions. You'll see how to create maintainable software for data science and how to document data engineering practices.
This book is a good starting point for people who want to gain practical skills to perform data science. All the code will be available in the form of IPython notebooks and Python 3 programs, which allow you to reproduce all analyses from the book and customize them for your own purpose. You'll also benefit from advanced topics like Machine Learning, Recommender Systems, and Security in Data Science.
Practical Data Science with Python will empower you analyze data, formulate proper questions, and produce actionable insights, three core stages in most data science endeavors.
What You'll LearnPlay the role of a data scientist when completing increasingly challenging exercises using Python 3Work work with proven data science techniques/technologies
Review scalable software engineering practices to ramp up data analysis abilities in the realm of Big Data
Apply theory of probability, statistical inference, and algebra to understand the data science practicesWho This Book Is For
Anyone who would like to embark into the realm of data science using Python 3.

Table of Contents

Frontmatter

Chapter 1. Introduction to Data Science

Abstract
Let me start by making an analogy between software engineering and data science. Software engineering may be summarized as the application of engineering principles and methods to the development of software. The aim is to produce a dependable software product. In a similar vein, data science may be described as the application of scientific principles and methods in working with data. The goal is to synthesize reliable and actionable insights from data (sometimes referred as data product). To continue with our analogy, the systems/software development life cycle (SDLC) prescribes the major phases of a software development process: project initiation, requirements engineering, design, construction, testing, deployment, and maintenance. The data science process also encompasses multiple phases: project initiation, data acquisition, data preparation, data analysis, reporting, and execution of actions (another “phase” is data exploration, which is more of an all-embracing activity than a stand-alone phase). As in software development, these phases are quite interwoven, and the process is inherently iterative and incremental. An overarching activity that is indispensable in both software engineering and data science (and any other iterative and incremental endeavor) is retrospection, which involves reviewing a project or process to determine what was successful and what could be improved. Another similarity to software engineering is that data science also relies on a multidimensional team or team of teams. A typical project requires domain experts, software engineers specializing in various technologies, and mathematicians (a single person may take different roles at various times). Yet another common denominator with software engineering is a penchant for automation (via programmability of most activities) to increase productivity, reproducibility, and quality. The aim of this chapter is to explain the key concepts regarding data science and put them into proper context.
Ervin Varga

Chapter 2. Data Engineering

Abstract
After project initiation, the data engineering team takes over to build necessary infrastructure to acquire (identify, retrieve, and query), munge, explore, and persist data. The goal is to enable further data analysis tasks. Data engineering requires different expertise than is required in later stages of a data science process. It is typically an engineering discipline oriented toward craftsmanship to provide necessary input to later phases. Often disparate technologies must be orchestrated to handle data communication protocols and formats, perform exploratory visualizations, and preprocess (clean, integrate, and package), scale, and transform data. All these tasks must be done in context of a global project vision and mission relying on domain knowledge. It is extremely rare that raw data from sources is immediately in perfect shape to perform analysis. Even in the case of a clean dataset, there is often a need to simplify it. Consequently, dimensionality reduction coupled with feature selection (remove, add, and combine) is also part of data engineering. This chapter illustrates data engineering through two detailed case studies, which highlight most aspects of it.
Ervin Varga

Chapter 3. Software Engineering

Abstract
An integral part of data science is executing efficient computations on large datasets. Such computations are driven by computer programs, and as problems increase in size and complexity, the accompanying software solutions tend to become larger and more intricate, too. Such software is built by organizations structured around teams. Data science is also a team effort, so effective communication and collaboration is extremely important in both software engineering and data science. Software developed by one team must be comprehensible to other teams to foster use, reuse, and evolution. This is where software maintainability, as a pertinent quality attribute, kicks in. This chapter presents important lessons from the realm of software engineering in the context of data science. The aim is to educate data science practitioners how to craft evolvable programs and increase productivity.
Ervin Varga

Chapter 4. Documenting Your Work

Abstract
Data science and scientific computing are human-centered, collaborative endeavors that gather teams of experts covering multiple domains. You will rarely perform any serious data analysis task alone. Therefore, efficient intra-team communication that ensures proper information exchange within a team is required. There is also a need to convey all details of an analysis to relevant external parties. Your team is part of a larger scientific community, so others must be able to easily validate and verify your team’s findings. Reproducibility of an analysis is as important as the result itself. Achieving this requirement—to effectively deliver data, programs, and associated narrative as an interactive bundle—is not a trivial task. You cannot assume that everybody who wants to peek into your analysis is an experienced software engineer. On the other hand, all stakeholders aspire to make decisions based on available data. Fortunately, there is a powerful open-source solution for reconciling differences in individuals’ skill sets. This chapter introduces the project Jupyter (see https://jupyter.org ), the most popular ecosystem for documenting and sharing data science work.
Ervin Varga

Chapter 5. Data Processing

Abstract
Data analysis is the central phase of a data science process. It is similar to the construction phase in software development, where actual code is produced. The focus is on being able to handle large volumes of data to synthesize an actionable insight and knowledge. Data processing is the major phase where math and software engineering skills interplay to cope with all sorts of scalability issues (size, velocity, complexity, etc.). It isn’t enough to simply pile up various technologies in the hope that all will auto-magically align and deliver the intended outcome. Knowing the basic paradigms and mechanisms is indispensable. This is the main topic of this chapter: to introduce and exemplify pertinent concepts related to scalable data processing. Once you properly understand these concepts, then you will be in a much better position to comprehend why a particular choice of technologies would be the best way to go.
Ervin Varga

Chapter 6. Data Visualization

Abstract
Visualization is a powerful method to gain more insight about data and underlying processes. It is extensively used during the exploration phase to attain a better sense of how data is structured. Visual presentation of results is also an effective way to convey the key findings; people usually grasp nice and informative diagrams much easier than they grasp facts embedded in convoluted tabular form. Of course, this doesn’t mean that visualization should replace other approaches; the best effect is achieved by judiciously combining multiple ways to describe data and outcomes. The topic of visualization itself is very broad, so this chapter focuses on two aspects that are not commonly discussed elsewhere: how visualization may help in optimizing applications, and how to create dynamic dashboards for high-velocity data. You can find examples of other uses throughout this book.
Ervin Varga

Chapter 7. Machine Learning

Abstract
Machine learning is regarded as a subfield of artificial intelligence that deals with algorithms and technologies to squeeze out knowledge from data. Its fundamental ingredient is Big Data, since without help of a machine, our attempt to manually process huge volumes of data would be hopeless. As a product of computer science, machine learning tries to approach problems algorithmically rather than purely via mathematics. An external spectator of a machine learning module would admire it as some sort of magic happening inside a box. Eager reductionism may lead us to say that it all is just “bare” code executed on a classical computer system. Of course, such a statement would be an abomination. Machine learning does belong to a separate branch of software, which learns from data instead of blindly following predefined rules. Nonetheless, for its efficient application, we must know how and what such algorithms learn as well as what type of algorithm(s) to apply in a given context. No machine learning system can notice that it is misappropriated. The goal of this chapter is to lay down the foundational concepts and principles of machine learning exclusively through examples.
Ervin Varga

Chapter 8. Recommender Systems

Abstract
When someone asks where machine learning is broadly applied in an industrial context, recommender systems is a typical answer. Indeed, these systems are ubiquitous, and we rely on them a lot. Amazon is maybe the best example of an e-commerce site that utilizes many types of recommendations to enhance users’ experience and help them quickly find what they are looking for. Spotify, whose domain is music, is another good example. Despite heavy usage of machine learning, recommender systems differ in two crucial ways from classically trained ones:
Ervin Varga

Chapter 9. Data Security

Abstract
Data science is all about data, which inevitably also includes sensitive information about people, organizations, government agencies, and so on. Any confidential information must be handled with utmost care and kept secret from villains. Protecting privacy and squeezing out value from data are opposing forces, like performance optimization and maintainability of a software system. As you improve one you diminish the other. As data scientists, we must ensure both that data is properly protected and that our data science product is capable of fending off abuse as well as unintended usage (for example, prevent it from being used to manipulate people by recommending to them specific items or convincing them to act in some particular manner). All protective actions should nicely interplay with the usefulness of a data science product; otherwise, there is no point in developing the product.
Ervin Varga

Chapter 10. Graph Analysis

Abstract
A graph is the primary data structure for representing different types of networks (for example, directed, undirected, weighted, signed, and bipartite). Networks are most naturally represented as a set of nodes with links between them. Social networks are one very prominent class of networks. The Internet is the biggest real-world network, easily codified as a graph. Furthermore, road networks and related algorithms to calculate various routes are again based on graphs. This chapter introduces the basic elements of a graph, shows how to manipulate graphs using a powerful Python framework, NetworkX, and exemplifies ways to transform various problems as graph problems. This chapter also unravels pertinent metrics to evaluate properties of graphs and associated networks. Special attention will be given to bipartite graphs and corresponding algorithms like graph projections. We will also cover some methods to generate artificial networks with predefined properties.
Ervin Varga

Chapter 11. Complexity and Heuristics

Abstract
As a data scientist, you must solve essential problems in an economical fashion, taking into account all constraints and functional requirements. The biggest mistake is to get emotionally attached to some particular technique and try to find “suitable” problems to which to apply it; this is pseudo-science, at best. A similar issue arises when you attempt to apply some convoluted approach where a simpler and more elegant method exists. I’ve seen these two recurring themes myriad times. For example, logistic regression (an intuitive technique) often can outperform kernelized support vector machines in classification tasks, and a bit of cleverness in feature engineering may completely obviate the need for an opaque neural network–oriented approach. Start simple and keep it simple is perhaps the best advice for any data scientist. In order to achieve simplicity, you must invest an enormous amount of energy and time, but the end result will be more than rewarding (especially when including long-term benefits). Something that goes along with simplicity is pragmatism—good enough is usually better than perfect in most scenarios. This chapter will exemplify an attitude toward simplicity and pragmatism as well as acquaint you with the notion of complexity and heuristic. It contains many small use cases, each highlighting a particular topic pertaining to succinctness and effectiveness of a data product.
Ervin Varga

Chapter 12. Deep Learning

Abstract
Deep learning is a field of machine learning that is a true enabler of cutting-edge achievements in the domain of artificial intelligence. The term deep implies a complex structure that is designed to handle massive datasets using intensive parallel computations (mostly by leveraging clusters of GPU-equipped machines). The term learning in this context means that feature engineering and customization of model parameters are left to the machine. In practice, the combination of these terms in the form of deep learning implies multilayered neural networks. Neural networks are heavily used for tasks like image classification, voice recognition/synthetization, time series analysis, and so forth. Neural networks tend to mimic how our brain cells work in tandem in decision-making activities. This chapter introduces you to neural networks and how to build them using PyTorch, which is an open-source Python framework (visit https://pytorch.org ) that has a familiar API to those accustomed to Numpy. Furthermore, as the last chapter in this book, it exemplifies many stages of the data science life cycle model (data preparation, feature engineering, data visualization, data analysis, and data product deployment). First, though, let’s consider the notion of intelligence as well as when, how, and why it matters.
Ervin Varga

Backmatter

Additional information

Premium Partner

    Image Credits