nach oben

2018 | Buch

Kapitel lesen Erstes Kapitel lesen

Computing with Data

An Introduction to the Data Industry

verfasst von: Guy Lebanon, Mohamed El-Geish

Verlag: Springer International Publishing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book introduces basic computing skills designed for industry professionals without a strong computer science background. Written in an easily accessible manner, and accompanied by a user-friendly website, it serves as a self-study guide to survey data science and data engineering for those who aspire to start a computing career, or expand on their current roles, in areas such as applied statistics, big data, machine learning, data mining, and informatics.

The authors draw from their combined experience working at software and social network companies, on big data products at several major online retailers, as well as their experience building big data systems for an AI startup. Spanning from the basic inner workings of a computer to advanced data manipulation techniques, this book opens doors for readers to quickly explore and enhance their computing knowledge.

Computing with Data comprises a wide range of computational topics essential for data scientists, analysts, and engineers, providing them with the necessary tools to be successful in any role that involves computing with data. The introduction is self-contained, and chapters progress from basic hardware concepts to operating systems, programming languages, graphing and processing data, testing and programming tools, big data frameworks, and cloud computing.

The book is fashioned with several audiences in mind. Readers without a strong educational background in CS--or those who need a refresher--will find the chapters on hardware, operating systems, and programming languages particularly useful. Readers with a strong educational background in CS, but without significant industry background, will find the following chapters especially beneficial: learning R, testing, programming, visualizing and processing data in Python and R, system design for big data, data stores, and software craftsmanship.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction: How to Use This Book?

Abstract

Machine learning, data analysis, and artificial intelligence are becoming increasingly ubiquitous in our lives, and more central to the high-tech industry. These fields play a central role in many of the recent and upcoming revolutions in computing; for example, social networks, streaming video on demand, personal assistants (e.g., Alexa, Siri, and Google Assistant), and self-driving cars. Alphabet’s Executive Chairman, Eric Schmidt, went a step further at the 2016 Google Cloud Computing Conference in San Francisco when he said, “Machine learning and crowdsourcing data will be the basis and fundamentals of every successful huge IPO win in five years.”

Guy Lebanon, Mohamed El-Geish

Chapter 2. Essential Knowledge: Hardware

Abstract

In order to implement efficient computer programs, it’s essential to understand the basic hardware structure of computers. In this chapter we examine the hardware components of a typical computer (CPU, memory, storage, GPU, etc.) focusing on issues that are relevant for software development and algorithm design. We also explore concepts like binary representations of numbers and strings, assembly language, multiprocessors, and the memory hierarchy.

Guy Lebanon, Mohamed El-Geish

Chapter 3. Essential Knowledge: Operating Systems

Abstract

In Chap. 2, we discussed the CPU and how it executes a sequence of assembly language instructions. This suggests the following model: a programmer writes a computer program as a sequence of assembly language instructions, loads them into memory, and instructs the CPU to execute them one by one by pointing the program counter at the relevant memory address. Unfortunately, there are many problems with this scheme: only one program can run at any particular time, one programmer may overwrite information important for another programmer, and it is hard to reuse instructions implementing common tasks. The operating system mediates between the computer hardware and programmers, and resolves difficulties such as the ones mentioned above. This chapter describes the concepts of an operating system, then delves into the Linux and Windows operating systems in some detail as concrete examples; in addition, it explores command-line interfaces like bash, Command Prompt, and PowerShell, which are essential for developers to know intimately.

Guy Lebanon, Mohamed El-Geish

Chapter 4. Learning C++

Abstract

C++ is a programming language that is especially well suited for computationally intensive programs and for interfacing with hardware or the operating system. In this chapter, we describe C++ starting with low-level features such as variable types, operators, pointers, arrays, I/O, and control flow, and concluding with object-oriented programming and the standard template library. We consider the latest version of C++ at the time of writing: C++17.

Guy Lebanon, Mohamed El-Geish

Chapter 5. Learning Java

Abstract

The Java programming language is heavily used in big data applications like Apache Cassandra and Elasticsearch. In this chapter, we describe Java starting with compilation, types, operators, I/O, control flow, etc., and concluding with object-oriented programming and other features. We cover parallel programming using Java in Chap. 10.

Guy Lebanon, Mohamed El-Geish

Chapter 6. Learning Python and a Few More Things

Abstract

Python is one of the most popular programming languages. It’s broadly used in programming web applications, writing scripts for automation, accessing data, processing text, data analysis, etc. Many software packages that are useful for data analysis (like NumPy, SciPy, and Pandas) and machine learning (scikit-learn, TensorFlow, Keras, and PyTorch) can be integrated within a Python application in a few lines of code. In this chapter, we explore the programming language in a similar approach to the one we took for C++ and Java. In addition, we explore tools and packages that help accelerate the development of data-driven application using Python.

Guy Lebanon, Mohamed El-Geish

Chapter 7. Learning R

Abstract

R is a programming language that’s especially designed for data analysis and data visualization. In some cases, it’s more convenient to use R than C++ or Java, making R a key data analysis tool. In this chapter, we describe similarities and differences between R and its close relatives: Matlab and Python. We then delve into the R programming language to learn about data types, control flow, interfacing with C++, etc.

Guy Lebanon, Mohamed El-Geish

Chapter 8. Visualizing Data in R and Python

Abstract

Visualizing data is key in effective data analysis: to perform initial investigations, to confirm or refuting data models, and to elucidate mathematical or algorithmic concepts. In this chapter, we explore different types of data graphs using the R programming language, which has excellent graphics functionality; we end the chapter with a description of Python’s matplotlib module—a popular Python tool for data visualization.

Guy Lebanon, Mohamed El-Geish

Chapter 9. Processing Data in R and Python

Abstract

There is no shortcut to knowledge; and there are no worthwhile data without preprocessing. In the first three sections of this chapter, we discuss situations that necessitate data preprocessing and how to handle them. In the final section we discuss how to manipulate data in general; specifically, how to manipulate data in R using the reshape2 and plyr packages and in Python using the pandas module.

Guy Lebanon, Mohamed El-Geish

Chapter 10. Essential Knowledge: Parallel Programming

Abstract

At the heart of any big data system is a plethora of processes and algorithms that run in parallel to crunch data and produce results that would have taken ages if they were run in a sequential manner. Parallel computing is what enables companies like Google to index the Internet and provide big data systems like email, video streaming, etc. Once workloads can be distributed effectively over multiple processes, scaling the processing horizontally becomes an easier task. In this chapter, we will explore how to parallelize work among concurrent processing units; such concepts apply for the most part whether said processing units are concurrent threads in the same process, or multiple processes running on the same machine or on multiple machines. If you haven’t already read about process management and scheduling in Sect. 3.4 of Chap. 3, now would be a good time to visit that.

Guy Lebanon, Mohamed El-Geish

Chapter 11. Essential Knowledge: Testing

Abstract

We believe that testing is inevitable; the question is: When do you want it to happen? You can let users test your code for you in production, at a hefty cost; or you can catch bugs as early as possible in your code’s lifecycle and save yourself the embarrassment and the money. Improper testing can lead a lot of things to go wrong—if not fatal.

Guy Lebanon, Mohamed El-Geish

Chapter 12. A Few More Things About Programming

Abstract

In this chapter, we explore a few more things about programming; specifically, tools used to write, store (version control), build, debug, and document code. In addition, we explore miscellaneous topics like exceptions.

Guy Lebanon, Mohamed El-Geish

Chapter 13. Essential Knowledge: Data Stores

Abstract

This chapter is all about data in the various shapes, forms, and formats they take. We explore different ways to format and store data: JSON, databases, SQL, NoSQL, and memory mapping. We also discuss concepts like atomicity, consistency, isolation, and durability of data transactions.

Guy Lebanon, Mohamed El-Geish

Chapter 14. Thoughts on System Design for Big Data

Abstract

In the context of computing with data, what exactly is a system? Generally speaking, a system is an aggregation of computing components (and the links between them) that collectively provide a solution to a problem. System design covers choices that system designers make regarding such components: hardware (e.g., servers, networks, sensors, etc.); software (e.g., operating systems, cluster managers, applications, etc.); data (e.g., collection, retention, processing, etc.); and other components that vary based on the nature of each solution. There’s no free lunch in system design and no silver bullet; instead, there are patterns that can jumpstart a solution; and for the most part, there will always be tradeoffs. Skilled system designers learn how to deal with novel problems and ambiguity; one of the skills they practice is decomposing a complex problem into more manageable subproblems that look analogous to ones that can be solved using known patterns, then connect those components together to solve the complex problem. In this chapter, we put on our designer hats and explore various aspects of system design in practice by creating a hypothetical big-data solution: a productivity bot.

Guy Lebanon, Mohamed El-Geish

Chapter 15. Thoughts on Software Craftsmanship

Abstract

Why should we care about code craftsmanship? The long answer is in this chapter. First, let’s see what a French philosopher, by the name of Guillaume Ferrero, discovered: the Principle of Least Effort. He wrote about it in 1894—long before a single line of computer code, as we know it today, was written. Nevertheless, we can’t help but think that, had he been alive to witness how most code is written, he would have demonstrated coding as the epitome of the least effort principle: “[Coding] stops as soon as minimally acceptable results are found.”

Guy Lebanon, Mohamed El-Geish

Titel: Computing with Data
verfasst von: Guy Lebanon
Mohamed El-Geish
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-98149-9
Print ISBN: 978-3-319-98148-2
DOI: https://doi.org/10.1007/978-3-319-98149-9