Skip to main content

2024 | Buch

Introduction to Data Science

A Python Approach to Concepts, Techniques and Applications

insite
SUCHEN

Über dieses Buch

This accessible and classroom-tested textbook/reference presents an introduction to the fundamentals of the interdisciplinary field of data science. The coverage spans key concepts from statistics, machine/deep learning and responsible data science, useful techniques for network analysis and natural language processing, and practical applications of data science such as recommender systems or sentiment analysis.

Topics and features:

Provides numerous practical case studies using real-world data throughout the book Supports understanding through hands-on experience of solving data science problems using Python Describes concepts, techniques and tools for statistical analysis, machine learning, graph analysis, natural language processing, deep learning and responsible data scienceReviews a range of applications of data science, including recommender systems and sentiment analysis of text data Provides supplementary code resources and data at an associated website

This practically-focused textbook provides an ideal introduction to the field for upper-tier undergraduate and beginning graduate students from computer science, mathematics, statistics, and other technical disciplines. The work is also eminently suitable for professionals on continuous education short courses, and to researchers following self-study courses.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to Data Science
Abstract
Data science is a new interdisciplinary field that has attracted a lot of interest from the media in recent years, where it has been presented as a novelty. The first aim of this chapter is to describe the main factors behind this novelty: the datafication process and the democratization of analytical techniques. In the first case, we show that rendering different aspects of our lives into data, as individuals or as members of organizations, has produced rich descriptions of the world that open the door to the development of predictive models. In the second case, we stress the impact of “open source” culture on the analytical software community. As a result of this impact, analytical techniques are now mostly developed within an open domain, enabling their widespread use and the fast dissemination of results. This book has been designed as a small contribution to this democratization process by showing that anyone interested in this topic can become a junior data scientist in a few weeks.
Laura Igual, Santi Seguí
Chapter 2. Data Science Tools
Abstract
In this chapter, first we introduce some of the cornerstone tools that data scientists use. The toolbox of any data scientist, as for any kind of programmer, is an essential ingredient for success and enhanced performance. Choosing the right tools can save a lot of time and thereby allow us to focus on data analysis.
Eloi Puertas
Chapter 3. Descriptive Statistics
Abstract
Descriptive statistics helps to simplify large amounts of data in a sensible way. In contrast to inferential statistics, which will be introduced in a later chapter, in descriptive statistics we do not draw conclusions beyond the data we are analyzing; neither do we reach any conclusions regarding hypotheses we may make. We do not try to infer characteristics of the “population” (see below) of the data, but claim to present quantitative descriptions of it in a manageable form. It is simply a way to describe the data.
Laura Igual, Santi Seguí
Chapter 4. Statistical Inference
Abstract
In the previous chapter we have seen how to describe a sample in order to produce potentially interesting hypotheses about its population. Some of the descriptions we have seen are based on graphical representations that are easily interpreted by humans, while others are based on parameters that summarize important properties of the sample distribution. In this chapter we see how to infer predictions about a population. To this end we explore the relationship between sample parameters and population parameters and we propose some methods, both theoretical and computational, to assess the quality of parameter estimates from a sample.
Laura Igual, Santi Seguí
Chapter 5. Supervised Learning
Abstract
In this chapter, we introduce the basics of classification: a type of supervised machine learning. We also give a brief practical tour of learning theory and good practices for successful use of classifiers in a real case using Python. The chapter starts by introducing the classic machine learning pipeline, defining features, and evaluating the performance of a classifier. After that, the notion of generalization error is needed, which allows us to show learning curves in terms of the number of examples and the complexity of the classifier, and also to define the notion of overfitting. That notion will then allow us to develop a strategy for model selection. Finally, two of the best-known techniques in machine learning are introduced: support vector machines and random forests. These are then applied to the proposed problem of predicting those loans that will not be successfully covered once they have been accepted.
Laura Igual, Santi Seguí
Chapter 6. Regression Analysis
Abstract
In this chapter, we introduce regression analysis and some of its applications in data science. Regression is related to how to make predictions about real-world quantities such as, for instance, the predictions alluded to in the following questions. How does sales volume change with changes in price? How is sales volume affected by the weather? How does the title of a book affect its sales? How does the amount of a drug absorbed vary with the patient’s body weight; and does this relationship depend on blood pressure? At what time should I go home to avoid traffic jams? What is the chance of rain on the next two Mondays; and what is the expected temperature?
Laura Igual, Santi Seguí
Chapter 7. Unsupervised Learning
Abstract
In this chapter, we address the problem of analyzing a set of inputs/data without labels with the goal of finding “interesting patterns” or structures in the data. This type of problem is sometimes called a knowledge discovery problem. Compared to other machine learning problems such as supervised learning, this is a much more open problem, since in general there is no well-defined metric to use and neither there is any specific kind of patterns that we wish to look for. Within unsupervised machine learning, the most common type of problems is the clustering problem; though other problems such as novelty detection, dimensionality reduction and outlier detection are also part of this area. So here we discuss different clustering methods, compare their advantages and disadvantages, and discuss measures for evaluating their quality. The chapter finishes with a case study using a real data set that analyzes the expenditure of different countries on education.
Laura Igual, Santi Seguí
Chapter 8. Network Analysis
Abstract
Network data are currently generated and collected to an increasing extent from different fields. In this chapter, we show how network data analysis allows us to gain insight into the data that would be hard to acquire by other means. We introduce some tools in network analysis and visualization. We present important concepts such as connected components, centrality measures and egonetworks, as well as community detection. We use a Python toolbox (NetworkX) to build graphs easily and analyze them. We motivate concepts in network analysis by a real problem dealing with a Facebook network dataset and answering a set of questions. For instance: Which is the most representative member of the network in terms of the most “connected”, the most “circulated”, the “closest” or the most “accessible” to the rest of the members?
Laura Igual, Santi Seguí
Chapter 9. Recommender Systems
Abstract
In this chapter, we introduce what recommender systems are, how they work and how they can be implemented in Python. We also present the taxonomy of different types of recommender systems based on the information they use, as well as the output they produce. We provide some insights in order to see how recommender systems can deal with questions such as: Which movie should I rent? Which TV should I buy? Or: Which is the best place for me and my family to travel to? These are typical questions that companies like Netflix or Amazon include in their products. We also see and discuss how recommender systems should be evaluated. Finally, a practical case with MovieLens dataset is presented.
Laura Igual, Santi Seguí
Chapter 10. Basics of Natural Language Processing
Abstract
In this chapter, we review the basics of Natural Language Processing and we apply them to the problem of sentiment analysis from text data. Generally, sentiment analysis is performed based on the processing of natural language, the analysis of text and computational linguistics. Although data can come from different data sources, in this chapter we analyze sentiment in text data, using two particular text data examples: one from film critics, where the text is highly structured and maintains text semantics; and another example coming from social networks, where the text can show a lack of structure and users may use text abbreviations. We review basic mechanisms required to perform sentiment analysis, including data cleaning, producing a general representation of the text, and performing some statistical inference on the text represented to determine positive and negative sentiments.
Laura Igual, Santi Seguí
Chapter 11. Deep Learning
Abstract
In this chapter, we introduce the main concepts of Neural Networks and Deep Learning. We review the mathematical foundations and implement the classical Multilayer Perceptron and the Convolution Neural Networks with Keras on the top of TensorFlow. We show how to construct an image classification model, bridging theory with hands-on experience. Furthermore, to ensure effective model training and prevent overfitting, we introduce various techniques, such as regularization, dropout and data augmentation.
Laura Igual, Santi Seguí
Chapter 12. Responsible Data Science
Abstract
Data science has an increasing responsibility in society, which means it needs to consider more than just technical skills. Data scientists must recognize and embrace this responsibility, acknowledging its ethical, moral, and societal implications. Addressing these responsibilities ensures that data science is used for the benefit of society while preserving individual rights. Data science’s impact on privacy, autonomy, and well-being requires prioritizing personal data protection and respecting privacy rights. Ethical data handling, informed consent, and robust security measures are imperative to prevent unauthorized access and misuse of personal information. Upholding these principles fosters trust between individuals and the data-driven systems influencing their lives, ultimately guiding data science toward a socially responsible and ethically sound future.
Laura Igual, Santi Seguí
Backmatter
Metadaten
Titel
Introduction to Data Science
verfasst von
Laura Igual
Santi Seguí
Copyright-Jahr
2024
Electronic ISBN
978-3-031-48956-3
Print ISBN
978-3-031-48955-6
DOI
https://doi.org/10.1007/978-3-031-48956-3

Premium Partner