Skip to main content

2024 | Buch

Online Machine Learning

A Practical Guide with Examples in Python

insite
SUCHEN

Über dieses Buch

This book deals with the exciting, seminal topic of Online Machine Learning (OML). The content is divided into three parts: the first part looks in detail at the theoretical foundations of OML, comparing it to Batch Machine Learning (BML) and discussing what criteria should be developed for a meaningful comparison. The second part provides practical considerations, and the third part substantiates them with concrete practical applications.

The book is equally suitable as a reference manual for experts dealing with OML, as a textbook for beginners who want to deal with OML, and as a scientific publication for scientists dealing with OML since it reflects the latest state of research. But it can also serve as quasi OML consulting since decision-makers and practitioners can use the explanations to tailor OML to their needs and use it for their application and ask whether the benefits of OML might outweigh the costs.

OML will soon become practical; it is worthwhile to get involved with it now. This book already presents some tools that will facilitate the practice of OML in the future. A promising breakthrough is expected because practice shows that due to the large amounts of data that accumulate, the previous BML is no longer sufficient. OML is the solution to evaluate and process data streams in real-time and deliver results that are relevant for practice.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction: From Batch to Online Machine Learning
Abstract
Batch Machine Learning (BML), which is also referred to as “offline machine learning”, reaches its limits when dealing with very large amounts of data. This is especially true for available memory, handling drift in data streams, and processing new, unknown data. Online Machine Learning (OML) is an alternative to BML that overcomes the limitations of BML. In this chapter, the basic terms and concepts of OML are introduced and the differences to BML are shown.
Thomas Bartz-Beielstein
Chapter 2. Supervised Learning: Classification and Regression
Abstract
This chapter provides an overview and evaluation of Online Machine Learning (OML) methods and algorithms, with a special focus on supervised learning. First, methods from the areas of classification (Sect. 2.1) and regression (Sect. 2.2) are presented. Then, ensemble methods are described in Sect. 2.3. Clustering methods are briefly mentioned in Sect. 2.4. An overview is given in Sect. 2.5.
Thomas Bartz-Beielstein
Chapter 3. Drift Detection and Handling
Abstract
Structural changes (“drift”) in the data cause problems for many algorithms. Based on the drift definitions given in Chap. 1, methods for drift detection and handling are discussed. For the algorithms presented in Chap. 2, it is clarified to what extent concept drift is reacted to. In turn, the extent to which catastrophic forgetting is an issue is described in Sect. 4.​3. Section 3.1 describes three architectures for implementing drift detection algorithms. Basic properties of window-based approaches are presented in Sect. 3.2. Section 3.4 presents commonly used drift detection techniques. Section 3.4 describes how the drift detection techniques introduced in Sect. 3.3 are used in Online Machine Learning (OML) algorithms and summarizes the tree-based OML techniques implemented in the River package. Section 3.5 introduces scaling methods for handling drift.
Thomas Bartz-Beielstein, Lukas Hans
Chapter 4. Initial Selection and Subsequent Updating of OML Models
Abstract
In Sect. 4.1, we describe a current best practice methodology for the initial model selection of Online Machine Learning (OML) models, taking into account that the model is continuously updated. In Sect. 4.2, we discuss possibilities for removing or changing observations/instances that have already been added to the model. We describe how completely new features can be added to the model afterwards. In addition, we show how it is ensured that the model quality is still adequate after a model update. Catastrophic forgetting (catastrophic interference) is considered in Sect. 4.3 in the OML context: The continuous updating of the OML models carries the risk that this learning is not successful if correctly learned older relationships are falsely forgotten (“de-learned”).
Thomas Bartz-Beielstein
Chapter 5. Evaluation and Performance Measurement
Abstract
This chapter discusses aspects to be considered when evaluating Online Machine Learning (OML) algorithms, especially when comparing them to Batch Machine Learning (BML) algorithms. The following considerations play an important role:
1.
How are training and test data selected?
 
2.
How can performance be measured?
 
3.
What procedures are available for generating benchmark data sets?
 
Section 5.1 describes the selection of training and test data. Section 5.2 presents an implementation in Python for selecting training and test data. Section 5.3 describes the calculation of performance. Section 5.4 introduces the generation of benchmark data sets in the field of OML.
Thomas Bartz-Beielstein
Chapter 6. Special Requirements for Online Machine Learning Methods
Abstract
This chapter investigates whether Online Machine Learning (OML) algorithms require special steps and considerations compared to batch learning with respect to typical practice challenges such as missing data (Sect. 6.1), categorical attributes (Sect. 6.2), outliers (Sect. 6.3), imbalanced data (Sect. 6.4), or an extremely large number of variables (Sect. 6.5). Section 6.6 describes important aspects such as fairness (Fair Machine Learning (ML)) or interpretability (Interpretable ML) in the context of OML algorithms.
Thomas Bartz-Beielstein
Chapter 7. Practical Applications of Online Machine Learning
Abstract
This chapter addresses prerequisites, challenges, and potentials of applying Online Machine Learning (OML) methods in practice. These aspects are illustrated by means of domain-specific examples from different application fields. One of these surveyed application fields is official statistics (Sect. 7.1). Section 7.1.1 shows, that OML offers forward-looking potential for official statistics, but presently also comes with a lot of challenges. Especially compliance with quality assurance procedures (Sect. 7.1.2) and integration into existing process architectures (Sect. 7.1.3) prove to be major challenges. A survey about Machine Learning (ML) usage in German and other international statistical institutions shows that OML is currently still rather a niche topic in official statistics (Sect. 7.1.4). However, there are also domains, closely linked to official statistics, that either already feature OML applications or show promising potential for OML usage (Sect. 7.1.5). The second surveyed application field is the process of hot rolling in the steel industry (Sect. 7.2). In general, the process quality of hot rolling (Sect. 7.2.1) benefits from ML predictions (Sect. 7.2.2). However, because of being susceptible to drift, the complex hot rolling process cannot be adequately described without models that are continuously updated (Sect. 7.2.3). These characteristics make industrial hot rolling a suitable use case for the application of OML (Sect. 7.2.4). General aspects important for using OML in practice are briefly summarized in Sect. 7.3. These include reflections about model deployment (Sect. 7.3.1) and considerations regarding differences in required labor hours in comparison to Batch Machine Learning (BML) (Sect. 7.3.2).
Steffen Moritz, Florian Dumpert, Christian Jung, Thomas Bartz-Beielstein, Eva Bartz
Chapter 8. Open-Source Software for Online Machine Learning
Abstract
In contrast to Batch Machine Learning (BML), there are only a few open-source software packages for Online Machine Learning (OML). This chapter describes the availability of open-source software packages (especially in R/Python) that provide OML methods and algorithms to solve tasks such as regression, classification, clustering, or outlier detection. Section 8.1 gives an overview of the software, followed by a description of the corresponding packages. Then, Sect. 8.2 provides a comparative overview of the scope of the individual software packages. The chapter concludes with a comparison of the most important programming languages in the field of Machine Learning (ML) (Sect. 8.3).
Thomas Bartz-Beielstein
Chapter 9. An Experimental Comparison of Batch and Online Machine Learning Algorithms
Abstract
This chapter presents the results of the experimental analyses. The first study (Sect. 9.1) examines the use of Batch Machine Learning (BML) and Online Machine Learning (OML) models for predicting the demand for bicycles at a bike-sharing station. The second study (Sect. 9.2) investigates the use of BML and OML models for prediction when very large data sets are available and drift is present. The synthetic Friedman-drift data set (see Definition 1.​8) is used for this purpose. All data sets were standardized using the StandardScaler method so that the models were trained on data with mean zero and standard deviation one. In Sect. 9.3, we conducted a comprehensive investigation to evaluate the efficacy of scaling techniques in the context of drifting events. Our primary hypothesis centered on the potential benefits of scaling in handling dynamic data streams. Through rigorous experimentation and analysis, we compared various scaling methods to determine if one specific approach outperforms others in adapting to evolving data distributions.
Thomas Bartz-Beielstein, Lukas Hans
Chapter 10. Hyperparameter Tuning
Abstract
The Online Machine Learning (OML) methods presented in the previous chapters require the specification of many hyperparameters. For example, a variety of “splitters” are available for Hoeffding trees to generate subtrees. There are different methods for limiting the tree size in order to keep the time and memory requirements within reasonable limits. In addition, there are many other parameters, so that a manual search for the optimal hyperparameter setting is very time-consuming and doomed to fail due to the complexity of the possible combinations. Therefore, this chapter explains how an automatic optimization (or “tuning”) of the hyperparameters can be performed. In addition to the optimization of the OML procedure, Hyperparameter Tuning (HPT) performed with the Sequential Parameter Optimization Toolbox (SPOT) is also important for the explainability and interpretability of OML procedures and can lead to a more efficient and thus resource-saving algorithm (“Green IT”).
Thomas Bartz-Beielstein
Chapter 11. Summary and Outlook
Abstract
This chapter presents an assessment of the potential of Online Machine Learning (OML) for practitioners. The results of the studies are summarized and discussed and concrete recommendations for OML practice are given. The importance of a suitable comparison methodology for Batch Machine Learning (BML) and OML methods is highlighted to avoid “comparing apples to oranges”. We also point out the great potential of OML that is available through the development of the open-source software River.
Thomas Bartz-Beielstein, Eva Bartz
Backmatter
Metadaten
Titel
Online Machine Learning
herausgegeben von
Eva Bartz
Thomas Bartz-Beielstein
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9970-07-0
Print ISBN
978-981-9970-06-3
DOI
https://doi.org/10.1007/978-981-99-7007-0

Premium Partner