Top

Published in:

2017 | OriginalPaper | Chapter

5. Supervised Learning

Authors : Laura Igual, Santi Seguí

Published in: Introduction to Data Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this chapter, we introduce the basics of classification: a type of supervised machine learning. We also give a brief practical tour of learning theory and good practices for successful use of classifiers in a real case using Python. The chapter starts by introducing the classic machine learning pipeline, defining features, and evaluating the performance of a classifier. After that, the notion of generalization error is needed, which allows us to show learning curves in terms of the number of examples and the complexity of the classifier, and also to define the notion of overfitting. That notion will then allow us to develop a strategy for model selection. Finally, two of the best-known techniques in machine learning are introduced: support vector machines and random forests. These are then applied to the proposed problem of predicting those loans that will not be successfully covered once they have been accepted.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Statistical Inference

next chapter Regression Analysis

https://www.lendingclub.com/info/download-data.action.

Several well-known techniques such as support vector machines or adaptive boosting (adaboost) are originally defined in the binary case. Any binary classifier can be extended to the multiclass case in two different ways. We may either change the formulation of the learning/optimization process. This requires the derivation of a new learning algorithm capable of handling the new modeling. Alternatively, we may adopt ensemble techniques. The idea behind this latter approach is that we may divide the multiclass problem into several binary problems; solve them; and then aggregate the results. If the reader is interested in these techniques, it is a good idea to look for: one-versus-all, one-versus-one, or error correcting output codes methods.

Many problems are described using categorical data. In these cases either we need classifiers that are capable of coping with this kind of data or we need to change the representation of those variables into numerical values.

The notebook companion shows the preprocessing steps, from reading the dataset, cleaning and imputing data, up to saving a subsampled clean version of the original dataset.

The term unbalanced describes the condition of data where the ratio between positives and negatives is a small value. In these scenarios, always predicting the majority class usually yields accurate performance, though it is not very informative. This kind of problems is very common when we want to model unusual events such as rare diseases, the occurrence of a failure in machinery, fraudulent credit card operations, etc. In these scenarios, gathering data from usual events is very easy but collecting data from unusual events is difficult and results in a comparatively small dataset.

sklearn allows us to easily automate the train/test splitting using the function train_test_split(...).

The reader should note that there are several bounds in machine learning to characterize the generalization error. Most of them come from variations of Hoeffding’s inequality.

This set cannot be used to select a classifier, model or hyperparameter; nor can it be used in any decision process.

This reduction in the complexity of the best model should not surprise us. Remember that complexity and the number of examples are intimately related for the learning to succeed. By using a test set we perform model selection with a smaller dataset than in the former case.

These techniques have been shown to be two of the most powerful families for classification [1].

Remember the regularization cure for overfitting.

Note the strict inequalities in the formulation. Informally, we can consider the smallest satisfied constraint, and observe that the rest must be satisfied with a larger value. Thus, we can arbitrarily set that value to 1 and rewrite the problem as

$$a^Ts_i+b\ge 1\; \text {and}\; a^Tr_i+b\le -1.$$

It is worth mentioning that another useful tool for visualizing the trade-off between true positives and false positives in order to choose the operating point of the classifier is the receiver-operating characteristic (ROC) curve. This curve plots the true positive rate/sensitivity/recall (TP/(TP+FN)) with respect to the false positive rate (FP/(FP+TN)).

M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research 15, 3133 (2014). http://jmlr.org/papers/v15/delgado14a.html

Title: Supervised Learning
Authors: Laura Igual
Santi Seguí
Publisher: Springer International Publishing
Book: Introduction to Data Science
Print ISBN: 978-3-319-50016-4

Electronic ISBN: 978-3-319-50017-1

Copyright Year: 2017
DOI: https://doi.org/10.1007/978-3-319-50017-1_5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner