Skip to main content

2017 | OriginalPaper | Buchkapitel

5. Supervised Learning

verfasst von : Laura Igual, Santi Seguí

Erschienen in: Introduction to Data Science

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this chapter, we introduce the basics of classification: a type of supervised machine learning. We also give a brief practical tour of learning theory and good practices for successful use of classifiers in a real case using Python. The chapter starts by introducing the classic machine learning pipeline, defining features, and evaluating the performance of a classifier. After that, the notion of generalization error is needed, which allows us to show learning curves in terms of the number of examples and the complexity of the classifier, and also to define the notion of overfitting. That notion will then allow us to develop a strategy for model selection. Finally, two of the best-known techniques in machine learning are introduced: support vector machines and random forests. These are then applied to the proposed problem of predicting those loans that will not be successfully covered once they have been accepted.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Several well-known techniques such as support vector machines or adaptive boosting (adaboost) are originally defined in the binary case. Any binary classifier can be extended to the multiclass case in two different ways. We may either change the formulation of the learning/optimization process. This requires the derivation of a new learning algorithm capable of handling the new modeling. Alternatively, we may adopt ensemble techniques. The idea behind this latter approach is that we may divide the multiclass problem into several binary problems; solve them; and then aggregate the results. If the reader is interested in these techniques, it is a good idea to look for: one-versus-all, one-versus-one, or error correcting output codes methods.
 
3
Many problems are described using categorical data. In these cases either we need classifiers that are capable of coping with this kind of data or we need to change the representation of those variables into numerical values.
 
4
The notebook companion shows the preprocessing steps, from reading the dataset, cleaning and imputing data, up to saving a subsampled clean version of the original dataset.
 
5
The term unbalanced describes the condition of data where the ratio between positives and negatives is a small value. In these scenarios, always predicting the majority class usually yields accurate performance, though it is not very informative. This kind of problems is very common when we want to model unusual events such as rare diseases, the occurrence of a failure in machinery, fraudulent credit card operations, etc. In these scenarios, gathering data from usual events is very easy but collecting data from unusual events is difficult and results in a comparatively small dataset.
 
6
sklearn allows us to easily automate the train/test splitting using the function train_test_split(...).
 
7
The reader should note that there are several bounds in machine learning to characterize the generalization error. Most of them come from variations of Hoeffding’s inequality.
 
8
This set cannot be used to select a classifier, model or hyperparameter; nor can it be used in any decision process.
 
9
This reduction in the complexity of the best model should not surprise us. Remember that complexity and the number of examples are intimately related for the learning to succeed. By using a test set we perform model selection with a smaller dataset than in the former case.
 
10
These techniques have been shown to be two of the most powerful families for classification [1].
 
11
Remember the regularization cure for overfitting.
 
12
Note the strict inequalities in the formulation. Informally, we can consider the smallest satisfied constraint, and observe that the rest must be satisfied with a larger value. Thus, we can arbitrarily set that value to 1 and rewrite the problem as
$$a^Ts_i+b\ge 1\; \text {and}\; a^Tr_i+b\le -1.$$
 
13
It is worth mentioning that another useful tool for visualizing the trade-off between true positives and false positives in order to choose the operating point of the classifier is the receiver-operating characteristic (ROC) curve. This curve plots the true positive rate/sensitivity/recall (TP/(TP+FN)) with respect to the false positive rate (FP/(FP+TN)).
 
Literatur
Metadaten
Titel
Supervised Learning
verfasst von
Laura Igual
Santi Seguí
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-50017-1_5