Skip to main content
Top

2018 | OriginalPaper | Chapter

3. Machine Learning Basics

Author : Sandro Skansi

Published in: Introduction to Deep Learning

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter explores the fundamentals of machine learning, since deep learning is above everything else, a technique for machine learning. We explore the idea of classification and what it means for a classificator to classify data, and proceed to evaluating the performance of a general classifier. The first actual classifier we present is naive Bayes (which includes a general discussion on data encoding and normalization), and we also present the simplest neural network, logistic regression, which is the bread and butter of deep learning. We introduce the classic MNIST dataset of handwritten digits, the so-called ‘fruit fly of machine learning’. We present also two showcase techniques of unsupervised learning, K-means to explain clustering and the general principle of learning without labels and the principal component analysis (PCA) to explain how to learn representations. PCA is also explored in more detail later on. We conclude with a brief exposition on how to represent language for learning with bag of words.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
You may wonder how a side gets a label, and this procedure is different for the various machine learning algorithms and has a number o peculiarities, but for now you may just think that the side will get the label which the majority of datapoints on that side have. This will usually be true, but is not an elegant definition. One case where this is not true is the case where you have only one dog and two cats overlapping (in 2D space) it and four other cats. Most classifiers will place the dog and the two cats in the category ‘dog’. Cases like this are rare, but they may be quite meaningful.
 
2
A dataset is simply a set of datapoints, some labelled some unlabelled.
 
3
Noise is just a name for the random oscillations that are present in the data. They are imperfections that happen and we do not want to learn to predict noise but the elements that are actually relevant to what we want.
 
4
It does not have to a perfect separation, a good separation will do.
 
5
Think about how one-hot encoding can boost the understanding of n-dimensional space.
 
6
Deep learning is no exception.
 
7
Notice that to do one-hot encoding, it needs to make two passes over the data: the first collects the names of the new columns, then we create the columns, and then we make another pass over the data to fill them.
 
8
Strictly speaking, these vectors would not look exactly the same: the training sample would be (54,17,1,0,0, Dog), which is a row vector of length 6, and the row vector for which we want to predict the label would have to be of length 5 (without the last component which is the label), e.g. (47,15,0,0,1).
 
9
If we will be needing more we will keep more decimals, but in this book we will usually round off to four.
 
10
It is mostly a matter of choice, there is no objective way of determining how much to split.
 
11
The prior probability is just a matter of counting. If you have a dataset with 20 datapoints and in some feature there are five values of ‘New Vegas’ while the others (15 of them) are ‘Core region’, the prior probability \(\mathbb {P}(\text {New Vegas })=0.25\).
 
12
If we were to have n features, this would be an n-dimensional row vector such as \((x_1,x_2,\ldots , x_n)\), but now we have only one feature so we have a 1D row vector of the form \((x_1)\). A 1D vector is exactly the same as the scalar \(x_1\) but we keep referring to it as a vector to delineate that in the general case it would be an n-dimensional vector.
 
13
That is, the assumption that features are conditionally independent given the target.
 
14
Regression problems can be simulated with classification. An example would be if we had to find the proper value between 0 and 1, and we had to round it in two decimals, then we could treat it as a 100-class classification problem. The opposite also holds, and we have actually seen this in the naive Bayes section, where we had to pick a threshold over which we would consider it a 1 and below which it would be a 0.
 
15
Afterwards, we may do a bit of feature engineering and use an all-together different model. This is important when we do not have an understanding of the data we use which is often the case in industry.
 
16
We will see later that logistic regression has more than one neuron, since each component of the input vector will have to have an input neuron, but it has ‘one’ neuron in the sense of having a single ‘workhorse’ neuron.
 
17
If the training set consists of n-dimensional row vectors, then there are exactly \(n-1\) features—the last one is the target or label.
 
18
Mathematically, the bias is useful to make an offset called the intercept.
 
19
There are other error functions that can be used, but the SSE is one of the simplest.
 
20
Recall that this is not the same as a \(3\times 5\) matrix.
 
21
In the older literature, this is sometimes called activation function.
 
24
The interested reader may look up the details in Chap. 4 of [10].
 
25
But PCA itself is not that simple to understand.
 
26
K-means (also called the Lloyd-Forgy algorithm) was first proposed by independently by S. P. Lloyd in [16] and E. W. Forgy in [17].
 
27
Usually, in a predefined number of times, there are other tactics as well.
 
28
Imagine that a centroid is pinned down and connected to all its datapoints with rubber bands, and then you unpin it from the surface. It will move so that the rubber bands are less tense in total (even though individual rubber bands may become more tense).
 
29
Recall that a cluster in K-means is a region around a centroid separated by the hyperplane.
 
30
We have to use the same number of centroids in both clusterings for this to work.
 
31
These features are known as latent variables in statistics.
 
32
One of the reasons for this is that we have not yet developed all the tools we need to write out the details now.
 
33
See Chap. 2.
 
34
And if a feature is always the same, it has a variance of 0 and it carries no information useful for drawing the hyperplane.
 
35
An example of an expansion of the basic bag of words model is a bag of n-grams. An n-gram is a n-tuple consisting of n words that occur next to each other. If we have a sentence ‘I will go now’, the set of its 2-grams will be \(\{(`I',`will'), (`will',`go'), (`go', `now')\}\).
 
36
For most language processing tasks, especially tasks requiring the use of data collected from social media, it makes sense to convert all text to lowercase first and get rid of all commas apostrophes and non-alphanumerics, which we have already done here.
 
Literature
1.
go back to reference R. Tibshirani, T. Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. (Springer, New York, 2016)MATH R. Tibshirani, T. Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. (Springer, New York, 2016)MATH
2.
go back to reference F. van Harmelen, V. Lifschitz, B. Porter, Handbook of Knowledge Representation (Elsevier Science, New York, 2008)MATH F. van Harmelen, V. Lifschitz, B. Porter, Handbook of Knowledge Representation (Elsevier Science, New York, 2008)MATH
3.
go back to reference R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998)MATH R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998)MATH
4.
go back to reference J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986) J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
5.
go back to reference M.E. Maron, Automatic indexing: an experimental inquiry. J. ACM 8(3), 404–417 (1961)CrossRef M.E. Maron, Automatic indexing: an experimental inquiry. J. ACM 8(3), 404–417 (1961)CrossRef
6.
go back to reference D.R. Cox, The regression analysis of binary sequences (with discussion). J. Roy. Stat. Soc. B (Methodol.) 20(2), 215–242 (1958)MATH D.R. Cox, The regression analysis of binary sequences (with discussion). J. Roy. Stat. Soc. B (Methodol.) 20(2), 215–242 (1958)MATH
7.
go back to reference P.J. Grother, NIST special database 19: handprinted forms and characters database (1995) P.J. Grother, NIST special database 19: handprinted forms and characters database (1995)
8.
go back to reference Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef
9.
go back to reference M.A. Nielsen, Neural Networks and Deep Learning (Determination Press, 2015) M.A. Nielsen, Neural Networks and Deep Learning (Determination Press, 2015)
10.
go back to reference P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013) P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013)
11.
go back to reference I. Färber, S. Günnemann, H.P. Kriegel, P. Kroöger, E. Müller, E. Schubert, T. Seidl, A. Zimek. On using class-labels in evaluation of clusterings, in MultiClust: Discovering, Summarizing, and Using Multiple Clusterings, ed. by X.Z. Fern, I. Davidson, J. Dy (ACM SIGKDD, 2010) I. Färber, S. Günnemann, H.P. Kriegel, P. Kroöger, E. Müller, E. Schubert, T. Seidl, A. Zimek. On using class-labels in evaluation of clusterings, in MultiClust: Discovering, Summarizing, and Using Multiple Clusterings, ed. by X.Z. Fern, I. Davidson, J. Dy (ACM SIGKDD, 2010)
13.
go back to reference K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11), 559–572 (1901)CrossRef K. Pearson, On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(11), 559–572 (1901)CrossRef
14.
go back to reference C. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT Press, Cambridge, 1999)MATH C. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT Press, Cambridge, 1999)MATH
15.
go back to reference D. Jurafsky, J. Martin, Speech and Language Processing (Prentice Hall, New Jersey, 2008) D. Jurafsky, J. Martin, Speech and Language Processing (Prentice Hall, New Jersey, 2008)
16.
go back to reference S. P. Lloyd, Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137 (1982) S. P. Lloyd, Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137 (1982)
17.
go back to reference E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21(3), 768–769 (1965) E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21(3), 768–769 (1965)
Metadata
Title
Machine Learning Basics
Author
Sandro Skansi
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-73004-2_3

Premium Partner