Skip to main content

2008 | Buch

Machine Learning for Audio, Image and Video Analysis

Theory and Applications

verfasst von: Francesco Camastra, PhD, Alessandro Vinciarelli, PhD

Verlag: Springer London

Buchreihe : Advanced Information and Knowledge Processing

insite
SUCHEN

Über dieses Buch

1. 1 TwoFundamentalQuestions There are two fundamental questions that should be answered before buying, and even more before reading, a book: • Why should one read the book? • What is the book about? This is the reason why this section, the ?rst of the whole text, proposes some motivations for potential readers (Section 1. 1. 1) and an overall description of the content (Section 1. 1. 2). If the answers are convincing, further information can be found in the rest of this chapter: Section 1. 2 shows in detail the str- ture of the book, Section 1. 3 presents some features that can help the reader to better move through the text, and Section 1. 4 provides some reading tracks targeting speci?c topics. 1. 1. 1 Why Should One Read The Book? One of the most interesting technological phenomena in recent years is the di?usion of consumer electronic products with constantly increasing acqui- tion, storage and processing power. As an example, consider the evolution of digital cameras: the ?rst models available in the market in the early nineties produced images composed of 1. 6 million pixels (this is the meaning of the expression 1. 6 megapixels), carried an onboard memory of 16 megabytes, and had an average cost higher than 10,000 U. S. dollars. At the time this book is being written, the best models are close to or even above 8 megapixels, have internal memories of one gigabyte and they cost around 1,000 U. S. dollars.

Inhaltsverzeichnis

Frontmatter

From Perception to Computation

1. Introduction
There are two fundamental questions that should be answered before buying, and even more before reading, a book:
  • Why should one read the book?
  • What is the book about?
This is the reason why this section, the first of the whole text, proposes some motivations for potential readers (Section 1.1.1) and an overall description of the content (Section 1.1.2). If the answers are convincing, further information can be found in the rest of this chapter: Section 1.2 shows in detail the structure of the book, Section 1.3 presents some features that can help the reader to better move through the text, and Section 1.4 provides some reading tracks targeting specific topics.
2. Audio Acquisition, Representation and Storage
The goal of this chapter is to provide basic notions about digital audio processing technologies. These are applied in many everyday life products such as phones, radio and television, videogames, CD players, cellular phones, etc. However, although there is a wide spectrum of applications, the main problems to be addressed in order to manipulate digital sound are essentially three: acquisition, representation and storage. The acquisition is the process of converting the physical phenomenon we call sound into a form suitable for digital processing, the representation is the problem of extracting from the sound information necessary to perform a specific task, and the storage is the problem of reducing the number of bits necessary to encode the acoustic signals.
3. Image and Video Acquisition, Representation and Storage
The eye is the organ that allows our brain to acquire the visual information around us. One of the most challanging tasks in the science consists in developing a machine that can see, that is it can acquire, integrate and interpret the visual information embedded in still images and videos. This is the topic of scientific domain called image processing. The topic of image processing is so large it cannot be described in a single chapter. Therefore for comprehensive surveys of this topic, the reader can refer to [10] [23] [27].

Machine Learning

4. Machine Learning
The ability to learn is one of the distintive attributes of intelligent behavior. Following a seminal work [5], we can say that “Learning process includes the acquisition of new declarative knowledge, the development of motor and cognitive skills through instruction or practice, the organization of new knowledge into general, effective representations, and the discovery of new facts and theories through observation and experimentation.”
5. Bayesian Theory of Decision
Bayesian theory of decision (BTD) is a fundamental tool of analysis in Machine Learning. Several machine learning algorithms have been derived using BTD. The fundamental idea in BTD is that the decision problem can be solved using probabilistic considerations. In order to introduce the theory we consider the following example. We suppose to have a classroom in which there are students of both genders. Moreover, there is an examiner, outside the classroom, that has to call the students for the examination. He has a list of the surnames of the students, but the surnames are not accompanied by the first names. How can the examiner decide if to a given surname corresponds a girl or a boy?
6. Clustering Methods
Given a set of examples of a concept, the learning problem can be described as finding a general rule that explains examples given only a sample of limited size. Examples are generally referred as data. The difficulty of the learning problem is similar to the problem of children learning to speak from the sounds emitted by the grown-up people. The learning problem can be stated as follows: given an example sample of limited size, to find a concise data description. Learning methods can be grouped in three big families: supervised learning, reinforcement learning and unsupervised learning.
7. Foundations of Statistical Learning and Model Selection
This chapter has two main topics the the model selection and the learning problem.
8. Supervised Neural Networks and Ensemble Methods
9. Kernel Methods
Kernel methods are algorithms which allow to project implicitly the data in a high-dimensional space. The use of kernel functions to make computations was introduced by [1] in 1964. Two decades later several authors [60] [68] [70] proposed a neural network, radial basis function (RBF), based on the kernel functions which was widely used in many applicative fields. Since 1995 kernel methods have conquered a fundamental place in machine learning when support vector machines (SVMs) were proposed. In several applications, SVMs have showed better performances in comparison with other machine learning algorithms. SVM strategy can be summarized in two steps. In the first step the data are projected implicitly onto a high-dimensional space by means of the kernel trick [74] which consists of replacing the inner product between data vectors with a kernel function. The second step consists of applying a linear classifier to the projected data. Since a linear classifier can solve a very limited class of problems, the kernel trick is used to enpower the linear classifier, making SVM capable of solving a larger class of problems.
10. Markovian Models for Sequential Data
Most of the techniques presented in this book are aimed at making decisions about data. By data it is meant, in general, vectors representing, in some sense, real-world objects that cannot be handled directly by computers. The components of the vectors, the so-called features, are supposed to contain enough information to allow a correct decision and to distinguish between different objects (see Chapter 5). The algorithms are typically capable, after a training procedure, of associating input vectors with output decisions. On the other hand, in some cases real-world objects of interest cannot be represented with a single vector because they are sequential in nature. This is the case of speech and handwriting, which can be thought of as sequences of phonemes (see Chapter 2) and letters, respectively, temporal series, biological sequences (e.g. chains of proteins in DNA), natural language sentences, music, etc. The goal of this chapter is to show how some of the techniques presented so far for single vectors can be extended to sequential data.
11. Feature Extraction Methods and Manifold Learning Methods
In the previous chapters we presented several learning algorithms for classification and regression tasks. In many applicative problems data cannot be straightaway used to feed learning algorithms; they first need to have undergone a preliminary preprocessing. To illustrate this concept, we consider the following example. Suppose we want to build an automatic handwriting character recognizer, that is a system able to associate to a given bitmap the correct alphabet letter or digit. We assume that the data have the same sizes, that the data are bitmaps of n × m pixels; for the sake of simplicity we assume n = m = 28. Therefore the number of possible configurations is 28 × 28 = 216. This consideration implies that a learning machine straightly fed by character bitmaps will perform poorly since a representative training set can not be built. A common approach for overcoming this problem consists in representing each bitmap by a vector of d (with d ª nm) measures computed on the bitmap, called features, and then feeding the learning machine with the feature vector. The feature vector has the aim of representing in a concise way the distinctive characteristics of each letter. The more features represent the distinctive characteristics of each single character the higher is the performance of the learning machine. In machine learning, the preprocessing stage that converts the data into feature vectors is called feature extraction. One of the main aims of the feature extraction is to obtain the most representative feature vector using a number as small as possible of features. The use of more features than strictly necessary leads to several problems. A problem is the space needed to store the data. As the amount of available information increases, the compression for storage purposes becomes even more important. The speed of learning machines using the data depends on the dimension of the vectors, so a reduction of the dimension can result in reduced computational time. The most important problem is the sparsity of data when the dimensionality of the features is high. The sparsity of data implies that it is usually hard to make learning machines with good performances when the dimensionality of input data (that is, the feature dimensionality), is high. This phenomenon, discovered by Bellman, is called the curse of dimensionality [7].

Applications

12. Speech and Handwriting Recognition
This chapter presents speech and handwriting recognition, i.e. two major applications involving the markovian models described in Chapter 10. The goal is not only to present some of the most widely investigated applications of the literature, but also to show how the same machine learning techniques can be applied to recognize data apparently different like handwritten word images and speech recordings. In fact, the only differences between handwriting and speech recognition systems concern the so-called front-end, i.e. the low-level processing steps dealing directly with the raw data (see Section 12.2 for more details). Once the raw data have been converted into sequences of vectors, the same recognition approach, based on hidden Markov models and N-grams, is applied to both problems and no more domain specific knowledge is needed. The possibility of dealing with different data using the same approach is one of the main advantages of machine learning, in fact it makes it possible to work on a wide spectrum of problems even in absence of deep problem specific knowledge.
13. Automatic Face Recognition
The problem of automatic face recognition (AFR) can be stated as follows: given an image or a video showing one or more persons, recognize the individuals that are portrayed in a predefined dataset of face images [72]. Such a task has been studied for several decades. The earliest works appeared at the beginning of the 1970s [28] [29], but it is only in the last few years that the domain has reached its maturity. The reason is twofold: on one hand, necessary computational resources are now easily available and recognition approaches achieve, at least in controlled conditions, satisfactory results. On the other hand, several applications of commercial interest require robust face recognition systems, e.g. multimedia indexing, tracking, human computer interaction, etc.
14. Video Segmentation and Keyframe Extraction
The goal of this chapter is to show how clustering techniques are applied to perform video segmentation, i.e. to split videos into segments meaningful from a semantic point of view. The segmentation is the first step of any process aimed at extracting from videos high level information, i.e. information which is not explicitly stated in the data, but it rather requires an abstraction process [10] [17] [22]. The video segmentation can be thought of as the partitioning of a text into chapters, sections and other parts that help the reader to better access the content. In more general terms, the segmentation of a long document (text, video, audio, etc.) into smaller parts addresses the limits of the human mind in dealing with large amounts of information. In fact, humans are known to be more effective when managing five to nine information chunks rather than a single information block corresponding to the sum of the chunks [30].
Backmatter
Metadaten
Titel
Machine Learning for Audio, Image and Video Analysis
verfasst von
Francesco Camastra, PhD
Alessandro Vinciarelli, PhD
Copyright-Jahr
2008
Verlag
Springer London
Electronic ISBN
978-1-84800-007-0
Print ISBN
978-1-84800-006-3
DOI
https://doi.org/10.1007/978-1-84800-007-0

Premium Partner