Skip to main content

2013 | Buch

Intelligent Audio Analysis

verfasst von: Björn W. Schuller

Verlag: Springer Berlin Heidelberg

Buchreihe : Signals and Communication Technology

insite
SUCHEN

Über dieses Buch

This book provides the reader with the knowledge necessary for comprehension of the field of Intelligent Audio Analysis. It firstly introduces standard methods and discusses the typical Intelligent Audio Analysis chain going from audio data to audio features to audio recognition. Further, an introduction to audio source separation, and enhancement and robustness are given.

After the introductory parts, the book shows several applications for the three types of audio: speech, music, and general sound. Each task is shortly introduced, followed by a description of the specific data and methods applied, experiments and results, and a conclusion for this specific task.

The books provides benchmark results and standardized test-beds for a broader range of audio analysis tasks. The main focus thereby lies on the parallel advancement of realism in audio analysis, as too often today’s results are overly optimistic owing to idealized testing conditions, and it serves to stimulate synergies arising from transfer of methods and leads to a holistic audio analysis.

Inhaltsverzeichnis

Frontmatter

Introduction

Frontmatter
Chapter 1. Intelligent Audio Analysis: A Definition
Abstract
A definition of Intelligent Audio Analysis is given. Further, real-life conditions are defined comprising the aspects of non-prototypical and non-preselected independent test data that has never been used during optimisation, and fully automatic processing.
Björn Schuller
Chapter 2. Motivation, Aims, and Solutions
Abstract
Research in the field of Intelligent Audio Analysis possesses rich application potential of commercial and practical interest. These comprise audio alteration, retrieval, interaction, monitoring and surveillance, and entertainment. A detailed description is given alongside current aims to advance the field.
Björn Schuller
Chapter 3. Structure of the Book
Abstract
A division into four main parts is outlined leading from the introductory short motivation with general aims and solutions towards reaching of these to a definition of audio analysis per se to audio analysis methods and applications before concluding.
Björn Schuller

Intelligent Audio Analysis Methods

Frontmatter
Chapter 4. Chain of Audio Processing
Abstract
The chain of processing in a typical Intelligent Audio Analysis system is outlined. Along its path, it leads from preprocessing to Low Level Descriptor extraction, chunking, supra segmental analysis and hierarchical functional extraction, feature reduction, feature selection and generation, parameter selection, model learning to the actual classification or regression. This can be followed by a fusion with other information streams, and encoding for the application context. The individual steps are explained in more detail.
Björn Schuller
Chapter 5. Audio Data
Abstract
In order to train and test Intelligent Audio Analysis systems, audio data is needed. In fact, this is often considered as one of the main bottle necks and the common opinion is that there is "no data like more data". In this light, the requirements for collecting and providing audio databases are outlined. This includes in particular the establishment of a reliable gold standard. Explanatory examples are given for the three types of audio—speech, music, and general sound—by the corpora TUM AVIC, NTWICM, and the FindSounds database.
Björn Schuller
Chapter 6. Audio Features
Abstract
To represent the information contained in an audio (stream) in a compact way focussing on a task of interest, a parameterised form is usually chosen. These parameters describe properties of the audio usually in a highly information reduced form and typically at a considerably lower rate, such as the mean energy or pitch over a longer period of time. As different Intelligent Audio Analysis tasks are often best represented by different such ’features’, a broad selection of the most typical ones is presented. This includes description of the digitalisation and segmentation of the audio as first step. Features include intensity, zero-crossings, auto correlation, spectrum and cepstrum, linear prediction, line spectral pairs, perceptual linear prediction, formants, fundamental frequency and voicing probability, and jitter and shimmer from the speech domain. Further, music, sound, and textual descriptors are included. Then, the principle of supra-segmental brute-forcing and subsequent reduction and selection are explained. As an example serves the widely used openSMILE feature extractor.
Björn Schuller
Chapter 7. Audio Recognition
Abstract
A huge variety of learning algorithms is applied in the field of Intelligent Audio Analysis depending on the nature of the target task of interest such as being static or dynamic. In addition, manifold other factors are decisive when selecting the optimal algorithm incorporating aspects such as efficiency or reliability. The most frequently found representatives are explained in detail: Decision Trees, Support Vector-based approaches, Artificial Neural Networks including the Long Short-Term Memory paradigm, as well as dynamic modelling by hidden Markov models. In addition, bootstrapping, meta-learning, and tandem learning are described. This is followed by the optimal evaluation of such algorithms touching partitioning and balancing, and evaluation measures as are frequently used in the field.
Björn Schuller
Chapter 8. Audio Source Separation
Abstract
In order to enhance the (audio) signal of interest in the case of added audio sources, one can aim at their separation. Albeit being very demanding, Audio Source Separation of audio signals has many interesting applications: for example, in Music Information Retrieval, it allows for polyphonic transcription or recognition of lyrics in singing after decomposing the original recording into voices and/or instruments such as drums or guitars, or vocals, e.g., for ’query by humming’. Here, non-negative matrix factorisation-based (NMF) approaches are explained. Further, ’NMF Activation Features’ are introduced and exemplified in the speech processing domain.
Björn Schuller
Chapter 9. Audio Enhancement and Robustness
Abstract
Once an audio recognition system that functions under idealistic conditions is established, the primary concern shifts towards making it robust in the real-world. Several options exist for system improvement along the chain of processing, and have proved to be promising especially in the monaural case. Here, most frequently methods and some recent candidates are explained, first including advanced front-end feature extraction, unsupervised spectral subtraction, feature enhancement and normalisation by Cepstral Mean Subtraction, Mean and Variance Normalisation, and Histogram Equalisation. Then, model-based feature enhancement based on (switching) linear dynamical modelling is followed by model architectures such as (hidden) conditional random fields, and switching autoregressive approaches.
Björn Schuller

Intelligent Audio Analysis Applications

Frontmatter
Chapter 10. Applications in Intelligent Speech Analysis
Abstract
Speech is broadly considered as being the most natural communication form for humans. Obviously, there are manifold applications opening up for general technical and computer systems, once they are able to recognise speech as well as humans do—be it for interaction purposes with humans, mediation purposes between humans, or speech retrieval. Here, state-of-the-art methodology is presented for highly robust speech recognition, nonlinguistic vocalisation recognition, paralinguistic speaker states and traits as exemplified by sentiment, emotion, interest, age, gender, intoxication and sleepiness. All examples stem from the author’s recent work. In particular the latter are chosen from a series of Challenges co-organised by the author at Interspeech from 2009 onwards.
Björn Schuller
Chapter 11. Applications in Intelligent Music Analysis
Abstract
As digitised music has conquered the market for more than a decade, advanced techniques of Intelligent Music Analysis are gaining interest and importance. From this exciting field, recent application examples were selected for presentation in detail from the work of the author including current performance benchmarks. These comprise drum-beat separation, onset detection, tempo, metre, ballroom dance style, and mood determination, key and chord recognition, and structure analysis alongside singer trait classification. The latter includes singer age, gender, race, and height recognition.
Björn Schuller
Chapter 12. Applications in Intelligent Sound Analysis
Abstract
Apart from speech and music, general sound can also carry relevant information. This is, however, a considerably less researched field up to-date. Most prominent in this area are the tasks of acoustic event detection and classification that can be subsumed under the area of computational auditory scene analysis. Fields of application include media retrieval including affective content analysis or human-machine and human-robot interaction, animal vocalisation recognition, and monitoring of industrial processes. Here, three applications in real-life Intelligent Sound Analysis are given from the work of the author: audio-based animal recognition, acoustic event classification, and prediction of emotion as induced in sound listeners. In particular, weakly supervised learning techniques are presented to cope with the typical label-sparseness in this field.
Björn Schuller

Conclusion

Frontmatter
Chapter 13. Discussion
Abstract
A statement on how the state-of-the-art in the field of Intelligent Audio Analysis was advanced more recently is provided at first. Based upon this, a distilled ’best practice’ recommendation is given to the reader. This includes aspects of high realism, standardised, multi-faceted and machine-aided data collection, source separation, feature brute-forcing, temporal evolution modelling, coupling of tasks, and standardisation. Then, a critical discussion is led on missing aspects and remaining research steps. Considerations in this direction comprise the request for more robustness, blind separation and multi-task processing of real-life streams, massive weakly supervised and evolutionary learning, closure of the gap between analysis and synthesis, cross-cultural and cross-lingual widening, novel tasks, further unification and transfer of methods, confidence measures, distributed processing, and new competitive research challenges.
Björn Schuller
Chapter 14. Vision
Abstract
Based on recent examples of the author’s work from the domains of Intelligent Speech, Music, and Sound Analysis, a comprehensive overview is given on currently obtainable performances in the field of Intelligent Audio Analysis. This comprises discrete classification tasks, namely, digit, spelling, phoneme, word, and non-linguistic vocalisation recognition alongside writer sentiment and speaker emotion, age, gender, intoxication, and sleepiness recognition. Further, continuous writer sentiment, and speaker interest and height determination as well as sound listener induced arousal and valence prediction are contained. Based on these performances, future perspectives are given.
Björn Schuller
Backmatter
Metadaten
Titel
Intelligent Audio Analysis
verfasst von
Björn W. Schuller
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-36806-6
Print ISBN
978-3-642-36805-9
DOI
https://doi.org/10.1007/978-3-642-36806-6

Neuer Inhalt