Real-time Speech and Music Classification by Large Audio Feature Space Extraction

verfasst von: Florian Eyben

Verlag: Springer International Publishing

Buchreihe : Springer Theses

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book reports on an outstanding thesis that has significantly advanced the state-of-the-art in the automated analysis and classification of speech and music. It defines several standard acoustic parameter sets and describes their implementation in a novel, open-source, audio analysis framework called openSMILE, which has been accepted and intensively used worldwide. The book offers extensive descriptions of key methods for the automatic classification of speech and music signals in real-life conditions and reports on the evaluation of the framework developed and the acoustic parameter sets that were selected. It is not only intended as a manual for openSMILE users, but also and primarily as a guide and source of inspiration for students and scientists involved in the design of speech and music analysis methods that can robustly handle real-life conditions.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

The number of interactions with computerized devices in our daily lives is rapidly increasing. In order to make these feel more intuitive and natural, automated processing of information beyond the linguistic and semantic content contained in audio recordings (speech and music) is becoming more and more important. This includes automatic detection of emotion, affect, mental and health states, speaker traits such as gender and age, and voice qualities from audio/speech signals.

Florian Eyben

Chapter 2. Acoustic Features and Modelling

Abstract

This chapter gives an overview of the methods for speech and music analysis implemented by the author in the openSMILE toolkit. The methods described, include all the relevant processing steps from an audio signal to a classification result. These steps include pre-processing and segmentation of the input, feature extraction (i.e., computation of acoustic Low-level Descriptors (LLDs) and summarisation of these descriptors in high level segments), and modelling (e.g., classification).

Florian Eyben

Chapter 3. Standard Baseline Feature Sets

Abstract

A central aim of this thesis was to define standard acoustic feature sets for both speech and music, which contain a large and comprehensive set of acoustic descriptors. Based on previous efforts to combine features and the authors experience from evaluations across several databases and tasks, 12 standard acoustic parameter sets have been proposed and well evaluated for this thesis. These sets include the acoustic baseline features sets of the INTERSPEECH challenges on Emotion and Paralinguistics form 2009–2013 (ComParE) as well as the Audio-Visual Emotion Challenges (2011–2013). Further, two sets for music processing and two minimalistic speech parameter sets (GeMAPS and eGeMAPS) are proposed.

Florian Eyben

Chapter 4. Real-time Incremental Processing

Abstract

The features and the modelling methods used in this thesis have been selected with the goal of on-line processing in mind, however, most of them are all general methods that are suitable both for on-line and off-line processing. This section deals specifically with the issues encountered in on-line (aka incremental) processing, such as segmentation, constraints on feature extraction, and complexity and run-time constraints.

Florian Eyben

Chapter 5. Real-Life Robustness

Abstract

With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-modal user profiling for better user adaptation of smart agents, call centre speech analytics, marketing research voice analytics, or health monitoring for stress and depression. This chapter discusses methods for pre-processing speech robustly and for enhancing audio classification algorithms in degraded acoustic conditions.

Florian Eyben

Chapter 6. Evaluation

Abstract

The baseline acoustic feature sets and the methods for robust and incremental audio analysis have been evaluated extensively by the author of this thesis. In this chapter, first, a set of 12 affective speech databases and two music style data-sets is introduced, which are used for a systematic evaluation of the proposed methods and baseline acoustic feature sets. Next, the effectiveness of the proposed noise robust affective speech classification approach is evaluated on two of the affective speech databases in Sect. 6.2. Then, recognition results obtained with all the baseline acoustic feature sets on a large set of 10 speech and two music databases are presented and discussed in Sect. 6.3. Finally, recognition results for continuous, dimensional affect recognition with an incremental recognition method are shown in Sect. 6.4.

Florian Eyben

Chapter 7. Discussion and Outlook

Abstract

This chapter summarises the methods presented for automatic speech and music analysis and the results obtained for speech emotion analytics and music genre identification with the openSMILE toolkit developed by the author. Further, it is discussed here if and how the aims defined upfront were achieved and open issues for future work are discussed.

Florian Eyben

Backmatter

Titel: Real-time Speech and Music Classification by Large Audio Feature Space Extraction
verfasst von: Florian Eyben
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-27299-3
Print ISBN: 978-3-319-27298-6
DOI: https://doi.org/10.1007/978-3-319-27299-3