Skip to main content

2018 | Buch

Computational Analysis of Sound Scenes and Events

herausgegeben von: Prof. Dr. Tuomas Virtanen, Prof. Dr. Mark D. Plumbley, Prof. Dan Ellis

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms.

Inhaltsverzeichnis

Frontmatter

Foundations

Frontmatter
Chapter 1. Introduction to Sound Scene and Event Analysis
Abstract
Sounds carry a great deal of information about our environments, from individual physical events to sound scenes as a whole. In recent years several novel methods have been proposed to analyze this information automatically, and several new applications have emerged. This chapter introduces the basic concepts and research problems and engineering challenges in computational environmental sound analysis. We motivate the field by briefly describing various applications where the methods can be used. We discuss the commonalities and differences of environmental sound analysis to other major audio content analysis fields such as automatic speech recognition and music information retrieval. We discuss the main challenges in the field, and give a short historical perspective of the development of the field. We also shortly summarize the role of each chapter in the book.
Tuomas Virtanen, Mark D. Plumbley, Dan Ellis
Chapter 2. The Machine Learning Approach for Analysis of Sound Scenes and Events
Abstract
This chapter explains the basic concepts in computational methods used for analysis of sound scenes and events. Even though the analysis tasks in many applications seem different, the underlying computational methods are typically based on the same principles. We explain the commonalities between analysis tasks such as sound event detection, sound scene classification, or audio tagging. We focus on the machine learning approach, where the sound categories (i.e., classes) to be analyzed are defined in advance. We explain the typical components of an analysis system, including signal pre-processing, feature extraction, and pattern classification. We also preset an example system based on multi-label deep neural networks, which has been found to be applicable in many analysis tasks discussed in this book. Finally, we explain the whole processing chain that involves developing computational audio analysis systems.
Toni Heittola, Emre Çakır, Tuomas Virtanen
Chapter 3. Acoustics and Psychoacoustics of Sound Scenes and Events
Abstract
Auditory scenes are made of several different sounds overlapping in time and frequency, propagating through space, and resulting in complex arrays of acoustic information reaching the listeners’ ears. Despite the complexity of the signal, human listeners segregate effortlessly these scenes into different meaningful sound events . This chapter provides an overview of the auditory mechanisms subserving this ability. First, we briefly introduce the major characteristics of sound production and propagation and basic notions of psychoacoustics. The next part describes one specific family of auditory scene analysis models (how listeners segregate the scene into auditory objects), based on multidimensional representations of the signal, temporal coherence analysis to form auditory objects, and the attentional processes that make the foreground pop out from the background. Then, the chapter reviews different approaches to study the perception and identification of sound events (how listeners make sense of the auditory objects): the identification of different properties of sound events (size, material, velocity, etc.), and a more general approach that investigates the acoustic and auditory features subserving sound recognition . Overall, this review of the acoustics and psychoacoustics of sound scenes and events provides a backdrop for the development of computational methods reported in the other chapters of this volume.
Guillaume Lemaitre, Nicolas Grimault, Clara Suied

Core Methods

Frontmatter
Chapter 4. Acoustic Features for Environmental Sound Analysis
Abstract
Most of the time it is nearly impossible to differentiate between particular type of sound events from a waveform only. Therefore, frequency-domain and time-frequency domain representations have been used for years providing representations of the sound signals that are more in line with the human perception. However, these representations are usually too generic and often fail to describe specific content that is present in a sound recording. A lot of work has been devoted to design features that could allow extracting such specific information leading to a wide variety of hand-crafted features. During the past years, owing to the increasing availability of medium-scale and large-scale sound datasets, an alternative approach to feature extraction has become popular, the so-called feature learning. Finally, processing the amount of data that is at hand nowadays can quickly become overwhelming. It is therefore of paramount importance to be able to reduce the size of the dataset in the feature space. The general processing chain to convert a sound signal to a feature vector that can be efficiently exploited by a classifier and the relation to features used for speech and music processing are described in this chapter.
Romain Serizel, Victor Bisot, Slim Essid, Gaël Richard
Chapter 5. Statistical Methods for Scene and Event Classification
Abstract
This chapter surveys methods for pattern classification in audio data. Broadly speaking, these methods take as input some representation of audio, typically the raw waveform or a time-frequency spectrogram, and produce semantically meaningful classification of its contents. We begin with a brief overview of statistical modeling, supervised machine learning, and model validation. This is followed by a survey of discriminative models for binary and multi-class classification problems. Next, we provide an overview of generative probabilistic models, including both maximum likelihood and Bayesian parameter estimation. We focus specifically on Gaussian mixture models and hidden Markov models, and their application to audio and time-series data. We then describe modern deep learning architectures, including convolutional networks, different variants of recurrent neural networks, and hybrid models. Finally, we survey model-agnostic techniques for improving the stability of classifiers.
Brian McFee
Chapter 6. Datasets and Evaluation
Abstract
Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems’ outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.
Annamaria Mesaros, Toni Heittola, Dan Ellis

Advanced Methods

Frontmatter
Chapter 7. Everyday Sound Categorization
Abstract
This chapter reviews theories and empirical research on the ways in which people spontaneously and effortlessly categorize sounds into meaningful categories to make sense of their environment. We begin with an overview of prominent theories of categorization in the psychological literature, followed by data collection and analysis methods used in empirical research on categorization with human participants. We then focus on auditory categorization, synthesizing the main findings of studies on isolated sound events as well as complex sound scenes. Finally, we review recently proposed taxonomies for everyday sounds and conclude by providing directions for integrating insights from cognitive psychology into the design and evaluation of computational systems.
Catherine Guastavino
Chapter 8. Approaches to Complex Sound Scene Analysis
Abstract
This chapter presents state-of-the-art research and open topics for analyzing complex sound scenes in a single microphone case. First, the concept of sound scene recognition is presented, from the perspective of different paradigms (classification, tagging, clustering, segmentation) and methods used. The core section is on sound event detection and classification, presenting various paradigms and practical considerations along with methods for monophonic and polyphonic sound event detection. The chapter will then focus on the concepts of context and “language modeling” for sound scenes, also covering the concept of relationships between sound events. Work on sound scene recognition based on event detection is also presented. Finally the chapter will summarize the topic and will provide directions for future research.
Emmanouil Benetos, Dan Stowell, Mark D. Plumbley
Chapter 9. Multiview Approaches to Event Detection and Scene Analysis
Abstract
This chapter addresses sound scene and event classification in multiview settings, that is, settings where the observations are obtained from multiple sensors, each sensor contributing a particular view of the data (e.g., audio microphones, video cameras, etc.). We briefly introduce some of the techniques that can be exploited to effectively combine the data conveyed by the different views under analysis for a better interpretation. We first provide a high-level presentation of generic methods that are particularly relevant in the context of multiview and multimodal sound scene analysis. Then, we more specifically present a selection of techniques used for audiovisual event detection and microphone array-based scene analysis.
Slim Essid, Sanjeel Parekh, Ngoc Q. K. Duong, Romain Serizel, Alexey Ozerov, Fabio Antonacci, Augusto Sarti

Applications

Frontmatter
Chapter 10. Sound Sharing and Retrieval
Abstract
Multimedia sharing has experienced an enormous growth in recent years, and sound sharing has not been an exception. Nowadays one can find online sound sharing sites in which users can search, browse, and contribute large amounts of audio content such as sound effects, field and urban recordings, music tracks, and music samples. This poses many challenges to enable search, discovery, and ultimately reuse of this content. In this chapter we give an overview of different ways to approach such challenges. We describe how to build an audio database by outlining different aspects to be taken into account. We discuss metadata-based descriptions of audio content and different searching and browsing techniques that can be used to navigate the database. In addition to metadata, we show sound retrieval techniques based on the extraction of audio features from (possibly) unannotated audio. We end the chapter by discussing advanced approaches to sound retrieval and by drawing some conclusions about present and future of sound sharing and retrieval. In addition to our explanations, we provide code examples that illustrate some of the concepts discussed.
Frederic Font, Gerard Roma, Xavier Serra
Chapter 11. Computational Bioacoustic Scene Analysis
Abstract
The analysis of natural and animal sound makes a demonstrable contribution to important challenges in conservation, animal behaviour, and evolution. And now bioacoustics has entered its big data era. Thus automation is important, as is scalability in many cases to very large amounts of audio data and to real-time processing. This chapter will focus on the data science and the computational methods that can enable this. Computational bioacoustics has some commonalities with wider audio scene analysis, as well as with speech processing and other disciplines. However, the tasks required and the specific characteristics of bioacoustic data require new and adapted techniques. This chapter will survey the tasks and the methods of computational bioacoustics, and will place particular emphasis on existing work and future prospects which address scalable analysis. We will mostly focus on airborne sound; there has also been much work on freshwater and marine bioacoustics, and a small amount on solid-borne sounds.
Dan Stowell
Chapter 12. Audio Event Recognition in the Smart Home
Abstract
After giving a brief overview of the relevance and value of deploying automatic audio event recognition (AER) in the smart home market, this chapter reviews three aspects of the productization of AER which are important to consider when developing pathways to impact between fundamental research and “real-world” applicative outlets. In the first section, it is shown that applications introduce a variety of practical constraints which elicit new research topics in the field: clarifying the definition of sound events, thus suggesting interest for the explicit modeling of temporal patterns and interruption; running and evaluating AER in 24/7 sound detection setups, which suggests to recast the problem as open-set recognition; and running AER applications on consumer devices with limited audio quality and computational power, thus triggering interest for scalability and robustness. The second section explores the definition of user experience for AER. After reporting field observations about the ways in which system errors affect user experience, it is proposed to introduce opinion scoring into AER evaluation methodology. Then, the link between standard AER performance metrics and subjective user experience metrics is being explored, and attention is being drawn to the fact that F-score metrics actually mash up the objective evaluation of acoustic discrimination with the subjective choice of an application-dependent operation point. Solutions to the separation of discrimination and calibration in system evaluation are introduced, thus allowing the more explicit separation of acoustic modeling optimization from that of application-dependent user experience. Finally, the last section analyses the ethical and legal issues involved in deploying AER systems which are “listening” at all times into the users’ private space. A review of the key notions underpinning European data and privacy protection laws, questioning if and when these apply to audio data, suggests a set of guidelines which summarize into empowering users to consent by fully informing them about the use of their data, as well as taking reasonable information security measures to protect users’ personal data.
Sacha Krstulović
Chapter 13. Sound Analysis in Smart Cities
Abstract
This chapter introduces the concept of smart cities and discusses the importance of sound as a source of information about urban life. It describes a wide range of applications for the computational analysis of urban sounds and focuses on two high-impact areas, audio surveillance, and noise pollution monitoring, which sit at the intersection of dense sensor networks and machine listening. For sensor networks we focus on the pros and cons of mobile versus static sensing strategies, and the description of a low-cost solution to acoustic sensing that supports distributed machine listening. For sound event detection and classification we focus on the challenges presented by this task, solutions including feature design and learning strategies, and how a combination of convolutional networks and data augmentation result in the current state of the art. We close with a discussion about the potential and challenges of mobile sensing, the limitations imposed by the data currently available for research, and a few areas for future exploration.
Juan Pablo Bello, Charlie Mydlarz, Justin Salamon

Perspectives

Frontmatter
Chapter 14. Future Perspective
Abstract
This book has covered the underlying principles and technologies of sound recognition, and described several current application areas. However, the field is still very young; this chapter briefly outlines several emerging areas, particularly relating to the provision of the very large training sets that can be exploited by deep learning approaches. We also forecast some of the technological and application advances we expect in the short-to-medium future.
Dan Ellis, Tuomas Virtanen, Mark D. Plumbley, Bhiksha Raj
Backmatter
Metadaten
Titel
Computational Analysis of Sound Scenes and Events
herausgegeben von
Prof. Dr. Tuomas Virtanen
Prof. Dr. Mark D. Plumbley
Prof. Dan Ellis
Copyright-Jahr
2018
Electronic ISBN
978-3-319-63450-0
Print ISBN
978-3-319-63449-4
DOI
https://doi.org/10.1007/978-3-319-63450-0

Neuer Inhalt