Skip to main content

2007 | Buch

Information Retrieval for Music and Motion

insite
SUCHEN

Über dieses Buch

A general scenario that has attracted a lot of attention for multimedia information retrieval is based on the query-by-example paradigm: retrieve all documents from a database containing parts or aspects similar to a given data fragment. However, multimedia objects, even though they are similar from a structural or semantic viewpoint, often reveal significant spatial or temporal differences. This makes content-based multimedia retrieval a challenging research field with many unsolved problems.

Meinard Müller details concepts and algorithms for robust and efficient information retrieval by means of two different types of multimedia data: waveform-based music data and human motion data. In Part I, he discusses in depth several approaches in music information retrieval, in particular general strategies as well as efficient algorithms for music synchronization, audio matching, and audio structure analysis. He also shows how the analysis results can be used in an advanced audio player to facilitate additional retrieval and browsing functionality. In Part II, he introduces a general and unified framework for motion analysis, retrieval, and classification, highlighting the design of suitable features, the notion of similarity used to compare data streams, and data organization. The detailed chapters at the beginning of each part give consideration to the interdisciplinary character of this field, covering information science, digital signal processing, audio engineering, musicology, and computer graphics.

This first monograph specializing in music and motion retrieval appeals to a wide audience, from students at the graduate level and lecturers to scientists working in the above mentioned fields in academia or industry. Lecturers and students will benefit from the didactic style, and each unit is suitable for stand-alone use in specialized graduate courses. Researchers will be interested in the detailed description of original research results and their application in real-world browsing and retrieval scenarios.

Inhaltsverzeichnis

Frontmatter

Analysis and Retrieval Techniques for Music Data

Frontmatter
1. Introduction
In this chapter, we provide motivating and domain-specific introductions of the information retrieval problems raised in this book. Sect. 1.1 covers the music and Sect. 1.2, the motion domain. These two sections also include an outline of the two parts, provide a summary of all chapters, and discuss general literature relevant to music information retrieval and motion retrieval, respectively. Finally, in Sect. 1.3, we reveal the conceptual relations between the two parts. In particular, we point out the general concepts for content-based information retrieval, which apply to both music and motion domains, and even beyond.
2. Fundamentals on Music and Audio Data
In the first part of this monograph, we discuss content-based analysis and retrieval techniques for music and audio data. To account for the interdisciplinary character of this research field, we start in this chapter with some fundamentals on music representations and digital signal processing. In particular, we summarize basic facts on the score, MIDI, and audio format (Sect. 2.1). We then review various forms of the Fourier transform (Sect. 2.2) and give a short account of digital convolution filters (Sect. 2.3). Doing so, we hope to refine and sharpen the understanding of the required basic signal transforms. This will be essential for the design as well as for the proper interpretation of musically relevant audio features, see Chap. 3.
3. Pitch- and Chroma-Based Audio Features
Automatic music processing poses a number of challenging questions because of the complexity and diversity of music data. As discussed in Sect. 2.1, one generally has to account for various aspects such as the data format (e.g., score, MIDI, audio), the instrumentation (e.g., orchestra, piano, drums, voice), and many other parameters such as articulation, dynamics, or tempo. To make music data comparable and algorithmically accessible, the first step in all music processing tasks is to extract suitable features that capture relevant key aspects while suppressing irrelevant details or variations. Here, the notion of similarity is of crucial importance in the design of audio features. In some applications and particularly in the case in music retrieval, one may be interested in characterizing an audio recording irrespective of certain details concerning the interpretation or instrumentation. Conversely, other applications may be concerned with measuring just the niceties that relate to a musician’s individual articulation or emotional expressiveness.
4. Dynamic Time Warping
Dynamic time warping (DTW) is a well-known technique to find an optimal alignment between two given (time-dependent) sequences under certain restrictions (Fig. 4.1). Intuitively, the sequences are warped in a nonlinear fashion to match each other. Originally, DTW has been used to compare different speech patterns in automatic speech recognition, see [170]. In fields such as data mining and information retrieval, DTW has been successfully applied to automatically cope with time deformations and different speeds associated with time-dependent data.
In this chapter, we introduce and discuss the main ideas of classical DTW (Sect. 4.1) and summarize several modifications concerning local as well as global parameters (Sect. 4.2). To speed up classical DTW, we describe in Sect. 4.3 a general multiscale DTW approach. In Sect. 4.4, we show how DTW can be employed to identify all subsequence within a long data stream that are similar to a given query sequence (Sect. 4.4). A discussion of related alignment techniques and references to the literature can be found in Sect. 4.5.
5. Music Synchronization
Modern digital music libraries contain textual, visual, and audio data. Recall that musical information is represented in diverse data formats which, depending upon the respective application, differ fundamentally in their structure and content. In this chapter, we introduce various synchronization tasks to automatically link different data streams given various formats (score, MIDI, audio) that represent the same piece of music (Sect. 5.1). Particularly, two different synchronization procedures are described in detail. First, we present an efficient and robust multiscale DTW approach for time-aligning two different CD recordings of the same piece (Sect. 5.2). Using chroma-based audio features, our algorithm yields good synchronization results for harmony-based music at a reasonable resolution level that is sufficient in view of music retrieval and navigation applications. Second, we discuss an algorithm for score- audio synchronization, which aligns the musical onset times given by a score with their physical occurrences a CD recording of the same piece (Sect. 5.3). Using semantically meaningful onset features, this algorithm works particularly well for piano music and yields alignments at a high temporal resolution. In Sect. 5.4, we describe possible research directions, give further references to the literature, and discuss some problems related to music synchronization. The first three sections of this chapter closely follow [5], [142], and [141], respectively.
6. Audio Matching
In the context of music retrieval, the query-by-example paradigm has attracted a large amount of attention: given a query in the form of a music excerpt, the task is to automatically retrieve all excerpts from the database containing parts or aspects which are somehow similar to the query. This problem is particularly difficult for digital waveform-based audio data such as CD recordings. Because of the complexity of such data, the notion of similarity used to compare different audio clips is a delicate issue and largely depends on the respective application as well as on the user requirements.
In this chapter, we consider the problem of audio matching. Here the goal is to retrieve all audio clips from the database that in some sense represent the same musical content as the query clip. This is typically the case when the same piece of music is available in several interpretations and arrangements. For example, given a 20-s excerpt of Bernstein’s interpretation of the theme of Beethoven’s Fifth Symphony, the goal is to find all other corresponding audio clips in the database; this includes the repetition in the exposition or in the recapitulation within the same interpretation as well as the corresponding excerpts in all recordings of the same piece conducted, e.g., by Karajan or Sawallisch. Even more challenging is to also include arrangements such as Liszt’s piano transcription of Beethoven’s Fifth or a synthesized version of a corresponding MIDI file. Obviously, the degree of difficulty increases with the degree of variations one wants to permit in the audio matching.
7. Audio Structure Analysis
The alignments and crosslinks obtained from audio synchronization (Chap. 5) and audio matching (Chap. 6) can be used to conveniently switch between different versions of a piece of music (interdocument navigation). We will now address the problem of audio structure analysis, which lays the basis for intradocument navigation. One major goal of the structural analysis of an audio recording is to automatically extract the repetitive structure or, more generally, the musical form of the underlying piece of music. Recent approaches such as [14,37,46,49,81,127,129,139,161]work well for music where the repetitions largely agree with respect to instrumentation and tempo as it is typically the case for popular music. For other classes of music including Western classical music, however, musically similar audio segments may exhibit significant variations in parameters such as dynamics, timbre, execution of note groups, musical key, articulation, and tempo progression. In this chapter, we propose robust and efficient algorithms for structure analysis that identifies musically similar segments. To obtain a flexible and robust algorithm, the idea is to simultaneously account for possible variations at various stages and levels. At the feature level, we use coarse chroma-based audio features that absorb microvariations. To cope with local variations, we design an advanced cost measure by integrating contextual information (Sect. 7.2). Finally, we describe a new strategy for structure extraction that can cope with more global variations (Sects. 7.3 and 7.4). Our experimental results with classical and popular music show that our algorithm performs successfully even in the presence of significant musical variations (Sect. 7.5). In Sect. 7.1, we start by summarizing a general strategy for audio structure analysis and introduce some notation that is used throughout this chapter. Related work and future research directions will be discussed in Sect. 7.6. In this chapter, we closely follow [139]. The enhancement strategy of self-similarity matrices by introducing a contextual local cost measure has first been described in [137].
8. SyncPlayer: An Advanced Audio Player
In the previous chapters, we have discussed various MIR techniques and algorithms for automatically generating annotations and linking structures of interrelated music data. The generated data can be used to support inter- and intradocument browsing and retrieval in complex and inhomogeneous music collections, thus allowing users to discover and explore music in an intuitive and multimodal way. To demonstrate the potentials of our MIR techniques, we have developed the SyncPlayer system [114], which is a client-server based advanced audio player. The SyncPlayer integrates novel functionalities for multimodal presentation of audio as well as symbolic data and comprises a search engine for lyrics and other metadata. In Sect. 8.1, we give an overview of the SyncPlayer system, which consists of a server as well as a client component. The server component, as will be described in Sect. 8.2, includes functionalities such as audio identification, data retrieval, and data delivery. In contrast, the client component constitutes the user front end of the system and provides the user interfaces for the services offered by the server (Sect. 8.3). A discussion of related work and possible extensions of our system can be found in Sect. 8.4. A demo version of the SyncPlayer is available at [199].

Analysis and Retrieval Techniques for Motion Data

Frontmatter
9. Fundamentals on Motion Capture Data
The second part of this monograph deals with content-based analysis and retrieval of 3D motion capture data as used in computer graphics for animating virtual human characters. In this chapter, we provide the reader with some fundamental facts on motion representations. We start with a short introduction on motion capturing and introduce a mathematical model for the motion data as used throughout the subsequent chapters (Sect. 9.1).We continue with a detailed discussion of general similarity aspects that are crucial in view of motion comparison and retrieval (Sect. 9.2). Then, in Sect. 9.3, we formally introduce the concept of kinematic chains, which are generally used to model flexibly linked rigid bodies such as robot arms or human skeletons. Kinematic chains are parameterized by joint angles, which in turn can be represented in various ways. In Sect. 9.4, we describe and compare three important angle representations based on rotation matrices, Euler angles, and quaternions. Each of these representations has its strengths and weaknesses depending on the respective analysis or synthesis application.
10. DTW-Based Motion Comparison and Retrieval
As we have seen in Chap. 4, dynamic time warping is a flexible tool for comparing time series in the presence of nonlinear time deformations. In this context, the choice of suitable local cost or distance measures is of crucial importance, since they determine the kind of (spatial) similarity between the elements (frames) of the two sequences to be aligned. For the mocap domain, we introduce two conceptually different local distance measures – one based on joint angle parameters and the other based on 3D coordinates – and discuss their respective strengths and weaknesses (Sect. 10.1). The importance of DTW is then illustrated by some synthesis and analysis applications (Sect. 10.2). By comparing a motion data stream to itself, one obtains a cost or distance matrix that exhibits self-similarities within the motion. In Sect. 10.3, we describe how this idea can be exploited for motion retrieval. Finally, in Sect. 10.4, we discuss some work related to DTW-based motion retrieval.
11. Relational Features and Adaptive Segmentation
Even though there is a rapidly growing corpus of motion capture data, there still is a lack of efficient motion retrieval systems that allow to identify and extract user-specified motions. Previous retrieval systems often require manually generated textual annotations, which roughly describe the motions in words. Since the manual generation of reliable and descriptive labels is infeasible for large datasets, one needs efficient content-based retrieval methods that only access the raw data itself. In this context, the query-by-example (QBE) paradigm has attracted a large amount of attention: given a query in form of a motion fragment, the task is to automatically retrieve all motion clips from the database containing parts or aspects similar to the query. The crucial point in such an approach is the notion of similarity used to compare the query with the database motions. For the motion scenario, two motions may be regarded as similar if they represent variations of the same action or sequence of actions. These variations may concern the spatial as well as the temporal domain. For example, the two jumps shown in Fig. 11.1 describe the same kind of motion, even though they differ considerably with respect to timing, intensity, and execution style (note, e.g., the arm swing). Similarly, the kicks shown in Fig. 1.1 describe the same kind of motion, even though they differ considerably with respect to direction and height of the kick. In other words, semantically similar motions need not be numerically similar, as is also pointed out in [107].
12. Index-Based Motion Retrieval
In Chap. 11, we gave an answer to the first question by introducing the concept of feature sequences, which represent motion capture data streams as coarse sequences of binary vectors. In Sect. 12.1, we will formally introduce the concepts of exact hits, fuzzy hits, and adaptive fuzzy hits. We then describe how one can compute such hits using an inverted file index. The proposed indexing and matching techniques can be put to use in a variety of query modes. Here, the possibilities range from isolated pose-based queries up to query-by-example (QBE), where the user supplies the system with a short query motion clip. In Sect. 12.2, we present a flexible and efficient QBE-based motion retrieval system and report on experimental results. Furthermore, we show how our relational approach to motion comparison can be used as a general tool for efficient motion preprocessing (Sect. 12.3). Finally, we discuss some problems and limitation of the presented index-based techniques and close with a discussion of related work (Sect. 12.4).
13. Motion Templates
In this chapter, we introduce a method for capturing the spatio-temporal characteristics of an entire motion class of semantically related motions in a compact and explicit matrix representation called a motion template (MT). Motion templates, which can be regarded as generalized boolean feature matrices, are formally introduced in Sect. 13.1. Employing an iterative warping and averaging strategy, we then describe an efficient algorithm that automatically derives a motion template from a class of training motions. We summarize the main ideas of this algorithm in Sect. 13.2 before giving the technical details in Sect. 13.3. In Sect. 13.4, we report on our experiments on template learning and discuss a number of illustrative examples to demonstrate the descriptive power of motion templates. Finally, in Sect. 13.5, we close with some general remarks on the multiple alignment problem underlying our learning procedure. In this and the following chapter, we close follow Müller and Röder [143]. An accompanying video is available at [144].
14. MT-Based Motion Annotation and Retrieval
Given a class of semantically related motions, we have derived a class motion template that captures the consistent as well as the inconsistent aspects of all motions in the class. The application of MTs to automatic motion annotation and retrieval, which is the content of this chapter, is based on the following interpretation: the consistent aspects of a class MT represent the class characteristics that are shared by all motions, whereas the inconsistent aspects represent the class variations that are due to different realizations. The key idea in designing a distance measure for comparing a class MT with unknown motion data is to mask out the inconsistent aspects – a kind of class-dependent adaptive feature selection – so that related motions can be identified even in the presence of significant spatio-temporal variations. In Sect. 14.1, we define such a distance measure, which is based on a subsequence variant of DTW. Our concepts of MT-based annotation and retrieval are then described in Sect. 14.2 and Sect. 14.3, respectively, where we also report on our extensive experiments [143, 144]. To substantially speed up the annotation and retrieval process, we introduce an index-based (the index being independent of the class MTs) preprocessing step to cut down the set of candidate motions by using suitable keyframes (Sect. 14.4). In Sect. 14.5, we compare MT-based matching to several baseline methods (based on numerical features) as well as to adaptive fuzzy querying. Finally, related work and future research directions are discussed Sect. 14.6.
Backmatter
Metadaten
Titel
Information Retrieval for Music and Motion
verfasst von
Meinard Müller
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-74048-3
Print ISBN
978-3-540-74047-6
DOI
https://doi.org/10.1007/978-3-540-74048-3

Premium Partner