Skip to main content

Über dieses Buch

Covering some of the most cutting-edge research on the delivery and retrieval of interactive multimedia content, this volume of specially chosen contributions provides the most updated perspective on one of the hottest contemporary topics. The material represents extended versions of papers presented at the 11th International Workshop on Image Analysis for Multimedia Interactive Services, a vital international forum on this fast-moving field.

Logically organized in discrete sections that approach the subject from its various angles, the content deals in turn with content analysis, motion and activity analysis, high-level descriptors and video retrieval, 3-D and multi-view, and multimedia delivery. The chapters cover the finest detail of emerging techniques such as the use of high-level audio information in improving scene segmentation and the use of subjective logic for forensic visual surveillance. On content delivery, the book examines both images and video, focusing on key subjects including an efficient pre-fetching strategy for JPEG 2000 image sequences. Further contributions look at new methodologies for simultaneous block reconstruction and provide a trellis-based algorithm for faster motion-vector decision making.



Multimedia content analysis

Chapter 1. On the Use of Audio Events for Improving Video Scene Segmentation

This work deals with the problem of automatic temporal segmentation of a video into elementary semantic units known as scenes. Its novelty lies in the use of high-level audio information, in the form of audio events, for the improvement of scene segmentation performance. More specifically, the proposed technique is built upon a recently proposed audio-visual scene segmentation approach that involves the construction of multiple scene transition graphs (STGs) that separately exploit information coming from different modalities. In the extension of the latter approach presented in this work, audio event detection results are introduced to the definition of an audio-based scene transition graph, while a visual-based scene transition graph is also defined independently. The results of these two types of STGs are subsequently combined. The results of the application of the proposed technique to broadcast videos demonstrate the usefulness of audio events for scene segmentation and highlight the importance of introducing additional high-level information to the scene segmentation algorithms.
Panagiotis Sidiropoulos, Vasileios Mezaris, Ioannis Kompatsiaris, Hugo Meinedo, Miguel Bugalho, Isabel Trancoso

Chapter 2. Region-Based Caption Text Extraction

This chapter presents a method for caption text detection. The proposed method will be included in a generic indexing system dealing with other semantic concepts which are to be automatically detected as well. To have a coherent detection system, the various object detection algorithms use a common image description, a hierarchical region-based image model. The proposed method takes advantage of texture and geometric features to detect the caption text. Texture features are estimated using wavelet analysis and mainly applied for text candidate spotting. In turn, text characteristics verification relies on geometric features, which are estimated exploiting the region-based image model. Analysis of the region hierarchy provides the final caption text objects. The final step of consistency analysis for output is performed by a binarization algorithm that robustly estimates the thresholds on the caption text area of support.
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques

Chapter 3. k-NN Boosting Prototype Learning for Object Classification

Image classification is a challenging task in computer vision. For example fully understanding real-world images may involve both scene and object recognition. Many approaches have been proposed to extract meaningful descriptors from images and classifying them in a supervised learning framework. In this chapter, we revisit the classic k-nearest neighbors (k-NN) classification rule, which has shown to be very effective when dealing with local image descriptors. However, k-NN still features some major drawbacks, mainly due to the uniform voting among the nearest prototypes in the feature space. In this chapter, we propose a generalization of the classic k-NN rule in a supervised learning (boosting) framework. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. In order to induce this classifier, we propose a novel learning algorithm, MLNN (Multiclass Leveraged Nearest Neighbors), which gives a simple procedure for performing prototype selection very efficiently. We tested our method first on object classification using 12 categories of objects, then on scene recognition as well, using 15 real-world categories. Experiments show significant improvement over classic k-NN in terms of classification performances.
Paolo Piro, Michel Barlaud, Richard Nock, Frank Nielsen

Motion and activity analysis

Chapter 4. Semi-Automatic Object Tracking in Video Sequences by Extension of the MRSST Algorithm

The objective of this work is to investigate a new approach for segmentation of real-world objects in video sequences. While some amount of user interaction is still necessary for most algorithms in this field, in order for them to produce adequate results, these can be reduced making use of certain properties of graph-based image segmentation algorithms. Based on one of these algorithms a framework is proposed that tracks individual foreground objects through arbitrary video sequences and partly automates the necessary corrections required from the user. Experimental results suggest that the proposed algorithm performs well on both low- and high-resolution video sequences and can even, to a certain extent, cope with motion blur and gradual object deformations.
Marko Esche, Mustafa Karaman, Thomas Sikora

Chapter 5. A Multi-Resolution Particle Filter Tracking with a Dual Consistency Check for Model Update in a Multi-Camera Environment

In this chapter, we present a novel tracking method with a multi-resolution approach and a dual model check to track a non-rigid object in an uncalibrated static multi-camera environment. It is based on particle filter methods using color features. The major contributions of the method are: multi-resolution tracking to handle strong and non-biased object motion by short term particle filters; stratified model consistency check by Kolmogorov-Smirnov test and object trajectory based view corresponding deformation in multi-camera environment.
Yifan Zhou, Jenny Benois-Pineau, Henri Nicolas

Chapter 6. Activity Detection Using Regular Expressions

In this chapter we propose a novel method to analyze trajectories in surveillance scenarios by means of Context-Free Grammars (CFGs). Given a training corpus of trajectories associated to a set of actions, a preliminary processing phase is carried out to characterize the paths as sequences of symbols. This representation turns the numerical representation of the coordinates into a syntactical description of the activity structure, which is successively adopted to identify different behaviors through the CFG models. Such a modeling is the basis for the classification and matching of new trajectories versus the learned templates and it is carried out through a parsing engine that enables the online recognition of human activities. An additional module is provided to recover parsing errors (i.e., insertion, deletion, or substitution of symbols) and update the activity models previously learned. The proposed system has been validated in indoor, in an assisted living context, demonstrating good capabilities in recognizing activity patterns in different configurations, and in particular in presence of noise in the acquired trajectories, or in case of concatenated and nested actions.
Mattia Daldoss, Nicola Piotto, Nicola Conci, Francesco G. B. De Natale

Chapter 7. Shape Adaptive Mean Shift Object Tracking Using Gaussian Mixture Models

GMM-SAMT, a new object tracking algorithm based on a combination of the mean shift principal and Gaussian mixture models (GMMs) is presented. GMM-SAMT stands for Gaussian mixture model based shape adaptive mean shift tracking. Instead of a symmetrical kernel like in traditional mean shift tracking, GMM-SAMT uses an asymmetric shape adapted kernel which is retrieved from an object mask. During the mean shift iterations the kernel scale is altered according to the object scale, providing an initial adaptation of the object shape. The final shape of the kernel is then obtained by segmenting the area inside and around the adapted kernel into object and non-object segments using Gaussian mixture models.
Katharina Quast, André Kaup

High-level descriptors and video retrieval

Chapter 8. Forensic Reasoning upon Pre-Obtained Surveillance Metadata Using Uncertain Spatio-Temporal Rules and Subjective Logic

This chapter presents an approach to modeling uncertain contextual rules using subjective logic for forensic visual surveillance. Unlike traditional real-time visual surveillance, forensic analysis of visual surveillance data requires mating of high level contextual cues with observed evidential metadata where both the specification of the context and the metadata suffer from uncertainties. To address this aspect, there has been work on the use of declarative logic formalisms to represent and reason about contextual knowledge, and on the use of different uncertainty handling formalisms. In such approaches, uncertainty attachment to logical rules and facts are crucial. However, there are often cases that the truth value of rule itself is also uncertain thereby, uncertainty attachment to rule itself should be rather functional. The more X then the more Y type of knowledge is one of the examples. To enable such type of rule modeling, in this chapter, we propose a reputational subjective opinion function upon logic programming, which is similar to fuzzy membership function but can also take into account uncertainty of membership value itself. Then we further adopt subjective logic’s fusion operator to accumulate the acquired opinions over time. To verify our approach, we present a preliminary experimental case study on reasoning likelihood of being a good witness that uses metadata extracted by a person tracker and evaluates the relationship between the tracked persons. The case study is further extended to demonstrate more complex forensic reasoning by considering additional contextual rules.
Seunghan Han, Bonjung Koo, Andreas Hutter, Walter Stechele

Chapter 9. AIR: Architecture for Interoperable Retrieval on Distributed and Heterogeneous Multimedia Repositories

Nowadays, multimedia data is produced and consumed at an ever increasing rate. Similarly to this trend, diverse storage approaches for multimedia data have been introduced. These observations lead to the fact that distributed and heterogeneous multimedia repositories exist, whereas an easy and unified access to the stored multimedia data is not given. This chapter presents an architecture, named AIR, that offers the aforementioned retrieval possibilities. To ensure interoperability, AIR makes use of recently issued standards, namely the MPEG Query Format (multimedia query language) and the JPSearch transformation rules (metadata interoperability).
Florian Stegmaier, Mario Döller, Harald Kosch, Andreas Hutter, Thomas Riegel

Chapter 10. Local Invariant Feature Tracks for High-Level Video Feature Extraction

In this work the use of feature tracks for the detection of high-level features (concepts) in video is proposed. Extending previous work on local interest point detection and description in images, feature tracks are defined as sets of local interest points that are found in different frames of a video shot and exhibit spatio-temporal and visual continuity, thus defining a trajectory in the 2D+Time space. These tracks jointly capture the spatial attributes of 2D local regions and their corresponding long-term motion. The extraction of feature tracks and the selection and representation of an appropriate subset of them allow the generation of a Bag-of-Spatiotemporal-Words model for the shot, which facilitates capturing the dynamics of video content. Experimental evaluation of the proposed approach on two challenging datasets (TRECVID 2007, TRECVID 2010) highlights how the selection, representation and use of such feature tracks enhances the results of traditional keyframe-based concept detection techniques.
Vasileios Mezaris, Anastasios Dimou, Ioannis Kompatsiaris

3D and multi-view

Chapter 11. A New Evaluation Criterion for Point Correspondences in Stereo Images

In this chapter, we present a new criterion to evaluate point correspondences within a stereo setup. Many applications such as stereo matching, triangulation, lens distortion correction, and camera calibration require an evaluation criterion for point correspondences. The common criterion here is the epipolar distance. The uncertainty of the epipolar geometry provides additional information, and our method uses this information for a new distance measure. The basic idea behind our criterion is to determine the most probable epipolar geometry that explains the point correspondence in the two views. This criterion considers the fact that the uncertainty increases for point correspondences induced by world points that are located at a different depth-level compared to those that were used for the fundamental matrix computation. Furthermore, we show that by using Lagrange multipliers, this constrained minimization problem can be reduced to solving a set of three linear equations with a computational complexity practically equal to the complexity of the epipolar distance.
Aleksandar Stojanovic, Michael Unger

Chapter 12. Local Homography Estimation Using Keypoint Descriptors

This chapter presents a novel learning-based approach to estimate local homography of points belong to a given surface and shows that it is more accurate than specific affine region detection methods. While others works attempt this by using iterative algorithms developed for template matching, our method introduces a direct estimation of the transformation. It performs the following steps. First, a training set of features captures geometry and appearance information about keypoints taken from multiple views of the surface. Then incoming keypoints are matched against the training set in order to retrieve a cluster of features representing their identity. Finally the retrieved clusters are used to estimate the local homography of the regions around keypoints. Thanks to the high accuracy, outliers and bad estimates are filtered out by multiscale Summed Square Difference (SSD) test.
Alberto Del Bimbo, Fernando Franco, Federico Pernici

Chapter 13. A Cognitive Source Coding Scheme for Multiple Description 3DTV Transmission

Multiple Description Coding has recently proved to be an effective solution for the robust transmission of 3D video sequences over unreliable channels. However, adapting the characteristics of the source coding strategy (Cognitive Source Coding) permits improving the quality of 3D visualization experienced by the end-user. This strategy has been successfully employed for standard video signals, but it can be applied to Multiple Description video coding for an effective transmission of 3D signals. The chapter presents a novel Cognitive Source Coding scheme that improves the performance of traditional Multiple Description Coding approaches by adaptively combining traditional predictive and Wyner-Ziv coders according to the characteristics of the video sequence and to the channel conditions. The approach is employed for video+depth 3D transmissions improving the average PSNR value up to 2.5 dB with respect to traditional MDC schemes.
Simone Milani, Giancarlo Calvagno

Multimedia delivery

Chapter 14. An Efficient Prefetching Strategy for Remote Browsing of JPEG 2000 Image Sequences

This chapter proposes an efficient prefetching strategy for interactive remote browsing of sequences of high resolution JPEG 2000 images. As a result of the inherent latency of client-server communication, the experiments of this study prove that a significant benefit, can be achieved, in terms of both quality and responsiveness, by anticipating certain data from the rest of the sequence while an image is being explored. In this work a model based on the quality progression of the image is proposed in order to estimate which percentage of the bandwidth will be dedicated to prefetching. This solution can be easily implemented on top of any existing remote browsing architecture.
Juan Pablo García Ortiz, Vicente González Ruiz, Inmaculada García, Daniel Müller, George Dimitoglou

Chapter 15. Comparing Spatial Masking Modelling in Just Noticeable Distortion Controlled H.264/AVC Video Coding

This chapter studies the integration of a just noticeable distortion model in the H.264/AVC standard video codec to improve the final rate-distortion performance. Three masking aspects related to lossy transform coding and natural video contents are considered: frequency band decomposition, luminance component variations and pattern masking. For the latter aspect, three alternative models are considered, namely the Foley–Boynton, Foley–Boynton adaptive and Wei–Ngan models. Their performance, measured for high definition video contents, and reported in terms of bitrate improvement and objective quality loss, reveals that the Foley–Boynton and its adaptive version provide the best performance with up to 35.6 % bitrate reduction at the cost of at most 1.4 % objective quality loss.
Matteo Naccari, Fernando Pereira

Chapter 16. Coherent Video Reconstruction with Motion Estimation at the Decoder

In traditional motion compensated predictive video coding, both the motion vector and the prediction residue are encoded and stored or sent for every predicted block. The motion vector brings displacement information with respect to a reference frame while the residue represents what we really consider to be the innovation of the current block with respect to that reference frame. This encoding scheme has proved to be extremely effective in terms of rate distortion performance. Nevertheless, one may argue that full description of motion and residue could be avoided if the decoder could be made able to exploit a proper a priori model for the signal to be reconstructed. In particular, it was recently shown that a smart enough decoder could exploit such an a priori model to partially infer motion information for a single block given only neighboring blocks and the innovation of that block. This chapter presents an improvement over the single-block method. In particular, it is shown that higher performance can be achieved by simultaneously reconstructing a frame region composed of several blocks, rather than reconstructing those blocks separately. A trellis based algorithm is developed in order to make a global decision on many motion vectors at a time instead of many single separate decisions on different vectors.
Claudia Tonoli, Marco Dalai
Weitere Informationen