Skip to main content

2011 | Buch

Advances in Multimedia Modeling

17th International Multimedia Modeling Conference, MMM 2011, Taipei, Taiwan, January 5-7, 2011, Proceedings, Part I

herausgegeben von: Kuo-Tien Lee, Wen-Hsiang Tsai, Hong-Yuan Mark Liao, Tsuhan Chen, Jun-Wei Hsieh, Chien-Cheng Tseng

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume proceedings constitutes the refereed papers of the 17th International Multimedia Modeling Conference, MMM 2011, held in Taipei, Taiwan, in January 2011. The 51 revised regular papers, 25 special session papers, 21 poster session papers, and 3 demo session papers, were carefully reviewed and selected from 450 submissions. The papers are organized in topical sections on audio, image video processing, coding and compression; media content browsing and retrieval; multi-camera, multi-view, and 3D systems; multimedia indexing and mining; multimedia content analysis; multimedia signal processing and communications; and multimedia applications. The special session papers deal with content analysis for human-centered multimedia applications; large scale rich media data management; multimedia understanding for consumer electronics; image object recognition and compression; and interactive image and video search.

Inhaltsverzeichnis

Frontmatter

Regular Papers

Audio, Image, Video Processing, Coding and Compression

A Generalized Coding Artifacts and Noise Removal Algorithm for Digitally Compressed Video Signals

A generalized coding artifact reduction algorithm is proposed for a variety of artifacts, including blocking artifact, ringing artifact and mosquito noise. This algorithm does not require grid position detection and any coding parameters. All the filtering is based on local content analysis. Basically, the algorithm attempts to apply mild low-pass filtering on more informative regions to preserve the sharpness of the object details, and to apply strong low-pass filtering on less informative regions to remove severe artifacts. The size and parameters of the low-pass filters are changed continuously based on the entropy of a local region.

Ling Shao, Hui Zhang, Yan Liu
Efficient Mode Selection with BMA Based Pre-processing Algorithms for H.264/AVC Fast Intra Mode Decision

In a H.264/AVC intra-frame encoder, the complicated computations for the mode decision cause the difficulty in real-time applications. In this paper, we propose an efficient fast algorithm, which is called Block Matching Algorithm (BMA), to predict the best direction mode for the fast intra mode decision. The edge detective technique can predict luma-4x4, luma-16x16, and chroma-8x8 modes directly. The BMA method uses the relations between the current block and the predictive block to predict edge directions. We can partition the intra prediction procedure into two steps. At the first step, we use the pre-processing mode selection algorithms to find the primary mode which can be selected for fast prediction. At the second step, the selected fewer high-potential candidate modes are applied to calculate the RD cost for the mode decision. The encoding time is largely reduced, and meanwhile we also maintain the same video quality. Simulation results show that the proposed BMA method reduces the encoding time by 75%, and requires bit-rate increase about 2.5% and peak signal-to-noise ratio (PSNR) decrease about 0.07 dB in QCIF and CIF sequences, compared with the H.264/AVC JM 14.2 software. Our methods can achieve less PSNR degradation and bit-rate increase than the previous methods with more encoding time reduction.

Chen-Hsien Miao, Chih-Peng Fan
Perceptual Motivated Coding Strategy for Quality Consistency

In this paper, we propose a novel quality control scheme which aims to keep quality consistency within a frame. Quality consistency is an important requirement in video coding. However, many existing schemes usually consider the quality consistency as the quantization parameter (QP) consistency. Moreover, the most frequently used metric to evaluate the quality consistency is PSNR, which has been well known that it is not good for subjective quality evaluation. These flaws of the existing methods are pointed out and proved to be unreasonable. For optimization, we take the effect of texture complexity on subjective evaluation into consideration to build a new D-Q model. We use the new model to adjust the quantization parameters of different regions to keep quality consistency. The simulation result shows that the new scheme gets better subjective quality and higher coding efficiency compared to traditional way.

Like Yu, Feng Dai, Yongdong Zhang, Shouxun Lin
Compressed-Domain Shot Boundary Detection for H.264/AVC Using Intra Partitioning Maps

In this paper, a novel technique for shot boundary detection operating on H.264/AVC-compressed sequences is presented. Due to new and improved coding tools in H.264/AVC, the characteristics of the obtained sequences differ from former video coding standards. Although several algorithms working on this new standard are already proposed, the presence of IDR frames can still lead to a low accuracy for abrupt transitions. To solve this issue, we present the motion-compensated intra partitioning map which relies on the intra partitioning modes and the motion vectors present in the compressed video stream. Experimental results show that this motion-compensated map achieves a high accuracy and exceeds related work.

Sarah De Bruyne, Jan De Cock, Chris Poppe, Charles-Frederik Hollemeersch, Peter Lambert, Rik Van de Walle
Adaptive Orthogonal Transform for Motion Compensation Residual in Video Compression

Among the orthogonal transforms used in video and image compression, the Discrete-Cosine-Transform (DCT) is the most commonly used one. In the existing video codecs, the motion-compensation residual (MC-residual) is transformed with the DCT. In this paper, we propose an adaptive orthogonal transform that performs better on the MC-residual than the DCT. We formulate the proposed new transform based on L1-Norm minimization with orthogonal constraints. With the DCT matrix as the starting point, it is guaranteed to derive a better orthogonal transform matrix in terms of L1-Norm minimization. The experimental results confirm that, with little side information, our method leads to higher compression efficiency for the MC-residual. Remarkably, the proposed transform performs better in the high/ complex motion situation.

Zhouye Gu, Weisi Lin, Bu-sung Lee, Chiew Tong Lau
Parallel Deblocking Filter for H.264/AVC on the TILERA Many-Core Systems

For the purpose of accelerating deblocking filter, which accounts for a significant percentage of H.264/AVC decoding time, some studies use wavefront method to achieve the required performance on multi-core platforms. We study the problem under the context of many-core systems and present a new method to exploit the implicit parallelism. We apply our implementation to the deblocking filter of the H.264/AVC reference software JM15.1 on a 64-core TILERA and achieve more than eleven times speedup for 1280*720(HD) videos. Meanwhile the proposed method achieves an overall decoding speedup of 140% for the HD videos. Compared to the wavefront method, we also have a significant speedup 200% for 720*576(SD) videos.

Chenggang Yan, Feng Dai, Yongdong Zhang
Image Distortion Estimation by Hash Comparison

Perceptual hashing is conventionally used for content identification and authentication. In this work, we explore a new application of image hashing techniques. By comparing the hash values of original images and their compressed versions, we are able to estimate the distortion level. A particular image hash algorithm is proposed for this application. The distortion level is measured by the signal to noise ratio (SNR). It is estimated from the bit error rate (BER) of hash values. The estimation performance is evaluated by experiments. The JPEG, JPEG2000 compression, and additive white Gaussian noise are considered. We show that a theoretical model does not work well in practice. In order to improve estimation accuracy, we introduce a correction term in the theoretical model. We find that the correction term is highly correlated to the BER and the uncorrected SNR. Therefore it can be predicted using a linear model. A new estimation procedure is defined accordingly. New experiment results are much improved.

Li Weng, Bart Preneel

Media Content Browsing and Retrieval

Sewing Photos: Smooth Transition between Photos

In this paper, a new smooth slideshow transition effect,

Sewing Photos

, is proposed while considering both of smooth content transition and smooth camera motion. Comparing to the traditional photo browsing and displaying work, which all focused on presenting splendid visual effects,

Sewing Photos

emphasizes on the smooth transition process between photos while taking the photo contents into account. Unlike splendid visual effects, the smooth transition tends to provide more comfortable watching experience and are good for a long-term photo displaying. To smooth the content transition, the system finds the similar parts between the photos first, and then decides what kind of camera operations should be applied according to the corresponding regions. After that, a virtual camera path will be generated according to the extracted Region of Interests (ROIs). To make the camera motion smoother, the camera path is generated as a cubic interpolation spline.

Tzu-Hao Kuo, Chun-Yu Tsai, Kai-Yin Cheng, Bing-Yu Chen
Employing Aesthetic Principles for Automatic Photo Book Layout

Photos are a common way of preserving our personal memories. The visual souvenir of a personal event is often composed into a photo collage or the pages of a photo album. Today we find many tools to help users creating such compositions by different tools for authoring photo compositions. Some template-based approaches generate nice presentations, however, come mostly with limited design variations. Creating complex and fancy designs for, e.g., a personal photo book, still demands design and composition skills to achieve results that are really pleasing to the eye – skills which many users simply lack. Professional designers instead would follow general design principles such as spatial layout rules, symmetry, balance among the element as well color schemes and harmony. In this paper, we propose an approach to deliver principles of design and composition to the end user by embedding it into an automatic composition application. We identify and analyze common design and composition principles and transfer these to the automatic creation of pleasant photo compositions by employing genetic algorithms. In contrast to other approaches, we strictly base our design system on common design principles, consider additional media types besides text in the photo book and specifically take the content of photos into account. Our approach is both implemented in a web-based rich media application and a tool for the automatic transformation from blogs into photo books.

Philipp Sandhaus, Mohammad Rabbath, Susanne Boll
Video Event Retrieval from a Small Number of Examples Using Rough Set Theory

In this paper, we develop an example-based event retrieval method which constructs a model for retrieving events of interest in a video archive, by using examples provided by a user. But, this is challenging because shots of an event are characterized by significantly different features, due to camera techniques, settings and so on. That is, the video archive contains a large variety of shots of the event, while the user can only provide a small number of examples. Considering this, we use “rough set theory” to capture various characteristics of the event. Specifically, by using rough set theory, we can extract classification rules which can correctly identify different subsets of positive examples. Furthermore, in order to extract a larger variety of classification rules, we incorporate “bagging” and “random subspace method” into rough set theory. Here, we define indiscernibility relations among examples based on outputs of classifiers, built on different subsets of examples and different subsets of feature dimensions. Experimental results on TRECVID 2009 video data validate the effectiveness of our example-based event retrieval method.

Kimiaki Shirahama, Yuta Matsuoka, Kuniaki Uehara
Community Discovery from Movie and Its Application to Poster Generation

Discovering roles and their relationship is critical in movie content analysis. However, most conventional approaches ignore the correlations among roles or require rich metadata such as casts and scripts, which makes them not practical when little metadata is available, especially in the scenarios of IPTV and VOD systems. To solve this problem, we propose a new method to discover key roles and their relationship by treating a movie as a small community. We first segment a movie into a hierarchical structure (including scene, shot, and key-frame), and perform face detection and grouping on the detected key-frames. Based on such information, we then create a community by exploiting the key roles and their correlations in this movie. The discovered community provides a wide variety of applications. In particular, we present in this paper the automatic generation of video poster (with four different visualizations) based on the community, as well as preliminary experimental results.

Yan Wang, Tao Mei, Xian-Sheng Hua
A BOVW Based Query Generative Model

Bag-of-visual words (BOVW) is a local feature based framework for content-based image and video retrieval. Its performance relies on the discriminative power of visual vocabulary,

i.e.

the cluster set on local features. However, the optimisation of visual vocabulary is of a high complexity in a large collection. This paper aims to relax such a dependence by adapting the query generative model to BOVW based retrieval. Local features are directly projected onto latent content topics to create effective visual queries; visual word distributions are learnt around local features to estimate the contribution of a visual word to a query topic; the relevance is justified by considering concept distributions on visual words as well as on local features. Massive experiments are carried out the TRECVid 2009 collection. The notable improvement on retrieval performance shows that this probabilistic framework alleviates the problem of visual ambiguity and is able to afford visual vocabulary with relatively low discriminative power.

Reede Ren, John Collomosse, Joemon Jose
Video Sequence Identification in TV Broadcasts

We present a video sequence identification approach that can reliably and quickly detect equal or similar recurrences of a given video sequence in long video streams, e.g. such as TV broadcasts. The method relies on motion-based video signatures and has low run-time requirements. For TV broadcasts it enables to easily track recurring broadcasts of a specific video sequence and to locate their position, even across different TV broadcasting channels. In an evaluation with 48 hours of video content recorded from local TV broadcasts we show that our method is highly reliable and accurate and works in a fraction of real-time.

Klaus Schoeffmann, Laszlo Boeszoermenyi
Content-Based Multimedia Retrieval in the Presence of Unknown User Preferences

Content-based multimedia retrieval requires an appropriate similarity model which reflects user preferences. When these preferences are unknown or when the structure of the data collection is unclear, retrieving the most preferable objects the user has in mind is challenging, as the notion of similarity varies from data to data, from task to task, and ultimately from user to user. Based on a specific query object and unknown user preferences, retrieving the most similar objects according to some default similarity model does not necessarily include the most preferable ones. In this work, we address the problem of content-based multimedia retrieval in the presence of unknown user preferences. Our idea consists in performing content-based retrieval by considering all possibilities in a family of similarity models simultaneously. To this end, we propose a novel content-based retrieval approach which aims at retrieving all potentially preferable data objects with respect to any preference setting in order to meet individual user requirements as much as possible. We demonstrate that our approach improves the retrieval performance regarding unknown user preferences by more than 57% compared to the conventional retrieval approach.

Christian Beecks, Ira Assent, Thomas Seidl

Multi-Camera, Multi-View, and 3D Systems

People Localization in a Camera Network Combining Background Subtraction and Scene-Aware Human Detection

In a network of cameras, people localization is an important issue. Traditional methods utilize camera calibration and combine results of background subtraction in different views to locate people in the three dimensional space. Previous methods usually solve the localization problem iteratively based on background subtraction results, and high-level image information is neglected. In order to fully exploit the image information, we suggest incorporating human detection into multi-camera video surveillance. We develop a novel method combining human detection and background subtraction for multi-camera human localization by using convex optimization. This convex optimization problem is independent of the image size. In fact, the problem size only depends on the number of interested locations in ground plane. Experimental results show this combination performs better than background subtraction-based methods and demonstrate the advantage of combining these two types of complementary information.

Tung-Ying Lee, Tsung-Yu Lin, Szu-Hao Huang, Shang-Hong Lai, Shang-Chih Hung
A Novel Depth-Image Based View Synthesis Scheme for Multiview and 3DTV

Depth-Image Based View Synthesis (DIVS) is considered as a good method to reduce the view numbers for multiview and 3DTV. However, it can’t guarantee the quality of occlusion areas of the produced views. In order to reduce the view numbers for multiview and 3DTV, a novel depth-image based view synthesis scheme is proposed to produce four views by one view with depth. Artifact detection and repair functions are added to resolve occlusion areas problem. Artifact region is detected by comparing with original view, and repair function utilizes motion compensation technique to fix it from its previous virtual view. This repair function can correctly fix the artifact region at the cost of some additional bits for each virtual view. In our experiments, only one view is applied to produce four virtual views. About 4~6 dB gain can be achieved at the costs of some additional bit stream. The total bit rate of the four views is only 56% of one color image at the same QP.

Xun He, Xin Jin, Minghui Wang, Satoshi Goto
Egocentric View Transition for Video Monitoring in a Distributed Camera Network

Current visual surveillance system usually includes multiple cameras to monitor the activities of targets over a large area. An important issue for the guard or user using the system is to understand a series of events occurring in the environment, for example to track a target walking across multiple cameras. Opposite to the traditional systems switching the camera view from one to another directly, we propose a novel approach to egocentric view transition, which synthesizes the virtual views during the period of switching cameras, to ease the mental effort for users to understand the events. An important property of our system is that it can be applied to the situations of where the view fields of transition cameras are not close enough or even exclusive. Such situations have never been taken into consideration in the state-of-the-art view transition techniques, to our best knowledge.

Kuan-Wen Chen, Pei-Jyun Lee, Yi-Ping Hung
A Multiple Camera System with Real-Time Volume Reconstruction for Articulated Skeleton Pose Tracking

We present a multi-camera system for recovering skeleton body pose, by performing real-time volume reconstruction and using a hierarchical stochastic pose search algorithm. Different from many multi-camera systems that require a few connected workstations, our system only uses a signle PC to control 8 cameras for synchronous image acquisition. Silhouettes of the 8 cameras are extracted via a color-based background subtraction algorithm, and set as input to the 3D volume reconstruction. Our system can perform real-time volume reconstruction rendered in point clouds, voxels as well as voxels with texturing. The full-body skeleton pose (29-D vector) is then recovered by fitting an articulated body model to the volume sequences. The pose estimation is performed in a hierarchical manner, by using a particle swarm optmization (PSO) based search strategy combined with soft constraints. 3D distance transform (DT) is used for reducing the computing time of objective evaluations.

Zheng Zhang, Hock Soon Seah, Chee Kwang Quah, Alex Ong, Khalid Jabbar
A New Two-Omni-Camera System with a Console Table for Versatile 3D Vision Applications and Its Automatic Adaptation to Imprecise Camera Setups

A new two-omni-camera system for 3D vision applications and a method for adaptation of the system to imprecise camera setups are proposed in this study. First, an efficient scheme for calibration of several omni-camera parameters using a set of analytic formulas is proposed. Also proposed is a technique to adapt the system to imprecise camera configuration setups for in-field 3D feature point data computation. The adaptation is accomplished by the use of a line feature of the console table boundary. Finally, analytic formulas for computing 3D feature point data after adaptation are derived. Good experimental results are shown to prove the feasibility and correctness of the proposed method.

Shen-En Shih, Wen-Hsiang Tsai
3D Face Recognition Based on Local Shape Patterns and Sparse Representation Classifier

In recent years, 3D face recognition has been considered as a major solution to deal with these unsolved issues of reliable 2D face recognition, i.e. illumination and pose variations. This paper focuses on two critical aspects of 3D face recognition: facial feature description and classifier design. To address the former one, a novel local descriptor, namely Local Shape Patterns (LSP), is proposed. Since LSP operator extracts both differential structure and orientation information, it can describe local shape attributes comprehensively. For the latter one, Sparse Representation Classifier (SRC) is applied to classify these 3D shape-based facial features. Recently, SRC has been attracting more and more attention of researchers for its powerful ability on 2D image-based face recognition. This paper continues to investigate its competency in shape-based face recognition. The proposed approach is evaluated on the IV

2

3D face database containing rich facial expression variations, and promising experimental results are achieved which prove its effectiveness for 3D face recognition and insensitiveness to expression changes.

Di Huang, Karima Ouji, Mohsen Ardabilian, Yunhong Wang, Liming Chen
An Effective Approach to Pose Invariant 3D Face Recognition

One critical challenge encountered by existing face recognition techniques lies in the difficulties of handling varying poses. In this paper, we propose a novel pose invariant 3D face recognition scheme to improve regular face recognition from two aspects. Firstly, we propose an effective geometry based alignment approach, which transforms a 3D face mesh model to a well-aligned 2D image. Secondly, we propose to represent the facial images by a Locality Preserving Sparse Coding (LPSC) algorithm, which is more effective than the regular sparse coding algorithm for face representation. We conducted a set of extensive experiments on both 2D and 3D face recognition, in which the encouraging results showed that the proposed scheme is more effective than the regular face recognition solutions.

Dayong Wang, Steven C. H. Hoi, Ying He

Multimedia Indexing and Mining

Score Following and Retrieval Based on Chroma and Octave Representation

With the studies of effective representation of music signals and music scores, i.e. chroma and octave features, this work conducts score following and score retrieval. To complement the shortage of chromagram representation, energy distributions in different octaves are used to describe tone height information. By transforming music signals and scores into sequences of feature vectors, score following is transformed as a sequence matching problem, and is solved by the dynamic time warping (DTW) algorithm. To conduct score retrieval, we modify the backtracking step of DTW to determine multiple partial matchings between the query and a score. Experimental results show the effectiveness of the proposed features and the feasibility of the modified DTW algorithm in score retrieval.

Wei-Ta Chu, Meng-Luen Li
Incremental Multiple Classifier Active Learning for Concept Indexing in Images and Videos

Active learning with multiple classifiers has shown good performance for concept indexing in images or video shots in the case of highly imbalanced data. It involves however a large number of computations. In this paper, we propose a new incremental active learning algorithm based on multiple SVM for image and video annotation. The experimental result show that the best performance (MAP) is reached when 15-30% of the corpus is annotated and the new method can achieve almost the same precision while saving 50 to 63% of the computation time.

Bahjat Safadi, Yubing Tong, Georges Quénot
A Semantic Higher-Level Visual Representation for Object Recognition

Having effective methods to access the images with desired object is essential nowadays with the availability of huge amount of digital images. We propose a semantic higher-level visual representation which improves the traditional part-based bag-of words image representation, in two aspects. First, we propose a semantic model to generate a semantic visual words and phrases in order to bridge the semantic gab factor. Second, the approach strengthens the discrimination power of classical visual words by constructing an mid level descriptor, Semantic Visual Phrase, from frequently co-occurring Semantic Visual Words set in the same local context.

Ismail El Sayad, Jean Martinet, Thierry Urruty, Chabane Dejraba
Mining Travel Patterns from GPS-Tagged Photos

The phenomenal advances of photo-sharing services, such as Flickr

TM

, have led to voluminous community-contributed photos with socially generated textual, temporal and geographical metadata on the Internet. The photos, together with their time- and geo-references, implicitly document the photographers’ spatiotemporal movement paths. This study aims to leverage the wealth of these enriched online photos to analyze the people’s travel pattern at the local level of a tour destination. First, from a noisy pool of GPS-tagged photos downloaded from Internet, we build a statistically reliable database of travel paths, and mine a list of regions of attraction (RoA). We then investigate the tourist traffic flow among different RoAs, by exploiting Markov chain model. Testings on four major cities demonstrate promising results of the proposed system.

Yan-Tao Zheng, Yiqun Li, Zheng-Jun Zha, Tat-Seng Chua
Augmenting Image Processing with Social Tag Mining for Landmark Recognition

Social Multimedia computing is a new approach which combines the contextual information available in the social networks with available multimedia content to achieve greater accuracy in traditional multimedia problems like face and landmark recognition. Tian et al.[12] introduce this concept and suggest various fields where this approach yields significant benefits. In this paper, this approach has been applied to the landmark recognition problem. The dataset of flickr.com was used to select a set of images for a given landmark. Then image processing techniques were applied on the images and text mining techniques were applied on the accompanying social metadata to determine independent rankings. These rankings were combined using models similar to meta search engines to develop an improved integrated ranking system. Experiments have shown that the recombination approach gives better results than the separate analysis.

Amogh Mahapatra, Xin Wan, Yonghong Tian, Jaideep Srivastava
News Shot Cloud: Ranking TV News Shots by Cross TV-Channel Filtering for Efficient Browsing of Large-Scale News Video Archives

TV news programs are important target of multimedia content analysis since they are one of major information sources for ordinary daily lives. Since the computer storage cost has reduced significantly, today we can digitally archive a huge amount of TV news programs. On the other hand, as the archive size grows larger, the cost for browsing and utilizing video archives also increases significantly. To circumvent this problem, we present a visualization method of TV news shots using the popularity-based filtering across multiple TV channels. This method can be regarded as social filtering by TV broadcasters or popularity ranking among TV channels. In order to examine the effectiveness of our approach, we conducted an experiment against a thousand-hour order video archive storing 6 TV-channel streams for one month long. To our best knowledge, there is no former work applying this scheme to such a huge archive with conducting quantitative evaluation.

Norio Katayama, Hiroshi Mo, Shin’ichi Satoh

Multimedia Content Analysis (I)

Speaker Change Detection Using Variable Segments for Video Indexing

Video indexing based on shots obtained by visual features is useful for content-based video browsing but has more limited success in facilitating semantic search of videos. Meanwhile, recent developments in speech recognition allow the option of surpassing many difficulties associated with the detections of semantic meanings over visual features by operating directly on the verbal content. The use of language based indexing inspires a new video segmentation technique based on speaker change detection. This paper deals with the improvement of existing speaker change detectors by introducing an extra preprocessing step which aligns the audio features with syllables. We investigate the benefits of such synchronization and propose a variable presegmentation scheme that utilizes both magnitude and frequency information to attain such alignment. The experimental results show that the quality of the extracted audio feature is improved, resulting in a better recall rate.

King Yiu Tam, Jose Lay, David Levy
Correlated PLSA for Image Clustering

Probabilistic Latent Semantic Analysis (PLSA) has become a popular topic model for image clustering. However, the traditional PLSA method considers each image (document) independently, which would often be conflict with the real occasion. In this paper, we presents an improved PLSA model, named Correlated Probabilistic Latent Semantic Analysis (C-PLSA). Different from PLSA, the topics of the given image are modeled by the images that are related to it. In our method, each image is represented by bag-of-visual-words. With this representation, we calculate the cosine similarity between each pair of images to capture their correlations. Then we use our C-PLSA model to generate K latent topics and Expectation Maximization (EM) algorithm is utilized for parameter estimation. Based on the latent topics, image clustering is carried out according to the estimated conditional probabilities. Extensive experiments are conducted on the publicly available database. The comparison results show that our approach is superior to the traditional PLSA for image clustering.

Peng Li, Jian Cheng, Zechao Li, Hanqing Lu
Genre Classification and the Invariance of MFCC Features to Key and Tempo

Musical genre classification is a promising yet difficult task in the field of musical information retrieval. As a widely used feature in genre classification systems, MFCC is typically believed to encode timbral information, since it represents short-duration musical textures. In this paper, we investigate the invariance of MFCC to musical key and tempo, and show that MFCCs in fact encode both timbral and key information. We also show that musical genres, which should be independent of key, are in fact influenced by the fundamental keys of the instruments involved. As a result, genre classifiers based on the MFCC features will be influenced by the dominant keys of the genre, resulting in poor performance on songs in less common keys. We propose an approach to address this problem, which consists of augmenting classifier training and prediction with various key and tempo transformations of the songs. The resulting genre classifier is invariant to key, and thus more timbre-oriented, resulting in improved classification accuracy in our experiments.

Tom L. H. Li, Antoni B. Chan
Combination of Local and Global Features for Near-Duplicate Detection

This paper presents a new method to combine local and global features for near-duplicate images detection. It mainly consists of three steps. Firstly, the keypoints of images are extracted and preliminarily matched. Secondly, the matched keypoints are voted for estimation of affine transform to reduce false matching keypoints. Finally, to further confirm the matching, the Local Binary Pattern (LBP) and color histograms of areas formed by matched keypoints in two images are compared. This method has the advantage for handling the case when there are only a few matched keypoints. The proposed algorithm has been tested on Columbia dataset and compared quantitatively with the RANdom SAmple Consensus (RANSAC) and the Scale-Rotation Invariant Pattern Entropy (SR-PE) methods. The results turn out that the proposed method compares favorably against the state-of-the-arts.

Yue Wang, ZuJun Hou, Karianto Leman, Nam Trung Pham, TeckWee Chua, Richard Chang
Audio Tag Annotation and Retrieval Using Tag Count Information

Audio tags correspond to keywords that people use to describe different aspects of a music clip, such as the genre, mood, and instrumentation. With the explosive growth of digital music available on the Web, automatic audio tagging, which can be used to annotate unknown music or retrieve desirable music, is becoming increasingly important. This can be achieved by training a binary classifier for each tag based on the labeled music data. However, since social tags are usually assigned by people with different levels of musical knowledge, they inevitably contain noisy information. To address the noisy label problem, we propose a novel method that exploits the tag count information. By treating the tag counts as costs, we model the audio tagging problem as a cost-sensitive classification problem. The results of audio tag annotation and retrieval experiments show that the proposed approach outperforms our previous method, which won the MIREX 2009 audio tagging competition.

Hung-Yi Lo, Shou-De Lin, Hsin-Min Wang
Similarity Measurement for Animation Movies

When considering the quantity of multimedia content that people and professionals accumulate day by day on their storage devices, the necessity of appropriate intelligent tools for searching or navigating, becomes an issue. Nevertheless, the richness of such media is difficult to handle with today’s video analysis algorithm. In this context, we propose a similarity measure dedicated to animation movies. This measure is based on the fuzzy fusion of low level descriptors. We focus on the use of a Choquet Integral based fuzzy approach which is proved to be a good solution to take into account complementarity or conflict between fused data and so to model a human like similarity measure. Subjective tests with human observers have been carried out to validate the model.

Alexandre Benoit, Madalina Ciobotaru, Patrick Lambert, Bogdan Ionescu

Multimedia Content Analysis (II)

A Feature Sequence Kernel for Video Concept Classification

Kernel methods such as Support Vector Machines are widely applied to classification problems, including concept detection in video. Nonetheless issues like modeling specific distance functions of feature descriptors or the temporal sequence of features in the kernel have received comparatively little attention in multimedia research. We review work on kernels for commonly used MPEG-7 visual features and propose a kernel for matching temporal sequences of these features. The sequence kernel is based on ideas from string matching, but does not require discretization of the input feature vectors and deals with partial matches and gaps. Evaluation on the TRECVID 2007 high-level feature extraction data set shows that the sequence kernel clearly outperforms the radial basis function (RBF) kernel and the MPEG-7 visual feature kernels using only single key frames.

Werner Bailer
Bottom-Up Saliency Detection Model Based on Amplitude Spectrum

In this paper, we propose a saliency detection model based on amplitude spectrum. The proposed model first divides the input image into small patches, and then uses the amplitude spectrum of the Quaternion Fourier Transform (QFT) to represent the color, intensity and orientation distributions for each patch. The saliency for each patch is determined by two factors: the difference between amplitude spectrums of the patch and its neighbor patches and the Euclidian distance of the associated patches. The novel saliency measure for image patches by using amplitude spectrum of QFT proves promising, as the experiment results show that this saliency detection model performs better than the relevant existing models.

Yuming Fang, Weisi Lin, Bu-Sung Lee, Chiew Tong Lau, Chia-Wen Lin
L2-Signature Quadratic Form Distance for Efficient Query Processing in Very Large Multimedia Databases

The highly increasing amount of multimedia data leads to extremely growing databases which support users in searching and exploring the database contents. Content-based searching for similar objects inside such vivid and voluminous multimedia databases is typically accompanied by an immense amount of costly similarity computations among the stored data objects. In order to process similarity computations arising in content-based similarity queries efficiently, we present the

L

2

-Signature Quadratic Form Distance which maintains high retrieval quality and improves the computation time of the Signature Quadratic Form Distance by more than one order of magnitude. As a result, we process millions of similarity computations in less than a few seconds.

Christian Beecks, Merih Seran Uysal, Thomas Seidl
Generating Representative Views of Landmarks via Scenic Theme Detection

Visual summarization of landmarks is an interesting and non-trivial task with the availability of gigantic community-contributed resources. In this work, we investigate ways to generate representative and distinctive views of landmarks by automatically discovering the underlying

Scenic Themes

(e.g. sunny, night view, snow, foggy views, etc.) via a content-based analysis. The challenge is that the task suffers from the subjectivity of the scenic theme understanding, and there is lack of prior knowledge of scenic themes understanding. In addition, the visual variations of scenic themes are results of joint effects of factors including weather, time, season, etc. To tackle the aforementioned issues, we exploit the Dirichlet Process Gaussian Mixture Model (DPGMM). The major advantages in using DPGMM is that it is fully unsupervised and do not require the number of components to be fixed beforehand, which avoids the difficulty in adjusting model complexity to avoid over-fitting. This work makes the first attempt towards generation of representative views of landmarks via scenic theme mining. Testing on seven famous world landmarks show promising results.

Yi-Liang Zhao, Yan-Tao Zheng, Xiangdong Zhou, Tat-Seng Chua
Regularized Semi-supervised Latent Dirichlet Allocation for Visual Concept Learning

Topic models are a popular tool for visual concept learning. Current topic models are either unsupervised or fully supervised. Although lots of labeled images can significantly improve the performance of topic models, they are very costly to acquire. Meanwhile, billions of unlabeled images are freely available on the internet. In this paper, to take advantage of both limited labeled training images and rich unlabeled images, we propose a novel technique called regularized Semi-supervised Latent Dirichlet Allocation (r-SSLDA) for learning visual concept classifiers. Instead of introducing a new topic model, we attempt to find an efficient way to learn topic models in a semi-supervised way. r-SSLDA considers both semi-supervised properties and supervised topic model simultaneously in a regularization framework. Experiments on Caltech 101 and Caltech 256 have shown that r-SSLDA outperforms unsupervised LDA, and achieves competitive performance against fully supervised LDA, while sharply reducing the number of labeled images required.

Liansheng Zhuang, Lanbo She, Jingjing Huang, Jiebo Luo, Nenghai Yu
Boosted Scene Categorization Approach by Adjusting Inner Structures and Outer Weights of Weak Classifiers

Scene categorization plays an important role in computer vision and image content understanding. It is a multi-class pattern classification problem. Usually, multi-class pattern classification can be completed by using several component classifiers. Each component classifier carries out discrimination of some patterns from the others. Due to the biases of training data, and local optimal of weak classifiers, some weak classifiers may not be well trained. Usually, some component classifiers of a weak classifier may be not act well as the others. This will make the performances of the weak classifier not as good as it should be. In this paper, the inner structures of weak classifiers are adjusted before their outer weights determination. Experimental results on three AdaBoost algorithms show the effectiveness of the proposed approach.

Xueming Qian, Zhe Yan, Kaiyu Hang
A User-Centric System for Home Movie Summarisation

In this paper we present a user-centric summarisation system that combines automatic visual-content analysis with user-interface design features as a practical method for home movie summarisation. The proposed summarisation system is designed in such a manner that the video segmentation results generated by the automatic content analysis tools are further subject to refinement through the use of an intuitive user-interface so that the automatically created summaries can be effectively tailored to each individual’s personal need. To this end, we study a number of content analysis techniques to facilitate the efficient computation of video summaries, and more specifically emphasise the need for employing an efficient and robust optical flow field computation method for sub-shot segmentation in home movies. Due to the subjectivity of video summarisation and the inherent challenges associated with automatic content analysis, we propose novel user-interface design features as a means to enable the creation of meaningful home movie summaries in a simple manner. The main features of the proposed summarisation system include the ability to automatically create summaries of different visual comprehension, interactively defining the target length of the desired summary, easy and interactive viewing of the content in terms of a storyboard, and manual refinement of the boundaries of the automatically selected video segments in the summary.

Saman H. Cooray, Hyowon Lee, Noel E. O’Connor

Multimedia Signal Processing and Communications

Image Super-Resolution by Vectorizing Edges

As the resolution of output device increases, the demand of high resolution contents has become more eagerly. Therefore, the image super-resolution algorithms become more important. In digital image, the edges in the image are related to human perception heavily. Because of this, most recent research topics tend to enhance the image edges to achieve better visual quality. In this paper, we propose an edge-preserving image super-resolution algorithm by vectorizing the image edges. We first parameterize the image edges to fit the edges’ shapes, and then use these data as the constraint for image super-resolution. However, the color nearby the image edges is usually a combination of two different regions. The matting technique is utilized to solve this problem. Finally, we do the image super-resolution based on the edge shape, position, and nearby color information to compute a digital image with sharp edges.

Chia-Jung Hung, Chun-Kai Huang, Bing-Yu Chen
Vehicle Counting without Background Modeling

In general, the vision-based methods may face the problems of serious illumination variation, shadows, or swaying trees. Here, we propose a novel vehicle detection method without background modeling to overcome the aforementioned problems. First, a modified block-based frame differential method is established to quickly detect the moving targets without the influences of rapid illumination variations. Second, the precise targets’ regions are extracted with the dual foregrounds fusion method. Third, a texture-based object segmentation method is proposed to segment each vehicle from the merged foreground image blob and remove the shadows. Fourth, a false foreground filtering method is developed based on the concept of motion entropy to remove the false object regions caused by the swaying trees or moving clouds. Finally, the texture-based target tracking method is proposed to track each detected target and then apply the virtual-loop detector to compute the traffic flow. Experimental results show that our proposed system can work with the computing rate above 20 fps and the average accuracy of vehicle counting can approach 86%.

Cheng-Chang Lien, Ya-Ting Tsai, Ming-Hsiu Tsai, Lih-Guong Jang
Effective Color-Difference-Based Interpolation Algorithm for CFA Image Demosaicking

This paper proposes an effective color-difference-based (ECDB) interpolation algorithm for CFA Image demosaicking. A CFA image consists of a set of spectrally selective filters which are arranged in an interleaved pattern such that only one of color component is sampled at each pixel location. To improve the quality of reconstructed full-color images from color filter array (CFA) images, the ECDB algorithm first analyzes the neighboring samples around a green missing pixel to determine suitable samples for interpolating the value of this green missing pixel. After finishing the interpolation operations of all the green missing pixels, a complete green plane (i.e

$\overline{G}$

plane) can be obtained. The ECDB algorithm then makes use of the high correlation between R, G, and B planes to produce the red– green and blue–green color difference planes and further reconstructs the red and blue planes in successive operations. Because of the green plane provides twice information than red and blue planes, the algorithm exploits the information of green plane more than that of red/blue plane so that the full color image can be reconstructed more accurately. In essence, the ECDB algorithm uses the red–green and blue–green color difference planes, and develops different conditional operations according to the horizontal, vertical, and diagonal neighboring pixel information with suitable weighting technique. The experimental results demonstrate that the proposed algorithm has outstanding performance.

Yea-Shuan Huang, Sheng-Yi Cheng
Utility Max-Min Fair Rate Allocation for Multiuser Multimedia Communications

In this paper, we study the rate allocation problem among multiuser multimedia communications. Two essential objectives, efficiency and fairness, are often considered. The efficiency concerns how to maximize the sum of video qualities over all users, and the fairness concerns the video quality differences among users. Generally, increasing the efficiency and keeping the fairness are inconsistent. Towards this problem, we design a utility function by taking both the video quality and the allocated rate into consideration. Then, we propose a utility max-min fair rate allocation scheme, which can achieve a good tradeoff between the efficiency and the fairness. The simulation results demonstrate the validity of the proposed scheme.

Qing Zhang, Guizhong Liu, Fan Li

Multimedia Applications

Adaptive Model for Robust Pedestrian Counting

Toward robust pedestrian counting with partly occlusion, we put forward a novel model-based approach for pedestrian detection. Our approach consists of two stages: pre-detection and verification. Firstly, based on a whole pedestrian model built up in advance, adaptive models are dynamically determined by the occlusion conditions of corresponding body parts. Thus, a heuristic approach with grid masks is proposed to examine visibility of certain body part. Using part models for template matching, we adopt an approximate branch structure for preliminary detection. Secondly, Bayesian framework is utilized to verify and optimize the pre-detection results. Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm is used to solve such problem of high dimensions. Experiments and comparison demonstrate promising application of the proposed approach.

Jingjing Liu, Jinqiao Wang, Hanqing Lu
Multi Objective Optimization Based Fast Motion Detector

A large number of surveillance applications require fast action, and since many surveillance applications, motive objects contain most critical information. Fast detection algorithm system becomes a necessity. A problem in computer vision is the determination of weights for multiple objective function optimizations. In this paper we propose techniques for automatically determining the weights, and discuss their properties. The Min-Max Principle, which avoids the problems of extremely low or high weights, is introduced. Expressions are derived relating the optimal weights, objective function values, and total cost. Simulation results show, compared to the conventional work, it can achieve around 40% time saving and higher detection accuracy for both outdoor and indoor surveillance videos.

Jia Su, Xin Wei, Xiaocong Jin, Takeshi Ikenaga
Narrative Generation by Repurposing Digital Videos

Storytelling and narrative creation are very popular research issues in the field of interactive media design. In this paper, we propose a framework for generating video narrative from existing videos which user only needs to involve in two steps: (1) select background video and avatars; (2) set up the movement and trajectory of avatars. To generate a realistic video narrative, several important steps have to be implemented. First, a video scene generation process is designed to generate a video mosaic. This video mosaic can be used as a basis for narrative planning. Second, an avatar preprocessing procedure with moderate avatar control technologies is designed to regulate a number of specific properties, such as the size or the length of constituent motion clips, and control the motion of avatars. Third, a layer merging algorithm and a spatiotemporal replacement algorithm are developed to ensure the visual quality of a generated video narrative. To demonstrate the efficacy of the proposed method, we generated several realistic video narratives from some chosen video clips and the results turned out to be visually pleasing.

Nick C. Tang, Hsiao-Rong Tyan, Chiou-Ting Hsu, Hong-Yuan Mark Liao
A Coordinate Transformation System Based on the Human Feature Information

In this paper, we propose a method to find feature in human object that used SURF algorithm, and use this information into 3D coordinate that use coordinate system transformation. In our method, first we use thinning algorithm to obtained skeleton of object, and find the endpoints in skeleton. In the second step, we try to use those endpoints to cluster skeleton, and the part number of cluster is six. Then, to cluster human object that use cluster skeleton result. Third, we use SURF algorithm to find the feature in each part in the cluster object image. In this step, we also use SAD method to ensure are correct of feature points that after SURF algorithm treatment. Finally we use the coordinate system transformation method. In this step, we use image coordinate system into world coordinate system, and show those result in our experiments result.

Shih-Ming Chang, Joseph Tsai, Timothy K. Shih, Hui-Huang Hsu
An Effective Illumination Compensation Method for Face Recognition

Face recognition is very useful in many applications, such as safety and surveillance, intelligent robot, and computer login. The reliability and accuracy of such systems will be influenced by the variation of background illumination. Therefore, how to accomplish an effective illumination compensation method for human face image is a key technology for face recognition. Our study uses several computer vision techniques to develop an illumination compensation algorithm to processing the single channel (such as grey level or illumination intensity) face image. The proposed method mainly consists of four processing modules: (1) Homomorphic Filtering, (2) Ratio Image Generation, and (3) Anisotropic Smoothing. Experiments have shown that by applying the proposed method the human face images can be further recognized by conventional classifiers with high recognition accuracy.

Yea-Shuan Huang, Chu-Yung Li
Shape Stylized Face Caricatures

Facial caricatures exaggerate key features to emphasize unique structural and personality traits. It is quite a challenge to retain the identity of the original person despite the exaggerations. We find that primitive shapes are well known for representing certain personality traits, in art and psychology literature. Unfortunately, current automated caricature generation techniques ignore the role of primitive shapes in stylization. These methods are limited to emphasizing key distances from a fixed Golden Ratio, or computing the best mapping in a proprietary example set of (real-image, cartoon portrait) pairs. We propose a novel stylization algorithm that allows expressive vector control with primitive shapes. We propose three shape-inspired ideas for caricature generation from input frontal face portraits: 1) Extrapolation in the Golden Ratio and Primitive Shape Spaces; 2) Use of art and psychology stereotype rules; 3) Constrained adaptation to a supplied cartoon mask. We adopt a recent mesh-less parametric image warp algorithm for the hair, face and facial features (eyes, mouth, eyebrows, nose, and ears) that provides fast results. The user can synthesize a range of caricatures by changing the number of identity constraints, relaxing shape change constraints, and controlling a global exaggeration scaling factor. Different cartoon templates and art rules can make the person’s caricature mimic different personalities, and yet retain basic identity. The proposed method is easy to use and implement, and can be extended to create animated facial caricatures for games, film and interactive media applications.

Nguyen Kim Hai Le, Yong Peng Why, Golam Ashraf
i − m − Breath: The Effect of Multimedia Biofeedback on Learning Abdominal Breath

Breathing is a natural and important exercise for human beings, and the right breath method can make people healthier and even happier.

i-m-Breath

was developed to assist users in learning of abdominal breath, which used Respiration Girth Sensors (RGS) to measure user’s breath pattern and provided visual feedback to assist in learning abdominal breath. In this paper, we tried to study the effect of biofeedback mechanism on learning of abdominal breath. We cooperated with College of Medicine in National Taiwan University to take the experiments to explore whether the biofeedback mechanism affect the learning of abdominal breath. The results of the experiments showed that

i-m-Breath

could help people in improving the breath habit from chest breath to abdominal breath, and in the future the system will be used the hospital. Finally, this study is important for providing a biofeedback mechanism to assist users in better understanding of his breath pattern and improving the breath habit.

Meng-Chieh Yu, Jin-Shing Chen, King-Jen Chang, Su-Chu Hsu, Ming-Sui Lee, Yi-Ping Hung
Backmatter
Metadaten
Titel
Advances in Multimedia Modeling
herausgegeben von
Kuo-Tien Lee
Wen-Hsiang Tsai
Hong-Yuan Mark Liao
Tsuhan Chen
Jun-Wei Hsieh
Chien-Cheng Tseng
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-17832-0
Print ISBN
978-3-642-17831-3
DOI
https://doi.org/10.1007/978-3-642-17832-0