Skip to main content

2008 | Buch

Advances in Multimedia Modeling

14th International Multimedia Modeling Conference, MMM 2008, Kyoto, Japan, January 9-11, 2008. Proceedings

herausgegeben von: Shin’ichi Satoh, Frank Nack, Minoru Etoh

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Welcometothe14thInternationalMultimediaModelingConference(MMM2008), held January 9–11, 2008 at Kyoto University, Kyoto, Japan. MMM is a leading international conference for researchersand industry practitioners to share their new ideas, original research results and practical development experiences from all multimedia related areas. It was a great honor to have MMM2008, one of the most long-standing m- timedia conferences, at one of the most beautiful and historically important Japanese cities. Kyoto was an ancient capital of Japan, and was and still is at the heartofJapanesecultureandhistory. Kyotoinwintermaydistinctivelyo?er the sober atmosphere of an ink painting. You can enjoy old shrines and temples which are designated as World Heritage Sites. The conference venue was the Clock Tower Centennial Hall in Kyoto University, which is one of the oldest universities in Japan. MMM2008 featured a comprehensive program including three keynote talks, six oral presentation sessions, and two poster and demo sessions. The 133 s- missions included a large number of high-quality papers in multimedia content analysis, multimedia signal processing and communications, and multimedia applications and services. We thank our 137 Technical Program Committee members and reviewers who spent many hours reviewing papers and prov- ing valuable feedback to the authors. Based on the 3 or 4 reviews per paper the Program Chairs decided to accept only 23 as oral papers and 24 as poster papers, where each type of presentation could in addition present the work as a demo. The acceptance rate of 36% follows the MMM tradition of ful?lling fruitful discussions throughout the conference.

Inhaltsverzeichnis

Frontmatter

Media Understanding

A Novel Approach for Filtering Junk Images from Google Search Results

Keyword-based image search engines such as Google Images are now very popular for accessing large amount of images on the Internet. Because only the text information that are directly or indirectly linked to the images are used for image indexing and retrieval, most existing image search engines such as Google Images may return large amount of junk images which are irrelevant to the given queries. To filter out the junk images from Google Images, we have developed a kernel-based image clustering technique to partition the images returned by Google Images into multiple visually-similar clusters. In addition, users are allowed to input their feedbacks for updating the underlying kernels to achieve more accurate characterization of the diversity of visual similarities between the images. To help users assess the goodness of image kernels and the relevance between the returned images, a novel framework is developed to achieve more intuitive visualization of large amount of returned images according to their visual similarity. Experiments on diverse queries on Google Images have shown that our proposed algorithm can filter out the junk images effectively. Online demo is also released for public evaluation at:

http://www.cs.uncc.edu/

~

jfan/google

demo/

.

Yuli Gao, Jianping Fan, Hangzai Luo, Shin’ichi Satoh
Object-Based Image Retrieval Beyond Visual Appearances

The performance of object-based image retrieval systems remains unsatisfactory, as it relies highly on visual similarity and regularity among images of same semantic class. In order to retrieve images beyond their visual appearances, we propose a novel image presentation, i.e. bag of visual synset. A visual synset is defined as a probabilistic relevance-consistent cluster of visual words (quantized vectors of region descriptors such as SIFT), in which the member visual words

w

induce similar semantic inference

P

(

c

|

w

) towards the image class

c

. The visual synset can be obtained by finding an optimal distributional clustering of visual words, based on Information Bottleneck principle. The testing on Caltech-256 datasets shows that by fusing the visual words in a relevance consistent way, the visual synset can partially bridge visual differences of images of same class and deliver satisfactory retrieval of relevant images with different visual appearances.

Yan-Tao Zheng, Shi-Yong Neo, Tat-Seng Chua, Qi Tian
MILC2: A Multi-Layer Multi-Instance Learning Approach to Video Concept Detection

Video is a kind of structured data with multi-layer (ML) information, e.g., a shot is consisted of three layers including

shot

,

key-frame

, and

region

. Moreover, multi-instance (MI) relation is embedded along the consecutive layers. Both the ML structure and MI relation are essential for video concept detection. The previous work [5] dealt with ML structure and MI relation by constructing a MLMI kernel in which each layer is assumed to have equal contribution. However, such equal weighting technique cannot well model MI relation or handle

ambiguity propagation

problem,

i.e.

, the propagation of uncertainty of sub-layer label through multiple layers, as it has been proved that different layers have different contributions to the kernel. In this paper, we propose a novel algorithm named MILC

2

(

M

ulti-

L

ayer

M

ulti-

I

nstance

L

earning with

I

nter-layer

C

onsistency

C

onstraint.) to tackle the ambiguity propagation problem, in which an inter-layer consistency constraint is explicitly introduced to measure the disagreement of inter-layers, and thus the MI relation is better modeled. This learning task is formulated in a regularization framework with three components including

hyper-bag

prediction error, inter-layer inconsistency measure, and classifier complexity. We apply the proposed MILC

2

to video concept detection over TRECVID 2005 development corpus, and report better performance than both standard Support Vector Machine based and MLMI kernel methods.

Zhiwei Gu, Tao Mei, Jinhui Tang, Xiuqing Wu, Xian-Sheng Hua

Poster I

Creative Media

A Multimodal Input Device for Music Authoring for Children

We present a novel interface of a digital music authoring tool that is designed for children. Our system consists of a unique multimodal input device and corresponding platform to support children’s music authoring. By departing from the conventional design approaches of existing tools, our system provides a simple and intuitive environment for young children to use. The proposed interface is tested with children in a focus group study, and the results of the study are encouraging in that the children were able to perform some of complicated multitrack recording tasks within a single session that lasted for 15 to 20 minutes.

Yasushi Akiyama, Sageev Oore
Free-Shaped Video Collage

With the explosive growth of multimedia data, video presentation has become an important technology for fast browsing of video content. In this paper, we present a novel video presentation technique called “Free-Shaped Video Collage” (FS-Collage), which is motivated from and built upon our previous work on Video Collage [3]. Video Collage is a kind of static summary which selects the most representative regions-of-interest (ROI) from video and seamlessly arranges them on a synthesized image. Unlike Video Collage in which both the shapes of ROI and final collage are fixed as rectangle, we support arbitrary shapes of ROI and a set of collage templates in FS-Collage. Furthermore, we design three ROI arrangement schemes (i.e.,

book

,

diagonal

, and

spiral

) for satisfying different video genres. We formulate the generation of FS-Collage as an energy minimization problem and solve the problem by designing a random sampling process. The experiment results show that our FS-Collage achieves satisfying performance.

Bo Yang, Tao Mei, Li-Feng Sun, Shi-Qiang Yang, Xian-Sheng Hua
Aesthetics-Based Automatic Home Video Skimming System

In this paper, we propose an automatic home video skimming system based on media aesthetics. Unlike other similar works, the proposed system considers video editing theory and realizes the idea of computational media aesthetics. Given a home video and a incidental background music, this system generates a music video (MV) style skimming video automatically, with consideration of video quality, music tempo, and the editing theory. The background music is analyzed so that visual rhythm caused by shot changes in the skimming video are synchronous with the music tempo. Our work focuses on the rhythm over aesthetic features, which is more recognizable and more suitable to describe the relationship between video and audio. Experiments show that the generated skimming video is effective in representing the original input video, and the audio-video conformity is satisfactory.

Wei-Ting Peng, Yueh-Hsuan Chiang, Wei-Ta Chu, Wei-Jia Huang, Wei-Lun Chang, Po-Chung Huang, Yi-Ping Hung
Using Fuzzy Lists for Playlist Management

The increasing popularity of music recommendation systems and the recent growth of online music communities further emphasizes the need for effective playlist management tools able to create, share, and personalize playlists. This paper proposes the development of generic playlists and presents a concrete scenario to illustrate their possibilities. Additionally, to enable the development of playlist management tools, a formal foundation is provided. Therefore, the concept of fuzzy lists is defined and a corresponding algebra is developed. Fuzzy lists offer a solution perfectly suited to meet the demands of playlist management.

François Deliège, Torben Bach Pedersen

Visual Content Representation

Tagging Video Contents with Positive/Negative Interest Based on User’s Facial Expression

Recently, there are so many videos available for people to choose to watch. To solve this problem, we propose a tagging system for video content based on facial expression that can be used for recommendations based on video content. Viewer’s face captured by a camera is extracted by Elastic Bunch Graph Matching, and the facial expression is recognized by Support Vector Machines. The facial expression is classified into Neutral, Positive, Negative and Rejective. Recognition results are recorded as ”facial expression tags” in synchronization with video content. Experimental results achieved an averaged recall rate of 87.61%, and averaged precision rate of 88.03%.

Masanori Miyahara, Masaki Aoki, Tetsuya Takiguchi, Yasuo Ariki
Snap2Play: A Mixed-Reality Game Based on Scene Identification

The ubiquity of camera phones provides a convenient platform to develop immersive mixed-reality games. In this paper we introduce such a game which is loosely based on the popular card game “Memory”, where players are asked to match a pair of identical cards among a set of overturned cards by revealing only two cards at a time. In our game, the players are asked to match a “physical card”, which is an image of a scene in the real world, to a “digital card”, which corresponds to a scene in a virtual world. The objective is to convey a mixed-reality sensation. Cards are matched with a scene identification engine which consists of multiple classifiers trained on previously collected images. We present our comprehensive overall game design, as well as implementation details and results. Additionally, we also describe how we constructed our scene identification engine and its performance.

Tat-Jun Chin, Yilun You, Celine Coutrix, Joo-Hwee Lim, Jean-Pierre Chevallet, Laurence Nigay
Real-Time Multi-view Object Tracking in Mediated Environments

In this paper, we present a robust approach to real-time tracking of multiple objects in mediated environments using a set of calibrated color and IR cameras. Challenges addressed in this paper include robust object tracking in the presence of color projections on the ground plane and partial/complete occlusions. To improve tracking in such complex environment, false candidates introduced by ground plane projection or mismatching of objects between views are removed by using the epipolar constraint and the planar homography. A mixture of Gaussian is learned using the expectation-maximization algorithm for each target to further refine the 3D location estimates. Experimental results demonstrate that the proposed approach is capable of robust and accurate 3D object tracking in a complex environment with a great amount of visual projections and partial/complete occlusions.

Huan Jin, Gang Qian, David Birchfield
Reconstruct 3D Human Motion from Monocular Video Using Motion Library

In this paper, we present a new approach to reconstruct 3D human motion from video clips with the assistance of a precaputred motion library. Given a monocular video clip recording of one person performing some kind of locomotion and a motion library consisting of similar motions, we can infer the 3D motion from the video clip. We segment the video clip into segments with fixed length, and by using a shape matching method we can find out from the motion library several candidate motion sequences for every video segment, then from these sequences a coarse motion clip is generated by performing a continuity test on the boundaries of these candidate sequences. We propose a pose deformation algorithm to refine the coarse motion. To guarantee the naturalness of the recovered motion, we apply a motion splicing algorithm to the motion clip. We tested the approach using synthetic and real sports videos. The experimental results show the effectiveness of this approach.

Wenzhong Wang, Xianjie Qiu, Zhaoqi Wang, Rongrong Wang, Jintao Li
An Implicit Active Contour Model for Feature Regions and Lines

We present a level-set based implicit active contour method which can detect innermost homogeneous regions which are often considered feature regions or lines depending on the width of the regions. The curve evolution algorithm is derived from optimization of energy defined for the evolving curves. The energy has basically three terms: the first is the energy of the regions inside the curves, the second the energy of the bands inside the curves, and the third the energy of the bands outside the curves. If the band width is small, the total energy is minimized when the evolving curves lie at the boundaries of the innermost homogeneous regions, and the regions inside the curves are considered feature regions. Our method contrasts with the Chan-Vese model, which does not have the notion of innermost homogeneous regions but tries to find the curves such that the regions inside and outside them are both homogeneous. Our model approaches Chan-Vese model as the band width is increased, and is equivalent to Chan-Vese model when the band width is sufficiently large, so that all points inside/outside the curves lie within the bands inside/outside the curves, respectively.

Ho-Ryong Jung, Moon-Ryul Jung
New Approach for Hierarchical Classifier Training and Multi-level Image Annotation

In this paper, we have proposed a novel algorithm to achieve automatic multi-level image annotation by incorporating concept ontology and multi-task learning for hierarchical image classifier training. To achieve more reliable image classifier training in high-dimensional heterogeneous feature space, a new algorithm is proposed by incorporating multiple kernels for diverse image similarity characterization, and a

multiple kernel learning

algorithm is developed to train the SVM classifiers for the atomic image concepts at the first level of the concept ontology. To enable automatic multi-level image annotation, a novel

hierarchical boosting

algorithm is proposed by incorporating

concept ontology

and

multi-task learning

to achieve hierarchical image classifier training.

Jianping Fan, Yuli Gao, Hangzai Luo, Shin’ichi Satoh
Extracting Text Information for Content-Based Video Retrieval

In this paper we present a novel video text detection and segmentation system. In the detection stage, we utilize edge density feature, pyramid strategy and some weak rules to search for text regions, so that high detection rate can be achieved. Meanwhile, to eliminate the false alarms and improve the precision rate, a multilevel verification strategy is adopted. In the segmentation stage, a precise polarity estimation algorithm is firstly provided. Then, multiple frames containing the same text are integrated to enhance the contrast between text and background. Finally, a novel connected components based binarization algorithm is proposed to improve the recognition rate. Experimental results show the superior performance of the proposed system.

Lei Xu, Kongqiao Wang
Real-Time Video Surveillance Based on Combining Foreground Extraction and Human Detection

In this paper, we present an adaptive foreground object extraction algorithm for real-time video surveillance, in conjunction with a human detection technique applied in the extracted foreground regions by using AdaBoost learning algorithm and Histograms of Oriented Gradient (HOG) descriptors. Furthermore, a RANSAC-based temporal tracking algorithm is also applied to refine and trace the detected human windows in order to increase the detection accuracy and reduce the false alarm rate. The traditional background subtraction technique usually cannot work well for situations with lighting variations in the scene. The proposed algorithm employs a two-stage foreground/background classification procedure to perform background subtraction and remove the undesirable subtraction results due to shadow, automatic white balance, and sudden illumination change. Experimental results on some real surveillance video are shown to demonstrate the good performance of the proposed adaptive foreground extraction algorithm under a variety of different environments with lighting variations and human detection system.

Hui-Chi Zeng, Szu-Hao Huang, Shang-Hong Lai
Detecting and Clustering Multiple Takes of One Scene

In applications such as video post-production users are confronted with large amounts of redundant unedited raw material, called rushes. Viewing and organizing this material are crucial but time consuming tasks. Typically multiple but slightly different takes of the same scene can be found in the rushes video. We propose a method for detecting and clustering takes of one scene shot from the same or very similar camera positions. It uses a variant of the LCSS algorithm to find matching subsequences in sequences of visual features extracted from the source video. Hierarchical clustering is used to group the takes of one scene. The approach is evaluated in terms of correctly assigned takes using manually annotated ground truth.

Werner Bailer, Felix Lee, Georg Thallinger
An Images-Based 3D Model Retrieval Approach

This paper presents an images based 3D model retrieval method in which each model is described by six 2D images. The images are generated by three steps: 1) the model is normalized based on the distribution of the surface normal directions; 2) then, the normalized model is uniformly sampled to generate a number of random points; 3) finally, the random points are projected along six directions to create six images, each of which is described by Zernike moment feature. In the comparison of two models, six images of each model are naturally divided into three pairs, and the similarity between two models is calculated by summing up the distances of all corresponding pairs. The effectiveness of our method is verified by comparative experiments. Meanwhile, high matching speed is achieved, e.g., it takes about 3e-5 seconds to compare two models using a computer with

Pentium

IV

3.00

GHz

CPU

.

Yuehong Wang, Rujie Liu, Takayuki Baba, Yusuke Uehara, Daiki Masumoto, Shigemi Nagata
‘Oh Web Image, Where Art Thou?’

Web image search today is mostly keyword-based and explores the content surrounding the image. Searching for images related to a certain location quickly shows that Web images typically do not reveal their explicit relation to an actual geographic position. The geographic semantics of Web images are either not available at all or hidden somewhere within the the Web pages’ content. Our spatial search engine crawls and identifies Web pages with a spatial relationship. Analysing location-related Web pages, we identify photographs based on content-based image analysis as well as image context. Following the photograph classification, a location-relevance classification process evaluates image context and content against the previously identified address. The results of our experiments show that our approach is a viable method for Web image location assessment. Thus, a large number of potentially geographically-related Web images are unlocked for commercially relevant spatial Web image search.

Dirk Ahlers, Susanne Boll
Complementary Variance Energy for Fingerprint Segmentation

The paper presents a method for fingerprint segmentation, which executes in two phases. The enrolled image is partitioned into block by block and a preliminary segmentation is firstly carried out based on thresholding the blockwise variance. Afterwhich, the segmentation result is refined by energy information, where the blocks with the most “reliable” energy are selected to polish the preliminary result and yield the final segmentation map. Reliability of the method has been demonstrated using NIST and FVC data sets. Comparison with other methods is presented.

Z. Hou, W. Y. Yau, N. L. Than, W. Tang
Similarity Search in Multimedia Time Series Data Using Amplitude-Level Features

Effective similarity search in multi-media time series such as video or audio sequences is important for content-based multi-media retrieval applications. We propose a framework that extracts a sequence of local features from large multi-media time series that reflect the characteristics of the complex structured time series more accurately than global features. In addition, we propose a set of suitable local features that can be derived by our framework. These features are scanned from a time series amplitude-levelwise and are called amplitude-level features. Our experimental evaluation shows that our method models the intuitive similarity of multi-media time series better than existing techniques.

Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, Peter Kunath, Alexey Pryakhin, Matthias Renz
Sound Source Localization with Non-calibrated Microphones

We propose a new method for localizing a sound source in a known space with non-calibrated microphones. Our method does not need the accurate positions of the microphones that are required by traditional sound source localization. Our method can make use of wide variety of microphone layout in a large space because it does not need calibration step on installing microphones. After a number of sampling points have been stored in a database, our system can estimate the nearest sampling point of a sound by utilizing the set of time delays of microphone pairs. We conducted a simulation experiment to determine the best microphone layout in order to maximize the accuracy of the localization. We also conducted a preliminary experiment in real environment and obtained promising results.

Tomoyuki Kobayashi, Yoshinari Kameda, Yuichi Ohta
PriSurv: Privacy Protected Video Surveillance System Using Adaptive Visual Abstraction

Recently, video surveillance has received a lot of attention as a technology to realize a secure and safe community. Video surveillance is useful for crime deterrence and investigations, but may cause the invasion of privacy. In this paper, we propose a video surveillance system named PriSurv, which is characterized by

visual abstraction

. This system protects the privacy of objects in a video by referring to their privacy policies which are determined according to closeness between objects in the video and viewers monitoring it. A prototype of PriSurv is able to protect the privacy adaptively through

visual abstraction

.

Kenta Chinomi, Naoko Nitta, Yoshimichi Ito, Noboru Babaguchi
Distribution-Based Similarity for Multi-represented Multimedia Objects

In modern multimedia databases, objects can be represented by a large variety of feature representations. In order to employ all available information in a best possible way, a joint statement about object similarity must be derived. In this paper, we present a novel technique for multi-represented similarity estimation which is based on probability distributions modeling the connection between the distance value and object similarity. To tune these distribution functions to model the similarity in each representation, we propose a bootstrapping approach maximizing the agreement between the distributions. Thus, we capture the general notion of similarity which is implicitly given by the distance relationships in the available feature representations. Thus, our approach does not need any training examples. In our experimental evaluation, we demonstrate that our new approach offers superior precision and recall compared to standard similarity measures on a real world audio data set.

Hans-Peter Kriegel, Peter Kunath, Alexey Pryakhin, Matthias Schubert

Poster II

Appropriate Segment Extraction from Shots Based on Temporal Patterns of Example Videos

Videos are composed of shots, each of which is recorded continuously by a camera, and video editing can be considered as a process of re-sequencing shots selected from original videos. Shots usually include redundant segments, which are often edited out by professional editors. This paper proposes a method for automatically extracting appropriate segments to be included in edited videos from shots based on temporal patterns of audio and visual features in appropriate segments and redundant segments in example videos learned with Hidden Markov Models. The effectiveness of the proposed method is verified with experiments using original shots extracted from movies and appropriate segments extracted from movie trailers as example videos.

Yousuke Kurihara, Naoko Nitta, Noboru Babaguchi
Fast Segmentation of H.264/AVC Bitstreams for On-Demand Video Summarization

Video summarization methods need fast segmentation of a video into smaller units as a first step, especially if used in an on-demand fashion. We propose an efficient segmentation algorithm for H.264/AVC bitstreams that is able to segment a video in appr. 10% of the time required to decode the video. This is possible because our approach uses features available after entropy-decoding (which is the very first stage of the decoding process) only. More precisely, we use a combination of two features, especially appropriate to H.264/AVC, with different characteristics in order to decide if a new segment starts or not: (1) L1-Distance based partition histograms and (2) ratio of intra-coded macroblocks on a per-frame basis. Our results show that this approach performs well and works for several different encoders used in practice today.

Klaus Schöffmann, Laszlo Böszörmenyi
Blurred Image Detection and Classification

Digital photos are massively produced while digital cameras are becoming popular, however, not every photo has good quality. Blur is one of the conventional image quality degradation which is caused by various factors. In this paper, we propose a scheme to detect blurred images and classify them into several different categories. The blur detector uses support vector machines to estimate the blur extent of an image. The blurred images are further classified into either locally or globally blurred images. For globally blurred images, we estimate their point spread functions and classify them into camera shake or out of focus images. For locally blurred images, we find the blurred regions using a segmentation method, and the point spread function estimation on the blurred region can sort out the images with depth of field or moving object. The blur detection and classification processes are fully automatic and can help users to filter out blurred images before importing the photos into their digital photo albums.

Ping Hsu, Bing-Yu Chen
Cross-Lingual Retrieval of Identical News Events by Near-Duplicate Video Segment Detection

Recently, for reusing large quantities of accumulated news video, technology for news topic searching and tracking has become necessary. Moreover, since we need to understand a certain topic from various viewpoints, we focus on identical event detection in various news programs from different countries. Currently, text information is generally used to retrieve news video. However, cross-lingual retrieval is complicated by machine translation performance and different viewpoints and cultures. In this paper, we propose a cross-lingual retrieval method for detecting identical news events that exploits image information together with text information. In an experiment, we verified the effectiveness of making use of the existence of near-duplicate video segments and the possibility of improving retrieval performance.

Akira Ogawa, Tomokazu Takahashi, Ichiro Ide, Hiroshi Murase
Web Image Gathering with a Part-Based Object Recognition Method

We propose a new Web image gathering system which employs a part-based object recognition method. The novelty of our work is introducing the bag-of-keypoints representation into an Web image gathering task instead of color histogram or segmented regions our previous system used. The bag-of-keypoints representation has been proven that it has the excellent ability to represent image concepts in the context of visual object categorization / recognition in spite of its simplicity. Most of object recognition work assumed that complete training data is available. On the other hand, in the Web image gathering task, since images associated with the given keywords are gathered from the Web fully-automatically, complete training images cannot be available.

In this paper, we combine the HTML-based automatic positive training image selection and the bag-of-keypoints-based image selection with an SVM which is a supervised machine learning method. This combination enables the system to gather many images related to given concepts with high precision fully automatically needing no human intervention. Our main objective is to examine if the bag-of-keypoints model is also effective for the Web image gathering task where training images always include some noise. By the experiments, we show the new system outperforms our previous systems, other systems and Google Image Search greatly.

Keiji Yanai
A Query Language Combining Object Features and Semantic Events for Surveillance Video Retrieval

In this paper, we propose a novel query language for video indexing and retrieval that (1) enables to make queries both at the image level and at the semantic level (2) enables the users to define their own scenarios based on semantic events and (3) retrieves videos with both exact matching and similarity matching. For a query language, four main issues must be addressed: data modeling, query formulation, query parsing and query matching. In this paper we focus and give contributions on data modeling, query formulation and query matching. We are currently using color histograms and SIFT features at the image level and 10 types of events at the semantic level. We have tested the proposed query language for the retrieval of surveillance videos of a metro station. In our experiments the database contains more than 200 indexed physical objects and 48 semantic events. The results using different types of queries are promising.

Thi-Lan Le, Monique Thonnat, Alain Boucher, François Brémond
Semantic Quantization of 3D Human Motion Capture Data Through Spatial-Temporal Feature Extraction

3D motion capture is a form of multimedia data that is widely used in animation and medical fields (such as physical medicine and rehabilitation where body joint analysis is needed). These applications typically create large repositories of motion capture data and need efficient and accurate content-based retrieval techniques. 3D motion capture data is in the form of multi-dimensional time series data. To reduce the dimensions of human motion data while maintaining semantically important features, we quantize human motion data by extracting Spatial-Temporal Features through SVD and translate them onto a 1-dimensional sequential representation through our proposed sGMMEM (semantic Gaussian Mixture Modeling with EM). Thus, we achieve good classification accuracies for primitive human motion categories (walking 92.85%,run 91.42%,jump 94.11%) and even for subtle categories (dance 89.47%,laugh 83.33%,basketball signal 85.71%,golf putting 80.00%).

Yohan Jin, B. Prabhakaran
Fast Intermode Decision Via Statistical Learning for H.264 Video Coding

Although the variable-block-size motion compensation scheme significantly reduces the compensation error, the computational complexity of motion estimation (ME) is tremendously increased at the same time. To reduce the complexity of the variable-block-size ME algorithm, we propose a statistical learning approach to simplify the computation involved in the sub-MB mode selection. Some representative features are extracted during ME with fixed sizes. Then, an off-line pre-classification approach is used to predict the most probable sub-MB modes according to the run-time features. It turns out that only possible sub-MB modes need to perform ME. Experimental results show that the computation complexity is significantly reduced while the video quality degradation and bitrate increment is negligible.

Wei-Hau Pan, Chen-Kuo Chiang, Shang-Hong Lai
A Novel Motion Estimation Method Based on Normalized Cross Correlation for Video Compression

In this paper we propose to use the normalized cross correlation (NCC) as the similarity measure for block-based motion estimation (ME) to replace the sum of absolute difference (SAD) measure. NCC is a more suitable similarity measure than SAD for reducing the temporal redundancy in video comparison since we can obtain flatter residual after motion compensation by using the NCC as the similarity measure in the motion estimation. The flat residual results in large DC term and smaller AC term, which means less information is lost after quantization. Thus, we can obtain better quality in the compressed video. Experimental results show the proposed NCC-based motion estimation algorithm can provide similar PSNR but better SSIM than the traditional full search ME with the SAD measure.

Shou-Der Wei, Wei-Hau Pan, Shang-Hong Lai
Curved Ray-Casting for Displacement Mapping in the GPU

To achieve interactive speed, displacement mapping in the GPU is typically implemented in two steps: vertex shading/rasterization of the base surface and pixel shading. Pixel shading applies the height map relative to the image plane of the base surface, casts view rays to the height field through each pixel, finds the intersection point with the height field, and computes the color of that point. Here, the ray-casting process involves significant errors; The spatial relationship between the ray and the base surface is not preserved between the ray and the image plane of the base surface. The errors result in incorrect silhouettes. To address this problem, we curve the ray so that the spatial relationship between the (linear) ray and the base surface is preserved between the curved ray and the image plane of the base surface. This method reduces intersection errors, producing more satisfactory silhouettes, self-occlusions and shadows.

Kyung-Gun Na, Moon-Ryul Jung
Emotion-Based Music Visualization Using Photos

Music players for personal computers are often featured with music visualization by generating animated patterns according to the music’s low-level features such as loudness and spectrum. This paper proposes an emotion-based music player which synchronizes visualization (photos) with music based on the emotions evoked by auditory stimulus of music and visual content of visualization. For emotion detection from photos, we collected 398 photos with their emotions annotated by 496 users through the web. With these annotations, a Bayesian classification method is proposed for automatic photo emotion detection. For emotion detection from music, we adopt an existing method. Finally, for composition of music and photos, in addition to matching high-level emotions, we also consider low-level feature harmony and temporal visual coherence. It is formulated as an optimization problem and solved by a greedy algorithm. Subjective evaluation shows emotion-based music visualization enriches users’ listening experiences.

Chin-Han Chen, Ming-Fang Weng, Shyh-Kang Jeng, Yung-Yu Chuang
LightCollabo: Distant Collaboration Support System for Manufacturers

This paper introduces our LightCollabo system which is a distant collaboration tool designed for manufacturers. For distant collaboration between factory workers and members of a product design team, it is important to support interactive discussion over actual objects. LightCollabo enables remotely located people to share a view of actual objects and to annotate on the objects by using a camera and a projector whose optic axes and view angles are precisely aligned. LightCollabo has been experimentally deployed in several locations in China and Japan and proven to be effective. Initial results of our use case study are also described.

Tetsuo Iyoda, Tsutomu Abe, Kiwame Tokai, Shoji Sakamoto, Jun Shingu, Hiroko Onuki, Meng Shi, Shingo Uchihashi
Backmatter
Metadaten
Titel
Advances in Multimedia Modeling
herausgegeben von
Shin’ichi Satoh
Frank Nack
Minoru Etoh
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-77409-9
Print ISBN
978-3-540-77407-5
DOI
https://doi.org/10.1007/978-3-540-77409-9

Premium Partner