Session 1: Next Generation Video Coding Techniques

Accurate Correlation Modeling for Transform-Domain Wyner-Ziv Video Coding

In Wyner-Ziv (WZ) video coding, low-complexity encoding is achieved by generating the prediction signal only at the decoder. An accurate model of the correlation between the original frame and its prediction is necessary for efficient coding. Firstly, we propose an improvement for the pixel-domain correlation estimation. In transform-domain WZ video coding current models estimate the necessary correlation parameters directly in the transform domain. We propose an alternative approach, where an accurate estimation in the pixel domain is followed by a novel method of transforming the pixel-domain estimation into the transform domain. The experimental results show that our model leads to average bit-rate gain of 3.5–8%.

Jozef Škorupa, Jürgen Slowack, Stefaan Mys, Peter Lambert, Rik Van de Walle

High-Quality Multi-Mode Mipmapping Texture Compression with Alpha Map

This paper presents a high-quality multi-mode mipmapping texture compression with alpha map (MMMTC) system. Based on wavelet transform, a hierarchical approach is adopted for mipmap in YCbCr color space to embed three levels of mipmap in a single bitstream. In addition, by inspecting the similarity between alpha channel and luminance channel, the alpha high coefficients are efficiently encoded with linear prediction from the luminance high coefficients in three modes. Simulation results show that MMMTC can provide better image quality with similar memory bandwidth for RGB channels and outperform other texture compression systems for alpha channel. VLSI implementation result shows that the hardware cost of MMMTC is similar to that of DXTC and is suitable to be integrated in GPU to provide high-quality textures with low memory bandwidth requirement.

Chih-Hao Sun, Tse-Wei Chen, You-Ming Tsao, Shao-Yi Chien

Combined Prediction Mode for AVS

Audio Video Coding Standard (AVS) is Chinese new audio and video coding standard, which provides close performance to H.264/AVC main profile with lower complexity. Like H.264/AVC, prediction plays an important role in AVS. A macroblock (MB) uses either Intra or Inter prediction to exploit spatial and temporal correlation to achieve good compression efficiency. In this paper, a new coding technique based on combined prediction is proposed for AVS. Based on the distribution analysis of residuals generated by Intra prediction and Inter prediction, the characteristics in both spatial and temporal domain are utilized to combine two selected modes Intra_8×8_DC and Inter_P_16×16 together. The limited complexity increase and the standard compatibility are discussed for the proposed combined prediction mode (CPM). Experimental results show that by applying the proposed CPM to the AVS rate-distortion optimized encoding flow, up to 2.70% bit rate reduction for CIF sequences and 1.57% for 720p sequences are achieved, because it generates the more precise prediction and has minimal complexity increase both in the encoder and decoder.

Zhiyi Zhu, Xin Jin

A JND Guided Foveation Video Coding

This paper presents a novel just noticeable distortion (JND) guided foveation video coding method, by which the foveation region can be adaptively selected according to the video content. In the proposed method, two factors are taken into account: first, we set a fixed foveation and then get the cut-off frequency for every macroblock. Second, we compute the JND values for each pixel in the frame to be encoded, and then use the JND variance of each macroblock to tune the corresponding cut-off frequency. The method presented can be implemented without any modification of the decoder and is compatible with all the existed coding standards. Based on the proposed mode, a framework of preprocessing is brought forward. And experimental results have demonstrated that foveated pictures with higher perceptual quality could be obtained, compared with traditional foveation model.

Erli Zhang, Debin Zhao, Yongbing Zhang, Hongbin Liu, Siwei Ma, Ronggang Wang

Session 2: Audio Processing and Classification

Pitch Shifting of Music Based on Adaptive Order Estimation of Linear Predictor

A method for timbre-preserving pitch shifting of music signals based on estimating the order in linear predictive coding (LPC) is described. LPC is used for estimating a transfer function approximated using autoregressive (AR) models. We need to determine the appropriate number of poles in the AR model for LPC. For general audio signals, the number of poles varies with time because the number and the kind of sound sources such as musical instruments change dynamically. To estimate the order, we utilize the inequality of arithmetic and geometric means (AM-GM inequality) and fractional bandwidth at each pole of the model. The estimated order can be applied to LPC for the timbre-preserving pitch shifting. A listening test evaluated by the mean opinion score (MOS) shows that our approach improves the sound quality of pitch-shifted signals.

Naoki Koshikawa, Takahiro Murakami, Toshihisa Tanaka

An Efficient Approach for Classification of Speech and Music

A new method to classify an audio segment into speech and music related to the automatic transcription of broadcast news is presented. To discriminate between speech and music, sample entropy (

SampEn

), a time complexity measure, mainly operates as a feature.

SampEn

is a variant of the approximate entropy (

ApEn

) that measures the regularity of time series. The basic idea is to label a given audio into speech or music depending on its regularity. Based on the

SampEn

sequence calculated over a window, the regularity of a given audio stream is measured. The effectiveness of the proposed method is tested on experiments, including broadcast news shows from BBC radio stations, WBAI news, UN news and music genres with different temporal distributions. Results show the robustness of the proposed method achieving high discrimination accuracy for all tested experiments.

Ei Mon Mon Swe, Moe Pwint

Speech Emotion Classification on a Riemannian Manifold

We present a novel algorithm for speech emotion classification. In contrast to previous methods, we additionally consider the relations between simple features by incorporating covariance matrices as the new feature descriptors. Since non-singular covariance matrices do not lie on a linear space, we endow the space with an affine invariance metric and render it into a Riemannian manifold. After that we use the tangent space to approximate the manifold. Classification is performed in the tangent space and a generalized principal component analysis is presented. We test the algorithm on speech emotion classification and the experiment results show an improvement at around 13%(+3% with PCA) in recognition accuracy. Based on that we are able to train one simple model to accurately differentiate the emotions from both genders.

Chengxi Ye, Jia Liu, Chun Chen, Mingli Song, Jiajun Bu

Toward Multi-modal Music Emotion Classification

The performance of categorical music emotion classification that divides emotion into classes and uses audio features alone for emotion classification has reached a limit due to the presence of a semantic gap between the object feature level and the human cognitive level of emotion perception. Motivated by the fact that lyrics carry rich semantic information of a song, we propose a multi-modal approach to help improve categorical music emotion classification. By exploiting both the audio features and the lyrics of a song, the proposed approach improves the 4-class emotion classification accuracy from 46.6% to 57.1%. The results also show that the incorporation of lyrics significantly enhances the classification accuracy of valence.

Yi-Hsuan Yang, Yu-Ching Lin, Heng-Tze Cheng, I-Bin Liao, Yeh-Chin Ho, Homer H. Chen

Session 3: Interactive Multimedia Systems

Using GPU-Based Ray Tracing for Real-Time Composition in the Real Scene

This paper presents a novel rendering system using GPU-based ray tracing for interactive composition of a synthetic object in the real scene. In order to generate correct lighting in the scene, we extract the texture information and the light samples from a panoramic image in real-time. The proposed interactive system provides a powerful way to represent shadows and reflections by surrounding environments. The simulation results demonstrated that the proposed method is useful in real-time applications such as rendering service on the mobile device.

Sungmin Bae, Kyunghee Hwang, Hyunki Hong

Vision-Based Hand Gesture Interactions for Large LCD-TV Display Tabletop Systems

The advent of the LCD displays and computer vision technology has encouraged the development of large interactive systems that allow several users to interact simultaneously. We introduce a vision-based hand gestural interactive tabletop system which provides high definition display output and can recognize several standard gestures and track hand motion by using a camera and computer vision technology. The system focuses on the fast recognition and tracking techniques minimizing response time between users and multimedia which is especially important for motion games with immediate and large hand movements. A prototype system, which is capable of recognizing 5 gesture types and tracking hand motion in real time, was built. A demonstration game, Virtual Air Hockey, was built under the prototype system to demonstrate interactive techniques which receive positive user feedback.

Chi-Ho Yeung, Man-Wa Lam, Hong-Ching Chan, Oscar C. Au

Fast Content-Based Mining of Web2.0 Videos

The accumulation of many transformed versions of the same original videos on Web2.0 sites has a negative impact on the quality of the results presented to the users and on the management of content by the provider. An automatic identification of such

content links

between video sequences can address these difficulties. We put forward a fast solution to this video mining problem, relying on a compact keyframe descriptor and an adapted indexing solution. Two versions are developed, an off-line one for mining large databases and an online one to quickly post-process the results of keyword-based interactive queries. After demonstrating the reliability of the method on a ground truth, the scalability on a database of 10,000 hours of video and the speed on 3 interactive queries, some results obtained on Web2.0 content are illustrated.

Sébastien Poullot, Michel Crucianu, Olivier Buisson

An Interactive Painting System by Using Enhanced Auto-positioning and Pattern Matching Techniques

In this paper, we propose a low cost interactive painting system, which is realized by a general camera and liquid crystal display (LCD) monitor. By using a newly-designed initialization grid and pattern matching technique, the low-cost camera can achieve a nature painting brush but further figured with many advanced features such video and image capturing, positioning, controlling, painting exiting texture and background capabilities. The proposed interactive painting system can help people to draw a painting directly on any LCD monitor to successfully achieve an edutainment system. Through nature human behaviors, the proposed system can be further advanced to any interactive multimedia educations, gaming and other entertainment applications.

Ling-Erl Cheng, Jiun-Yu Chen, Jar-Ferr Yang

Session 4: Advances in H.264/AVC

VLC Table Prediction Algorithm for CAVLC in H.264 Using the Characteristics of Mode Information

The most recent H.264 video coding standard adopted context-based adaptive variable length coding (CAVLC) as the entropy coding tool. By combining adaptive variable length coding (VLC) with context modeling, we can achieve a better coding performance. However, CAVLC in H.264 has a problem that correctness of VLC table prediction is low. In this paper, we propose a new VLC table prediction algorithm using the correlation of coding mode between the current and neighboring blocks and the statistics of mode distribution in both intra and inter frames. Moreover, we can further increase correctness of VLC table prediction considering the structural characteristics of the mode information in inter frame. Experimental results show that the proposed algorithm increases correctness of VLC table prediction by 10.07% and reduces the bit rate by 1.21% on average.

Jin Heo, Yo-Sung Ho

Perceptually Motivated Adaptive Quantization Algorithm for Region-of-Interest Coding in H.264

Encoding a video sequence at very low bit-rate with good quality remains a major challenge in video coding research. Improvements on Region of Interest (ROI) can provide better perceptual quality. This paper proposes an adaptive perceptual quantization algorithm for ROI video coding. Our goal is to find a solution to achieve significant ROI quality improvement and a perceptually pleasing background. Based on the masking effects of Human Visual System (HVS), the area inside ROI is divided into partitions by a grid model. And Macroblocks (MBs) are classified as non-ROI, grids inside ROI, and gradient region. The quantization parameters for different regions are modulated by an adaptive perceptual priority map. Extensive experiment results show that the proposed algorithm could improve the ROI objective quality by up to 4.14db compared by uniform quantization while with little loss of background. Subjective results show that the perceptual video quality is improved obviously.

Qiong Liu, Rui-Min Hu

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

In H.264, block-based discrete cosine transform (DCT) and block-based motion compensated prediction are used to reduce both spatial and temporal redundancies. Due to the block-based coding, discontinuities at block boundaries, referred to blocking artifacts are created. To reduce these blocking artifacts, H.264 employs a deblocking filter. However, it causes a significant amount of complexity; therefore, the deblocking filter occupies one-third of the computational complexity of the decoder. In this paper, we propose a deblocking filtering algorithm with low complexity. Using boundary strength (BS) of the first Line-of-Pixel (LOP), we determine BS of successive LOP in advance. Then, we apply deblocking filters including newly designed filters. Experimental results show that the proposed algorithm provides an average computations reduction of 73.45 % in the BS decision. In the filter implementation, it reduces an average 57.52 % of additions, 100 % of multiplications, and 5.66 % of shift operations compared to the deblocking filter in H.264 with comparable objective quality.

Jung-Ah Choi, Yo-Sung Ho

Two-Dimensional Map Based Fast Mode Decision for H.264/AVC

The state-of-the-art video coding standard H.264/AVC achieves significant coding performance improvement by choosing the rate-distortion (RD) optimized encoding mode from a list of candidate modes at the cost of intensive computations, which limits the practical application of H.264/AVC. A number of fast mode decision algorithms have been recently presented in the literature in order to reduce H.264/AVC encoding computations. In this paper, a novel two-dimensional (2D) map based fast mode decision scheme is proposed, which produces a flexible mode checking order for fast mode decision. Experimental results demonstrate that with the proposed algorithm, the computational complexity of H.264/AVC coding is significantly reduced while the video quality and compression efficiency are preserved.

Tiesong Zhao, Hanli Wang, Xiangyang Ji, Sam Kwong

A Novel Hierarchical Mode Selection Algorithm for P-Slices in H.264/AVC

In this paper, a novel hierarchical mode selection algorithm is proposed to reduce the complexity of the mode selection process for P-Slices in H.264/AVC. Generally, the proposed algorithm is divided into two major stages –inter mode selection and intra mode selection. For inter mode selection, a three-level prediction structure is proposed with a flexible early termination strategy, which can efficiently skip the unnecessary mode checking while achieving a high prediction accuracy. For intra mode selection, the spatial correlation and the information of the best inter mode are exploited to decide whether the intra mode selection can be skipped. If intra mode selection is necessary, an early selection between Intra16x16 and Intra4x4 will be given. Experiment results show that with this algorithm, 50%~80% encoding time can be saved with a negligible loss in PSNR and very little increment in bit rates.

Xinxin Zhou, Chun Yuan, Chunhua Li, Yuzhuo Zhong

Motion Compensated Interpolation as a New Inter Coding Mode for 8x8 Macroblock Partitions in H.264/AVC B Slices

Thanks to a lot of new or improved coding tools, the H.264/AVC video coding standard achieves up to twice as good compression ratios over its predecessors. Unfortunately, many of these new coding tools come with the cost of a considerable extra amount of bits needed to send motion vectors to the decoder. Especially at low bit rates, this overhead is an important drawback. In this paper, a new inter coding mode for 8x8 macroblock partitions in B slices is proposed. The mode is based on motion compensated interpolation, a commonly used technique in the world of distributed video coding. Since motion compensated interpolation allows both encoder and decoder to find the same motion vector, no bits need to be spent on coding this motion vector. Experimental results show that introducing the new coding mode can achieve a bit rate gain of up to 3.03%, especially at low bit rates.

Stefaan Mys, Jürgen Slowack, Jozef Škorupa, Peter Lambert, Rik Van de Walle

Session 5: Multimedia Networking Techniques

On the Server Placement Problem of P2P Live Media Streaming System

Commercial P2P media streaming systems have widely utilized servers or Content Distribution Networks (CDN) service to help alleviating the effect from high peer dynamics. While refocusing on servers becomes a MUST in large-scale commercial P2P streaming systems, it comes to the problem on how to place server nodes in best efficiency so that they can better serve the needs of peers with minimal cost. In line with this, we formulate the Server Placement (SP) problem in P2P streaming system, and propose solution schemes for the sub-problem of server selection and rate assignment. The server selection problem targets at minimizing the end-to-end user round-trip latency, and traffic transmission cost, which was finally reduced to a P-median problem and solved by Greedy Randomized Adaptive Search Procedure (GRASP). Secondly, we formulate the problem of which clients are served by which server at which rate as the rate allocation problem and optimize to minimize the total streaming cost subject to the play rate requirement. As a starting work, this paper aims to attract more researches on this challenging topic.

Zhijia Chen, Chuang Lin, Hao Yin, Bo Li

Guaranteed Bandwidth Requirement Mechanism for Multimedia Multicast Traffic in MANETs

The emergence of nomadic multimedia applications has recently generated much interest in a mobile ad hoc network (MANET) to support diverse Quality-of-Service (QoS). In the existing MANET QoS routing and multicasting protocols, the methods of bandwidth calculation and allocation were proposed to determine routes with bandwidth guaranteed for QoS applications. As our observations, two bandwidth-violation problems will be incurred in the above protocols. In this paper, a heuristic algorithm is proposed to avoid the two problems and determine a feasible bandwidth-satisfied multicast tree. Then, we integrate the algorithm with the existing multicast protocol ODMRP for supporting MANET bandwidth-requirement multicast services. To evaluate the performance of the proposed algorithm, the minimizing problem is formulated as a 0/1 integer linear programming (ILP) for the theoretical studies.

Chia-Cheng Hu, Der-Jiunn Deng, Lih-Gwo Jeng

Seamless Handoff Support for Real-Time Multimedia Applications in Nested Mobile Networks

In this paper, we extend our Hierarchical Care-of Prefix with the binding update tree, i.e., HCoP-B, to support route optimization and seamless handoff for real-time multimedia applications in the nested mobile network simultaneously, which is rare in the literature. As compared to the traditional ROTIO scheme with mathematical analyses and simulations, HCoP-B achieves shorter playback disruption time and buffering time for on-going real-time multimedia applications whenever the mobile subnet in the old nested mobile network hands over to a new one.

Ing-Chau Chang, Chia-Hao Chou

Service Differentiation in IEEE 802.11e HCF Access Method

As the demand for broadband multimedia wireless is increasing, there is a great need for

quality of service

(QoS). The IEEE 802.11e workgroup proposes the

hybrid coordination function

(HCF) channel access mechanisms, which includes two mechanisms:

enhanced distributed channel access

(EDCA) and

HCF control channel access

(HCCA). The EDCA is a contention-based scheme and supports service differentiation through prioritiesed access to the wireless medium. The HCCA works under the control of a

hybrid coordinator

(HC) and provides a centralised polling scheme. In this paper, we evaluate and compare the performance of HCCA and EDCA for

constant bit rate

(CBR) traffic transmission. The results show that both access methods are able to carry the prioritized traffic.

Whe-Dar Lin, Der-Jiunn Deng

Throughput Performance of an IP Differentiated-Services Network for Video Communication

The QoS solutions for video transmission over IP network presently include integrated services (IntServ) [4], DiffServ [5], multiprotocol label switching (MPLS) [6]. Among them, the DiffServ is regarded as a dominant protocol for its flexibility, scalability, and capability of QoS guarantee. Traditionally, policing or limiting function was implemented in DiffServ by using srTCM (single-rate three color marker) [2] and trTCM (two-rate three-color marker) [3]. Proposed scheme improve the throughput of MPEG4 video transmissions over IP network. In this scheme, I-frames, P-frames and B-frames are marked as different service priorities according to transmission condition and their contribution to the perceived picture quality. Investigators used authentic MPEG4 video traffic traces to compare the performance of the proposed scheme with other exesting RFCs for MPEG video transmission, such as srTCM and trTCM, in a DiffServ network. Results show that proposed scheme improves the throughput of video transmission over IP network.

Dharm Singh Jat, Chih-Heng Ke, M. K. Jain, Chen Ruey Shin

Session 6: Advanced Image Processing Techniques

Skeleton-Based Recognition of Chinese Calligraphic Character Image

The large amount of digitized Chinese calligraphic works in existence is a valuable part of the Chinese cultural heritage. But they can hardly be recognized by optical character recognition (OCR) which performs well on machine printed characters against clean background, because there are so different styles of shape complexity characters. So the approaches of automatic Chinese calligraphic character recognition become more and more important. A novel skeletonization algorithm called MFITS (morphology-fused index table skeletonization) is proposed and a skeleton-based Chinese calligraphic character recognition method is proposed too. The experiments show that MFITS can extract skeletons with only a few deformations and the skeleton-based Chinese calligraphic character image recognition method has a good performance.

Kai Yu, Jiangqin Wu, Yueting Zhuang

A Platform Implementation for Real Time Image Processing

The technologies for image processing are improved evidently since the popularity of modern IA products in the last decade. In order to satisfy the needs on high quality images, higher contrast, higher saturation and higher resolution become the main goals of the researches. This article focuses on how to implement a platform for real time image processing to provide a realistic environment for developing techniques on enhancing image contrast dynamically in real time. A new method for improving the contrast quality of images can be verified by feeding analog images of NTSC television to this platform after digitalized, then perform the functions of enhancing image contrast by FPGA to provide the enhanced result of image contrast. It’s different from the simulations made by software that proposed by the past researches, this article proposes a practical hardware real time processing foundation that can be used as a valuable reference of real time image processing.

Ching-Hsi Lu, Yu-Sheng Wang, Lei Wang, Hong-Yang Hsu

Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

We present a subword lexical chaining approach to automatic story segmentation of Chinese broadcast news (BN). Conventional lexical chains link related words with cohesion (e.g. repetition of words) and high concentration points of starting and ending chains are indicative of story boundaries. However, inevitable speech recognition errors in BN transcripts may destroy the cohesiveness of words, resulting in word match failures. We show the robustness of Chinese subwords (characters and syllables) in lexical matching in errorful ASR transcripts. This motivates us to discover story boundaries on chains formed by character and syllable

n

-gram units. Experimental results on the TDT2 Mandarin corpus show that chaining by character unigram exhibits the best story segmentation performance with relative

F

-measure improvement of 6.06% over conventional word chaining. Integrations of multi-scales (words and subwords) exhibit further improvement. For example, fusion by voting from different scales achieves an

F

-measure gain of 9.04% over words.

Lei Xie, Yulian Yang, Jia Zeng

Moving Video Object Edge Detection Using Complex Wavelets

In this paper, an unsupervised moving video object edge detection method based on the

dual-tree complex wavelet transform

(DT-CWT) domain is proposed, and its performance is compared with that of the

discrete wavelet transform

(DWT). In the proposed DT-CWT-based approach, an interframe change detection method with adaptive thresholding on six DT-CWT subbands is developed to detect moving video object edges, with assist of the edge map generated by exploiting the Canny edge detector in the corresponding frame. The performance of the proposed method is demonstrated both numerically and visually, and shows its clear superiority over the DWT-based approach.

Turgay Celik, Kai-Kuang Ma

Detecting Interesting Regions in Photographs – How Metadata Can Help

Photographs, which are taken by human beings with creative thinking, may significantly differ from the images that are taken by a surveillance camera or a visual sensor on a robot. Human being intentionally shoot a photograph to express his/her feeling or photo-realistically record a scene by adjusting two factors: the parameters setting of a camera and the position between the camera and the object which he or she is interested in. Based on these observations, a graph model based stochastic method is used to discover the pattern of how people taking photos, so that the interesting regions of the images can be determined automatically. Both the visual features of the images and the camera metadata parameters are simultaneously taken into considered. Experimental evaluation on over 7000+ photos taken by 200+ different models of cameras with variety of interests has shown the robustness of our techniques.

Zhong Li, Hong Lu, Xiangyang Xue, Jianping Fan

Learning-Based Image Restoration for Compressed Image through Neighboring Embedding

In this paper, we propose a novel learning-based image restoration scheme for compressed images by suppressing compression artifacts and recovering high frequency components with the priors learned from a training set of natural images. Specifically, Deblocking is performed to alleviate the blocking artifacts. Moreover, consistency of the primitives is enhanced by estimating the high frequency components, which are simply truncated during quantization. Furthermore, with the assumption that small image patches in the enhanced and real high frequency images form manifolds with similar local geometry in the corresponding image feature spaces, a neighboring embedding-based mapping strategy is utilized to reconstruct the target high frequency components. And experimental results have demonstrated that the proposed scheme can reproduce higher-quality images in terms of visual quality and PSNR, especially the regions relating to the contours.

Lin Ma, Feng Wu, Debin Zhao, Wen Gao, Siwei Ma

Session 7: Video Analysis and Its Applications

Synopsis Alignment: Importing External Text Information for Multi-model Movie Analysis

Text information, which plays important role in news video concept detection, has been ignored in state-of-the-art movie analysis technology. It is so because movie subtitles are speech of roles which do not directly describe content of movie and contributes little to movie analysis. In this paper, we import collaborative-editing synopsis from professional movie sites for movie analysis, which gives detailed descriptive text of movie. Two aligning methods, subtitle alignment and RoleNet alignment, are proposed complementarily to align synopsis to movie to get scene-level text information of movie. The experiment show that proposed methods can effectively align synopsis to movie scene and the imported text information can give a more user-preferred summarization than merely using audiovisual feature.

Ming Li, Yang Liu, Yong-Dong Zhang, Shou-Xun Lin

Objects over the World

This paper considers the problem of selecting representative photographs for regions in the worldwide dimensions. Selecting and generating such representative photographs for representative regions from large-scale collections would help us understand about local specific objects with a worldwide perspective. We propose a solution to this problem using a large-scale collection of geo-tagged photographs. Our solution firstly extracts the most relevant images by clustering and evaluation on the visual features. Then, based on geographic information of the images, representative regions are automatically detected. Finally, we select and generate a set of representative images for the representative regions by employing the Probabilistic Latent Semantic Analysis (PLSA) modelling. The results show the ability of our approach to generate region-based representative photographs.

Bingyu Qiu, Keiji Yanai

Detect and Recognize Clock Time in Sports Video

Analyzing overlay content in sports video could help us to get reference information about the game. For example, the digital clock on score board, which indicates elapsed time of the game, is very useful to sports game summarization, indexing and retrieval. However, existing methods can only handle well when time clock is on a clean background and appears all the time. This paper has proposed a more general approach to locate and recognize the digit characters on soccer videos. It is notably efficient in reading the clock digits on a transparent overlay, where the clock digits appear and disappear, and they are blurred due to compression. Experimental results show that this approach is robust regardless of contrast, font-size, font-color and background complexity. It is also simple and quick enough for real time clock reading while the video is playing.

Fan Bu, Li-Feng Sun, Xi-Feng Ding, Yin-Jun Miao, Shi-Qiang Yang

Detecting Violent Scenes in Movies by Auditory and Visual Cues

To detect violence in movies, we present a three-stage method integrating visual and auditory cues. In our method, those shots with potential violent content are first identified according to universal film-making rules. A modified semi-supervised learning technique based on semi-supervised cross feature learning (SCFL) is exploited, since it is capable to combine different types of features and use unlabeled data to improve the classification performance. Then, typical violence-related audio effects are further detected for the candidate shots, and we manage to transform the confidences outputted by the classifiers of various audio events into a shot-based violence score. Finally, the first two-stage probabilistic outputs are integrated in a boosting way to generate the final inference. The experimental results on four typical action movies preliminarily show the effectiveness of our method.

Yu Gong, Weiqiang Wang, Shuqiang Jiang, Qingming Huang, Wen Gao

Personalized MTV Affective Analysis Using User Profile

At present, MTV has become an important favorite pastime to people. Affective analysis which can extract the affective states contained in MTVs could be a potential and promising solution for efficient and intelligent MTV access. One of the most challenging and insufficiently covered problems of affective analysis is that affective understanding is personal and various among users. Consequently, it is meaningful to develop personalized affective modeling technique. Because user’s feedbacks and descriptions about affective sates provide valuable and relatively reliable clues about user’s personal affective understanding, it is supposed to be reasonable to conduct personalized affective modeling by analyzing the affective descriptions recorded in user profile. Utilizing the user profile, we propose a novel approach combining support vector regression and psychological affective model to achieve personalized affective analysis. The experimental results including both user study and comparisons between current approaches illustrate the effectiveness and advantages of our proposed method.

Shiliang Zhang, Qingming Huang, Qi Tian, Shuqiang Jiang, Wen Gao

Session 8: Image Detection and Classification

Sequential Simulated Annealing System for Pattern Detection

Asequential system is proposed by using simulated annealing for the detection of lines, circles, ellipses, and hyperbolas in image. The sequential detection procedures are type by type and patterns by patterns. The equation of ellipse and hyperbola is defined under translation and rotation. The distance from all points to all patterns is defined as the error. Using simulated annealing for parameter detection can search a set of parameter vectors for the global minimal error. We also propose synchronous simulated annealing detection system to compare with sequential system. In the experiments, using the sequential system, the result of a large number of simulated and real image patterns detection is better than that of using the synchronous system.

Kou-Yuan Huang, Ying-Liang Chou

A Collaborative Bayesian Image Annotation Framework

The integration of content and context information within an image annotation framework is studied, which refer to the low-level visual features and the co-occurrence of different real world objects in a probabilistic sense, respectively. Conventional annotation approaches fail to collect and utilize the context information. Therefore, we proposed a new framework, termed as Collaborative Bayesian Image Annotation (CBIA) framework. 1) In addition to the content information, the proposed system accumulates past annotation results and/or information actively provided by domain experts, from which the context knowledge is extracted. Hence, part of the system is collaboratively constructed by human users. 2) The above information is utilized through a Bayesian framework. Numerical results based on images collected from the Internet demonstrated better performance resulting from the introduction of context knowledge and information fusion.

Rui Zhang, Kui Wu, Kim-Hui Yap, Ling Guan

A New Hierarchical Particle Filter Based Tracking System for Soccer Game Analysis

In this study, a new hierarchical particle filter based tracking system for soccer game analysis is proposed. Here, the hierarchical particle filter is used to fuse different types of features, such as color, edge, and position information in a cascaded feature structure. Additionally, the proposed system will handle the occlusion problem in an effective manner. As compared with three comparison methods, simulation results demonstrate that the proposed system tracks soccer players successfully and accurately.

Shih-Ting Wang, Jin-Jang Leou, Cheng-Shian Lin

Automatic Web Page Classification Using Various Features

A model of automatically classifying uncertain Web pages using multiple features is presented. Since the traditional tree structure can barely classify an avalanche of new Web pages, the proposed approach partially uses the idea of “bag of words” incorporating the idea of classification fusion to describe and categorize Web pages. The proposed approach extracts features of Web pages from various perspectives, such as consulting a Web directory service, analyzing the text features of Web pages’ titles and meta-search keywords, and identifying primary content of Web pages. Through fusing the results from these three dedicated classifiers, Web pages are classified to one or more categories with a bunch of words representing the Web pages. In order to demonstrate the effectiveness of the proposed method, experiments are carried out. In the experiments, the Web pages are classified using the proposed fusion method to four categories. A comparison between the dedicated classifiers and fusion methods is also presented.

Hao Wen, Liping Fang, Ling Guan

An Efficient Method for Near-Duplicate Video Detection

In order to monitor video streams in real-time or search large collections of video documents, several solutions based on near-duplicate video detection have been proposed in the literature. We present in this paper an architecture based on signature-based index structures coupling visual and temporal features and on an N-gram matching and scoring framework. The techniques we cover are robust and insensitive to general video editing and/or degradation, making it ideal for re-broadcasted video search. Through the use of signature-based indexing and N-gram matching and scoring, we identify corresponding query and index contents accurately in order to detect near-duplicate videos, even when these contents constitute only a small section of the videos being compared. Experiments are carried out on large quantities of video data collected from the TRECVID 02, 03 and 04 collections and real-world video broadcasts recorded from two German TV stations. An empirical comparison over two state-of-the-art dynamic programming techniques is encouraging and demonstrates the advantage and feasibility of our method.

Bashar Tahayna, Mohammed Belkhatir

Session 9: Visual and Spatial Analyses

Affective Space Exploration for Impressionism Paintings

In this paper, we explore the affective contents of Impressionism paintings. While past analysis of artworks concentrated on artistic concept annotation, like styles and periods, a more perceptual aspect is to investigate the emotions artists projected into. We propose affective space to fuse all affective factors from features. Since the combination of colors is more sensitive and intuitive to emotions, a meta-level feature, color harmony, is introduced to bridge the semantic gaps. By considering the correlation relationships among features and emotions, the affective space is explored through multiple-type latent semantic analysis. Experimental results show the effectiveness of harmonic feature and affective space via multi-label emotion annotation. Some potential applications are demonstrated based on affective space as well, including painting emotionalization and emotion-based slideshow system.

Cheng-Te Li, Man-Kwan Shan

Collaborative Video Scene Annotation Based on Tag Cloud

In this paper, we propose a video scene annotation method based on tag clouds. First, user comments associated with a video are collected from existing video sharing services. Next, a tag cloud is generated from these user comments. The tag cloud is displayed on the video window of the Web browser. When users click on a tag included in the tag cloud while watching the video, the tag gets associated with the time point of the video. Users can share the information on the tags that have already been clicked. We confirmed that the coverage of annotations generated by this method is higher than that of the existing methods, and users are motivated to add tags by using tag-sharing and tag-cloud methods. This method assists in establishing highly accurate advanced video applications.

Daisuke Yamamoto, Tomoki Masuda, Shigeki Ohira, Katashi Nagao

A Spatial-Temporal-Scale Registration Approach for Video Copy Detection

Video copy detection is an active research field in copyright control, business intelligence and advertisement monitor etc. The main issues are transformation-invariant feature extraction and robust registration in object level. This paper proposes a novel video copy detection approach based on spatial-temporal-scale registration. In detail, we first build interesting points’ trajectories by speeded up robust features (SURF). Then we use an efficient voting based spatial-temporal-scale registration approach to estimate the optimal transformation parameters and achieve the final video copy detection results by propagations of video segments in both spatial-temporal and scale directions. To speed up the detection speed, we use local sensitive hash indexing (LSH) to index trajectories for fast queries of candidate trajectories. Compared with existing approaches, our approach can detect many kinds of copy transformations including cropping, zoom in/out, camcording and re-encoding etc. Extensive experiments on 200 hours of videos demonstrate the effectiveness of our approach.

Shi Chen, Tao Wang, Jinqiao Wang, Jianguo Li, Yimin Zhang, Hanqing Lu

Vision-Based Semi-supervised Homecare with Spatial Constraint

Vision-based homecare system receives increasing research interest owing to its efficiency, portability and low-cost characters. This paper presents a vision-based semi-supervised homecare system to automatically monitor the exceptional behaviors of self-helpless persons in home environment. Firstly, our proposed framework tracks the behavior of surveilled individual using dynamic conditional random field tracker fusion, based on which we extract motion descriptor by Fourier curve fitting to model behavior routines for exception detection. Secondly, we propose a Spatial Field constraint strategy to assist SVM-based exception action decision with a Bayesian inference model. Finally, a novel semi-supervised learning mechanism is also presented to overcome the exhaustive labeling behavior in previous works. Experiments over home environment video dataset with five normal and two exceptional behavior categories shows the advantage of our proposed system comparing with previous works.

Tianqiang Liu, Hongxun Yao, Rongrong Ji, Yan Liu, Xianming Liu, Xiaoshuai Sun, Pengfei Xu, Zhen Zhang

Session 10: Multimedia Human Computer Interfaces

Interaction Reproducing Model: A Model for Giving Supports Appropriate to User State

This paper presents a novel

Interaction Reproducing Model (IRM)

for the purpose of adjusting computer user support to match the state of the user, and it consists of a set of Interaction Finite State Machines (I-FSMs). Each I-FSM is trained using actual interaction records, and it represents an ideal interaction pattern for a user state. It can choose an appropriate system action by reproducing the interaction pattern of the I-FSM most similar to the current interaction. We developed a prototype teaching system, and conducted preliminary experiments. The results show that user impressions using our approach were better than when using the system without our approach.

Hideki Aoyama, Motoyuki Ozeki, Yuichi Nakamura

Multi-projector Calibration and Alignment Using Flatness Analysis for Irregular-Shape Surfaces

A multi-projector calibration and alignment method, which has no assumptions on projection surfaces’ shape, is presented. Based on surface flatness analysis, the method will automatically partition the projection surface into pieces, more at none flat regions and less at flat ones, as flat region will cause less distortions. Corresponding to these pieces, an image will first be cut into some sub-images, and then projected onto the surfaces using texture warping method. The principle which ensures the inner projector distortion correction is proved, that is if a camera captures an undistorted projected image, human eye will also observe an undistorted one at that spot. Multi-projector alignment is achieved by projecting images which are rendered according to the camera’s extrinsic parameters during the course of coded structured light capturing. Because the camera can capture a wide field of view area by moving and rotating itself, the alignment method has a build-in support for scalability in view filed. The whole work of calibration and alignment can be done in a single camera capture process. Results show that the proposed method is efficient and robust.

Qingshu Yuan, Dongming Lu

Embedded Tags and Visual Querying for Face Photo Retrieval

We investigate the utility of automated digital photo retrieval using

faceboxes

. The sizes and locations of faces in each photo are automatically detected and embedded within the image file. An interactive user interface allows querying for photos visually, in a simple and intuitive manner. We present a user study conducted for evaluating the system on a personal collection of 10,000 digital photos, and report the results. The average search time, including time for sketching and browsing the results, is 31.2 seconds. Usability ratings indicate that the interface is easy to use, and useful as a tool for photo retrieval.

Chaminda de Silva, Toshihiko Yamasaki, Kiyoharu Aizawa

Illumination Invariant Face Detection Using Classifier Fusion

An approach to the problem of illumination variations in face detection that uses classifier fusion is presented. Multiple face detectors are seperately trained for different illumination environments and their results are combined using a combination rule. To define the illumination environments, the training samples are clustered based on their illumination using unsupervised training. Different methods of clustering the samples and combining the outputs of the classifiers are examined. Experiments with the AR face database show that the proposed method achieves higher accuracy than the traditional monolithic face detection method.

Alister Cordiner, Philip Ogunbona, Wanqing Li

Nonlinear Characterisation of Fronto-Normal Gait for Human Recognition

We present a novel analysis of multimedia data that is useful in human computer interfacing. By analyzing the video content of humans walking towards a camera, we establish the nonlinear nature of fronto-normal human gait which motivates the use of nonlinear dynamical analysis used in chaos theory to analyze human gait. In doing so, we obtain features that may be used as a biometric which can be used for automatic identification of humans using computers. We apply this in a multi-biometric experiment to demonstrate its effectiveness.

T. K. M. Lee, Mohammed Belkhatir, P. A. Lee, S. Sanei

Session 11: Multimedia Security and DRM

Steganalytic Measures for the Steganography Using Pixel-Value Differencing and Modulus Function

The steganalytic measures are proposed to destroy the steganographic method using pixel-value differencing(PVD) and modulus function. The modulus PVD steganography method utilizes the remainder of the two consecutive pixels to embed the secret data, which gains more flexibility and is capable of deriving the optimal remainder of the two pixels at the least distortion. However, there exist unavoidable weaknesses: fluctuation of the histogram, asymmetry of the 1 and -1 histogram values, and abnormal increasing of the 0 histogram value. The steganalytic measures are designed using these weaknesses of the modulus PVD steganography method. Through experiments, we prove that the proposed steganalytic measures successfully defeat the modulus PVD steganography method.

Jeong-Chun Joo, Hae-Yeoun Lee, Cong Nguyen Bui, Won-Young Yoo, Heung-Kyu Lee

Document Forgery Detection with SVM Classifier and Image Quality Measures

This paper presents a detection scheme for a fraudulent document made by printers. The fraud document is indistinguishable by the naked eye from a genuine document because of the technological advances in printing methods. Even though we cannot find any visual evidence of forgery, the fraud document includes inherent device features. We propose a method to uncover these features. 17 image quality measures are applied to discriminate between genuine and fake documents. The results of each measure are used as training and testing parameters of SVM classifier to determine fake documents. Preliminary experimental results are presented based on the fraud gift voucher made by several color printers.

Seung-Jin Ryu, Hae-Yeoun Lee, Il-Weon Cho, Heung-Kyu Lee

NAL Level Encryption for Scalable Video Coding

In this paper, we propose an encryption scheme for Scalable Video Coding (SVC). The proposed video encryption method provides secured SVC contents with selective encryption scheme. The main feature of this scheme is making use of the characteristics of Scalable Video Coding. The encryption procedures are carried out at the Network Abstractor Layer (NAL) level. Of all NAL units, Instantaneous Decoding Refresh (IDR) Picture, Sequence Parameter Set (SPS), and Picture Parameter Set (PPS) are encrypted with a stream cipher. Leak EXtraction (LEX) algorithm is adopted to reduce computational cost. Experiments were performed to verify the proposed methods using the joint scalable video model (JSVM), and the experimental results show that the proposed methods could provide high security, low computational cost as well as support a video encryption with scalability function.

Chunhua Li, Xinxin Zhou, Yuzhuo Zhong

Robust Digital Image Watermarking Based on Principal Component Analysis and Discrete Wavelet Transform

In this paper, a robust image watermarking algorithm based on principal component analysis and discrete wavelet transform is proposed for copyright protection. At first, an orientation histogram is obtained from the direction of the feature points of an image identified by a scale invariant feature transform in the proposed watermark embedding scheme. The highest peak of the orientation histogram regarded as the dominant orientation of an image is a start index to divide the image into bins covering 360 degree range of orientation. Moreover, the most important principal components of image coefficients within each bin are obtained after applying principal component analysis. Finally, the copyright watermark is inserted into the wavelet coefficients of these components in quantization steps. Similarly, the proposed watermark detecting scheme follows the above procedure to blindly extract the copyright watermark from a watermarked image. Various attacks are applied to the watermarked images in order to examine the robustness of our algorithm. The experimental results in this paper show that the proposed watermarking algorithm can tolerate JPEG compression, median filtering, Gaussian filtering, sharpening, and rotation attacks.

Jen-Sheng Tsai, Win-Bin Huang, Po-Hung Li, Chao-Lieh Chen, Yau-Hwang Kuo

Session 12: Advanced Image and Video Processing

Tone Adjustment of Underexposed Images Using Dynamic Range Remapping

We present a new method for automatically adjusting the tonal values of underexposed digital images. We make the most of the dynamic range of digital images, and adjust the tonal values through dynamic range remapping, with a specifically defined tone mapping operator. The operator comprises two concatenate terms. The first one is a global operator that adjusts the tonal values of underexposed image with a linear scale transformation as well as a nonuniform intensity reduction function. The second one is a local operator used for noise suppression and detail enhancement. With such operator, tone values of underexposed images are faithfully adjusted. Meanwhile, noises are suppressed without introducing noticeable artifacts into resulting images. Our method runs with high efficiency. Experimental results demonstrate its effectiveness.

Yanwen Guo, Xiaodong Xu

A Novel Video Text Detection and Localization Approach

Text detection and localization in videos are often used for video information indexing and retrieval, as text can retrieve the semantic information of videos. In this paper, we propose a novel approach to detect and localize texts by means of integrating the multiple video frames and multiple video frame motion features. For text detection, first, the motion feature detection is employed to fulfill the multiple frame verification. Second, the synthesized motion feature image, which is produced by motion vector on consecutive frames, is used to detect the text region under a synthesized image, which is produced by multiframe integration. Third, corner points are employed to locate the candidate text pixels position and a region growing algorithm is developed to connect these pixels into text blocks. In text localization, we use the corner points to accurately locate the text region. Experimental results show satisfying performance of the proposed algorithm.

Xiaodong Huang, Huadong Ma, Haidong Yuan

The Influence of Regularization Parameter on Error Bound in Super-Resolution Reconstruction

Regularization method is widely used to address the ill-conditioned problem of super-resolution (SR) reconstruction to improve its performance. The tradeoff between the fidelity of the data (due to small values of regularization parameter) and the smoothness of the SR result necessitates the choice of the regularization parameter to obtain the optimal solution. In this paper, the objective relative error is analyzed to explore the influence of the regularization parameter on SR reconstruction performance. With the optimal regularization parameter, we derive a relative error bound. The analysis is verified by experiment results.

Minmin Shen, Ping Xue, Ci Wang

Geometrical Compensation Algorithm of Multiview Image for Arc Multi-camera Arrays

In this paper, we propose a geometrical compensation algorithm for multiview image captured by an arc multi-camera array. To capture multiview image or multiview video, we require more than two cameras arranged on a multi-camera array. This multi-camera array, however, has geometrical errors since it is manually built. These errors are related to the mismatch in the vertical and horizontal directions among images and irregular camera rotations. Also, these errors become serious obstructions to implement many three-dimensional (3-D) or multiview image processing and applications such as depth map estimation and intermediate view generation. Therefore, it is required to compensate geometrical errors in multiview image. Our proposed algorithm simultaneously adjusts positions, orientations, and internal characteristics of cameras arranged on an arc multi-camera array. Experimental results show that this algorithm reduced the vertical mismatch in pixels among images. In addition, there were equal horizontal disparities and equal angles between two cameras, respectively.

Yun-Suk Kang, Yo-Sung Ho

A Unified Hierarchical Appearance Model for People Re-identification Using Multi-view Vision Sensors

Surveillance of wide areas requires a system of multiple cameras to keep observing people. In such a multiple view system, the people appearance obtained in one camera is usually different from the ones obtained in other cameras. In order to correctly identify people, the unique appearance model of each specific object should be invariant to such changes. In this paper, our appearance model is represented by a hierarchical structure where each node maintains a color Gaussian mixture model (GMM). The re-identification is performed with Bayesian decision. Experimental results show our unified appearance model is robust to rotation and scaling variations. Furthermore, it achieves high accuracy rate (92.7% in average) and high processing performance (above 30 FPS) without tracking mechanism.

Jau-Hong Kao, Chih-Yang Lin, Wen-How Wang, Yi-Ta Wu

An XML-Based Comic Image Compression

Traditional raster images cannot be fitted to different display sizes dynamically. Although some vector conversion systems have provided with the ability to convert raster format data into vector format data, the sizes of files they produced are large and consume most of time in image rendering. In this study, we adopt the SVG developed by W3C for the target vector format. We use the vector contour searching algorithms for removing extra spaces, combining the slope of clips, and merging similar color regions to compress the vector images. It converts comic images from raster format to SVG vector format. The factors affecting image quality and compression efficiency are studied. Compared with the existing methods, ours has a better performance in image quality, compressed ratio, and image rendering speed. It also makes the converted image able to display in different platforms.

Ray-I Chang, Yachik Yen, Ting-Yu Hsu

Session 13: Multimedia Database and Retrieval

Compensated Visual Hull with GPU-Based Optimization

We propose an advanced visual hull technique to compensate for outliers using the reliabilities of silhouettes. The proposed method consists of a foreground extraction technique with multiple thresholds based on the Generalized Gaussian Family model and a compensated visual hull algorithm. We proved that the proposed technique constructs a compact visual hull even in the presence of segmentation errors and occlusions. The 3D reconstruction and rendering processes are implemented on a graphics processing unit to greatly accelerate computation time.

Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, Kiyoshi Kogure

Video News Retrieval Incorporating Relevant Terms Based on Distribution of Document Frequency

This paper presents an approach to video news retrieval within an event by integrating visual and textual features. A set of histogram bins of key frames in a shot is adopted as the visual feature, while the term frequency is used as the textual feature. A term scoring method is proposed to enhance the weights of relevant terms in an event by considering the windowed document frequency distribution. The weight for a given term is determined by mean of the difference between usual and unusual term groups which are quantized by the boxplot method. The first experiment evaluate the performance of the proposed method by giving generated document frequency distributions, while the second experiment gives the desired retrieval results for relevant terms in the real data. It concludes the proposed method can increase the performance of retrieving video news stories within an event using relevant terms.

Jun-Bin Yeh, Chung-Hsien Wu

ROISeer: Region-Based Image Retrieval by Hierarchical Feature Filtering

In this paper, an image retrieval system, ROISeer is proposed. ROISeer uses a region-based approach for feature extraction, and integrated region matching based upon color and shape similarity measurements. Different from most existing image query by example systems, the proposed system provides a hierarchical feature filtering instead of using single or multiple features fusion. It considers that color feature is the most significant factor simulating the attention of Human Visual System (HVS), therefore image segmentation procedure is only applied for those image where similar color patch is presented. The novel design of this technique significantly reduces the complexity of image segmentation for large scale image databases. In addition, to enhance the reliability of the system, a region-based shape matching algorithm is utilized. This method ensures that the same region is robustly retrieved when the target is rotated, resized and shifted. Experimental results show that the ROISeer provides robust and well performance in Region of Interest (ROI) retrieval in single category or cross-category image databases.

Zou Lei, Bizhong Wei, Xiaodong Cai, Yuelin Chen

High-Performance Image Annotation and Retrieval for Weakly Labeled Images Using Latent Space Learning

Image annotation and retrieval are among the most promising new internet search technologies and have widespread applications. However, the task is very difficult because of the generic nature of the target images. In this paper, we propose a high speed and high accuracy image annotation and retrieval method for miscellaneous objects and scenes. This method combines the higher-order local auto-correlation (HLAC) features with the probabilistic canonical correlation analysis framework. A distance between images can be defined in the intrinsic feature space for annotation using latent space learning between images and labels. The HLAC features have additive and position invariance properties, which makes them well-suited for images in which the positions and number of objects are arbitrary. The proposed method is shown to be faster and more accurate than previously published methods.

Hideki Nakayama, Tatsuya Harada, Yasuo Kuniyoshi, Nobuyuki Otsu

The Many Facets of Progressive Retrieval for CBIR

Recently,

progressive

retrieval has been advocated as an alternate solution to multidimensional indexes or approximate techniques, in order to accelerate similarity search of points in multidimensional spaces. The principle of progressive search is to offer a first subset of the answers to the user during retrieval. If this subset satisfies the user’s needs retrieval stops. Otherwise search resumes, and after a number of steps the exact answer set is returned to the user. Such a process is justified by the fact that in a large number of applications it is more interesting to rapidly bring first approximate answer sets rather than waiting for a long time the exact answer set. The contribution of this paper is a first typology of existing techniques for progressive retrieval. We survey a variety of methods designed for image retrieval although some of them apply to a general database browsing context which goes beyond CBIR. We also include techniques not designed for but that can easily be adapted to progressive retrieval.

Nouha Bouteldja, Valerie Gouet-Brunet, Michel Scholl

Special Session 1: Multimedia Management and Authoring

SemanGist: A Local Semantic Image Representation

Although various kinds of image features have been proposed, there exists no single optimal feature which can save the effort of all other features for multimedia analysis applications, e.g. image annotation. In this paper, we propose a novel image representation, Semantic Gist (SemanGist), to combine the merit of multiple features automatically. Given a local image patch, SemanGist converts multiple low-level features of the patch into compact prediction scores of a few predefined semantic categories. To this end, a discriminative multi-label boosting algorithm is adopted. This local SemanGist output allows for incorporating semantic spatial context among adjacent patches. For applications like image annotation, this may further reduce possible annotation errors by considering the label compatibility. The same boosting algorithm is applied to the SemanGist representation, together with low-level features, to ensure the label compatibility. Experiments on an image annotation task show that SemanGist not only achieves compact representation but also incorporates spatial context at low run-time computational cost.

Dong Wang, Xiaobing Liu, Duanpeng Wang, Jianmin Li, Bo Zhang

Analysis of Human Actions for Video Indexing

Automatically understanding human actions is crutial for efficiently indexing many types of videos, such as sports videos, home videos, movies etc. However, it is challenging due to their variances caused by different actors, different scales, and different views. In order to incorporate these variances, most methods in literature have to sacrifice the discriminability of action models. In this paper, we address the tradeoff between invariability and discriminability. We firstly propose a novel set of pixel-wise features which are invariant to actor appearances, scales, and motion directions. Then, multi-prototype action models are constructed to realize view invariance. By leaving the most challenging invariance from feature level to model level, we successfully maintain the discriminability of action models. The extensive experiments demonstrated the good performance of the proposed method.

Zhuoyuan Chen, Peng Cui, Lifeng Sun, Shiqiang Yang

AAML Based Avatar Animation with Personalized Expression for Online Chatting System

This paper presents an online chatting system with automatic expression animation and animated avatar. The system is constructed upon an XML based language AAML (Affective Animation Markup Language), which describes hierarchically affective content for online communication. Once the user types a chatting sentence, the AAML produces a piece of personalized facial animation. The animation would be sent to the remote client right now to enrich the chatting process. Comparatively, our system is designed with its dedicated feature that its structure is hierarchical and open, so the users can extend the AAML, defining their own tags to realize personal affective expression. In addition, caricature generation according to an input facial photograph is embedded in the framework, so the animation is synthesized based on the caricature with entertainment effect. Successful application of the online chatting system shows that our system can enhance affective interaction by the customized facial animation and the animated avatar that is similar with the user. The function can also be widely applied in online or mobile environment such as multimedia message synthesis.

Junfa Liu, Yiqiang Chen, Xingyu Gao, Jinjing Xie, Wen Gao

Exploring Music Video Editing Rules with Dual-Wing Harmonium Model

Automatic music video editing is still a challenging task due to the lack of knowledge of how music and video are matched to produce attractive effects. Previous works usually matches music and video following assumption or empirical knowledge. In this paper, we use a dual-wing harmonium model to learn and represent the underlying music video editing rules from a large dataset of music videos. The editing rules are extracted by clustering the low dimensional representation of music video clips. In the experiments, we give an intuitive visualization for the discovered editing rules. These editing rules partially reflect professional music video editor’s skills and can be used to further improve the quality of automatically generated music video.

Chao Liao, Patricia P. Wang, Yimin Zhang

Web-Scale Image Annotation

In this paper, we describe our experiments using Latent Dirichlet Allocation (LDA) to model images containing both perceptual features and words. To build a large-scale image tagging system, we distribute the computation of LDA parameters using MapReduce. Empirical study shows that our scalable LDA supports image annotation both effectively and efficiently.

Jiakai Liu, Rong Hu, Meihong Wang, Yi Wang, Edward Y. Chang

Special Session 2: Multimedia Personalization

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback

Most of existing work on sports video analysis concentrates on highlight extraction. Few efforts devoted to the important issue as how to organize the extracted highlights which is adapt for the user preference. In this paper, we propose a novel approach to rank the highlights extracted from broadcast tennis video based on multi-modality analysis and relevance feedback. Firstly, visual and auditory features are employed to construct the mid-level representations for the content of broadcast tennis video. Then, the affective features are extracted from mid-level representations and the multiple ranking models are built using nonlinear regression algorithm. Finally, the ranking models are linearly combined to generate the final highlight ranking results. The relevance feedback technique is employed to effectively capture the user interest in visual and auditory attention spaces to adjust the ranking results being suitable to the user preference. The experimental results are encouraging and demonstrate that our approach is effective.

Guangyu Zhu, Qingming Huang, Yihong Gong

Affective Content Detection by Using Timing Features and Fuzzy Clustering

Emotional factors directly reflect audiences’ attention, evaluation and memory. Movie affective content detection attracts more and more research efforts. Most of the existing work focus on developing efficient affective features or implementing feasible pattern recognition algorithms. However, some important issues are ignored. 1) Most of the feature used in affective content detection are traditional visual/audio features. While affective content detection needs those features which are directly related to emotions. 2) affective content is a subjective concept which heavily depends on human perception. It is hard to find a clear boundary for various emotion categories. While most of the existing methods utilize hard pattern recognition algorithm to generate clear boundary for emotion categories. In this paper, we consider the above two issues by two aspects. 1) We employ timing features which are important of films and an important part of films’ power to affect viewers’ feelings and emotions. Meanwhile, audio features are used together with timing features to detect affective content from multiple modalities. 2) Fuzzy clustering is used in this paper to map affective features to affective content. Fuzzy logic provides a mathematical model to represent vagueness, which is close to human perception. Experimental results shows the proposed method is effective and efficient.

Min Xu, Suhuai Luo, Jesse S. Jin

Speaker Clustering Aided by Visual Dialogue Analysis

Speaker clustering aims to automatically cluster speech segments for each speaker. By speaker clustering, we can discover main cast list from long videos and retrieve their relevant video clips for efficient browsing. In this paper, we propose a dialogue supervised speaker clustering method, which makes use of the visual dialogue analysis results to improve the performance of speaker clustering. Compared with the traditional approach based only on acoustic features, the dialogue supervised speaker clustering approach can get significant improvement on the clustering result for movie and TV series.

Shuang Zhang, Wei Hu, Tao Wang, Jia Liu, Yimin Zhang

Personalized Multimedia Retrieval in CADAL Digital Library

CADAL has been a large digital library including million digital books and large volumes of multimedia resources, e.g. videos, images. In this paper, in order to overcome the problem of information overload, we propose a framework of personalized cross-media retrieval in CADAL digital library and present the details of the algorithms used in the personalized cross-media retrieval, which is a new kind of retrieval technology by which query examples and search results can be of different modalities. In order to provide personalized cross-media retrieval, we construct the uniform cross-media correlation graph in terms of three kinds of information: low-level features of media objects, co-existence information between them and semantic correlations between MMDs that are mined out of large amounts of logs. Moreover, we also use the in-session relevance feedback approach to mine the hints in the positive and negative examples to boost the retrieval performance for the individuals.

Yin Zhang, Jiangqin Wu, Yueting Zhuang

Personalized 3D Caricature Generation Based on PCA Subspace

3D caricature generation is becoming a more interesting and challenging research topic. This paper presents personalized 3D caricature generation using PCA (Principal Component Analysis) for each 3D caricature component. First, we construct a 3D caricature dataset manually, and then we construct 3D caricature components datasets according to the defined attribute for each component. Next, PCA is employed to create each subspace for each 3D caricature component, and diverse vivid caricature component is generated by interactive mode. Finally, the generated caricature components are combined into 3D caricature based on the 3D true face. Our approach is very convenient and efficient to generate different kinds of 3D caricature by personalized thought, for example, a user just needs to drag the sliders for each principal component of the 3D caricature component.

Xingyu Gao, Yiqiang Chen, Junfa Liu, Jingye Zhou

Ontology-Based Inter-concept Relation Fusion for Concept Detection

Although detectors for individual concepts have been widely studied in multimedia search area, the exploration of inter-concept relations among concepts receives relatively less attention, especially when hierarchical concept taxonomy is not manually constructed beforehand. In this paper, we present an ontology-based concept fusion method for building more reliable concept detectors from multiple independent detectors. Specifically, two logical relations among concepts are defined in advance so that an ontology hierarchy can be explicitly built by using a decision rule based on a relation strength function. With the ontology hierarchy built, an effective fusion strategy is then explored to construct an improved detector for each concept. Evaluation on TRECVID’06 test set shows that the proposed method achieves more remarkable and consistent improvement.

Shikui Wei, Yao Zhao, Zhenfeng Zhu

Special Session 3: Multimedia for E-Learning

Building Video Concordancer Supported English Online Learning Exemplification

Language acquisition is by the process of the human interaction and the context developed in a language. For English-as-Foreign-Language (EFL) students, language presentation must give attention to the roles played by sound, gesture and context in the process of genuine communication in order to understanding the meaning of the language. Using video clips from contemporary films and videos is an alternative approach for students to support their acquisition of the learning language in a real world context. In this study, an innovative learning-assisted system is proposed, named the Video Concordancer, which uses a word-stem algorithm to retrieve specific words and their families from a multimedia corpus, as well as subsequently utilize the mutual information approach to analyze word collocations to appropriately present the use of words with several of the video contexts. This multimedia-based aid enables students in Internet-based English learning settings to find engaging, authentic examples of multilingual language in use, giving students the opportunities to strengthen their language skills in a culturally relevant way.

Chih-Hsiung Fu, Kun-Te Wang, Shu-Chen Cheng, Ting-Wei Hou

A Replication-Aware CDN-P2P Architecture Based on Two-Step Server Selection and Network Coding

E-learning service is getting increasingly popular, especially the multimedia content educations. To distribute content to end users, two different technologies – Content Distribution Network (CDN) and Peer-to-Peer (P2P) network – have been proposed. However, both technologies have their own limitations: CDN servers are expensive to deploy and maintain. Under P2P network, anyone could share his content regardless of the copyright. In this paper, a new concept of integrating both CDN and P2P technologies into a replication-aware CDN-P2P architecture has been proposed. We propose a two-step selection approach on landmark-based selection algorithm to find the nearest replica-cache server and peer-caches to download content in this architecture. Furthermore, recent studies have supported the claim that network coding technology is beneficial for large-scale P2P content distribution. Herein, we show how to apply network coding technology to distribute content so that the content provider’s rights can be protected.

Hung-Chang Yang, Min-Yi Hsieh, Hsiang-Fu Yu, Li-Ming Tseng

A Study of Listening Diversity and Speaking for English Learning with Mobile Device Supports

This paper proposed a mobile learning system supporting listening diversity and speaking and investigates its effect on elementary school English learning. This study tried to help the EFL (English as Foreign Language) students who just can learn English through reading and writing in classes with stimulating their motivation to speak and listen by mobile devices. Therefore, we designed some learning activities with mobile devices to promote students speaking and to enable them to listen various speaking from others. We will investigate the correlation of students speaking and listening and their influence on learning. After experiment the result showed that doing some learning activities like listening diversity, listening times, listening involvement and speaking out have significant correlation with students’ English learning achievement. Meanwhile, it was also found that our designed activities with mobile devices can stimulate students’ motivation and facilitate their English speaking and listening. We suggested that English learning might refer to this study and design activities using mobile devices to encourage students to speak and listen to various high-achievement peers and make English learn much better.

Wu-Yuin Hwang, Sheng-Yi Wu, Jia-Han Su

The Search Engine for Articles and Multimedia in Blogs

Since the development of information technology and internet, more and more people share multimedia files via internet, including images, animation or music. Blogs have become one of the means to share multimedia, life and study experiences on internet. Nowadays, there are quite huge amount of texts and multimedia in blogs. The quality of contents of blogs is so ragged that this phenomenon somehow would lead to negative influence on the blogs’ learners. This research attempts to utilize the features of articles and multimedia to categorize multimedia in blogs. This system reveals that to effectively separate the quintessence and rubbish multimedia can improve learners’ quality of learning by obtaining quintessence articles and multimedia effectively.

Shu-Chen Cheng, Kao-Pin Huang, Yun-Chung Chen

Reversible Fragile Watermarking for E-Learning Media

Fragile watermarking techniques are widely applied to protect the ownership of digital media. For learning materials, high-quality images can enhance learning efficiency. This study presents a reversible fragile watermarking scheme with two-level blocks for e-Learning media based on histogram shifting techniques. The method can validate the integrity of the protected image while maintaining the high quality of the protected image. Experimental results reveal that each test-marked image maintained its high image quality. Furthermore, the proposed method can detect and locate the blocks that have been tampered with.

Yu-Chiang Li, Cheng-Jung Tsai, Wei-Cheng Wu

Poster Session 1: Multimedia Networking Techniques

A Low Bit Rate Audio Bandwidth Extension Method for Mobile Communication

High quality audio signal is supposed to be provided with low bit rate and low computational complexity in mobile communication system. This paper proposed a novel audio coding bandwidth extension method (BWE), which calculates high-frequency synthesis filter by using codebook mapping method, and transmits quantified gain corrections of high-frequency part. The preliminary test show that this method can provide comparable audio quality with lower bit consumption compared with AMR-WB+.

Bo Hang, Rui-Min Hu, Xing Li, Yuan Fang, An-Chao Tsai

A New Side Information Generation Scheme for Distributed Video Coding

In this paper, a new side information generation scheme for distributed video coding (DVC) is proposed. The proposed scheme uses classification-based motion estimation and multiple block motion interpolation to generate the side information. Here, classification-based motion estimation tries to find the “true” motion vector of each block, instead of exploiting the highest temporal redundancy between two corresponding blocks in two key frames. Then, multiple block motion interpolation using several motion vectors for each block is performed so that the corresponding matched block in the side information can be interpolated more efficiently. Based on the simulation results obtained in this study, the proposed scheme successfully improves the perceptual quality and the average PSNR (dB) in both side information and reconstructed frames.

Ming-Hui Cheng, Jin-Jang Leou

Robust TCP Vegas in Wireless Networks

Since wireless applications have been widely studied over the recent years, the utilization of TCP under wireless environments is also an important issue. TCP design focuses on whether the wired networks perform well in wireless networks where links encounter an erratic error and result in a very low TCP throughput. This study investigated the TCP Vegas in a wireless environment, and used the queue length to determine the cause of packet loss, in order to reduce the unnecessary window degradation and improve the utilization of wireless network.

Rung-Shiang Cheng, Chung-Fan Liu, Yuan-Cheng Lai, Der-Jiunn Deng

A Robust Priority Assignment Mechanism for Video Data over Managed IP Networks

This paper proposes a Flexible Significance Classification (FSC) mechanism for video data over managed IP networks. FSC determines the video packet significance in both temporal and spatial domains. In the temporal domain, FSC evaluates the packet significance based on the estimated error propagation provided a packet is lost. In the spatial domain, FSC computes the packet significance based on the content complexity in a packet. Compared with traditional significance classification scheme, simulation results show that the proposed mechanism can significantly improve the accuracy of signification determination by up to 13% and effectively improve the received video quality by up to 0.54dB in PSNR.

Ya-Ju Yu, Chu-Chuan Lee, Pao-Chi Chang

An Efficient Method to Reduce Peer-to-Peer Streaming Latency

Recently a large number of multimedia content is streamed to millions of users through peer-to-peer networks. Most of them are based on an unstructured mesh-based topology. It results to a long end-to-end delay. Based on this consideration, this paper proposed a novel topology construction approach, LayerP2P, to construct a low-latency structured overlay for live streaming. It organize peers into a multi-layer mesh-based topology. Since the maximum relay hops in the overlay are bounded by the number of layer, the average end-to-end delay is cut down. At the same time, the overlay keeps resilient by self-organized peers in a decentralized way. Experiments carried out over a simulated network of up to 500 peers illustrate the effectiveness of our approach.

Xinggong Zhang, Yan Pang, Zongming Guo

LRM: A Local Band Resource Management System

In conventional Best-Effort forwarding (BE) service network (e.g., Internet), the bandwidth resource for video streaming is unfixed and service performance is highly dependent on the rest local bandwidth. In this paper, local band resource management (LRM) system is developed for managing the bandwidth of local access network as well as providing Assured Forwarding (AF) service. Test results show that LRM not only provides effective management for local band resource but improves the QoS of video streaming greatly.

Guang-Qian Kong, Xun Duan, Jian-Shi Li

Maximum Traffic Routing Problems in Wireless Mesh Networks

This paper studies two NP-hard problems, the maximum bandwidth routing problem (MBRP) and the maximum flow routing problem (MFRP) in multi-hop wireless mesh networks (WMNs) to support diverse Quality-of-Service (QoS). First, upper bounds on their optimal values are derived, and a lower bound is derived on the feasible value obtained for the MBRP. Then, heuristic algorithms for the two problems are proposed. With the upper bound and the lower bound, an approximation ratio for the heuristic algorithm of the MBRP is obtained.

Yu-Liang Kuo, Chia-Cheng Hu, Chun-Yuan Chiu

An Efficient Broadcasting Scheme with Low Buffer Demand for Video-on-Demand Applications

It has been well recognized as an efficient approach for broadcasting popular videos by partitioning a video data stream into multiple segments and launching each segment through an individual channel simultaneously and periodically. Based on the design premises, the fibonacci broadcasting (FiB) scheme allows a client with two-channel bandwidths to receive video segments. Extending FiB, this work designs a new FiB scheme (called FiB+) to achieve small buffering space as well as low bandwidth demand at the client. Extensive simulation was conducted, which demonstrated that FiB+ could have 30% smaller client buffer size than FiB.

Hsiang-Fu Yu, Chu-Yi Chien, Hung-Chang Yang, Yuan-Chieng Huang

Poster Session 2: Multimedia Systems and Applications

Cross-View Object Tracking by Projective Invariants and Feature Matching

One of the key techniques of multi-camera tracking systems is cross-view object tracking. Feature Matching (FM) and Field of View (FOV) based methods are adopted in conventional solutions towards this problem. However, FM is not computationally efficient and the results heavily depend on the parameter settings of the cameras. Therefore, it is not effective in practical applications. In addition, approaches based on FOV suffer from the delay of the detection of newly appeared objects. The results are not reliable if only consistent labelling is utilized. In this paper, we propose a novel scheme for cross-view object tracking based on Projective Invariants (PI) and FM. The experimental results show that, our method improves the performance of normal PI-based tracking algorithms. Especially, it provides accurate tracking performance in the case of multiple objects appear closely in the same area.

Ning Ouyang, Le-ping Lin, Zhao Liu

Metadata Retrieval Using RTCP for Multimedia Streaming

Media streaming has now become very popular in the Internet, and metadata plays a significant role for media applications. Recently, with the growth of the multimedia service in IP network, RTSP and RTP has become a de-facto standard for streaming. However, these protocols do not include metadata transmission in itself. Instead, metadata delivery in IP network is done with other various protocols such as HTTP or SOAP over HTTP. Also file delivery protocol such as FLUTE is used. But using different protocol from media streaming may induce protocol issues such as TCP-friendliness problem. In this paper, we propose an efficient way of metadata retrieval using RTCP. It uses identical protocol for streaming and metadata retrieval. Embedding metadata into RTCP Report also makes it easy to control the metadata transmission. Protocol extensions of SDP and RTSP for metadata subscription are also presented.

Seung-woo Kum, Tae Beom Lim, Seok Pil Lee

ICam: Maximizes Viewers’ Attention on Intended Objects

Continuous inventions of modern cameras are making the act of good photography easier. This paper proposes a methodology that makes a camera even smarter by converting the objects of interest to the photographer more salient. This results in maximization of viewers’ attention on intended objects. The captured photo can be taken as a media of communicating photographer’s mind to the potential viewers- in that sense; the proposed methodology reduces the communication gap between them.

Rajarshi Pal, Pabitra Mitra, Jayanta Mukhopadhyay

Fast Algorithms for Computer Generated Islamic Patterns of 8-ZOHREH and 8-SILI

The interests and tendencies of motivations in generated computer graphics between researchers are increased in recent years. There are many various methods for computer graphics in some applications such as architectural arts. In this paper we present two new algorithms for computer generated Islamic Geometrical Patterns, which are 8-ZOHREH and 8-SILI. They are presented as computer algorithms. These algorithms presented and implemented as simple and optimized algorithms. In the proposed algorithms we considered the simple basic rules in geometry such as orthogonal drawing, bisector of angles, and symmetry. The first algorithm generated a main region for 8-ZOHREH in eleven steps and the second algorithm to generate 8-SILI which has twelve steps. Simplicity, robustness, low computational complexity and high speedy are some of the main features of the proposed algorithms. The results of our algorithms are demonstrated in last section.

Peyman Rasouli, Azam Bastanfard, Alireza Rezvanian, Omid Jalilian

A Scene Representation Application Implementing LASeR Using Object-Based Timing Model

In this paper, we present a multimedia scene representation application implementing MPEG4-Part 20 Lightweight Application Scene Representation standard (LASeR). In this paper we propose an object-based timing model to ensure all the elements are rendered in correct time, and we also propose corresponding strategies to deal with the user’s actions.

Xiaocong Zhou, Jianping Chen, Tiejun Huang

High Performance Hardware Architecture of Linear Filters for Intelligent Video Processing

Image filtering is an essential process in the field of image processing, and linear image filters with large kernels are especially significant for computer vision or intelligent video processing. In this paper, a high performance hardware architecture of linear image filters, which is designed for embedded system in mobile devices, is proposed and analyzed. This linear filter hardware is capable of dealing with a 15 by 15 kernel by using massively parallel processing elements, and it offers a set of configurable parameters, which generalizes the functionality to handle different kinds of linear filters in most applications.

Tse-Wei Chen, Shao-Yi Chien

Towards a Universal Friendly P2P Media Streaming Application: An Evaluation Framework

Network-oblivious P2P applications have produced high cross-ISP traffics and led to serious disruption to ISP economics. While end users are concerned with quality, Internet Content Provider (ICP) cares more about service scale and ISP focuses on operating cost. Thus this paper provides a framework of evaluating the P2P streaming application performance from perspectives of all entities involved, i.e ISP, ICP and end users. Based on the analytical study, we propose potential solutions to define unprecedented friendly and cost-effective P2P streaming application to achieve the ideal philosophy of “More Users =Better performance + Lower cost”.

Zhijia Chen, Hao Yin, Chuang Lin, Fenlin Wu, Xuening Liu

Search-Based Automatic Web Image Annotation Using Latent Visual and Semantic Analysis

Automatic web image annotation is a practical and effective way for both web image retrieval and image understanding. In this paper, we proposed a search-based automatic web image annotation using latent visual and semantic analysis. At first, the semantic content of web images and visibility of the words are combined to compute the probability that the initial annotations are present in the images. And then, it is extended by using latent visual and semantic analysis to find the synonyms of initial annotations. The final rankings are estimated by using commercial image search engines by mining content-based correlation. Experiments conducted on real-world web images demonstrate the effectiveness of the proposed approach.

Dingyin Xia, Fei Wu, Yueting Zhuang

An Ultra Large Area Scanner for Ancient Painting and Calligraphy

It is of great significance to digitize ancient paintings and calligraphy, especially for a country with five thousand years of history and rich cultural heritages. Millions of paintings and calligraphy are haunted by various diseases or even natural disasters and the only way to eternalize them is to store their digital replica in computers, which many museums and libraries are planned or started to do. A typical way to acquire them is use a linear CCD based large area table scanner. But this kind of solution has great drawbacks in terms of precision as well as scanning range, which prohibit its use in museums and libraries. Our lab has recently developed a new equipment to solve these drawbacks and hopefully it would shed new light on the documentation of ancient paintings and calligraphy.

Xifan Shi, Dongming Lu, Changyu Diao

Exploiting Spatial Locality for Objects Layout in Virtual Environments

On-disk sequentiality of requested data, or their spatial locality, is critical to disk performance. Unfornately, spatial locality of cached data is largely ignored, and only temporal locality is considered in current system buffer cache managments. Besides, an individual object might induce different relations in different applications. A novel hypergraph scheme was proposed to represent the complex relations among the objects. Instead of a local measure that depends only on common objects among patterns, we propose a global measure that is based on the semantic properties of these patterns in the overall data set. The experiments show the effectiveness of the proposed framework.

Ching-Shun Hsieh, Hui-Ling Lin, Shao-Shin Hung, Shih-Hao Shen, Ching-Hung Yeh

Poster Session 3: Advanced Multimedia Techniques

A Parallax-Map-Based DIBR System to Compromise Visual Quality and Depth Resolution

Depth-Image-Based Rendering (DIBR) has been regarded as the most promising solution for 3D TV, but exposure of occlusion areas that will lead to poor visual quality of synthesized virtual-view images is its inherent disadvantage. In previous studies, an asymmetric Gaussian low-pass filter with strong strength has been used to reduce numbers and sizes of disocclusion areas in synthesized images for improving visual quality. However, it also attenuates depth resolution seriously. DIBR should have a good compromise between improving visual quality and preserving depth resolution. In this paper, the factors inducing poor visual quality and attenuate depth resolution are addressed, and how these factors influence one another is analyzed as well. Based on the analysis results, a parallax-map-based DIBR system without soothing depth-images is proposed. Experimental results show that the proposed DIBR system can improve visual quality and preserve depth resolution well within normal depth preferences of viewers.

Ting-Ching Lin, Hsien-Chao Huang, Yueh-Min Huang

Luminance Adjustment in Image and Video

It is a important issue that luminance coherent in the sequence video. After adjusting in the video, and then we can see the clearly image with accordant luminance. In addition, inpainting the intensity in image or video, the image will display the new intensity. We can see the image different and get better than original intensity. Moreover, we maybe also adjust intensity contrast, make sure the detail color and luminance appear same coordinate color. In these issue, we try to use average intensity to solve luminance in video. We can calculate the middle intensity with the frames average intensity, and then we get the threshold help us for fixing the image intensity. And we can use histogram and tensor voting to solve luminance difference in image. We will show the experiment in this research. Then, we will compare these two methods, the result in histogram and tensor voting.

Wen-Ju Tsai, Timothy K. Shih, Yuan-Yu Chao, Hsing-Ying Zhong

Extracting Facial Features and Face Inpainting

In recent years, facial recognition techniques can be applicable in varies fields and has significant results. In obtaining facial images, if the face image were covered or damaged, the degree of error and inaccuracy will increase significantly. Therefore, the purpose of this paper is not only for extracting facial features, but also to recover the damaged regions. This paper uses Feature-based methodology to detect facial feature points and build up the face database. And the recovering procedure, the system will be to compare with the faces within the face database to find the applicable face, and reconstruct it.

Lin Chun Yu, Nick C. Tang, Huang Bo Jun, Timothy K. Shih

A Two-Stage Approach to Highlight Extraction in Sports Video by Using AdaBoost and Multi-modal

In this paper, we propose a novel two-stage approach for highlight extraction in sports video. In the first stage, a preliminary classification is performed to the audio stream to locate the position of the highlight candidates. We employ AdaBoost algorithm for feature selection and audio classification. In the second stage, we extract visual and temporal features of these highlight candidates and feed them into a linear weighted model for further highlight extraction. The final highlight segments are determined based on the output value of the model. The advantage of this method is its low computational complexity and relatively high accuracy. Experimental results on tennis video demonstrate effectiveness and efficiency of our proposed approach.

Shaojie Cai, Shuqiang Jiang, Qingming Huang

Local Separability Assessment: A Novel Feature Selection Method for Multimedia Applications

Feature selection technology can help to reduce feature redundancy and improve classification performance. Most general feature selection methods do not perform well on high-dimension large-scale data sets of multimedia applications. In this paper we propose a novel feature selection method named Local Separability Assessment. We try to measure the separation level of samples in subregions of feature space, and integrate them for evaluating the separability of features. Our method has favorable performance on large-scale continuous data sets, and requires no priori hypothesis on data distribution. The experiments on various applications have proved its excellence.

Kun Tao, Shou-Xun Lin, Yong-Dong Zhang, Sheng Tang

Extraction of Perceptual Hue Feature Set for Color Image/Video Segmentation

In this paper, we present a simple but effective algorithm for the extraction of perceptual hue feature set used in color image/video segmentation with emphasis on color textures. Feature extraction, with significant impact on the overall image/video analysis process, plays a critical role in classification-based segmentation. Color textures are accurately characterized by the newly introduced feature set with invariance to illumination, translation, and rotation, which is contributed by the statistical scheme in exploring the distribution of six rudimentary colors and the achromatic component at local positions. The feature set provides characteristic information and enables segmentation that is more meaningful than the recently published works do.

Ming-Jiun Wang, Gwo Giun Lee, He-Yuan Lin

A Robust Active Appearance Models Search Algorithm

With the aid of AAMs search algorithm, Active Appearance Models (AAMs) can represent non-rigid image objects with shape and texture variations well. However, the performance of the traditional AAMs search algorithm(TAAMS) is limited by its assumption that the error function is convex. Therefore, this paper proposes a robust AAMs search algorithm (RAAMS) which combines the multi-pose search (MS) for better pose matching and an estimation mechanism of parameter search direction (EPSD) for more accurate search direction. Moreover, a precaution mechanism of local minimum (PLM) is proposed to avoid the search trapped into the local minimum of the error function. Experimental results show that the proposed algorithm can significantly reduce 36.41% of shape error and 30.81% of texture error between the synthesized instance and target image.

Yong-Fong Lin, Chih-Wei Tang

An SVM-Based Soccer Video Shot Classification Scheme Using Projection Histograms

In this paper, we propose an SVM based shot classification scheme, which categorizes shots of soccer videos into 4 kinds, namely global shots, medium shots, close-up shots and audience shots. The proposed scheme consists in a new adaptive dominant color detection algorithm, as well as a novel feature, the projection histogram. Experiments show that our scheme performs competitively and our new feature is effective.

Nan Nan, Guizhong Liu, Xueming Qian, Chen Wang

A Robust Feature Extraction Algorithm for Audio Fingerprinting

In this paper, we present a new feature extraction algorithm which is referred to as weighted ASF (WASF) in a fingerprint system. The feature in our algorithm is extracted based on a MPEG-7 descriptor-Audio Spectrum Flatness (ASF) and Human Auditory System (HAS). It also applies several effective filters and another MPEG-7 descriptor: Audio Signature (AS). This algorithm is tested under several audio distortions: sampling rate change, noise addition, and speed-change and so on. For these distortions, the WASF algorithm can get discrimination more than 90%. The MFCC feature and another MPEG-7 descriptor-Audio spectrum Centroid (ASC) are also considered.

Jianping Chen, Tiejun Huang

A Novel Original Streaming Video Wavelet Domain Watermarking Algorithm in Copyright Protection

In the paper, it is suggested a novel algorithm of original streaming video of wavelet domain watermarking with Independent Component Analysis (ICA) using in signal blind detection. In this algorithm, it is made up regulate model of wavelet low frequency coefficient difference in video background picked-up by ICA and quantization modulate the wavelet low frequency coefficient differences of 8 pixel blocks in one MB. This algorithm has a good stability to some common video attacks.

Xinghao Jiang, Tanfeng Sun, Jianhua Li

A No-Reference Blocking Artifacts Metric Using Selective Gradient and Plainness Measures

This paper presents a novel no-reference blocking artifacts metric using selective gradient and plainness (BAM_SGP) measures for DCT-coded images. A boundary selection criterion is introduced to distinguish the blocking artifacts boundaries from the true-edge boundaries, which ensures that the most potential artifacts boundaries are involved in the measurement. Next, the artifacts are evaluated by the gradient and plainness measures indicating different aspects of blocking artifacts characteristics. Then these two measures are fused into a metric of blocking artifacts. Compared with some existing metrics, experiments on the LIVE database and our own test set show that the proposed metric can keep better consistent with Mean Opinion Score (MOS).

Jianhua Chen, Yongbing Zhang, Luhong Liang, Siwei Ma, Ronggang Wang, Wen Gao

Poster Session 4: Multimedia Processing and Analyses

Intelligent Content-Aware Model-Free Low Power Evoked Neural Signal Compression

Neural recording is an important key for us to realize the neuron activity, and multi-channel recording will be more and more crucial. However, nowadays research can only deal with spontaneous signals, which characteristics are far different from evoked signals. For evoked signals, we cannot just judge the spike at the front-end because evoked signals can’t be distinguished by recent spike sorting algorithm. Then, we need to send “full” waveform for bio-researchers. Therefore, proper compression algorithm is unavoidable due to full waveform transmission creates huge data amount. We use signal processing skills to get the targets for lossless compression, SNR>25db, and compression rate (compressed data / origin data)<25%.

Chen Han Chung, Yu-Chieh Kao, Liang-Gee Chen, Fu-Shan Jaw

Adaptive Subclass Discriminant Analysis Color Space Learning for Visual Tracking

A robust tracking method using subclass discriminant analysis (SDA) color space is presented. SDA color space is proposed which seeks to find the color subspace for representing pixels by maximizing the distance between the foreground pixels and background pixels even if target and background have multi-model color distributions. Further, SDA color space is adaptively updated by only using “confident” target pixels. Experimental results on several challenging videos show the effectiveness of the proposed method.

Zhifei Xu, Pengfei Shi, Xiaoyu Xu

Loitering Detection Using Bayesian Appearance Tracker and List of Visitors

This paper presents a framework of detecting loitering pedestrians in a video surveillance system. When a pedestrian appears in the field of view of the monitoring camera, he/she is tracked by a Bayesian appearance tracker (BAT). The tracker takes the advantage of Bayesian decision to associate the detected pedestrians according to their color appearances among consecutive frames. The pedestrian’s appearance is modeled as a multivariate normal distribution and recorded in a table called list of visitors (LV). LV also records time stamps when the pedestrian appears as an appearing history. Therefore, even though the pedestrian leaves and returns to the scene, he/she can still be recognized and re-identified as a locally or globally loitering suspect by using different rules. A 10-minute video about three loitering pedestrians is used to test the proposed system. They are successfully detected and recognized from other passing-by pedestrians.

Chung-Hsien Huang, Ming-Yu Shih, Yi-Ta Wu, Jau-Hong Kao

Recognizing Human Action from Videos Using Histograms of Visual Words

We propose a new method for human action recognition from video sequences using histograms of visual words. Video sequences are represented by a novel “bag-of-words” representation, where each frame corresponds to a “word”. The major difference between our model and previous “bag-of-words” models for recognition problems in computer vision is that, a “word” in our representation corresponds to a whole frame. The advantage of this representation is that the large-scale feature of a frame is better captured, which turns out to be important for recognition actions. We demonstrate our approach on two publicly available datasets. Our results are comparable to other state-of-the-art approaches for action recognition.

Yang Wang, Yi Sun

A Flexible Framework for Audio Semantic Content Detection

Audio semantic analysis is a crucial issue in multimedia applications. In this paper, a hierarchical framework is proposed for high-level semantic content detection for a continuous audio stream. In the proposed framework, basic audio events are modeled with hidden Markov models. Based on the obtained key audio event sequence, a neural network-based approach is proposed to discover the high-level semantic content of the audio context. With the neural network-based approach, human knowledge and machine learning are effectively combined in the semantic inference. We select some audio streams to evaluate the performance of the proposed framework, and the experiment results demonstrate the framework can achieve satisfying results.

Qi Li, Huadong Ma, Kanyan Zheng

Highly Efficient Face Detection in Color Images

A highly efficient algorithm for image-based face detection is proposed in this paper, whose features include a fast and regular search method, a local histogram equalization of face candidates, and a frontal face classifier. Experimental results show that the proposed algorithm works efficiently with different types of images. Besides, the average detection rate is 92%, and the required execution time is about 90ms/frame.

Tse-Wei Chen, Chi-Sun Tang, Shao-Yi Chien

A Robust Denoising Filter with Adaptive Edge Preservation

To preserve edge information during noise removal, bilateral filtering is designed as a powerful denoising filter using Gaussian function with geometric and photometric information. Unfortunately, peak noise will be regarded as effective high-frequency information and degrade the performance of bilateral filter. We design a new robust denoising filter employing image pre-processing and high-frequency edge detection. This filter can preserve details and remove noise simultaneously and it performs better than bilateral filter.

Li-Cheng Chiu, Chiou-Shann Fuh

A New Face Recognition Approach to Boosting the Worst-Case Performance

In this paper, we aim to develop a new classifier for increasing the worst case performance for individual person. Technically, we adopt the idea from LDA and improve the worst recognition performance for individuals. This is achieved via introducing different weighting coefficients in LDA optimization process for obtaining the projection matrix. By increasing the weighting coefficients associated with the smallest between-class distance, the pair of classes with the nearest distance can exert more powerful influence on the optimization process in derivation of the projection matrix. The algorithm is tested on the Extended YaleB dataset and the ORL dataset.

Fang Chen, Senjian An, Wanquan Liu

A Novel Method for Pencil Drawing Generation in Non-Photo-Realistic Rendering

This paper puts forward a novel method for automatically generating a pencil drawing from a real 2D color image in non-photo-realistic rendering. First, the edge of the color image is detected by Sobel operator. Next, the color image is sharpened by Unsharp Mask (USM), and then color scaling is used to get an image with radial and edge details. Furthermore, the original image is divided into many meaningful regions using an efficient method of image segmentation, then the texture direction is determined by Fourier transform and shape feature. To render better effects of illumination and local texture of pencil drawing, the line integral convolution (LIC) algorithm is applied and combined with color scaling and white noise image. Finally, the pencil drawing is created and the generation results from the superposition of the edge, the USM image, and the texture. Experimental results prove that our method could enhance the generated efficiency greatly by creating a more obvious edge, more natural tone, and more real texture than those offered by existed methods.

Zhenyu Chen, Jingye Zhou, Xingyu Gao, Longsheng Li, Junfa Liu

Theoretic Analysis of Inter Frame Dependency in Video Coding

In this paper, we present theoretic analysis of optimal bit allocation in prediction-based video coders. Usually, the optimal bit allocation is to minimize the sum of frame distortions under a total bit rate constraint. We use the Laplacian model to describe the distribution of DCT coefficients of residual frame and analyze the dependency between reference frame and predicted frame at different bit rates. Then we found that inter frame dependency will increase as bit rates decrease. Finally, based on this rule, we give some suggestions for rate allocation in practical video coding.

Yue Wang, Jun Sun, Siwei Ma, Wen Gao

Springer Professional

About this book

Table of Contents

Frontmatter

Session 1: Next Generation Video Coding Techniques

Accurate Correlation Modeling for Transform-Domain Wyner-Ziv Video Coding

High-Quality Multi-Mode Mipmapping Texture Compression with Alpha Map

Combined Prediction Mode for AVS

A JND Guided Foveation Video Coding

Session 2: Audio Processing and Classification

Pitch Shifting of Music Based on Adaptive Order Estimation of Linear Predictor

An Efficient Approach for Classification of Speech and Music

Speech Emotion Classification on a Riemannian Manifold

Toward Multi-modal Music Emotion Classification

Session 3: Interactive Multimedia Systems

Using GPU-Based Ray Tracing for Real-Time Composition in the Real Scene

Vision-Based Hand Gesture Interactions for Large LCD-TV Display Tabletop Systems

Fast Content-Based Mining of Web2.0 Videos

An Interactive Painting System by Using Enhanced Auto-positioning and Pattern Matching Techniques

Session 4: Advances in H.264/AVC

VLC Table Prediction Algorithm for CAVLC in H.264 Using the Characteristics of Mode Information

Perceptually Motivated Adaptive Quantization Algorithm for Region-of-Interest Coding in H.264

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

Two-Dimensional Map Based Fast Mode Decision for H.264/AVC

A Novel Hierarchical Mode Selection Algorithm for P-Slices in H.264/AVC

Motion Compensated Interpolation as a New Inter Coding Mode for 8x8 Macroblock Partitions in H.264/AVC B Slices

Session 5: Multimedia Networking Techniques

On the Server Placement Problem of P2P Live Media Streaming System

Guaranteed Bandwidth Requirement Mechanism for Multimedia Multicast Traffic in MANETs

Seamless Handoff Support for Real-Time Multimedia Applications in Nested Mobile Networks

Service Differentiation in IEEE 802.11e HCF Access Method

Throughput Performance of an IP Differentiated-Services Network for Video Communication

Session 6: Advanced Image Processing Techniques

Skeleton-Based Recognition of Chinese Calligraphic Character Image

A Platform Implementation for Real Time Image Processing

Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

Moving Video Object Edge Detection Using Complex Wavelets

Detecting Interesting Regions in Photographs – How Metadata Can Help

Learning-Based Image Restoration for Compressed Image through Neighboring Embedding

Session 7: Video Analysis and Its Applications

Synopsis Alignment: Importing External Text Information for Multi-model Movie Analysis

Objects over the World

Detect and Recognize Clock Time in Sports Video

Detecting Violent Scenes in Movies by Auditory and Visual Cues

Personalized MTV Affective Analysis Using User Profile

Session 8: Image Detection and Classification

Sequential Simulated Annealing System for Pattern Detection

A Collaborative Bayesian Image Annotation Framework

A New Hierarchical Particle Filter Based Tracking System for Soccer Game Analysis

Automatic Web Page Classification Using Various Features

An Efficient Method for Near-Duplicate Video Detection

Session 9: Visual and Spatial Analyses

Affective Space Exploration for Impressionism Paintings

Collaborative Video Scene Annotation Based on Tag Cloud

A Spatial-Temporal-Scale Registration Approach for Video Copy Detection

Vision-Based Semi-supervised Homecare with Spatial Constraint

Session 10: Multimedia Human Computer Interfaces

Interaction Reproducing Model: A Model for Giving Supports Appropriate to User State

Multi-projector Calibration and Alignment Using Flatness Analysis for Irregular-Shape Surfaces

Embedded Tags and Visual Querying for Face Photo Retrieval

Illumination Invariant Face Detection Using Classifier Fusion

Nonlinear Characterisation of Fronto-Normal Gait for Human Recognition

Session 11: Multimedia Security and DRM

Steganalytic Measures for the Steganography Using Pixel-Value Differencing and Modulus Function

Document Forgery Detection with SVM Classifier and Image Quality Measures

NAL Level Encryption for Scalable Video Coding

Robust Digital Image Watermarking Based on Principal Component Analysis and Discrete Wavelet Transform

Session 12: Advanced Image and Video Processing

Tone Adjustment of Underexposed Images Using Dynamic Range Remapping

A Novel Video Text Detection and Localization Approach

The Influence of Regularization Parameter on Error Bound in Super-Resolution Reconstruction

Geometrical Compensation Algorithm of Multiview Image for Arc Multi-camera Arrays

A Unified Hierarchical Appearance Model for People Re-identification Using Multi-view Vision Sensors

An XML-Based Comic Image Compression

Session 13: Multimedia Database and Retrieval

Compensated Visual Hull with GPU-Based Optimization

Video News Retrieval Incorporating Relevant Terms Based on Distribution of Document Frequency

ROISeer: Region-Based Image Retrieval by Hierarchical Feature Filtering

High-Performance Image Annotation and Retrieval for Weakly Labeled Images Using Latent Space Learning

The Many Facets of Progressive Retrieval for CBIR