Skip to main content

Über dieses Buch

Interactive Media is a new research field and a landmark in multimedia development. The Era of Interactive Media is an edited volume contributed from world experts working in academia, research institutions and industry.

The Era of Interactive Media focuses mainly on Interactive Media and its various applications. This book also covers multimedia analysis and retrieval; multimedia security rights and management; multimedia compression and optimization; multimedia communication and networking; and multimedia systems and applications.

The Era of Interactive Media is designed for a professional audience composed of practitioners and researchers working in the field of multimedia. Advanced-level students in computer science and electrical engineering will also find this book useful as a secondary text or reference.



Best Papers and Runner-ups


Image Re-Emotionalizing

In this work, we develop a novel system for synthesizing user specified emotional affection onto arbitrary input images. To tackle the subjectivity and complexity issue of the image affection generation process, we propose a learning framework which discovers emotion-related knowledge, such as image local appearance distributions, from a set of emotion annotated images. First, emotion-specific generative models are constructed from color features of the image super-pixels within each emotion-specific scene subgroup. Then, a piece-wise linear transformation is defined for aligning the feature distribution of the target image to the statistical model constructed from the given emotion-specific scene subgroup. Finally, a framework is developed by further incorporation of a regularization term enforcing the spatial smoothness and edge preservation for the derived transformation, and the optimal solution of the objective function is sought via standard non-linear optimization. Intensive user studies demonstrate that the proposed image emotion synthesis framework can yield effective and natural effects.

Mengdi Xu, Bingbing Ni, Jinhui Tang, Shuicheng Yan

Thesaurus-Assistant Query Expansion for Context-Based Medical Image Retrieval

While medical image retrieval using visual feature has poor performance, context-based retrieval emerges as a more easy and effective solution. And medical domain knowledge can also be used to boost the text-based image retrieval. In this paper, UMLS metathasaurus is used to expand query for retrieving medical images by their context information. The proposed query expansion method is with a phrase-based retrieval model, which is implemented based on Indri search engine and their structured query language. In the phrase-based retrieval model, original query and syntax phrases are used to formulate a structured query. The concepts detected from the original query and their hyponyms are used to append query, and added to the structured query. Both phrases and medical concepts are identified with the help of the MetaMap program. Our approach was evaluated on ImageCLEFmed 2010 dataset, which contains more than 77,000 images and their captions from online medical journals. Several representations of phrase and concept were also compared in experiments. The experimental results show the effectiveness of our approach for context-based medical image retrieval.

Hong Wu, Chengbo Tian

Forgery Detection for Surveillance Video

In many courts, surveillance videos are used as important legal evidence. Nevertheless, little research is concerned with forgery of surveillance videos. In this paper, we present a forgery detection system for surveillance videos. We analyze the characteristic of surveillance videos. Subsequently, forgeries mainly occur to the surveillance videos are investigated. To identify both RGB and infrared video, Sensor Pattern Noise (SPN) for each video is transformed by Minimum Average Correlation Energy (MACE) filter. Manipulations on the given video are detected by estimating scaling factor and calculating correlation coefficient. Experimental results demonstrate that the proposed scheme is appropriate to identify forgeries of surveillance video.

Dai-Kyung Hyun, Min-Jeong Lee, Seung-Jin Ryu, Hae-Yeoun Lee, Heung-Kyu Lee

Digital Image Forensics: A Two-Step Approach for Identifying Source and Detecting Forgeries

Digital Image Forensics includes two main domains: source device identification and semantic modification detection. Usually, existing works address one aspect only: either source identification or either image manipulation. In this article, we investigate a new approach based on sensor noise that operates in a two-step sequence: the first one is global whereas the second one is local. During the first step, we analyze noise in order to identify the sensor. We reused the method proposed by Jessica Fridrich et al. with an improvement of it useful when only a limited number of images is available to compute noise patterns. Then, having identified the sensor, we examine more locally, using quadtree segmentation, the noise differences between the pattern noise attached to the sensor and the noise extracted from the picture under investigation in order to detect possible alterations. We assume here that the portion of the image that underwent modifications is relatively small with regards to the surface of the whole picture. Finally, we report tests on the first publically available database (i.e. the Dresden database) that makes possible further comparisons of our algorithm with other approaches.

Wiem Taktak, Jean-Luc Dugelay

Human Activity Analysis for Geriatric Care in Nursing Homes

As our society is increasingly aging, it is urgent to develop computer aided techniques to improve the quality-of-care (QoC) and quality-of-life (QoL) of geriatric patients. In this paper, we focus on automatic human activities analysis in video surveillance recorded in complicated environments at a nursing home. This will enable the automatic exploration of the statistical patterns between patients’ daily activities and their clinical diagnosis. We also discuss potential future research directions in this area. Experiment demonstrate the proposed approach is effective for human activity analysis.

Ming-Yu Chen, Alexander Hauptmann, Ashok Bharucha, Howard Wactlar, Yi Yang

Face Detection, Recognition and Synthesis


Multi-Feature Face Recognition Based on 2D-PCA and SVM

Identification and authentication by face recognition mainly use global face features. However, the recognition accuracy rate is still not high enough. This research aims to develop a method to increase the efficiency of recognition using global-face feature and local-face feature with four parts: the left-eye, right-eye, nose and mouth. This method is based on geometrical techniques used to find location of eyes, nose and mouth from the frontal face image. We used 115 face images for learning and testing. Each-individual person’s images are divided into three difference images for training and two difference images for testing. The Two-Dimension Principle Component Analysis (2D-PCA) technique is used for feature extraction and the Support Vector Machine (SVM) method is used for face recognition. The results show that the recognition percentage is 97.83%.

Sompong Valuvanathorn, Supot Nitsuwat, Mao Lin Huang

Face Orientation Detection Using Histogram of Optimized Local Binary Pattern

Histogram of optimized local binary pattern (HOOPLBP) is a new and robust orientation descriptor, which is claimed to be content-independent. In this paper, we improved the algorithm by disregarding features of some unwanted pixels. Furthermore, we found out that the minimum range of HOOPLBP is much smaller on human face orientation detection than other content-based images. Then we set up a series of experiments on human face orientation detection and with a general model, we can detect the face orientation with arbitrary degree. Thus the method can be further developed to enhance the present face detection algorithm.

Nan Dong, Xiangzhao Zeng, Ling Guan

Fast Eye Detection and Localization Using a Salient Map

Among the facial features, the eyes play the most important role in face recognition and face hallucination. In this paper, an efficient algorithm for eye detection in face images is proposed. The proposed algorithm is robust to illumination variations, size, and orientation of face images. As the eye region always has the most variations in a face image, our algorithm uses a wavelet-based salient map, which can detect and reflect the most visually meaningful regions for eye detection and localization. Our proposed algorithm is non-iterative and computationally simple. Experimental results show that our algorithm can achieve a superior performance compared to other current methods.

Muwei Jian, Kin-Man Lam

Eyeglasses Removal from Facial Image Based on MVLR

Eyeglass is a kind of common interfering factor in face recognition and analysis. A statistical learning method is presented to remove eyeglass from an input facial image for this problem. First, training samples are collected from facial images wearing eyeglasses and their corresponding facial images without eyeglasses. Then, a model of multi-variable linear regression is established based on linear correlation assumption between two sets of samples. Finally, the parameters matrix is solved by which the eyeglasses can be removed from an input facial image. Experimental results have demonstrated that the proposed algorithm is efficient and practical. The method is easy to be realized without any auxiliary equipment.

Zhigang Zhang, Yu Peng

Video Coding and Transmission


A Multiple Hexagon Search Algorithm for Motion and Disparity Estimation in Multiview Video Coding

In single viewpoint video coding, there are many fast block matching motion estimation algorithms proposed, such as three-step search, four-step search, diamond search as well as hexagon-based search and so on. However, experimental analysis show that these algorithms are not suitable for using directly in Multiview View Coding (MVC). Since the increased search range, the larger format of multiview video as well as the correlations between inter-view frames are not considered by these algorithms, they may easy led the block matching search into local minimum, the Rate-Distortion (R-D) performance will degrade dramatically. In this paper, we propose a novel multiple hexagon search algorithm to address this problem. Firstly, according to the original initial search point, four sub-search windows are constructed. Then, the hexagon based search algorithm will be performed respectively in the four sub-search windows. The final result is the best search point with the minimum R-D cost among the best points in the four sub-search windows. In order to trade off the computational complexity and R-D performance, two adaptive early termination strategies are proposed. The experimental results show that the proposed algorithm yields a quite promising coding performance in terms of R-D performance and computational complexity. Especially, the proposed algorithm can work well in multiview video sequences with various motion and disparity activities.

Zhaoqing Pan, Sam Kwong, Yun Zhang

Adaptive Motion Skip Mode for AVS 3D Video Coding

Inter-view motion skip mode has been proposed to improve the coding efficiency of multiview video coding (MVC) by reusing the motion information of the referenced views. In this paper, an adaptive motion estimation algorithm for the motion skip mode is proposed for AVS 3D video coding. The proposed algorithm searches the best motion information, by means of adaptive global disparity estimation and a refine search along the horizontal direction, for the purpose of motion-compensated coding. Moreover, the method is applied to sampling-aided scheme for AVS 3D video coding, which encodes the reorganized sequence by merging two downsampled videos. Rate-distortion optimization criterion is employed to find the best motion information. Experimental results demonstrate that the proposed algorithm can substantially improve the coding efficiency.

Lianlian Jiang, Yue Wang, Li Zhang, Siwei Ma

Adaptive Search Range Methods for B Pictures Coding

This paper presents some adaptive search range strategies at both frame and macroblock level for B pictures coding in H.264/AVC. First, a basic frame-level search range scaling strategy is proposed based on the linear motion relationship between P and B pictures, which is only suitable for normal or low motion situation. Then, an improved frame-level adaptive search range scaling (F-ASRS) algorithm is proposed by taking full advantage of intra mode and motion vector statistics. F-ASRS algorithm can precisely detect the global motion degree and adjust the search range of next B picture in advance. The local motion is further studied according to the information of the adjacent previous coded blocks, and then a macroblock-level adaptive search range (MB-ASR) algorithm is proposed. When integrating F-ASRS with MB-ASR, the search area of B pictures can be greatly reduced without any coding performance loss.

Zhigang Yang

Replacing Conventional Motion Estimation with Affine Motion Prediction for High-Quality Video Coding

In modern video coding systems, motion estimation and compensation play an indispensable role among various video compression tools. However, translation motion model is still extensively employed for motion compensation, making the systems not efficient in handling complex inter-frame motion activity including scaling, rotation and various forms of distortion. In this paper, we propose a local affine motion prediction method, which manages to improve the inter-frame image prediction quality using the conventional motion vectors. Our method is characterized with no extra overhead, such that no extra bit has to be sent to the decoder for proper decoding. Experimental results show that our method manages to achieve a maximum average bit rate reduction of 5.75% compared to the conventional inter-frame prediction method using translation only motion compensation techniques, at no cost on quality degradation. Our proposed algorithm is useful for the current video standard, H.264, but is also particularly efficient for large partition size coding modes. For future video compression standards: high-efficiency video coding (HEVC) or the possible H.265, which could include partition sizes of 64 × 64 and 32 × 32, the coding gain should further be boosted by making use of our proposed algorithm.

Hoi-Kok Cheung, Wan-Chi Siu

Fast Mode Decision Using Rate-Distortion Cost and Temporal Correlations in H.264/AVC

In this paper, we propose a fast mode decision scheme in the H.264/AVC standard. The encoder uses variable block sizes to select the optimal best mode incurring the high computational complexity. Thus, we propose the fast mode decision algorithm to reduce the complexity. Using the rate-distortion cost and temporal correlations, we propose the additional early SKIP and inactive inter/intra mode decision methods in P frames. Experimental results show the proposed method provides about 60% encoding time savings on average without significant performance loss, compared to the H.264/AVC fast high complexity mode. Specifically, we can save the time about 22% and 39% in the intra and inter mode conditions, respectively.

Yo-Sung Ho, Soo-Jin Hwang

Disparity and Motion Activity Based Mode Prediction for Fast Mode Decision in Multiview Video Coding

Multiview video coding (MVC) uses exhaustive variable size block mode decision in motion estimation and disparity estimation to improve coding efficiency; however, it causes intensive computational complexity. A fast mode decision algorithm for the non-anchor pictures is proposed to lower computational complexity of MVC. In the proposed algorithm, mode decision is early terminated based on the inter-view mode correlation among neighboring views and motion activity predicted from motion information from checking previous macroblock mode. Experimental results show that the proposed fast mode decision algorithm achieves from 53.67 to 81.75% complexity reduction. Meanwhile, the peak signal-to-noise ratio degradation of the proposed algorithm is 0.046 dB on average and bit rate increases from −2.13% to 0.77%, which are negligible.

Dan Mao, Yun Zhang, Qian Chen, Sam Kwong

Multiple Reference Frame Motion Re-estimation for H.264/AVC Frame-Skipping Transcoding with Zonal Search

In this paper we propose an efficient multiple reference frame motion re-estimation for H.264/AVC frame-skipping transcoding. In the proposed algorithm, all information including reference frame number, motion vector etc. in original H.264/AVC-coded videos are reused when the original reference frame is not skipped; while an efficient multiple reference frame motion re-estimation is employed when the reference frame is dropped. The neighboring reference frames around the skipped reference frame are selected for motion re-estimation, and in addition a zonal search method is used for motion re-estimation. The experimental results reveal that average 87% of computation time (corresponding to a speed-up factor of 8) can be saved for the proposed H.264 frame-skipping transcoding algorithm, when compared with fully decoding/encoding procedure. The degradation in the rate-distortion performance is fairly small.

Jhong-Hau Jiang, Yu-Ming Lee, Yinyi Lin

Frame Layer Rate Control Method for Stereoscopic Video Coding Based on a Novel Rate-Distortion Model

Rate control plays an important role in video coding and transmission. In this paper, a novel rate-distortion model has first been proposed to characterize the coding characteristics of stereoscopic video coding, where the weighted average of the left and right viewpoint measured with the video quality metric (VQM) is adopted as the stereoscopic video coding distortion metric, instead of Mean Square Error (MSE). Then a frame layer rate control method for stereoscopic video coding has been presented based on the proposed R-D model. Experimental results demonstrate that, the proposed R-D model can accurately characterize the relationship among coding distortion, coding rate and quantization parameter and the proposed rate control method can efficiently control the output bit rate consistent with the target bit rate while the reconstructed video quality is comparable.

Qun Wang, Li Zhuo, Jing Zhang, Xiaoguang Li

Hardware Friendly Oriented Design for Alternative Transform in HEVC

Hardware implementation for video coding is gathering more and more focus these days. For intra prediction, the new emerging secondary directional transform is applied after DCT/ICT in the video coding to exploit the potential energy compaction. In this paper, we propose a hardware friendly orientated design flow based on simple difference function which can achieve similar performance as original transform but substitute all low efficient multiplication operations by regular shifting and addition. Especially for the current secondary transform which is rotational transform (ROT), we obtain the hardware friendly ROT (HF_ROT) through our method. Based on the analysis, the proposed method can achieve massive operation reduction if properly design and regular data flow and simple control signal. Simulation results of hardware friendly ROT show that our proposed method can achieve similar performance as the original transform and importantly it is hardware friendly, largely lifting the hardware implemented efficiency. What is more, our proposed method can be applied to other matrix related coding methods.

Lin Sun, Oscar C. Au, Xing Wen, Jiali Li, Wei Dai

A New Just-Noticeable-Distortion Model Combined with the Depth Information and Its Application in Multi-view Video Coding

Traditional video compression methods remove spatial and temporal redundancy based on the statistical correlation. However the final receptor is the human, we can remove the perception redundancy to get higher compression efficiency, by taking use of the properties of human visual system (HVS). Research has simulated the sensitivity of HVS to luminance contrast and spatial and temporal masking effects with the just-noticeable-distortion (JND) model, which describes the perception redundancy quantitatively. This paper proposes a new model named MJND (JND in Multi-view), which explores the property of HVS to stereoscopic masking effect. The proposed model not only contains the spatial and temporal JND, but also includes the JND in depth. The MJND model is then used in macroblock (MB) quantization adjustment and rate-distortion optimization in multi-view video coding (MVC). Compared with the standard MVC scheme without JND, our model can get better visual quality in the case of the same bit rate.

Fengzong Lian, Shaohui Liu, Xiaopeng Fan, Debin Zhao, Wen Gao

Audio, image and video quality assessment


Multi-camera Skype: Enhancing the Quality of Experience of Video Conferencing

We propose a novel approach towards real-time control, selection and transmission of the best view of human faces in Skype video conferencing. Our goal is to improve the Quality-of-Experience (QoE) of current video conferencing services by incorporating real-time multi-camera control and selection mechanism. Traditional 3D viewpoint selection algorithms rely on complex 3D-model computation and are not applicable for real-time applications. We define a new image-based metric, Viewpoint Saliency (VS), for evaluating the quality of views of human subject and a centralized multi-camera control mechanism to track and select the best view of human.

Ying Wang, Prabhu Natarajan, Mohan Kankanhalli

Content Aware Metric for Image Resizing Assessment

Development of mobile multimedia techniques requires image resizing. An image quality assessment (IQA) metric is needed for objectively assessing image resizing approaches. Most of existing IQA approaches are designed for images of the same size, and are not suitable for image resizing application. In this paper, we propose a Content Aware Metric (CAM) to evaluate the similarity between resized image and the original one. By introducing the Structure Similarity (SSIM) metric for retargeting assessment, we firstly analyze how image resizing influences the different components of SSIM. Then we divide the important regions into lots of important sub images, which are represented by the coordinates of centers. We track the location of all the sub images and compute the similarity of corresponding sub images. The CAM is obtained by averaging the distance of all sub images. The CAM is effective for assessing the quality of Seam Carving, and the experimental results show that our metrics is consistent with human observation.

Lifang Wu, Lianchao Cao, Jinqiao Wang, Shuqin Liu

A Comprehensive Approach to Automatic Image Browsing for Small Display Devices

Recently, small displays are widely used to browse digital images. While using a small display device, the content of the image appears very small. Users have to use manual zooming and panning in order to see the detail of the image on a small display. Hence, an automatic image browsing solution is desired for user convenience. In this chapter, a novel comprehensive and efficient system is proposed to browse high resolution images using small display devices by automatically panning and zooming on Region-of-Interests (ROIs). The challenge is to provide a better user experience on heterogeneous small display sizes. First of all, an input image is classified into one of the three different classes: close-up, landscape and other. Then the ROIs of image are extracted. Finally, ROIs are browsed based on different intuitive and study based strategies. Our proposed system is evaluated by subjective test. Experimental results indicate that the proposed system is an effective large image displaying technique on small display devices.

Muhammad Abul Hasan, Min Xu, Xiangjian He

Coarse-to-Fine Dissolve Detection Based on Image Quality Assessment

Although many approaches have been proposed for video shot boundary detection, dissolve detection remains an open issue. For a dissolve, we could find that the video frames reveal a “clarity–blur–clarity” visual pattern. Accordingly, the image quality in the dissolve also reveals a “high–low–high” pattern. Based on the above observation, in this paper a novel coarse-to-fine dissolve detection approach based on image quality assessment is presented. Firstly, the normalized variance autofocus function is employed to calculate the image quality value for its good performance and the image quality feature curve is obtained. The grooves on the curve, which are monotone decreasing to a local minimum and then are monotone increasing to a normal value, are detected by using a simple threshold-based method and deemed as dissolve candidates. After obtaining the coarse results, some refined features are extracted from these dissolve candidates and the final dissolve detection is accomplished with the help of the support vector machine based on a new dissolve length normalization method. The experimental results show that the proposed method is effective.

Weigang Zhang, Chunxi Liu, Qingming Huang, Shuqiang Jiang, Wen Gao

Audio and image classification


Better Than MFCC Audio Classification Features

Mel-Frequency Ceptral Coeffienents (MFCCs) are generally the features of choice for both audio classification and content-based retrieval due to their proven performance. This paper presents alternate feature sets that not only consistently outperform MFCC features but are simpler to calculate.

Ruben Gonzalez

A Novel 2D Wavelet Level Energy for Breast Lesion Classification on Ultrasound Images

Infiltrative nature is a unique characteristic of breast cancer. Cross-sectional view of infiltrative nature can find a rough lesion contour on ultrasound image. Roughness description is crucial for clinical diagnosis of breast lesions. Based on boundary tracking, traditional roughness descriptors usually suffer from information loss due to dimension reduction. In this paper, a novel 2-D wavelet-based energy feature is proposed for breast lesion classification on ultrasound images. This approach characterizes the roughness of breast lesion contour with normalized spatial frequency components. Feature efficacies are evaluated by using two breast sonogram datasets with lesion contour delineated by an experienced physician and the ImageJ, respectively. Experimental results show that the new feature can obtain excellent performance and robust contour variation resistance.

Yueh-Ching Liao, King-Chu Hung, Shu-Mei Guo, Po-Chin Wang, Tsung-Lung Yang

Learning-to-Share Based on Finding Groups for Large Scale Image Classification

With the large scale image classification attracting more attention in recent years, a lot of new challenges spring up. To tackle the problems of distribution imbalance and divergent visual correlation of multiple classes, this paper proposes a method to learn a group-based sharing model such that the visually similar classes are assigned to a discriminative group. This model enables the class draw support from other classes in the same group, thus the poor discrimination ability with limited available samples can be relieved. To generate effective groups, the intra-class coherence and the inter-class similarity are computed. Then a hierarchical model is learned based on these groups that the classes within the group can inherit the power from the discriminative model of the group. We evaluate our method across 200 categories extracted from


. Experimental results show our model has better performance in large scale image classification.

Li Shen, Shuqiang Jiang, Shuhui Wang, Qingming Huang

Vehicle Type Classification Using Data Mining Techniques

In this paper, we proposed a novel and accurate visual-based vehicle type classification system. The system builts up a classifier through applying Support Vector Machine with various features of vehicle image. We made three contributions here: first, we originally incorporated color of license plate in the classification system. Moreover, the vehicle front was measured accurately based on license plate localization and background-subtraction technique. Finally, type probabilities for every vehicle image were derived from eigenvectors rather than deciding vehicle type directly. Instead of calculating eigenvectors from the whole body images of vehicle in existing methods, our eigenvectors are calculated from vehicle front images. These improvements make our system more applicable and accurate. The experiments demonstrated our system performed well with very promising classification rate under different weather or lighting conditions.

Yu Peng, Jesse S. Jin, Suhuai Luo, Min Xu, Sherlock Au, Zhigang Zhang, Yue Cui

Stereo Image and Video Analysis


Stereo Perception’s Salience Assessment of Stereoscopic Images

3D quality assessment (QA) has been widely studied in recent years, however, the depth perception and its influence to stereoscopic image’s quality is still not mentioned. To the best of our knowledge the proposed approach is the first attempt in stereo perception’s salience assessment of stereoscopic image. We research how depth perception’s salience influences the quality of stereoscopic image. Under the assumptions that humans acquire different depth perception to the stereoscopic images with different discrepancy, four groups of multiview’s images with parallel camera structure are chosen to test. By comparing the results of the proposed model and the subjective experiment, it shows that the proposed model has good consistency with the subjective experiment.

Qi Feng, Fan Xiaopeng, Zhao Debin

Confidence-Based Hierarchical Support Window for Fast Local Stereo Matching

Various cost aggregation methods have been developed for finding correspondences between stereo pairs, but their high complexity is still a problem for practical use. In this paper, we propose a confidence-based hierarchical structure to reduce the complexity of the cost aggregation algorithms. Aggregating matching costs for each pixel with the smallest support window, we estimate confidence levels. The confidence values are used to decide which pixel needs additional cost aggregations. For the pixels of small confidence, we iteratively supplement their matching costs by using larger support windows. Our experiments show that our approach reduces computational time and improves the quality of output disparity images.

Jae-Il Jung, Yo-Sung Ho

Occlusion Detection Using Warping and Cross-Checking Constraints for Stereo Matching

In this paper, we propose an occlusion detection algorithm which estimates occluded pixels automatically. In order to detect the occlusion, we obtain an initial disparity map with an optimization algorithm based on the modified constant-space belief propagation (CSBP) which has low complexity. The initial disparity map gives us clues for occlusion detection. These clues are the warping constraint and the cross check constraint. From both constraints, we define a potential energy function for occlusion detection and optimize it using an energy minimization framework. Experimental results show that the result of the occlusion detection from the proposed algorithm is very close to the ground truth.

Yo-Sung Ho, Woo-Seok Jang

Joint Multilateral Filtering for Stereo Image Generation Using Depth Camera

In this paper, we propose a stereo view generation algorithm using the Kinect depth camera that utilizes the infrared structured light. After we capture the color image and the corresponding depth map, we first preprocess the depth map and apply joint multilateral filtering to improve depth quality and temporal consistency. The preprocessed depth map is warped to the virtual viewpoint and filtered by median filtering to reduce truncation errors. Then, the color image is back-projected to the virtual viewpoint. In order to fill out remaining holes caused by disocclusion areas, we apply a background-based image in-painting process. Finally, we obtain a synthesized image without any visual distortion. From our experimental results, we realize that we obtain synthesized images without noticeable errors.

Yo-Sung Ho, Sang-Beom Lee

Object Detection


Justifying the Importance of Color Cues in Object Detection: A Case Study on Pedestrian

Considerable progress has been made on hand-crafted features in object detection, while little effort has been devoted to make use of the color cues. In this paper, we study the role of color cues in detection via a representative object, i.e., pedestrian, as its variaility of pose or appearance is very common for “general” objects. The efficiency of color space is first ranked by empirical comparisons among typical ones. Furthermore, a color descriptor, called MDST (Max DisSimilarity of different Templates), is built on those selected color spaces to explore invariant ability and discriminative power of color cues. The extensive experiments reveal two facts: one is that the choice of color spaces has a great influence on performance; another is that MDST achieves better results than the state-of-the-art color feature for pedestrian detection in terms of both accuracy and speed.

Qingyuan Wang, Junbiao Pang, Lei Qin, Shuqiang Jiang, Qingming Huang

Adaptive Moving Cast Shadow Detection

Moving object detection is an important task in real-time video surveillance. However, in real scenario, moving cast shadows associated with moving objects may also be detected, making moving cast shadow detection a challenge for video surveillance. In this paper, we propose an adaptive shadow detection method based on the cast shadow model. The method combines ratio edge and ratio brightness, and reduces computation complexity by the cascading algorithm. It calculates the difference of ratio edge between the shadow region and the background according to the invariability of the ratio edge of object in different light. Experimental results show that our approach outperforms existing methods.

Guizhi Li, Lei Qin, Qingming Huang

A Framework for Surveillance Video Fast Browsing Based on Object Flags

Conventional surveillance video coding frameworks are designed to maximize the coding efficiency or to improve the adaptability. However, the problem that how to construct a flexible framework for browsing surveillance video has become another important issue as well as improving the efficiency of coding and adaptability of video bitstream. This paper proposes a framework for efficient storing and synopsis browsing of surveillance video based on object flags. The main contributions of our work are that: (1) the framework provides an applicable video coding approach for video surveillance by combining with the video synopsis method; (2) our method can improve the storage efficiency and provide users a fast browsing scheme for surveillance video. The experiments of implementing the framework based on the H.264/AVC video codec are shown.

Shizheng Wang, Wanxin Xu, Chao Wang, Baoju Wang

Pole Tip Corrosion Detection Using Various Image Processing Techniques

In this paper, three features are proposed for detecting the pole tip corrosion in the hard disk drives by using various techniques of image processing. The proposed method is divided into three parts. The first part involves with preparing the template image of the pole tip. The second part involves with selecting a region of interest. The third part involves with constructing the three features. The first feature is the area around the top shield of the pole tip. The second and third features are the row coordinate and the length along the lower edge around the top shield, respectively. Finally, the last part involves with measuring the detection efficiency. From experimental results with the 647 tested pole tip images, this shows that the method using the combination of the first feature and the second feature gives the better detection efficiency than that using the combination of the others in term of specification, precision and accuracy, respectively.

Suchart Yammen, Somjate Bunchuen, Ussadang Boonsri, Paisarn Muneesawang

Real-Time Cascade Template Matching for Object Instance Detection

Object instance detection finds where a specific object instance is in an image or a video frame. It is a variation of object detection, but distinguished on two points. First, object detection focused on a category of object, while object instance detection focused on a specific object. For instance, object detection may work to find where toothpaste is in an image, while object instance detection will work on finding and locating a specific brand of toothpaste, such as Colgate toothpaste. Second, object instance detection tasks usually have much fewer (positive) samples in training compared to that of object detection. Therefore, traditional object instance detection methods are mostly based on template matching.

This paper presents a cascade template matching framework for object instance detection. Specially, we propose a three-stage heterogeneous cascade template matching method. The first stage employs dominate orientation template (DOT) for scale and rotation invariant filtering. The second stage is based on local ternary patterns (LTP) to further filter with texture information. The third stage trained a classifier on appearance feature (PCA) to further reduce false-alarms. The cascade template matching (CTM) can provide very low false-alarm-rate comparing to traditional template matching based methods and SIFT matching based methods. We demonstrate the effectiveness of the proposed method on several instance detection tasks on YouTube videos.

Chengli Xie, Jianguo Li, Tao Wang, Jinqiao Wang, Hanqing Lu

Action Recognition and Surveillance


An Unsupervised Real-Time Tracking and Recognition Framework in Videos

A novel framework for unsupervised face tracking and recognition is built on Detection-Tracking-Refinement-Recognition (DTRR) approach. This framework proposed a hybrid face detector for real-time face tracking which is robust to occlusions, facial expression and posture changes. After a posture correction and face alignment, the tracked face is featured by the Local Ternary Pattern (LTP) operator. Then these faces are clustered into several groups according to the distance between feature vectors. During the next step, those groups which each contains a series of faces can be further merged by the Scale-invariant feature transform (SIFT) operator. Due to extreme computing time consumption by SIFT, a multithreaded refinement process was given. After the refinement process, the relevant faces are put together which is of much importance for face recognition in videos. The framework is validated both on several videos collected in unconstrained condition (8 min each.) and on Honda/UCSD database. These experiments demonstrated that the framework is capable of tracking the face and automatically grouping a serial faces for a single human-being object in an unlabeled video robustly.

Huafeng Wang, Yunhong Wang, Jin Huang, Fan Wang, Zhaoxiang Zhang

Recognizing Realistic Action Using Contextual Feature Group

Although the spatial–temporal local features and the bag of visual words model (BoW) have achieved a great success and a wide adoption in action classification, there still remain some problems. First, the local features extracted are not stable enough, which may be aroused by the background action or camera shake. Second, using local features alone ignores the spatial–temporal relationships of these features, which may decrease the classification accuracy. Finally, the distance mainly used in the clustering algorithm of the BoW model did not take the semantic context into consideration. Based on these problems, we proposed a systematic framework for recognizing realistic actions, with considering the spatial–temporal relationship between the pruned local features and utilizing a new discriminate group distance to incorporate the semantic context information. The Support Vector Machine (SVM) with multiple kernels is employed to make use of both the local feature and feature group information. The proposed method is evaluated on KTH dataset and a relatively realistic dataset YouTube. Experimental results validate our approach and the recognition performance is promising.

Yituo Ye, Lei Qin, Zhongwei Cheng, Qingming Huang

Mutual Information-Based Emotion Recognition

Emotions that arise in viewers in response to videos play an essential role in content-based indexing and retrieval. However, the emotional gap between low-level features and high-level semantic meanings is not well understood. This paper proposes a general scheme for video emotion identification using mutual information-based feature selection followed by regression. Continuous arousal and valence values are used to measure video affective content in dimensional arousal-valence space. Firstly, rich audio-visual features are extracted from video clips. The minimum redundancy and maximum relevance feature selection is then used to select most representative feature subsets for arousal and valence modelling. Finally support vector regression is employed to model arousal and valence estimation functions. As evaluated via tenfold cross-validation, the estimation results achieved by our scheme for arousal and valence are: mean absolute error, 0.1358 and 0.1479, variance of absolute error, 0.1074 and 0.1175, respectively. Encouraging results demonstrate the effectiveness of our proposed method.

Yue Cui, Suhuai Luo, Qi Tian, Shiliang Zhang, Yu Peng, Lei Jiang, Jesse S. Jin

Visual Analysis and Retrieval


Partitioned K-Means Clustering for Fast Construction of Unbiased Visual Vocabulary

Bag-of-Words (BoW) model has been widely used for feature representation in multimedia search area, in which a key step is to vector-quantize local image descriptors and generate a visual vocabulary. Popular visual vocabulary construction schemes generally perform a flat or hierarchical clustering operation using a very large training set in their original description space. However, these methods usually suffer from two issues: (1) A large training set is required to construct a large visual vocabulary, making the construction computationally inefficient; (2) The generated visual vocabularies are heavily biased towards the training samples. In this work, we introduce a partitioned k-means clustering (PKM) scheme to efficiently generate a large and unbiased vocabulary using only a small training set. Instead of directly clustering training descriptors in their original space, we first split the original space into a set of subspaces and then perform a separate k-means clustering process in each subspace. Sequentially, we can build a complete visual vocabulary by combining different cluster centroids from multiple subspaces. Comprehensive experiments demonstrate that the proposed method indeed generates unbiased vocabularies and provides good scalability for building large vocabularies.

Shikui Wei, Xinxiao Wu, Dong Xu

Component-Normalized Generalized Gradient Vector Flow for Snakes

The abstract should summarize the contents of the paper and should contain at least 70 and at most 150 words. It should be written using the


environment. Snakes, or active contours, have been widely used for image segmentation. An external force for snakes called gradient vector flow (GVF) largely addresses traditional snake problems of initialization sensitivity and poor convergence to concavities, and generalized GVF (GGVF) aims to improve GVF snake convergence to long and thin indentations. In this paper, we find and show that in the case of long and thin even-width indentations, GGVF generally fails to work. We identify the crux of the convergence problem, and accordingly propose a new external force termed as component-normalized GGVF (CN-GGVF) to eliminate the problem. CN-GGVF is obtained by normalizing each component of initial GGVF vectors with respect to its own magnitude. Comparisons against GGVF snakes show that the proposed CN-GGVF snakes can capture long and thin indentations regardless of odd or even widths with remarkably faster convergence speeds, and achieve lower computational complexity in vector normalization.

Yao Zhao, Ce Zhu, Lunming Qin, Huihui Bai, Huawei Tian

An Adaptive and Link-Based Method for Video Scene Clustering and Visualization

In this paper we propose to adaptively cluster video shots into scenes on page rank manner and visualize video content on clustered scenes. The clustering method has been compared with state-of-arts methods and experimental results demonstrate the effectiveness of the proposed method. In visualization, the importance of the shots in the scene can be obtained and incorporated into the visualization parameters. The visualization results of the test videos are shown in global and detail manner.

Hong Lu, Kai Chen, Yingbin Zheng, Zhuohong Cai, Xiangyang Xue

An Unsupervised Approach to Multiple Speaker Tracking for Robust Multimedia Retrieval

Tagging multi media data based on who is speaking at what time, is important especially in the intelligent retrieval of recordings of meetings and conferences. In this paper an unsupervised approach to tracking more than two speakers in multi media data recorded from multiple visual sensors and a single audio sensor is proposed. The multi-speaker detection and tracking problem is first formulated as a multiple hypothesis testing problem. From this formulation we proceed to derive the multi speaker detection and tracking problem as a condition in mutual information. The proposed method is then evaluated for multi media recordings consisting of four speakers recorded on a multi media recording test bed. Experimental results on the CUAVE multi modal corpus are also discussed. The proposed method exhibits reasonably good performance as demonstrated by the detection (ROC) curves. The results of analysis based on the condition in mutual information are also encouraging. A multiple speaker detection and tracking system implemented using this approach gives reasonable performance in actual meeting room scenarios.

M. Phanikumar, Lalan Kumar, Rajesh M. Hegde

On Effects of Visual Query Complexity

As an effective technique to manage large scale image collections, content-based image retrieval (CBIR) has been received great attentions and became a very active research domain in recent years. While assessing system performance is one of the key factors for the related technological advancement, relatively little attention has been paid to model and analyze test queries. This paper documents a study on the problem of determining “visual query complexity” as a measure for predicting image retrieval performance. We propose a quantitative metric for measuring complexity of image queries for content-based image search engine. A set of experiments are carried out using IAPR TC-12 Benchmark. The results demonstrate the effectiveness of the measurement, and verify that the retrieval accuracy of a query is inversely associated with the complexity level of its visual content.

Jialie Shen, Zhiyong Cheng

Watermarking and Image Processing


Reversible Image Watermarking Using Hybrid Prediction

In this paper, a hybrid prediction algorithm is designed to improve the histogram shifting based reversible watermarking method. This algorithm not only uses the local information near a pixel, but also utilizes the global information of the whole image. As a result, it produces a sharper histogram for watermark embedding. In addition, we enable the use of sorting idea by introducing an estimation function of the hybrid prediction. Experimental results illustrate that the proposed watermarking method outperforms many recently proposed methods.

Xiang Wang, Qingqi Pei, Xinbo Gao, Zongming Guo

A Rotation Invariant Descriptor for Robust Video Copy Detection

A large amount of videos on the Internet are generated from authorized sources by various kinds of transformations. Many works are proposed for robust description of video, which lead to satisfying matching qualities on Content Based Copy Detection (CBCD) issue. However, the trade-off of efficiency and effectiveness is still a problem among the state-of-the-art CBCD approaches. In this paper, we propose a novel frame-level descriptor for video. Firstly, each selected frame is partitioned into certain rings. Then the Histogram of Oriented Gradient (HOG) and the Relative Mean Intensity (RMI) are calculated as the original features. We finally fuse these two features by summing HOGs with RMIs as the corresponding weights. The proposed descriptor is succinct in concept, compact in structure, robust for rotation like transformations and fast to compute. Experiments on the CIVR’07 Copy Detection Corpus and the Video Transformation Corpus show improved performances both on matching quality and executive time compared to the pervious approaches.

Shuqiang Jiang, Li Su, Qingming Huang, Peng Cui, Zhipeng Wu

Depth-Wise Segmentation of 3D Images Using Dense Depth Maps

Unlike the conventional image segmentation problems dealing with surface-wise decomposition, the depth-wise segmentation is a problem of slicing an image containing 3D objects in a depth-wise sequence. The proposed depth-wise segmentation technique uses depth map of a 3D image to slice it into multiple layers. This technique can be used to improve viewing comfort with 3D displays, to compress videos and to interpolate intermediate views. The technique initially finds the edges of the dense depth map using a graduate edge detection algorithm. Then, it uses the detected edges to divide rows of a depth map into line-segments based on their entropy values. Finally, it links the line—segments to make object-layers. The experiments done through depth-wise segmentation technique have shown promising results.

Seyedsaeid Mirkamali, P. Nagabhushan

A Robust and Transparent Watermarking Method Against Block-Based Compression Attacks

In this paper, we present a new transparent and robust watermarking method against block-based compression attacks based on two perceptual models. In order to resist to block-based compression, the main idea is to embed the watermark into regions that are not or less affected by blocking effect. These auspicious regions are selected based on a spatial prediction model of blocking effect. Then, the embedding strength is optimally determined using a JND model. The combination of these two models provides more gain in robustness and transparency. Experimental results demonstrate that our proposed method achieves a good invisibility and robustness against common “signal processing” attacks, especially to JPEG compression.

Phi Bang Nguyen, Azeddine Beghdadi, Marie Luong

A New Signal Processing Method for Video Image-Reproduce the Frequency Spectrum Exceeding the Nyquist Frequency Using a Single Frame of the Video Image

A new non-linear signal processing method is proposed in this paper.

Enhancers are widely used in real time signal processing machines to improve the image quality. It does not actually increase the resolution but improves the degree of resolution as perceived by the human eye. It is almost impossible to create components exceeding the Nyquist frequency using conventional linear signal processing methods.

Super Resolution (SR) is a highly interesting research field and many ideas and methods have been proposed and some of which have the potential to enhance resolution. However, most of these ideas use several frames of video images and they have not been widely discussed in the frequency domain.

The new signal processing method in this paper uses just a single frame of the video image and creates components exceeding the Nyquist frequency. The simulation results are discussed with regard to the picture quality well as the frequency domain.

Seiichi Gohshi



A Novel UEP Scheme for Scalable Video Transmission Over MIMO Systems

In this paper, we propose a novel unequal error protection (UEP) scheme to improve the system performances for scalable video transmission over MIMO wireless networks, in which antenna selection, modulation, and channel coding are jointly optimized. We formulate the proposed scheme as an optimization problem and an efficient heuristic algorithm is proposed to solve it. By analyzing the dependency among antenna selection, modulation and channel coding, we determine the three vectors sequentially that antenna selection is heuristically implemented according to the order of subchannel’s SNR strength, the best modulation levels are selected by maximizing the good throughput, and based on the optimal antenna selection and modulation level vectors, the channel coding rate vector is determined to lower the packet error ratio (PER) as much as possible under bandwidth constraint. Experimental results show that the proposed algorithm achieves nearly the same optimal PSNR performance based on exhaustive search. By comparing with other transmission schemes, our proposed scheme and algorithm significantly improve the system performance.

Chao Zhou, Xinggong Zhang, Zongming Guo

Framework of Contour Based Depth Map Coding System

As conventional video coding standards such as H.264, Multiview Video Coding (MVC) adopts a strategy of “block based prediction” to achieve high compression ratio (such as JMVC). In order to improve the coding efficiency of MVC system, Depth Image Based Rendering (DIBR) is proposed. Depth map is introduced to represent the information of “another dimension”. Comparing with texture map (2D), depth map indicates much more spatial continuity. In this work, Contour Based Depth map Coding (CBDC) is proposed to take the place of the conventional block based code structure. Framework of whole system is illustrated. Especially, details about the contour coding module are introduced in this paper. Experimental result shows that the bit cost of propose method (when DLP between 8 and 16) is equivalent to JMVC when QP between 12 and 16. Although the reconstructed frame lost some detail of texture in painting, the PSNR of synthesized view is competitive.

Minghui Wang, Xun He, Xin Jin, Satoshi Goto

An Audiovisual Wireless Field Guide

This paper describes our work on developing a multimedia wireless field guide platform (WFG) for both fully automatic and assisted identification of different fauna and flora species. Built using a smart-client and server model it supports both visual and acoustic capture of individual specimen samples with optional graphical annotation for posing queries alongside more traditional searching and browsing query interfaces. The WFG assists the user to iteratively converge on a correct match by seeking additional information required to resolve class uncertainty.

Ruben Gonzalez, Yongsheng Gao

CDNs with DASH and iDASH Using Priority Caching

Global Internet traffic shows an upward trend mainly driven by the increasing demand for video services. In addition the further spread of mobile Internet leads to an increased diversification of access data rates and internet terminals. In such a context, Content Delivery Networks (CDNs) are forced to offer content in multiple versions for different resolutions. Moreover multiple bitrates are needed, such that emerging adaptive streaming technologies are enabled to adapt to network congestion. This enormous proliferation of the multimedia content becomes more and more a challenge for the efficiency of existing network and caching infrastructure. Dynamic Adaptive Streaming over HTTP (DASH) is an emerging standard which enables adaptation of the media bitrate to varying throughput conditions by offering multiple representations of the same content. The combination of Scalable Video Coding (SVC) with DASH, called improved DASH (iDASH) consists basically of relying on SVC to provide the different representations. This paper shows how prioritized caching strategies can improve the caching performance of (i)DASH services. Results obtained from statistics of a real world CDN deployment and a simple revenue model show a clear benefit in revenue for content providers when priority caching is used especially in combination with iDASH.

Cornelius Hellge, Yago Sánchez, Thomas Schierl, Thomas Wiegand, Danny De Vleeschauwer, Dohy Hong, Yannick Le Louédec

A Travel Planning System Based on Travel Trajectories Extracted from a Large Number of Geotagged Photos on the Web

Due to the recent wide spread of camera devices with GPS, the number of geotagged photos on the Web is increasing rapidly. Some image retrieval systems and travel recommendation systems which make use of geotagged images on the Web have been proposed so far. While most of them handle a large number of geotagged images as a set of location points, in this paper we handle them as sequences of location points. We propose a travel route recommendation system which utilizes actual travel paths extracted from a large number of photos uploaded by many people on the Web.

Kohya Okuyama, Keiji Yanai

A Robust Histogram Region-Based Global Camera Estimation Method for Video Sequences

Global motion estimation (GME) plays an important role in video object segmentation. This paper presents a computationally efficient two stage affine GME algorithm. The key idea is to create initial matches for histogram-based image segmented regions. Then an affine motion model is estimated and refined by iteratively removing incorrect matches. Experiments with different types of video sequences are used to demonstrate the performance of the proposed approach.

Xuesong Le, Ruben Gonzalez
Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.




Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!