Skip to main content

Über dieses Buch

The two-volume proceedings LNCS 9314 and 9315, constitute the proceedings of the 16th Pacific-Rim Conference on Multimedia, PCM 2015, held in Gwangju, South Korea, in September 2015.

The total of 138 full and 32 short papers presented in these proceedings was carefully reviewed and selected from 224 submissions. The papers were organized in topical sections named: image and audio processing; multimedia content analysis; multimedia applications and services; video coding and processing; multimedia representation learning; visual understanding and recognition on big data; coding and reconstruction of multimedia data with spatial-temporal information; 3D image/video processing and applications; video/image quality assessment and processing; social media computing; human action recognition in social robotics and video surveillance; recent advances in image/video processing; new media representation and transmission technologies for emerging UHD services.



Image and Audio Processing


Internal Generative Mechanism Based Otsu Multilevel Thresholding Segmentation for Medical Brain Images

Recent brain theories indicate that perceiving an image visually is an active inference procedure of the brain by using the Internal Generative Mechanism (IGM). Inspired by the theory, an IGM based Otsu multilevel thresholding algorithm for medical images is proposed in this paper, in which the Otsu thresholding technique is implemented on both the original image and the predicted version obtained by simulating the IGM on the original image. A regrouping measure is designed to refining the segmentation result. The proposed method takes the predicted visual information generated by the complicated Human Visual System (HVS) into account, as well as the details. Experiments on medical MR-T2 brain images are conducted to demonstrate the effectiveness of the proposed method. The experimental results indicate that the IGM based Otsu multilevel thresholding is superior to the other multilevel thresholdings.

Yuncong Feng, Xuanjing Shen, Haipeng Chen, Xiaoli Zhang

Efficient Face Image Deblurring via Robust Face Salient Landmark Detection

Recent years have witnessed great progress in image deblurring. However, as an important application case, the deblurring of face images has not been well studied. Most existing face deblurring methods rely on exemplar set construction and candidate matching, which not only cost much computation time but also are vulnerable to possible complex or exaggerated face variations. To address the aforementioned problems, we propose a novel face deblurring method by integrating classical L0 deblurring approach with face landmark detection. A carefully tailored landmark detector is used to detect the main face contours. Then the detected contours are used as salient edges to guide the blind image deconvolution. Extensive experimental results demonstrate that the proposed method can better handle various complex face poses while greatly reducing computation time, as compared with state-of-the-art approaches.

Yinghao Huang, Hongxun Yao, Sicheng Zhao, Yanhao Zhang

Non-uniform Deblur Using Gyro Sensor and Long/Short Exposure Image Pair

This paper proposes a deblur algorithm using IMU sensor and long/ short exposure-time image pair. First, we derive an initial blur kernel from gyro data of IMU sensor. Second, we refine the blur kernel by applying Lucas-Kanade algorithm to long/short exposure-time image pair. Using residual deconvolution based on the non-uniform blur kernel, we synthesize a final image. Experimental results show that the proposed algorithm is superior to the state-of-the-art methods in terms of subjective/objective visual quality.

Seung Ji Seo, Ho-hyoung Ryu, Dongyun Choi, Byung Cheol Song

Object Searching with Combination of Template Matching

Object searching is the identification of an object in an image or video. There are several approaches to object detection, including template matching in computer vision. Template matching uses a small image, or template, to find matching regions in a larger image. In this paper, we propose a robust object searching method based on adaptive combination template matching. We apply a partition search to resize the target image properly. During this process, we can make efficiently match each template into the sub-images based on normalized sum of squared differences or zero-mean normalized cross correlation depends on the class of the object location such as corresponding, neighbor, or previous location. Finally, the template image is updated appropriately by an adaptive template algorithm. Experiment results show that the proposed method outperforms in object searching.

Wisarut Chantara, Yo-Sung Ho

Multimedia Content Analysis


Two-Step Greedy Subspace Clustering

Greedy subspace clustering methods provide an efficient way to cluster large-scale multimedia datasets. However, these methods do not guarantee a global optimum and their clustering performance mainly depends on their initializations. To alleviate this initialization problem, this paper proposes a two-step greedy strategy by exploring proper neighbors that span an initial subspace. Firstly, for each data point, we seek a sparse representation with respect to its nearest neighbors. The data points corresponding to nonzero entries in the learning representation form an initial subspace, which potentially rejects bad or redundant data points. Secondly, the subspace is updated by adding an orthogonal basis involved with the newly added data points. Experimental results on real-world applications demonstrate that our method can significantly improve the clustering accuracy of greedy subspace clustering methods without scarifying much computational time.

Lingxiao Song, Man Zhang, Zhenan Sun, Jian Liang, Ran He

Iterative Collection Annotation for Sketch Recognition

Sketch recognition is an important issue in human-computer interaction, especially in sketch-based interface. To provide a scalable and flexible tool for user-driven sketch recognition, this paper proposes an iterative sketch collection annotation method for classifier-training by interleaving online metric learning, semi-supervised clustering and user intervention. It can discover the categories of the collections iteratively by combing online metric learning with semi-supervised clustering, and put the user intervention into the loop of each iteration. The features of our methods lie in three aspects. Firstly, the unlabeled collections are annotated with less effort in a group by group form. Secondly, the users can annotate the collections flexibly and freely to define the sketch recognition personally for different applications. Finally, the scalable collection can be annotated efficiently by combining the dynamically processing and online learning. The experimental results prove the effectiveness of our method.

Kai Liu, Zhengxing Sun, Mofei Song, Bo Li, Ye Tian

Supervised Dictionary Learning Based on Relationship Between Edges and Levels

Categories of images are often arranged in a hierarchical structure based on their semantic meanings. Many existing approaches demonstrate the hierarchical category structure could bolster the learning process for classification, but most of them are designed based on a flat category structure, hence may not be appreciated for dealing with complex category structure and large numbers of categories. In this paper, given the hierarchical category structure, we propose to jointly learn a shared discriminative dictionary and corresponding level classifiers for visual categorization by making use of the relationship between the edges and the relationship between each layer. Specially, we use the graphguided- fused-lasso penalty to embed the relationship between edges to the dictionary learning process. Besides, our approach not only learns the classifier towards the basic-class level, but also learns the classifier corresponding to the super-class level to embed the relationship between levels to the learning process. Experimental results on Caltech256 dataset and its subset show that the proposed approach yields promising performance improvements over some state-of-the-art methods.

Qiang Guo, Yahong Han

Adaptive Margin Nearest Neighbor for Person Re-Identification

Person re-identification is a challenging issue due to large visual appearance changes caused by variations in viewpoint, lighting, background clutter and occlusion among different cameras. Recently, Mahalanobis metric learning methods, which aim to find a global, linear transformation of the feature space between cameras [1-4], are widely used in person re-identification. In order to maximize the inter-class variation, general Mahalanobis metric learning methods usually push impostors (i.e., all negative samples that are nearer than the target neighbors) to a fixed threshold distance away, treating all these impostors equally without considering their diversity. However, for person re-identification, the discrepancies among impostors are useful for refining the ranking list. Motivated by this observation, we propose an Adaptive Margin Nearest Neighbor (AMNN) method for person re-identification. AMNN aims to take unequal treatment to each samples impostors by pushing them to adaptive variable margins away. Extensive comparative experiments conducted on two standard datasets have confirmed the superiority of the proposed method.

Lei Yao, Jun Chen, Yi Yu, Zheng Wang, Wenxin Huang, Mang Ye, Ruimin Hu

Compressed-Domain Based Camera Motion Estimation for Realtime Action Recognition

Camera motions seriously affect the accuracy of action recognition. Traditional methods address this issue through estimating and compensating camera motions based on optical flow in pixel-domain. But the high computational complexity of optical flow hinders these methods from applying to realtime scenarios. In this paper, we advance an efficient camera motion estimation and compensation method for realtime action recognition by exploiting motion vectors in video compressed-domain (a.k.a. compressed-domain global motion estimation, CGME). Taking advantage of geometric symmetry and differential theory of motion vectors, we estimate the parameters of camera affine transformation. These parameters are then used to compensate the initial motion vectors to retain crucial object motions. Finally, we extract video features for action recognition based on compensated motion vectors. Experimental results show that our method improves the speed of camera motion estimation by over 100 times with a minor reduction of about 4% in recognition accuracy compared with iDT.

Huafeng Chen, Jun Chen, Hongyang Li, Zengmin Xu, Ruimin Hu

Image and Audio Processing


On the Security of Image Manipulation Forensics

In this paper, we present a unified understanding on the formal performance evaluation for image manipulation forensics techniques. With hypothesis testing model, security is qualified as the difficulty for defeating an existing forensics system and making it generate two types of forensic errors, i.e., missing and false alarm detection. We point out that the security on false alarm risk, which is rarely addressed in current literatures, is equally significant for evaluating the performance of manipulation forensics techniques. With a case study on resampling-based composition forensics detector, both qualitative analyses and experimental results verify the correctness and rationality of our understanding on manipulation forensics security.

Gang Cao, Yongbin Wang, Yao Zhao, Rongrong Ni, Chunyu Lin

A Sparse Representation-Based Label Pruning for Image Inpainting Using Global Optimization

This paper presents a new label pruning based on sparse representation in image inpainting. In this literature, the label indicates a small rectangular patch to fill the missing regions. Global optimization-based image inpainting requires heavy computational cost due to a large number of labels. Therefore, it is necessary to effectively prune redundant labels. Also, inappropriate label pruning could degrade the inpainting quality. In this paper, we adopt the sparse representation of label to obtain a few reliable labels. The sparse representation of label is used to prune the redundant labels. Sparsely represented labels as well as non-zero sparse labels with high similarity to the target region are used as reliable labels in global optimization based image inpainting. Experimental results show that the proposed method can achieve the computational efficiency and structurally consistency.

Hak Gu Kim, Yong Man Ro

Interactive RGB-D Image Segmentation Using Hierarchical Graph Cut and Geodesic Distance

In this paper, we propose a novel interactive image segmentation method for RGB-D images using hierarchical Graph Cut. Considering the characteristics of RGB channels and depth channel in RGB-D image, we utilize Euclidean distance on RGB space and geodesic distance on 3D space to measure how likely a pixel belongs to foreground or background in color and depth respectively, and integrate the color cue and depth cue into a unified Graph Cut framework to obtain the optimal segmentation result. Moreover, to overcome the low efficiency problem of Graph Cut in handling high resolution images, we accelerate the proposed method with hierarchical strategy. The experimental results show that our method outperforms the state-of-the-art methods with high efficiency.

Ling Ge, Ran Ju, Tongwei Ren, Gangshan Wu

Face Alignment with Two-Layer Shape Regression

We present a novel approach to resolve the problem of face alignment with a two-layer shape regression framework. Traditional regression-based methods [4, 6, 7] regress all landmarks in a single shape without consideration of the difference between various landmarks in biologic property and texture, which would lead to a suboptimal prediction. Unlike previous regression-based approach, we do not regress the entire landmarks in a holistic manner without any discrimination. We categorize the geometric constraints into two types, inter-component constraints and intra-component constraints. Corresponding to these two shape constraints, we design a two-layer shape regression framework which can be integrated with regression-based methods. We define a term of “key points” of components to describe inter-component constraints and then determine the sub-shapes. We verify our two-layer shape regression framework on two widely used datasets (LFPW [10] and Helen [11]) for face alignment and experimental results prove its improvements in accuracy.

Qilong Zhang, Lei Zhang

3D Panning Based Sound Field Enhancement Method for Ambisonics

When conventional first order Ambisonics system uses four loudspeaker with platonic solid layout to reconstruct sound field, the 3D acoustic field effect is limited. A new signal distribution method is proposed to enhance the reproduced field without increasing loudspeakers. First, a platonic solid is extended to get more new vertexes, based on the traditional Ambisonics signal distribution method, original field signal is distributed to loudspeakers at original and new vertexes of platonic solid. Second, signals of loudspeakers at new vertexes are distributed to loudspeakers at original vertexes by a new 3D panning method, then loudspeakers at new vertexes of platonic solid are deleted, only original vertexes of platonic solid are left. The proposed method can improve the quality of the reconstructed sound field and will not increase the complexity of loudspeaker layout in practice. Results are verified through objective and subjective experiments.

Song Wang, Ruimin Hu, Shihong Chen, Xiaochen Wang, Yuhong Yang, Weiping Tu

Multimedia Applications and Services


Multi-target Tracking via Max-Entropy Target Selection and Heterogeneous Camera Fusion

Nowadays, dual-camera systems, which consist of a static camera and a pan-tilt-zoom (PTZ) camera, have become popular in video surveillance, since they can offer wide area coverage and highly detailed images of the interesting starget simultaneously. Different from most previous multi-target tracking methods without information fusion, we propose a multi-target tracking framework based on information fusion of the heterogeneous cameras. Specifically, a conservative online multitarget tracking method is introduced to generate reliable tracklets in both cameras in real time. A max-entropy target selection strategy is proposed to determine which target should be observed by the PTZ camera at a higher resolution to reduce the ambiguity of multi-target tracking. Finally, the information from the static camera and the PTZ camera is fused into a tracking-by-detection framework for more robust multitarget tracking. The proposed method is tested in an outdoor scene, and the experimental results show that our method significantly improves the multi-target tracking performance.

Jingjing Wang, Nenghai Yu

Adaptive Multiple Appearances Model Framework for Long-Term Robust Tracking

Tracking an object in long term is still a great challenge in computer vision. Appearance modeling is one of keys to build a good tracker. Much research attention focuses on building an appearance model by employing special features and learning method, especially online learning. However, one model is not enough to describe all historical appearances of the tracking target during a long term tracking task because of view port exchanging, illuminance varying, camera switching, etc. We propose the Adaptive Multiple Appearance Model (AMAM) framework to maintain not one model but appearance model set to solve this problem. Different appearance representations of the tracking target could be employed and grouped unsupervised and modeled by Dirichlet Process Mixture Model (DPMM) automatically. And tracking result can be selected from candidate targets predicted by trackers based on those appearance models by voting and confidence map. Experimental results on multiple public datasets demonstrate the better performance compared with state-of-the-art methods.

Shuo Tang, Longfei Zhang, Jiapeng Chi, Zhufan Wang, Gangyi Ding

On-line Sample Generation for In-air Written Chinese Character Recognition Based on Leap Motion Controller

As intelligent devices and human-computer interaction ways become diverse, the in-air writing is becoming popular as a very natural interaction way. Compared with online handwritten Chinese character recognition (OHCCR) based on touch screen or writing board, the research of in-air handwritten Chinese character recognition (IAHCCR) is still in the start-up phase. In this paper, we present an on-line sample generation method to enlarge the number of training instances in an automatic synthesis way. In our system, the in-air writing trajectory of fingertip is first captured by a Leap Motion Controller. Then corner points are detected. Finally, the corner points as well as the sampling points between corner points are distorted to generate artificial patterns. Compared with the previous sample generation methods, the proposed method focuses on distorting the inner structure of character patterns. We evaluate the proposed method on our in-air handwritten Chinese character dataset IAHCC-UCAS2014 which covers 3755 classes of Chinese characters. The experimental results demonstrate that proposed approach achieves higher recognition accuracies and lower computational cost.

Ning Xu, Weiqiang Wang, Xiwen Qu

Progressive Image Segmentation Using Online Learning

This article proposed a progressive image segmentation, which allow users to segment images according to their preferences without any boring pre-labeling or training stages. We use an online learning method to train/update the segmentation model progressively. User can scribble on the image to label initial samples or correct the false-labeled regions of the result. To efficiently integrate the interaction with the learning and updating process, a three-level representation of images is built. The proposed method has three advantages. Firstly, the segmentation model can be learned online along with user’s manipulation without any pre-labeling. Secondly, the diversity of segmentation accord with user’s preferences can be met flexibly, and the more use the more accurate the segmentation could be. Finally, the segmentation model can be updated online to meet the needs of users. The experimental results demonstrate these advantages.

Jiagao Hu, Zhengxing Sun, Kewei Yang, Yiwen Chen

A Study of Interactive Digital Multimedia Applications

Many communication models for communication arts and numerous interactive multimedia applications for computer science were discussed over many decades ago. However, there has been little work giving an overview of recent integrated research of digital media and emerging trends, such as interactive multimedia experience in an interdisciplinary aspect. In this paper, we review and study recently interactive digital multimedia applications using and applying the aforementioned emerging trends. We provide a short blueprint for interactive digital multimedia research when applying virtual reality, image processing, computer vision, real-time augmented reality, and interactive media into the senses of hearing and vision for virtual environments. A SMCR (Source-Message-Channel-Receiver) model for communicating via all human senses is also explained and linked to some interactive digital multimedia applications presented recently. After that, the senses of hearing and vision are discussed using related-technologies. It will be of good value to the new researchers in this integrated emerging field of interactive digital multimedia.

Chutisant Kerdvibulvech

Video Coding and Processing


Particle Filter with Ball Size Adaptive Tracking Window and Ball Feature Likelihood Model for Ball’s 3D Position Tracking in Volleyball Analysis

3D position tracking of the ball plays a crucial role in professional volleyball analysis. In volleyball games, the constraint conditions that limit the performance of the ball tracking include the fast irregular movement of the ball, the small-size of the ball, the complex background as well as the occlusion problem caused by players. This paper proposes a ball size adaptive (BSA) tracking window, a ball feature likelihood model and an anti-occlusion likelihood measurement (AOLM) base on Particle Filter for improving the accuracy. By adaptively changing the tracking windows according to the ball size, it is possible to track the ball with changing size in different video images. On the other hand, the ball feature likelihood enables to track stably even in complex background. Furthermore, AOLM based on a multiple-camera system solves the occlusion problems since it can eliminate the low likelihood caused by occlusion. Experimental results which are based on the HDTV video sequences (2014 Inter High School Games of Men’s Volleyball) captured by four cameras located at the corners of the court show that the success rate of the ball’s 3D position tracking achieves 93.39%.

Xina Cheng, Xizhou Zhuang, Yuan Wang, Masaaki Honda, Takeshi Ikenaga

Block-Based Global and Multiple-Reference Scheme for Surveillance Video Coding

There are often lots of periodic motions in the background of surveillance videos, such as countdown traffic lights, LED billboards and etc. The conventional motion-compensation scheme and the existing frame-based single background reference scheme cannot eliminate this kind of redundancies efficiently, especially when the cycle time exceeds the maximum GOP size. In this paper, we propose a block-based global and multiple-reference scheme to solve this problem. Firstly, the background is modeled on the basis of co-located blocks but not frames, which makes it possible to realize an adaptive block-level background updating. Secondly, multiple background blocks can be kept for one block location, which makes it suitable for modeling periodic background. Thirdly, the scheme enables global reference, which further eliminates the extensively existed redundancies among GOPs in surveillance videos. Experimental results show that the proposed scheme achieves better rate-distortion performance over the existing frame-based single background reference scheme in most cases.

Liming Yin, Ruimin Hu, Shihong Chen, Jing Xiao, Minsheng Ma

Global Object Representation of Scene Surveillance Video Based on Model and Feature Parameters

Scene surveillance video is a kind of video which are captured by stationary camera for a long time in specific surveillance scene. Due to regular movement of vehicles with similarity structures, models and appearances, surveillance video produce amounts of redundancy and needs to be efficiently coded for transmission and storage. In this study, we investigated the video redundancy generation mechanism of scene surveillance, exploit and presents a new redundancy type-Global Object Redundancy (GOR), it is proven that the vehicles occupy the mostly proportion which caused by amounts of vehicles movement. Secondly, aiming at global vehicle objects representation and GOR elimination, a global object representation scheme of scene surveillance video based on model and feature parameters is introduced, by establish a global knowledge dictionary and feature parameter sets, low bitrate with high quality compression can be achieved due to only few vehicle objects individual semantic and feature parametric be transfer and coded. Finally, we carried out preliminary experiments in simulation environment and shows that the object representation scheme can effectively improve the compression of long-term archive surveillance video which with a certain of image quality assurance.

Minsheng Ma, Ruimin Hu, Shihong Chen, Jing Xiao, Zhongyuan Wang, Shenming Qu

A Sparse Error Compensation Based Incremental Principal Component Analysis Method for Foreground Detection

Foreground detection is a fundamental task in video processing. Recently, many background subspace estimation based foreground detection methods have been proposed. In this paper, a sparse error compensation based incremental principal component analysis method, which robustly updates background subspace and estimates foreground, is proposed for foreground detection. There are mainly two notable features in our method. First, a sparse error compensation process via a probability sampling procedure is designed for subspace updating, which reduces the interference of undesirable foreground signal. Second, the proposed foreground detection method could operate without an initial background subspace estimation, which enlarges the application scope of our method. Extensive experiments on multiple real video sequences show the superiority of our method.

Ming Qin, Yao Lu, Huijun Di, Tianfei Zhou

Multimedia Representation Learning


Convolutional Neural Networks Features: Principal Pyramidal Convolution

The features extracted from convolutional neural networks (CNNs) are able to capture the discriminative part of an image and have shown superior performance in visual recognition. Furthermore, it has been verified that the CNN activations trained from large and diverse datasets can act as generic features and be transferred to other visual recognition tasks. In this paper, we aim to learn more from an image and present an effective method called Principal Pyramidal Convolution (PPC). The scheme first partitions the image into two levels, and extracts CNN activations for each sub-region along with the whole image, and then aggregates them together. The concatenated feature is later reduced to the standard dimension using Principal Component Analysis (PCA) algorithm, generating the refined CNN feature. When applied in image classification and retrieval tasks, the PPC feature consistently outperforms the conventional CNN feature, regardless of the network type where they derive from. Specifically, PPC achieves state-of-the-art result on the MIT Indoor67 dataset, utilizing the activations from Places-CNN.

Yanming Guo, Songyang Lao, Yu Liu, Liang Bai, Shi Liu, Michael S. Lew

Gaze Shifting Kernel: Engineering Perceptually- Aware Features for Scene Categorization

In this paper, we propose a novel gaze shifting kernel for scene image categorization, focusing on discovering the mechanism of humans perceiving visually/semantically salient regions in a scene. First, a weakly supervised embedding algorithm projects the local image descriptors (


, graphlets) into a pre-specified semantic space. Afterward, each graphlet can be represented by multiple visual features at both low-level and high-level. As humans typically attend to a small fraction of regions in a scene, a sparsity-constrained graphlet ranking algorithm is proposed to dynamically integrate both the low-level and the high-level visual cues. The top-ranked graphlets are either visually or semantically salient according to human perception. They are linked into a path to simulate human gaze shifting. Finally, we calculate the gaze shifting kernel (GSK) based on the discovered paths from a set of images. Experiments on the USC scene and the ZJU aerial image data sets demonstrate the competitiveness of our GSK, as well as the high consistency of the predicted path with real human gaze shifting path.

Luming Zhang, Richang Hong, Meng Wang

Two-Phase Representation Based Classification

In this paper, we propose the two-phase representation based classification called the two-phase linear reconstruction measure based classification (TPLRMC). It is inspired from the fact that the linear reconstruction measure (LRM) gauges the similarities among feature samples by decomposing each feature sample as a liner combination of the other feature samples with L


-norm regularization. Since the linear reconstruction coefficients can fully reveal the feature’s neighborhood structure that is hidden in the data, the similarity measures among the training samples and the query sample are well provided in classifier design. In TPLRMC, it first coarsely seeks the K nearest neighbors for the query sample with LRM, and then finely represents the query sample as the linear combination of the determined K nearest neighbors and uses LRM to perform classification. The experimental results on face databases show that TPLRMC can significantly improve the classification performance.

Jianping Gou, Yongzhao Zhan, Xiangjun Shen, Qirong Mao, Liangjun Wang

Deep Feature Representation via Multiple Stack Auto-Encoders

Recently, deep architectures, such as stack auto-encoders (SAEs), have been used to learn features from the unlabeled data. However, it is difficult to get the multi-level visual information from the traditional deep architectures (such as SAEs). In this paper, a feature representation method which concatenates Multiple Different Stack Auto-Encoders (MDSAEs) is presented. The proposed method tries to imitate the human visual cortex to recognize the objects from different views. The output of the last hidden layer for each SAE can be regarded as a kind of feature. Several kinds of features are concatenated together to form a final representation according to their weights (The output of deep architectures are assigned a high weight, and vice versa). From this way, the hierarchical structure of the human brain cortex can be simulated. Experimental results on datasets MNIST and CIRFA10 for classification have demonstrated the superior performance.

Mingfu Xiong, Jun Chen, Zheng Wang, Chao Liang, Qi Zheng, Zhen Han, Kaimin Sun

Beyond HOG: Learning Local Parts for Object Detection

Histogram of Oriented Gradients (HOG) features have laid solid foundation for object detection in recent years for its both accuracy and speed. However, the expressivity of HOG is limited because the simple gradient features may ignore some important local information about objects and HOG is actually data-independent. In this paper, we propose to replace HOG by a parts-based representation, Histogram of Local Parts (HLP), for object detection under sliding window framework. HLP can capture richer and larger local patterns of objects and are more expressive than HOG. Specifically, we adopt Sparse Nonnegative Matrix Factorization to learn an over-complete parts-based dictionary from data. Then we can obtain HLP representation for a local patch by aggregating the Local Parts coefficients of pixels in this patch. Like DPM, we can train a supervised model with HLP given the latent positions of roots and parts of objects. Extensive experiments on INRIA and PASCAL datasets verify the superiority of HLP to state-of-the-art HOG-based methods for object detection, which shows that HLP is more effective than HOG.

Chenjie Huang, Zheng Qin, Kaiping Xu, Guolong Wang, Tao Xu

Regular Poster Session


Tuning Sparsity for Face Hallucination Representation

Due to the under-sparsity or over-sparsity, the widely used regularization methods, such as ridge regression and sparse representation, lead to poor hallucination performance in the presence of noise. In addition, the regularized penalty function fails to consider the locality constraint within the observed image and training images, thus reducing the accuracy and stability of optimal solution. This paper proposes a locally weighted sparse regularization method by incorporating distance-inducing weights into the penalty function. This method accounts for heteroskedasticity of representation coefficients and can be theoretically justified from Bayesian inference perspective. Further, in terms of the reduced sparseness of noisy images, a moderately sparse regularization method with a mixture of






norms is introduced to deal with noise robust face hallucination. Various experimental results on public face database validate the effectiveness of proposed method.

Zhongyuan Wang, Jing Xiao, Tao Lu, Zhenfeng Shao, Ruimin Hu

Visual Tracking by Assembling Multiple Correlation Filters

In this paper, we present a robust object tracking method by fusing multiple correlation filters which leads to a weighted sum of these classifier vectors. Different from other learning methods which utilize a sparse sampling mechanism to generate training samples, our method adopts a dense sampling strategy for both training and testing which is more effective yet efficient due to the highly structured kernel matrix. A correlation filter pool is established based on the correlation filters trained by historical frames as tracking goes on. We consider the weighted sum of these correlation filters as the final classifier to locate the position of object. We introduce a coefficients optimization scheme by balancing the test errors for all correlation filters and emphasizing the recent frames. Also, a budget mechanism by removing the one which will result in the smallest change to final correlation filter is illustrated to prevent the unlimited increase of filter number. The experiments compare our method with other three state-of-the-art algorithms, demonstrating a robust and encouraging performance of the proposed algorithm.

Tianyu Yang, Zhongchao Shi, Gang Wang

A Unified Tone Mapping Operation for HDR Images Including Both Floating-Point and Integer Data

This paper considers a unified tone mapping operation (TMO) for HDR images. This paper includes not only floating-point data but also long-integer (i.e. longer than 8-bit) data as HDR image expression. A TMO generates a low dynamic range (LDR) image from a high dynamic range (HDR) image by compressing its dynamic range. A unified TMO can perform tone mapping for various HDR image formats with a single common TMO. The integer TMO which can perform unified tone mapping by converting an input HDR image into an intermediate format was proposed. This method can be executed efficiently with low memory and low performance processor. However, only floatingpoint HDR image formats have been considered in the unified TMO. In other words, a long-integer which is one of the HDR image formats has not been considered in the unified TMO. This paper extends the unified TMO to a long-integer format. Thereby, the unified TMO for all possible HDR image formats can be realized. The proposed method ventures to convert a long-integer number into a floating-point number, and treats it as two 8-bit integer numbers which correspond to its exponent part and mantissa part. These two integer numbers are applied the tone mapping separately. The experimental results shows the proposed method is effective for an integer format in terms of the resources such as the computational cost and the memory cost.

Toshiyuki Dobashi, Masahiro Iwahashi, Hitoshi Kiya

Implementation of Human Action Recognition System Using Multiple Kinect Sensors

Human action recognition is an important research topic that has many potential applications such as video surveillance, humancomputer interaction and virtual reality combat training. However, many researches of human action recognition have been performed in single camera system, and has low performance due to vulnerability to partial occlusion. In this paper, we propose a human action recognition system using multiple Kinect sensors to overcome the limitation of conventional single camera based human action recognition system. To test feasibility of the proposed system, we use the snapshot and temporal features which are extracted from three-dimensional (3D) skeleton data sequences, and apply the support vector machine (SVM) for classification of human action. The experiment results demonstrate the feasibility of the proposed system.

Beom Kwon, Doyoung Kim, Junghwan Kim, Inwoong Lee, Jongyoo Kim, Heeseok Oh, Haksub Kim, Sanghoon Lee

Simplification of 3D Multichannel Sound System Based on Multizone Soundfield Reproduction

Home sound environments are becoming increasingly important to the entertainment and audio industries. Compared with single zone soundfield reproduction, 3D spatial multizone soundfield reproduction is a more complex and challenging problem with few loudspeakers. In this paper, we introduce a simplification method based on the Least-Squares sound pressure matching method, and two separated zones can be reproduced accurately. For NHK 22.2 system, fourteen kinds of loudspeaker arrangements from 22 to 8 channels are derived. Simulation results demonstrate the favorable performance for two zones soundfield reproduction, and subjective evaluation results show the soundfield of two heads can be reproduced perfectly until 10 channels, and 8-channel systems can keep low distortions at ears. Compared with Ando’s multichannel conversion method by subjective evaluation, our proposed method is very close Ando’s in terms of sound localization in the center zone, what’s more, the performance of sound localization are improved significantly in the other zone of which position off the center.

Bowei Fang, Xiaochen Wang, Song Wang, Ruimin Hu, Yuhong Yang, Cheng Yang

Multi-channel Object-Based Spatial Parameter Compression Approach for 3D Audio

To improve the spatial precision of three-dimensional (3D) audio, the bit rates of spatial parameters are increased sharply. This paper presents a spatial parameters compression approach to decrease the bit rates of spatial parameters for 3D audio. Based on spatial direction filtering and spatial side information clustering, new multi-channel object-based spatial parameters compression approach (MOSPCA) is presented, through which the spatial parameters of intra-frame different frequency bands belonging to the same sound source can be compressed to one spatial parameter. In an experiment it is shown that the compression ratio of spatial parameter can reach 7:1 compared with the 1.4:1 of MPEG Surround and S


AC (spatial squeeze surround audio coding), while transparent spatial perception is maintained.

Cheng Yang, Ruimin Hu, Liuyue Su, Xiaochen Wang, Maosheng Zhang, Shenming Qu

A FPGA Based High-Speed Binocular Active Vision System for Tracking Circle-Shaped Target

With the development of digital image processing technology, computer vision technology has been widely used in various areas. Active vision is one of the main research fields in computer vision and can be used in different scenes, such as airports, ball games, and so on. FPGA (Field Programmable Gate Array) is widely used in computer vision field for its high speed and the ability to process a great amount of data. In this paper, a novel FPGA based high-speed binocular active vision system for tracking circle-shaped target is introduced. Specifically, our active vision system includes three parts: target tracking, coordinate transformation, and pan-tilt control. The system can handle 1000 successive frames in 1 s, track and keep the target at the center of the image for attention.

Zhengyang Du, Hong Lu, Haowei Yuan, Wenqiang Zhang, Chen Chen, Kongye Xie

The Extraction of Powerful and Attractive Video Contents Based on One Class SVM

With the quick increase of video data, it is difficult for people to find the favorite video to watch quickly. The existing video summarization methods can do a favor for viewers. However, these methods mainly contain the very brief content from the start to the end of the whole video. Viewers may hardly be interested in scanning these kinds of summary videos, and they will want to know the interesting or exciting contents in a shorter time. In this paper, we propose a video summarization approach of powerful and attractive contents based on the extracted deep learning feature and implement our approach on One Class SVM (OCSVM). Extensive experiments demonstrate that our approach is able to extract the powerful and attractive contents effectively and performs well on generating attractive summary videos, and we can provide a benchmark of powerful content extraction at the same time.

Xingchen Liu, Xiaonan Song, Jianmin Jiang

Blur Detection Using Multi-method Fusion

A new methodology for blur detection with multi-method fusion is presented in this paper. The research is motivated by the observation that there is no single method that can give the best performance in all situations. We try to discover the underlying performance complementary patterns of several state-of-the-art methods, then use the pattern specific to each image to get a better overall result. Specifically, a Conditional Random Filed (CRF) framework is adopted for multi-method blur detection that not only models the contribution from individual blur detection result but also the interrelation between neighbouring pixels. Considering the dependence of multi-method fusion on the specific image, we single out a subset of images similar to the input image from a training dataset and train the CRF-based multi-method fusion model only using this subset instead of the whole training dataset. The proposed multi-method fusion approach is shown to stably outperform each individual blur detection method on public blur detection benchmarks.

Yinghao Huang, Hongxun Yao, Sicheng Zhao

Motion Vector and Players’ Features Based Particle Filter for Volleyball Players Tracking in 3D Space

Multiple players tracking plays a key role in volleyball analysis. Due to the demand of developing effective tactics for professional events, players’ 3D information like speed and trajectory is needed. Although, 3D information can solve the occlusion relation problem, complete occlusion and similar feature between players may still reduce the accuracy of tracking. Thus, this paper proposes a motion vector and players’ features based particle filter for multiple players tracking in 3D space. For the prediction part, a motion vector prediction model combined with Gaussian window model is proposed to predict player’s position after occlusion. For the likelihood estimation part, a 3D distance likelihood model is proposed to avoid error tracking between two players. Also, a number detection likelihood model is used to distinguish players. With the proposed multiple players tracking algorithm, not only occlusion relation problem can be solved, but also physical features of players in the real world can be obtained. Experiment which executed on an official volleyball match video (Final Game of 2014 Japan Inter High School Games of Men’s Volleyball in Tokyo Metropolitan Gymnasium) shows that our tracking algorithm can achieve 91.9 % and 92.6 % success rate in the first and third set.

Xizhou Zhuang, Xina Cheng, Shuyi Huang, Masaaki Honda, Takeshi Ikenaga

A Novel Edit Propagation Algorithm via L0 Gradient Minimization

In this paper, we study how to perform edit propagation using



gradient minimization. Existing propagation methods only take simple constraints into consideration and neglects image structure information. We propose a new optimization framework making use of



gradient minimization, which can globally satisfy user-specified edits as well as tackle counts of non-zero gradients. In this process, a modified affinity matrix approximation method which efficiently reduces randomness is raised. We introduce a self-adaptive re-parameterization way to control the counts based on both original image and user inputs. Our approach is demonstrated by image recoloring and tonal values adjustments. Numerous experiments show that our method can significantly improve edit propagation via



gradient minimization.

Zhenyuan Guo, Haoqian Wang, Kai Li, Yongbing Zhang, Xingzheng Wang, Qionghai Dai

Improved Salient Object Detection Based on Background Priors

Recently, many saliency detection models use image boundary as an effective prior of image background for saliency extraction. However, these models may fail when the salient object is overlapped with the boundary. In this paper, we propose a novel saliency detection model by computing the contrast between superpixels with background priors and introducing a refinement method to address the problem in existing studies. Firstly, the SLIC (Simple Linear Iterative Clustering) method is used to segment the input image into superpixels. Then, the feature difference is calculated between superpixels based on the color histogram. The initial saliency value of each superpixel is computed as the sum of feature differences between this superpixel and other ones in image boundary. Finally, a saliency map refinement method is used to reassign the saliency value of each image pixel to obtain the final saliency map for images. Compared with other state-of-the-art saliency detection methods, the proposed saliency detection method can provide better saliency prediction results for images by the measure from precision, recall and F-measure on two widely used datasets.

Tao Xi, Yuming Fang, Weisi Lin, Yabin Zhang

Position-Patch Based Face Hallucination via High-Resolution Reconstructed-Weights Representation

Position-patch based face hallucination methods aim to reconstruct the high-resolution (HR) patch of each low-resolution (LR) input patch independently by the optimal linear combination of the training patches at the same position. Most of current approaches directly use the reconstruction weights learned from LR training set to generate HR face images, without considering the structure difference between LR and the HR feature space. However, it is reasonable to assume that utilizing HR images for weights learning would benefit the reconstruction process, because HR feature space generally contains much more information. Therefore, in this paper, we propose a novel representation scheme, called High-resolution Reconstructed-weights Representation (HRR), that allows us to improve an intermediate HR image into a more accurate one. Here the HR reconstruction weights can be effectively obtained by solving a least square problem. Our evaluations on publicly available face databases demonstrate favorable performance compared to the previous position-patch based methods.

Danfeng Wan, Yao Lu, Javaria Ikram, Jianwu Li

Real-Time Rendering of Layered Materials with Linearly Filterable Reflectance Model

This paper proposes a real-time system to render with layered materials by using a linearly filterable reflectance model. This model effectively captures both surface and subsurface reflections, and supports smooth transitions over different resolutions. In a preprocessing stage, we build mip-map structures for both surface and subsurface mesostructures via fitting their bumpiness with mixtures of von Mises Fisher (movMF) distributions. Particularly, a movMF convolution algorithm and a movMF reduction algorithm are provided to well-approximate the visually perceived bumpiness of the subsurface with controllable rendering complexity. Then, both surface and subsurface reflections are implemented on GPUs with real-time performance. Experimental results reveal that our approach enables aliasing-free illumination under environmental lighting at different scales.

Jie Guo, Jinghui Qian, Jingui Pan

Hybrid Lossless-Lossy Compression for Real-Time Depth-Sensor Streams in 3D Telepresence Applications

We developed and evaluated different schemes for the realtime compression of multiple depth image streams. Our analysis suggests that a hybrid lossless-lossy compression approach provides a good tradeoff between quality and compression ratio. Lossless compression based on run length encoding is used to preserve the information of the highest bits of the depth image pixels. The lowest 10-bits of a depth pixel value are directly encoded in the Y channel of a YUV image and encoded by a x264 codec. Our experiments show that the proposed method can encode and decode multiple depth image streams in less than 12 ms on average. Depending on the compression level, which can be adjusted during application runtime, we are able to achieve a compression ratio of about 4:1 to 20:1. Initial results indicate that the quality for 3D reconstructions is almost indistinguishable from the original for a compression ratio of up to 10:1.

Yunpeng Liu, Stephan Beck, Renfang Wang, Jin Li, Huixia Xu, Shijie Yao, Xiaopeng Tong, Bernd Froehlich

Marginal Fisher Regression Classification for Face Recognition

This paper presents a novel

marginal Fisher regression classification

(MFRC) method by incorporating the ideas of

marginal Fisher analysis

(MFA) and

linear regression classification

(LRC). The MFRC aims at minimizing the within-class compactness over the between-class separability to find an optimal embedding matrix for the LRC so that the LRC on that subspace achieves a high discrimination for classification. Specifically, the within-class compactness is measured with the sum of distances between each sample and its neighbors within the same class with the LRC, and the between-class separability is characterized as the sum of distances between margin points and their neighboring points from different classes with the LRC. Therefore, the MFRC embodies the ideas of the LRC, Fisher analysis and manifold learning. Experiments on the FERET, PIE and AR datasets demonstrate the effectiveness of the MFRC.

Zhong Ji, Yunlong Yu, Yanwei Pang, Yingming Li, Zhongfei Zhang

Temporally Adaptive Quantization Algorithm in Hybrid Video Encoder

In video coder, inter-frame prediction results in distortion propagation among adjacent frames, and this distortion dependency is a crucial factor for rate control and video coding algorithm optimization. The macroblock tree (MBTree) is a typical temporal quantization control algorithm, in which a quantization offset


is employed for adjustment according to the amount of distortion propagation measured by the relative propagation cost


. Appropriate




model is the key to the MBTree-like adaptive quantization algorithm. The default




model in MBTree algorithm is designed in an empirical way with rough model accuracy and insufficient universality to different input source. This paper focuses on this problem and apply the competitive decision mechanism in exploring optimal




model, and then proposes an improved




model with rate distortion optimization. The simulation results verify that the improved MBTree algorithm with the proposed model achieves up to 0.14 dB BD-PSNR improvement, and 0.29 dB BD-SSIM improvement. The proposed algorithm achieves better temporal bit allocation and reduces the distortion fluctuation in temporal domain, achieving in adaptive quantization control.

Haibing Yin, Zhongxiao Wang, Zhelei Xia, Ye Shen

Semi-automatic Labeling with Active Learning for Multi-label Image Classification

For multi-label image classification, we use active learning to select example-label pairs to acquire labels from experts. The core of active learning is to select the most informative examples to request their labels. Most previous studies in active learning for multi-label classification have two shortcomings. One is that they didn’t pay enough attention on label correlations. The other shortcoming is that existing example-label selection methods predict all the rest labels of the selected example-label pair. This leads to a bad performance for classification when the number of the labels is large. In this paper, we propose a semi-automatic labeling multi-label active learning (SLMAL) algorithm. Firstly, SLMAL integrates uncertainty and label informativeness to select example-label pairs to request labels. Then we choose the most uncertain example-label pair and predict its partial labels using its nearest neighbor. Our empirical results demonstrate that our proposed method SLMAL outperforms the state-of-the-art active learning methods for multi-label classification. It significantly reduces the labeling workloads and improves the performance of a classifier built.

Jian Wu, Chen Ye, Victor S. Sheng, Yufeng Yao, Pengpeng Zhao, Zhiming Cui

A New Multi-modal Technique for Bib Number/Text Detection in Natural Images

The detection and recognition of racing bib number/text, which is printed on paper, cardboard tag, or t-shirt in natural images in marathon, race and sports, is challenging due to person movement, non-rigid surface, distortion by non-illumination, severe occlusions, orientation variations etc. In this paper, we present a multi-modal technique that combines both biometric and textual features to achieve good results for bib number/text detection. We explore face and skin features in a new way for identifying text candidate regions from input natural images. For each text candidate region, we propose to use text detection and recognition methods for detecting and recognizing bib numbers/texts, respectively. To validate the usefulness of the proposed multi-modal technique, we conduct text detection and recognition experiments before text candidate region detection and after text candidate region detection in terms of recall, precision and f-measure. Experimental results show that the proposed multi-modal technique outperforms the existing bib number detection method.

Sangheeta Roy, Palaiahnakote Shivakumara, Prabir Mondal, R. Raghavendra, Umapada Pal, Tong Lu

A New Multi-spectral Fusion Method for Degraded Video Text Frame Enhancement

Text detection and recognition in degraded video is complex and challenging due to lighting effect, sensor and motion blurring. This paper presents a new method that derives multi-spectral images from each input video frame by studying non-linear intensity values in Gray, R, G and B color spaces to increase the contrast of text pixels, which results in four respective multi-spectral images. Then we propose a multiple fusion criteria for the four multi-spectral images to enhance text information in degraded video frames. We propose median operation to obtain a single image from the results of the multiple fusion criteria, which we name fusion-1. We further apply k-means clustering on the fused images obtained by the multiple fusion criteria to classify text clusters, which results in binary images. Then we propose the same median operation to obtain a single image by fusing binary images, which we name fusion-2. We evaluate the enhanced images at fusion-1 and fusion-2 using quality measures, such as Mean Square Error, Peak Signal to Noise Ratio and Structural Symmetry. Furthermore, the enhanced images are validated through text detection and recognition accuracies in video frames to show the effectiveness of enhancement.

Yangbing Weng, Palaiahnakote Shivakumara, Tong Lu, Liang Kim Meng, Hon Hock Woon

A Robust Video Text Extraction and Recognition Approach Using OCR Feedback Information

Video text is very important semantic information, which brings precise and meaningful clues for video indexing and retrieval. However, most previous approaches did video text extraction and recognition separately, while the main difficulty of extraction and recognition with complex background wasn’t handled very well. In this paper, these difficulty is investigated by combining text extraction and recognition together as well as using OCR feedback information. The following features are highlighted in our approach: (i) an efficient character image segmentation method is proposed in consideration of most prior knowledge. (ii) text extraction are implemented both on text-row and segmented single character images, since text-row based extraction maintains the color consistency of characters and backgrounds while single character has simpler background. After that, the best binary image is chosen for recognition with OCR feedback. (iii) The K-means algorithm is used for extraction which ensures that the best extraction result is involved, which is the binary image with clear classification of text strokes and background. Finally, extensive experiments and empirical evaluations on several video text images are conducted to demonstrate the satisfying performance of the proposed approach.

Guangyu Gao, He Zhang, Hongting Chen

Color and Active Infrared Vision: Estimate Infrared Vision of Printed Color Using Bayesian Classifier and K-Nearest Neighbor Regression

Speaking of active infrared vision, its inability to see physical colors has long been considered as one major drawback or something everybody has paid no attention to until very recently. Looking at this color blindness from other perspective, we propose an idea of a novel medium whose visibilities in both visible and active infrared light spectrums can be controlled, enabling vision-based techniques to transform everyday printed media into smart, eco-friendly and sustainable monitorlike interactive displays.

To begin with, this paper observes the most important key success procedure regarding the idea–estimating how physical colors should look like when being seen by an active infrared camera. Two alternative methods are proposed and evaluated here. The first one uses Bayesian classifier to find some color-attribute combinations that can precisely classify our sample data. The second alternative relies on simple weighted average and k-nearest neighbor regression in two color models–RGB and CIE L*a*b*. Suggesting by experimental results, the second method is more practical and consistent at different distances. Besides, it shows likelihoods of the model created in this work being able to estimate infrared vision of colors printed on different material.

Thitirat Siriborvornratanakul

Low Bitrates Audio Bandwidth Extension Using a Deep Auto-Encoder

Modern audio coding technologies apply methods of bandwidth extension (BWE) to efficiently represent audio data at low bitrates. An established method is the well-known spectral band replication (SBR) that can provide the very high sound quality with imperceptible artifact. However, its bitrates and complexity are very high. Another great method is LPC-based BWE, which is part of 3GPP AMR-WB+ codec. Although its bitrates and complexity are reduced distinctly, the sound quality it provided is unsatisfactory for music. In this paper, a novel bandwidth extension method is proposed which provided the high sound quality close to eSBR, with only 0.8 kbps bitrates. The proposed method predicts the fine structure of high frequency band from low frequency band by a deep auto-encoder, and only extracts the envelope of high frequency as side information. The performance evaluation demonstrates the advantage of the proposed method compared to the state of the art. Compared with eSBR, the bitrates drop about 63 %, and the subjective listening quality is close to it. Compared with LPC-based BWE, the subjective listening quality is better than it with the same bitrates.

Lin Jiang, Ruimin Hu, Xiaochen Wang, Maosheng Zhang

Part-Aware Segmentation for Fine-Grained Categorization

It is difficult to segment images of fine-grained objects due to the high variation of appearances. Common segmentation methods can hardly separate the part regions of the instance from background with sufficient accuracy. However, these parts are crucial in fine-grained recognition. Observing that fine-grained objects share the same configuration of parts, we present a novel part-aware segmentation method, which can get the foreground segmentation from a bounding box with preservation of semantic parts. We firstly design a hybrid part localization method, which combines parametric and non-parametric models. Then we iteratively update the segmentation outputs and the part proposal, which can get better foreground segmentation results. Experiments demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.

Cheng Pang, Hongxun Yao, Zhiyuan Yang, Xiaoshuai Sun, Sicheng Zhao, Yanhao Zhang

Improved Compressed Sensing Based 3D Soft Tissue Surface Reconstruction

This paper presents a 3D soft tissue surface reconstruction method based on improved compressed sensing and radial basis function interpolation for a small amount of uniform sampling data points on 3D surface. We adopt radial basis function interpolation to obtain the same amount of data points as to be reconstructed and propose an improved compressed sensing method to reconstruct 3D surface: we design a deterministic measurement matrix to signal observation, and then adopt the discrete cosine transform to the 3D coordinate sparse representation and use weak choose regularized orthogonal matching pursuit algorithm to reconstruct. Experimental results show that the proposed algorithm improves the resolution of the surface as well as the accuracy. The average maximum error is less than 0.9012 mm, which is smooth enough to provide accurate surface data model for virtual reality based surgery system.

Sijiao Yu, Zhiyong Yuan, Qianqian Tong, Xiangyun Liao, Yaoyi Bai

Constructing Learning Maps for Lecture Videos by Exploring Wikipedia Knowledge

Videos are commonly used as course materials for e-learning. In most existing systems, the lecture videos are usually presented in a linear manner. Structuring the video corpus has proven an effective way for the learners to conveniently browse the video corpus and design their learning strategies. However, the content analysis of lecture videos is difficult due to the low recognition rate of speech and handwriting texts and the noisy information. In this paper, we explore the use of external domain knowledge from Wikipedia to construct learning maps for online learners. First, with the external knowledge, we filter the noisy texts extracted from videos to form a more precise and elegant representation of the video content. This facilitates us to construct a more accurate video map to represent the domain knowledge of the course. Second, by combining the video information and the external academic articles for the domain concepts, we construct a directed map to show the relationships between different concepts. This can facilitate online learners to design their learning strategies and search for the target concepts and related videos. Our experiments demonstrate that external domain knowledge can help organize the lecture video corpus and construct more comprehensive knowledge representations, which improves the learning experience of online learners.

Feng Wang, Xiaoyan Li, Wenqiang Lei, Chen Huang, Min Yin, Ting-Chuen Pong

Object Tracking via Combining Discriminative Global and Generative Local Models

In this paper, in order to track objects which undergo rotation and pose changes, we propose a novel algorithm that combines discriminative global and generative local model. Initially, we exploit the wavelet approximation coefficients and completed local binary pattern (CLBP) to represent the object global features. With the obtained global appearance descriptor, we use online discriminative metric learning to differentiate the target object from background. To avoid the drift problem results from global discriminative model, a novel generative spatial geometric local model is introduced. Based on SURF features, the generative local model quantizes the geometric structure information in scale and angle. Then, we combine these global and local models so that they can be benefit each other. Compared with several other tracking algorithms, the experimental results demonstrate that the proposed algorithm is able to track the target object reliably, especially for object pose change and rotation.

Liujun Zhao, Qingjie Zhao

Tracking Deformable Target via Multi-cues Active Contours

In this study, we present a novel multi-cues active contours based method for tracking target contours using edge, region, and shape information. To locate the target position, a contour based meanshift tracker is designed which combines both color and texture information. In order to reduce the adverse impact of sophisticated background and accelerate the curve motion, we extract rough target region from the coming frame by the proposed target appearance model. What’s more, both discriminative pre-learning based global layer and voting based local layer are integrated into our appearance model. For obtaining the detailed target boundaries, we embed edge, region, and shape information into the level sets based multi-cues active contour model (MCAC). Experiments on seven video sequences demonstrate that the proposed method performs better than other competitive contour tracking methods under various tracking environment.

Peng Lv, Qingjie Zhao

Person Re-identification via Attribute Confidence and Saliency

Person re-identification is a problem of recognising and associating persons across different cameras. Existing methods usually take visual appearance features to address this issue, while the visual descriptions are sensitive to the environment variation. Relatively, the semantic attributes are more robust in complicated environments. Therefore, several attribute-based methods are introduced, but most of them ignored the diversities of different attributes. We epitomize the diversities of different attributes as two folds: the

attribute confidence

which denotes the descriptive power, and the

attribute saliency

which expresses the discriminative power. Specifically, the attribute confidence is determined by the performance of each attribute classifier, and the attribute saliency is defined by their occurrence frequency, similar to the IDF (Inverse Document Frequency) [1] idea in information retrieval. Then, each attribute is assigned an appropriate weighting according to its saliency and confidence when calculating similarity distances. Based on above considerations, a novel person re-identification method is proposed. Experiments conducted on two benchmark datasets have validated the effectiveness of the proposed method.

Jun Liu, Chao Liang, Mang Ye, Zheng Wang, Yang Yang, Zhen Han, Kaimin Sun

Light Field Editing Based on Reparameterization

Edit propagation algorithms are a powerful tool for performing complex edits with a few coarse strokes. However, current methods fail when dealing with light fields, since these methods do not account for view-consistency and due to the large size of data that needs to be handled. In this work we propose a new scalable algorithm for light field edit propagation, based on reparametrizing the input light field so that the coherence in the angular domain of the edits is preserved. Then, we handle the large size and dimensionality of the light field by using a downsampling-upsampling approach, where the edits are propagated in a reduced version of the light field, and then upsampled to the original resolution. We demonstrate that our method improves angular consistency in several experimental results.

Hongbo Ao, Yongbing Zhang, Adrian Jarabo, Belen Masia, Yebin Liu, Diego Gutierrez, Qionghai Dai

Interactive Animating Virtual Characters with the Human Body

This paper presents a novel interactive motion mapping system that maps the human motion to virtual characters with different body part size, topology and geometry. Our method is especially effective for characters whose body is disproportional to human structure. To achieve this, we propose an improved Embedded Deformation algorithm to control virtual characters in realtime. In preprocessing stage, we construct the deformation subgraph for each part, and then merge them into a connected deformation graph, these works are entirely automatic and only have to be done once before running. At runtime, we use the Kinect to track human skeletal joints and iteratively solve the rotation matrix and translation vector for each deformation graph node. Then, we update mesh vertices position and normal. We demonstrate the flexibility and versatility of our method on a variety of virtual characters.

Hao Jiang, Lei Zhang

Visual Understanding and Recognition on Big Data


Fast Graph Similarity Search via Locality Sensitive Hashing

Similarity search in graph databases has been widely studied in graph query processing in recent years. With the fast accumulation of graph databases, it is worthwhile to develop a fast algorithm to support similarity search in large-scale graph databases. In this paper, we study


-NN similarity search problem via locality sensitive hashing.We propose a fast graph search algorithm, which first transforms complex graphs into vectorial representations based on the prototypes in the database and then accelerates query efficiency in Euclidean space by employing locality sensitive hashing. Additionally, a general retrieval framework is established in our approach. Experiments on three real datasets show that our work achieves high performance both on the accuracy and the efficiency of the presented algorithm.

Boyu Zhang, Xianglong Liu, Bo Lang

Text Localization with Hierarchical Multiple Feature Learning

In this paper, we focus on English text localization in natural scene images. We propose a hierarchical localization framework which goes from characters to strings to words. Different from existing methods which either bet on sophisticated hand-crafted features or rely on heavy learning models, our approach tends to design simple but effective features and learning models. In this study, we introduce a kind of two level character structure features in collaboration with the Histogram of Gradient (HOG) and the Convolutional Neural Network (CNN) features for character localization. In string localization, a nine-dimension string feature is proposed for discriminative verification after grouping characters. For the final word localization, we learn an optimal splitting strategy based on the interval cues to split strings into words. Experiments on the challenging ICDAR benchmark datasets demonstrate the effectiveness and superiority of our approach.

Yanyun Qu, Li Lin, Weiming Liao, Junran Liu, Yang Wu, Hanzi Wang

Recognizing Human Actions by Sharing Knowledge in Implicit Action Groups

Most of the current action recognition approaches learn each action category separately. An important observation is that many action categories are correlated and could be clustered into groups, which are always ignored to decreasing the recognition accuracy. In this paper, we employ a multi-task learning framework with group-structured regularization to share knowledge in category groups. First, we employ Fisher Vector, concatenated by gradients with respect to mean vector and covariance matrix of GMM, to represent action data. Intuitively, the action categories in the same group are prone to have a closer relationship with the same Gaussian components. The proposed method uses one-vs-one SVM margin to measure the degree of similarity between each pair of categories and obtain the implicit group structure by Affinity Propagation Clustering. In order to encourage the categories in the same group to share dimensions feature from the same Gaussian component and vice versa, the implicit group structure is used as the prior regularization in multi-task learning. Our experiments on large and realistic dataset HMDB51 show that the proposed method has achieved the comparative even higher accuracy with less dimensions of feature over several state-of-the-art approaches.

RuiShan Liu, YanHua Yang, Cheng Deng

Human Parsing via Shape Boltzmann Machine Networks

Human parsing is a challenging task because it is difficult to obtain accurate results of each part of human body. Precious Boltzmann Machine based methods reach good results on segmentation but are poor expression on human parts. In this paper, an approach is presented that exploits Shape Boltzmann Machine networks to improve the accuracy of human body parsing. The proposed Curve Correction method refines the final segmentation results. Experimental results show that the proposed method achieves good performance in body parsing, measured by Average Pixel Accuracy (aPA) against state-of-the-art methods on Penn- Fudan Pedestrians dataset and Pedestrian Parsing Surveillance Scenes dataset.

Qiurui Wang, Chun Yuan, Feiyue Huang, Chengjie Wang

Depth-Based Stereoscopic Projection Approach for 3D Saliency Detection

With the popularity of 3D display and the widespread using of depth camera, 3D saliency detection is feasible and significant. Different with 2D saliency detection, 3D saliency detection increases an additional depth channel so that we need to take the influence of depth and binocular parallax into account. In this paper, a new depth-based stereoscopic projection approach is proposed for 3D visual salient region detection. 3D images reconstructed with color and depth images are respectively projected onto XOZ plane and YOZ plane with the specific direction. We find some obvious characteristics that help us to remove the background and progressive surface where the depth is from the near to the distant so that the salient regions are detected more accurately. Then depth saliency map (DSM) is created, which is combined with 2D saliency map to obtain a final 3D saliency map. Our approach performs well in removing progressive surface and background which are difficult to be detected in 2D saliency detection.

Hongyun Lin, Chunyu Lin, Yao Zhao, Jimin Xiao, Tammam Tillo

Coding and Reconstruction of Multimedia Data with Spatial-Temporal Information


Revisiting Single Image Super-Resolution Under Internet Environment: Blur Kernels and Reconstruction Algorithms

Due to limited network bandwidth, the blurred and downsampled high-resolution images in the spatial domain are inevitably used for transmission over the internet, and so single image super-resolution (SISR) algorithms would play a vital role in reconstructing the lost spatial information of the low-resolution images. Recently, it has been recognized that the blur kernel is crucial to the SISR performances. As most of the existing SISR methods typically assume the blur kernel is known, and in fact the blur kernel is either fixed with the scaling factor or unknown, it thus would be of high value to investigate the relationship between blur kernels and reconstruction algorithms. In this paper, we first propose a fast and effective SISR method based on mixture of experts and then give an empirical study on the sensitivity of different SISR algorithms to the blur kernels. Specially, we find that different algorithms have different sensitivity to the blur kernels and the most suitable blur kernels for different algorithms are different. Our findings highlight the importance of the blur models for SISR algorithms and may benefit current spatial information coding methods in multimedia processing.

Kai Zhang, Xiaoyu Zhou, Hongzhi Zhang, Wangmeng Zuo

Prediction Model of Multi-channel Audio Quality Based on Multiple Linear Regression

Perceived audio quality is an important metric to measure the perception degradation of multi-channel audio signals especially for coding and rendering systems. Conventional objective quality measurement such as PEAQ (Perceptual Evaluation of Audio Quality) is limited to describe both the basic audio quality and the spatial impression. A novel prediction model is proposed to predict the subjective quality of 5.1-channels audio systems. Two attributes are included in the evaluation including basic quality and surround effects. Multiple Linear Regression (MLR) combined with Principal Component Analysis (PCA) is used to establish the prediction model from the objective parameters to subjective audio quality. Data set for model training and testing is obtained from formal listening tests under different coding conditions. Preliminary experiment results with 5.1-channels audio show that the proposed model can predict multi-channel audio quality more accurately than the conventional PEAQ method considering both the basic audio quality and the surround effects.

Jing Wang, Yi Zhao, Wenzhi Li, Fei Wang, Zesong Fei, Xiang Xie

Physical Properties of Sound Field Based Estimation of Phantom Source in 3D

3D spatial sound effects can be achieved by amplitude panning with several loudspeakers, which can produce the auditory event of phantom source at arbitrary location with loudspeakers at arbitrary locations in 3D space. The estimation of the phantom source is to estimate the signal and location of a sound source which produce the same perception of auditory event with that of phantom source by loudspeakers. Several methods have been proposed to estimate the phantom sources, but these methods couldn’t ensure the conservation of sound energy at listening point in sound field, which including kinetic energy (particle velocity) and potential energy (sound pressure), so estimated errors were caused. A new method to estimate phantom source signal and the position is proposed, which is based on the physical properties (particle velocity, sound pressure) of the listening point in the sound field by loudspeakers. Moreover, the proposed method could be also appropriate for arbitrary asymmetric arranged loudspeakers. Experimental results showed that compared with current methods, estimated distortions of the location of phantom source and the superposed signal by loudspeakers with proposed method have been reduced obviously.

Shanfa Ke, Xiaochen Wang, Li Gao, Tingzhao Wu, Yuhong Yang

Non-overlapped Multi-source Surveillance Video Coding Using Two-Layer Knowledge Dictionary

In multi-source surveillance videos, a large number of moving objects are captured by different surveillance cameras. Although the regions that each camera covers are seldom overlapped, similarities of these objects among different videos still result in tremendous global object redundancy. Coding each source in an independent way for multi-source surveillance videos is inefficient due to the ignoring of correlation among different videos. Therefore, a novel coding framework for multi-source surveillance videos using two-layer knowledge dictionary is proposed. By analyzing the characteristics of multi-source surveillance videos in large scale of spatio and time space, a two-layer dictionary is built to explore the global object redundancy. Then, a dictionary-based coding method is developed for moving objects. For any object in multi-source surveillance videos, only some pose parameters and sparse coefficients are required for object representation and reconstruction. The experiment with two simulated surveillance videos has demonstrated that the proposed coding scheme can achieve better coding performance than the main profile of HEVC and can preserve better visual quality.

Yu Chen, Jing Xiao, Liang Liao, Ruimin Hu

Global Motion Information Based Depth Map Sequence Coding

Depth map is currently exploited in 3D video coding and computer vision systems. In this paper, a novel global motion information assisted depth map sequence coding method is proposed. The global motion information of depth camera is synchronously sampled to assist the encoder to improve depth map coding performance. This approach works by down-sampling the frame rate at the encoder side. Then, at the decoder side, each skipped frame is projected from its neighboring depth frames using the camera global motion. Using this technique, the frame rate of depth sequence is down-sampled. Therefore, the coding rate-distortion performance is improved. Finally, the experiment result demonstrates that the proposed method enhances the coding performance in various camera motion conditions and the coding performance gain could be up to 2.04 dB.

Fei Cheng, Jimin Xiao, Tammam Tillo, Yao Zhao


Weitere Informationen