Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 15th Pacific Rim Conference on Multimedia, PCM 2014, held in Kuching, Malaysia, in December 2014. The 35 revised full papers and 6 short papers presented were carefully reviewed and selected from 84 submissions. The papers cover a wide range of topics in the area of multimedia content analysis, multimedia signal processing and communications, and multimedia applications and services. They have been organized into topical sections on video coding, annotation, image and photo, applications, people, image analysis and processing under extra help, nearest neighbor, neural networks, and audio. Also included are sections with best papers and posters and demonstrations.



Best Paper Session

Region-Based Interactive Ranking Optimization for Person Re-identification

Person re-identification, aiming to identify images of the same person from various cameras configured in difference places, has attracted plenty of attention in the multimedia community. Previous work mainly focuses on feature presentation and distance measure, and achieves promising results on some standard databases. However, the performance is still not good enough due to appearance changes caused by variations in illuminations, poses, viewpoints and occlusion. This paper addresses the problem through result re-ranking by introducing user feedback. In particular, considering the peculiarity of scarce positive and global similar negative samples in the person re-identification problem, we propose a region-based interactive ranking optimization method, to improve the original query result by labeling locally similar and dissimilar image regions. Experiments conducted on two standard data sets have validated the effectiveness of the proposed method with an average improvement of 10-30% over original basic method. It is proved that the ranking optimization algorithm is both an effective and efficient method to improve the original person re-identification result.

Zheng Wang, Ruimin Hu, Chao Liang, Qingming Leng, Kaimin Sun

Adaptive Tag Selection for Image Annotation

Not all tags are relevant to an image, and the number of relevant tags is image-dependent. Although many methods have been proposed for image auto-annotation, the question of how to determine the number of tags to be selected per image remains open. The main challenge is that for a large tag vocabulary, there is often a lack of ground truth data for acquiring optimal cutoff thresholds per tag. In contrast to previous works that pre-specify the number of tags to be selected, we propose in this paper

adaptive tag selection

. The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth. Such a division allows us to estimate how many tags shall be selected from the novel set according to the tags that have been selected from the seen set. The effectiveness of the proposed method is justified by our participation in the ImageCLEF 2014 image annotation task. On a set of 2,065 test images with ground truth available for 207 tags, the benchmark evaluation shows that compared to the popular top-


strategy which obtains an F-score of 0.122, adaptive tag selection achieves a higher F-score of 0.223. Moreover, by treating the underlying image annotation system as a black box, the new method can be used as an easy plug-in to boost the performance of existing systems.

Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin

Twitter Food Photo Mining and Analysis for One Hundred Kinds of Foods

So many people post photos as well as short messages to Twitter every minutes from everywhere on the earth. By monitoring the Twitter stream, we can obtain various kinds of images with texts. In this paper, as a case study of Twitter image mining for specific kinds of photos, we describe food photo mining from the Twitter stream. To collect food photos from Twitter, we monitor the Twitter stream to find the tweets containing both food-related keywords and photos, and apply a “foodness” classifier and 100-class food classifiers to them to verify whether they represent foods or not after downloading the corresponding photos. In the paper, we report the experimental results of our food photo mining for the Twitter photo data we have collected for two years and four months. As results, we detected about 470,000 food photos from Twitter. With this data, we made spatio-temporal analysis on food photos.

Keiji Yanai, Yoshiyuki Kawano

Improving Color Constancy with Internet Photo Collections

Color constancy is the ability to measure colors of objects independent of the color of the light source. A well-known color constancy method makes use of the specular edge to estimate the illuminant. However, separating specular edge from input image is under-constrained and existing methods require user assistance or handle only simple scenes. This paper presents an iterative weighted Specular-Edge color constancy scheme that leverages large database of images gathered from the web. Given an input image, we execute an efficient visual search to find the closest visual context from the database and use the visual context as an image-specific prior. This prior is then used to correct the chromaticity of the input image before illumination estimation. Thus, introducing the prior can provide a good initial guess for the successive iteration. In the next, a specular-map guided filter is used to improve the precision of specular edge weighting. Consequently, it can increase the accuracy of estimating the illuminant. To the end, we evaluate our scheme on standard databases and the results show that the proposed scheme outperforms state of the art color constancy methods.

Shuai Fang, Chuanpei Zhou, Yang Cao, Zhengjun Zha

Video Coding

A Background Modeling Scheme Based on High Efficiency Motion Classification for Surveillance Video Coding

Recently, high-efficiency video coding becomes more and more demanded as the explosive requirements of network bandwidth and storage space for surveillance video applications. In this paper, we propose a background modeling scheme based on high efficiency motion classification. Firstly, pixels at each location are classified into three motion states, namely the static, the gentle motion and the severe motion states, according to the motion vectors of the corresponding current block and neighboring blocks. Then based on the classification and pixel differential value, the segmentation is performed for the co-located pixels in the training frames, and the mean pixel value of each segment can then be calculated. Finally, the background modeling frame can be obtained by an optimized weighted average of the segmented mean pixel values. Experimental results show that our proposed scheme achieves an average PSNR gain of 0.65dB than the AVS surveillance baseline video encoder, and it gets the best performance among several high efficiency background modeling methods in fast motion and large foreground sequences.

Pei Liao, Xiaofeng Huang, Huizhu Jia, Kaijin Wei, Binbin Cai, Guoqing Xiang, Don Xie

An Adaptive Perceptual Quantization Algorithm Based on Block-Level JND for Video Coding

It has been widely demonstrated that integrating efficient perceptual measures into traditional video coding framework can improve subjective coding performance significantly. In this paper, we propose a novel block-level JND (just-noticeable-distortion) model, which has not only adjusted pixel-level JND thresholds with more block characteristics, but also integrated them into a block-level model. And the model has been applied for adaptive perceptual quantization for video coding. Experimental results show that our model can save bit rates up to 24.5% on average with negligible degradation of the perceptual quality.

Guoqing Xiang, Xiaodong Xie, Huizhu Jia, Xiaofeng Huang, Jie Liu, Kaijin Wei, Yuanchao Bai, Pei Liao, Wen Gao

Improved Prediction Estimation Based H.264 to HEVC Intra Transcoding

High Efficiency Video Coding (HEVC) achieves significant coding efficiency improvement at a cost of much higher computation complexity. In this paper, we propose a prediction estimation based H.264 to HEVC intra transcoder. In HEVC intra coding, the most complexity comes from coding unit size decision. The proposed method judges the coding unit size not only by using the intra prediction mode, prediction direction and other information from H.264 decoder, but also by trying the prediction on decoding picture with the most dominant prediction directions of H.264. Both of this information is fed into the classifier. Then the Support Vector Machine (SVM) classifiers were trained and apply to different division level to improve the speed of coding unit size decision accuracy loss as small as possible. Experiment shows that about 1.42 times speed up over the HEVC HM 10.0 reference software at about 0.069dB rate distortion performance loss.

Daxin Guang, Pin Tao, Sichao Song, Lixin Feng, Jiangtao Wen, Shiqiang Yang

Unified VLSI Architecture of Motion Vector and Boundary Strength Parameter Decoder for 8K UHDTV HEVC Decoder

This paper presents a VLSI architecture design of unified motion vector (MV) and boundary strength (BS) parameter decoder (PDec) for 8K UHDTV HEVC decoder. PDec in HEVC is deemed as a highly algorithm-irregular module, which is also challenged by high throughput requirement for UHDTV. To solve these problems, four schemes are proposed. Firstly, the work unifies MV and BS parameter decoders to share on-chip memory and simplify the control logic. Secondly, we propose the CU-adaptive pipeline scheme to efficiently reduce the implementation complexity. Thirdly, on-chip memory is organized to meet the high throughput requirement for spatial neighboring fetching. Finally, optimizations on irregular MV algorithm are adopted for 43.2k area reduction. In 90nm process, our design costs 93.3k logic gates with 23.0kB line buffer. The proposed architecture can support 7680x4320@60fps real-time decoding at 249MHz in the worst case.

Shihao Wang, Dajiang Zhou, Jianbin Zhou, Takeshi Yoshimura, Satoshi Goto


Using Label Propagation to Get Confidence Map for Segmentation

We propose a novel algorithm to segment objects from the existed segmentation results of the co-segmentation algorithms [1]. Previous co-segmentation algorithms work well when the main regions of the images contain only the target objects; however, their performances degenerate significantly when multi-category objects appear in the images. In contrast, our method adopts mask transformation from multiple images and discriminatively enhancement from multiple object categories, which can effectively ensure a good performance in both scenarios. We propose to use sift-flow [2] between pre-segmented source images and target image, and transform the source images’ segmentation mask to fit the target testing image by the flow vectors. Then we use all the transformed masks to vote the testing image mask and get the initial segmentation results. We also propose to use the ratio between the target category and the other categories to eliminate the side effects from other objects that might appeared in the initial segmentation. We conduct our experiment on internet images collected by Rubinstein .etc [1]. We also do additional experiment to study the multi-object conjunction cases. Our algorithm is effective in computation complexity and able to achieve a better performance than the state-of-the-art algorithm.

Haoran Li, Hongxun Yao, Xiaoshuai Sun

Image Region Labeling by Exploring Contextual Information of Visual Spatial and Semantic Concepts

Region Labeling is to automatically assign semantic labels to the corresponding image regions. Most of the previous works focus on exploiting low level visual features, particularly visual spatial contextual information, to address the problem. However, very few work explore high level semantic information of the whole image to deal with the problem. In this paper, we propose a new region labeling approach by integrating both visual spatial and semantic contextual information into a unified model. In our method, region labeling is regarded as a multi-class classification problem. For each semantic concept, we train a Conditional Random Field (CRF) model respectively. It consists of both the region grid sub-graph and the co-occurred semantic label sub-graph. In our model, the integration of the two kinds of contextual information brings reinforcement effect on the improvement of region labeling. The experiments are conducted on two commonly used benchmark datasets and the experimental results show that our method achieves the best performance compared with the strong baselines and the state-of-the-art methods.

Kai He, Wen Chan, Guangtang Zhu, Lan Lin, Xiangdong Zhou

Manifold Regularized Multi-view Feature Selection for Web Image Annotation

The features used in many multimedia analysis-based applications are frequently of very high dimension. Feature selection offers several advantages in highly dimensional cases. Recently, multi-task feature selection has attracted much attention, and has been shown to often outperform the traditional single-task feature selection. Current multi-task feature selection methods are either supervised or unsupervised. In this paper, we address the semi-supervised multi-task feature selection problem. We first introduce manifold regularization in multi-task feature selection to utilize the limited number of labeled samples and the relatively large amount of unlabeled samples. However, the graph constructed in manifold regularization from a single feature representation (view) may be unreliable. We thus propose to construct the graph using the heterogeneous feature representations from multiple views. The proposed method is called manifold regularized multi-view feature selection (MRMVFS), which can exploit the label information, label relationship, data distribution, as well as correlation among different kinds of features simultaneously to boost the feature selection performance. All these information are integrated into a unified learning framework to estimate feature selection matrix, as well as the adaptive view weights. Experimental results on a real-world web image dataset demonstrate the effectiveness and superiority of the proposed MRMVFS over other state-of-the-art feature selection methods.

Yangxi Li, Xin Shi, Lingling Tong, Yong Luo, Jinhui Tu, Xiaobo Zhu

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

With the increasing use of audio sensors in user generated content (UGC) collection, semantic concept annotation using audio streams has become an important research problem. Huawei initiates a grand challenge in the International Conference on Multimedia & Expo (ICME) 2014: Huawei Accurate and Fast Mobile Video Annotation Challenge. In this paper, we present our semantic concept annotation system using audio stream only for the Huawei challenge. The system extracts audio stream from the video data and low-level acoustic features from the audio stream. Bag-of-feature representation is generated based on the low-level features and is used as input feature to train the support vector machine (SVM) concept classifier. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide important complementary information to the visual-based concept annotation system for performance boost.

Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li

Image and Photo 1

Text Detection in Natural Scene Images with Stroke Width Clustering and Superpixel

Text information in natural scene images is important for various kinds of applications. In this paper a novel method based on stroke width to detect text in unconstrained natural scene images is proposed. Firstly, we use the stroke width transform to generate a rough estimation of stroke width map, then use K-Means clustering and the elbow method to find some specific stroke width values that are both dominant and consistent. Secondly, in order to generate better edge detection and gradient direction results we use these specific stroke width values as the size parameters in the superpixel algorithm to generate smooth and uniform region boundaries. Finally, we try to refine the stroke width map and recover valid edge pixels by applying stroke width regularized constraints on the improved edge detection and gradient direction results computed from these region boundaries. Our method was evaluated on three benchmark datasets: ICDAR 2005, 2011 and 2013, and the experimental results show that it achieves state-of-the-art performance.

Shuang Liu, Yu Zhou, Yongzheng Zhang, Yipeng Wang, Weiyao Lin

Sketch-Based Retrieval Using Content-Aware Hashing

In this paper, we introduce a generic hashing-based approach. It aims to facilitate sketch-based retrieval on large datasets of visual shapes. Unlike previous methods where visual descriptors are extracted from overlapping grids, a content-aware selection scheme is proposed to generate candidate patches instead. Meanwhile, the saliency of each patch is efficiently estimated. Locality-sensitive hashing (LSH) is employed to integrate and capture both the content and saliency of patches, as well as the spatial information of visual shapes. Furthermore, hash codes are indexed so that a query can be processed in sub-linear time. Experiments on three standard datasets in terms of hand drawn shapes, images and 3D models demonstrate the superiority of our approach.

Shuang Liang, Long Zhao, Yichen Wei, Jinyuan Jia

Pixel Granularity Template Matching Method for Screen Content Lossless Intra Picture

The-state-of-art High Efficiency Video Coding(HEVC) is designed towards the natural picture. Screen Content picture contains many similarities in one picture which can be used to improve the intra picture compression ratio. We propose sample rearrangement and template matching method by exploiting the similarities in the screen content picture. The 21 pixels template and the high efficiency multiple hash table are designed. Experiment results show that our proposal method can improve the lossless compression ratio by up to 4.23 times than HEVC Range Extensions.

Lixin Feng, Pin Tao, Daxin Guang, Jiangtao Wen, ShiQiang Yang


Music Interaction on Mobile Phones

This paper proposes an interactive musical application for smart phones to increase the possible exposure of music. Users can adjust the expressions of music using gestures on smart phones. Our experimental results demonstrate user satisfaction with this system. The user study also demonstrates that user preference of music pieces could be improved after playing with the system.

Wilber Chao, Kuan-Ting Chen, Yi-Shin Chen

Semantically Enhancing Multimedia Lifelog Events

Lifelogging is the digital recording of our everyday behaviour in order to identify human activities and build applications that support daily life. Lifelogs represent a unique form of personal multimedia content in that they are temporal, synchronised, multi-modal and composed of multiple media. Analysing lifelogs with a view to supporting content-based access, presents many challenges. These include the integration of heterogeneous input streams from different sensors, structuring a lifelog into events, representing events, and interpreting and understanding lifelogs. In this paper we demonstrate the potential of semantic web technologies for analysing lifelogs by automatically augmenting descriptions of lifelog events. We report on experiments and demonstrate how our results yield rich descriptions of multi-modal, multimedia lifelog content, opening up even greater possibilities for managing and using lifelogs.

Peng Wang, Alan Smeaton, Alessandra Mileo

Haar-Like and HOG Fusion Based Object Tracking

Only unitary feature for object is adopted in the conventional tracking system, making it difficult for robust tracking. Regarding the characteristic of both Haar-like and HOG features, a tracking algorithm fusing these two features is proposed: using the Haar-like features for the structure of the object and HOG features for the edge. A mixed feature pool is constructed with these two features. The On-line Boosting feature selection framework is adopted to select out the notable features, and update these features on line to realize the optimal selection. Four representative videos are used to test the performance of the proposed algorithm in the aspect of illumination change, tacking small targets, complex motion of the object, similar object interference during tracking and so on. Statistical analysis Results of the error show that the tracking system using the fused features outperforms the system using either of the two features.

Chong Xia, Shui-Fa Sun, Peng Chen, Heng Luo, Fang-Min Dong


Noise Face Image Hallucination via Data-Driven Local Eigentransformation

Face hallucination refers to inferring an High-Resolution (HR) face image from the input Low-Resolution (LR) one. It plays a vital role in LR face recognition by both manual and computer. The eigentransformation method based on Principal Component Analysis (PCA), which represents face image as a linear combination of the eigenfaces, has attracted considerable interests because of its simplicity and effectiveness. However, the face image observed is in a high-dimensional non-linear space, whose statistical properties cannot be captured by the PCA based linear modeling method. To this end, in this paper we advance a Data-driven Local Eigentransformation (DLE) method for face hallucination by exploiting the local geometry structure of data manifold and learning a specified eigentransformation model for each observed image. Experimental results show the effectiveness of the proposed approach for hallucinating face images especially with noise.

Xiaohui Dong, Ruimin Hu, Junjun Jiang, Zhen Han, Liang Chen, Ge Gao

Lip Segmentation Based on Facial Complexion Template

In Traditional Chinese Medicine (TCM), lip diagnosis is an important diagnostic method to judge whether a person is healthy or not. Lip images can reflect the physical conditions of organs in the body. Lip diagnosis has a long history in China and the lips are analyzed by experienced doctors with their nude eyes. This method is not objective and efficient especially in the condition of handling many images. Developing an automatic way to split lips from an image is an important and necessary step. What’s more, lip segmentation can provide improvement in the areas of speech recognition and speaker authentication. To segment lips and facial complexions, many methods are proposed which are based on color spaces such as RGB, HSV, Lab, etc. Other methods are based on different models such as snake, geometry model, etc. This paper proposes a lip segmentation method based on facial complexion template. A facial complexion template can be constructed when the face is detected. We construct the facial complexion template using Hue channel and Saturation channel of color information. By removing the skin similar to facial complexion template values an initial lip image can be got. Finally, by smoothing the lip contour an optimized lip segmentation result can be obtained.

Chenyang Sun, Hong Lu, Wenqiang Zhang, Xiaoxin Qiu, Fufeng Li, Hongkai Zhang

Saliency-Based Deformable Model for Pedestrian Detection

Pedestrian detection, which is to identify category (pedestrian) of object and give the position information in the image, is an important and yet challenging task due to the intra-class variation of pedestrians in clothing and articulation. Previous researches mainly focus on feature extraction and sliding window, where the former aims to find robust feature representation while the latter seeks to locate the latent position. However, most of sliding windows are based on scale transformation and traverse the entire image. Therefore, it will bring computational complexity and false detection which is not necessary. To conquer the above difficulties, we propose a novel Saliency-Based Deformable Model (SBDM) method for pedestrian detection. In SBDM method we present that, besides the local features, the saliency in the image provides important constraints that are not yet well utilized. And a probabilistic framework is proposed to model the relationship between Saliency detection and the feature (Deformable Model) via a Bayesian rule to detect pedestrians in the still image.

Xiao Wang, Jun Chen, Wenhua Fang, Chao Liang, Chunjie Zhang, Kaimin Sun, Ruimin Hu

Special Session: Image Analysis and Processing under Extra Help

Age Estimation Based on Convolutional Neural Network

In recent years, face recognition technology has become a hot topic in the field of pattern recognition. The human face is one of the most important human biometric characteristics, which contains a lot of important information, such as identity, gender, age, expression, race and so on. Human age is a significant reference for identity discrimination, and age estimation can be potentially applied in human-computer interaction, computer vision and business intelligence. This paper addresses the problem of accurate estimation of human age. An age estimation system is generally composed of aging feature extraction and feature classification. In the feature extraction part, well-known texture descriptors like the Gabor wavelets and the Local Binary Patterns (LBP) have been utilized for the feature extraction. In our method, we use Convolutional Neural Network (CNN) to extract facial features. We gain the convolution activation features through building a multilevel CNN model based-on abundant training data. In the feature classification part, we divide different ages into 13 groups and use the Support Vector Machine (SVM) classifier to perform the classification. The experimental results show that the performance of the proposed method is superior to that of the previous methods when using our aging database.

Chenjing Yan, Congyan Lang, Tao Wang, Xuetao Du, Chen Zhang

A Curvature Filter and Normal Clustering Based Approach to Detecting Cylinder on 3D Medical Model

In this paper, we propose a cylinder detection approach based on curvature filtering and normal clustering. We first estimate the curvature of the vertexes on each triangle, reserve these triangles which have the characteristics of a cylinder, then the triangles are clustered by the normal. Then all the triangles are transformed onto a new coordinate system, which the Z axis is parallel to the normal. Finally, the cylinders are detected by the Hough transformation in the 2D plane. The experimental results show that our proposed algorithm has good performance.

Yuan Gao, Lifang Wu, Yuxin Mao, Jinqiao Wang

Image Compositing Based on Hierarchical Weighted Blending

Recent image compositing methods mainly focus on the compositing for normal images, while for shadow ones, these methods may be less effective due to the special structure in the shadow area. Besides, many of these methods may generate problems of serious color distortion or cannot realize seamless blending. In order to improve these problems, we propose a new hierarchical weighted method based on an alpha matte for image composition, especially for those with shadows. In our method, we divide the blending area into different layers according to the alpha matte, and implement a hybrid method combining gradient based method with transformed alpha blending as well as weights in these layers respectively. By conducting a series of experiments, we demonstrate the superiority of our proposed method.

Huihui Wei, Qimin Cheng

Multifold Concept Relationships Metrics

How to establish the relationship between concepts based on the large scale real-world click data from commercial engine is a challenging topic due to that the click data suffers from the noise such as typos, the same concept with different queries etc.

In this paper, we propose an approach for automatically establishing the concept relationship. We first define five specific relationships between concepts and leverage them to annotate the images collected from commercial search engine. We then extract some conceptual features in textual and visual domain to train the concept model. The relationship of each pairwise concept will thus be classified into one of the five special relationships. Experimental results demonstrate our proposed approach is more effective than Google Distance.

Wenyi Cao, Richang Hong, Meng Wang, Xiansheng Hua

Poster and Demo Session

A Comparison between Artificial Neural Network and Cascade-Correlation Neural Network in Concept Classification

Deep learning has achieved significant attention recently due to promising results in representing and classifying concepts most prominently in the form of convolutional neural networks (CNN). While CNN has been widely studied and evaluated in computer vision, there are other forms of deep learning algorithms which may be promising. One interesting deep learning approach which has received relatively little attention in visual concept classification is Cascade-Correlation Neural Networks (CCNN). In this paper, we create a visual concept retrieval system which is based on CCNN. Experimental results on the CalTech101 dataset indicate that CCNN outperforms ANN.

Yanming Guo, Liang Bai, Songyang Lao, Song Wu, Michael S. Lew

Location-Based Hierarchical Event Summary for Social Media Photos

In this paper, we propose a system named “LHES” to detect location-based social events on flexible time scales and generate a hierarchical summary for the event. Particularly, we focus on social events that happened at landmarks. Flexible time scales include month, day, hour and minute. For each landmark, our LHES system generates a hierarchical (tree style) summary, in which the root node gives a snapshot of the entire event and child nodes span different time periods (beginning, ending, etc.) of the parent event. To generate such a summary, we use both visual cues (e.g., color, texture) and metadata (e.g., time stamp, image tags, titles and description). Our online demo is available at


Weipeng Zhang, Jia Chen, Jie Shen, Yong Yu

Towards Natural Gestures for Presentation Control Using Microsoft Kinect

Microsoft Kinect was initially developed for gaming and has since been used in a variety of applications. The work here addressed some of the challenges for it to be used as a gesture driven presentation application. The proposed use of Hidden Markov Model to provide a smoother, and hence a step towards a more natural movement, is demonstrated.

Boon Yaik Ooi, Chee Siang Wong, Ian KT Tan, Chee Keong Lee

3D-Spatial-Texture Bilateral Filter for Depth-Based 3D Video

In the depth-based 3D video system, filters used in texture and depth images denoising, such as bilateral filter and trilateral filter, are generally designed based on calculating the weighted average of reference pixels located around the pixel to be filtered. In this paper, we propose a 3D-spatial-texture bilateral filter by considering the relationship of two pixels in the 3D space, including geometric closeness in the 3D world coordinate, as well as their corresponding texture/color similarity. Accordingly, the weight is defined with two kernels describing two abovementioned factors respectively, namely, a spatial kernel and a range kernel. The experimental results show that better performance can be achieved by using the proposed filter for both texture and depth image denoising, compared with conventional bilateral filter and trilateral filter.

Xin Wang, Ce Zhu, Jianhua Zheng, Yongbing Lin, Yuhua Zhang

A System for Parking Lot Marking Detection

In this paper, we proposed a robust parking lot marking detection technique that is one important component for intelligent transportation systems and assisted/autonomous driving. Our system learns features of parking lot markings from training data and matches these templates to detected features in the test video during runtime. In the proposed system, maximally stable extremal regions (MSER) are used to detect a set of parking lot marking candidates. Features are then extracted from the detected candidates and Support Vector Machine (SVM) is applied to classify the parking lot marking in an efficient manner. With the detected parking lot markings, a parking lot is estimated by fitting two adjacent parking lot markings. The proposed technique is tested on real world street-view videos captured with an in-car camera. The experimental results show that the proposed technique is robust and capable of detecting parking lots under different lighting, marking sizes, and marking poses.

Bolan Su, Shijian Lu

Nearest Neighbor

Fast Search of Binary Codes with Distinctive Bits

Although distance between binary codes can be computed fast in hamming space, linear search is not practical for large scale dataset. Therefore attention has been paid to the efficiency of performing approximate nearest neighbor search, in which Hierarchical Clustering Trees (HCT) is the state-of-the-art method. However, HCT builds index with the whole binary codes, which degrades search performance. In this paper, we first propose an algorithm to compress binary codes by extracting distinctive bits according to the standard deviation of each bit. Then, a new index is proposed using com-pressed binary codes based on hierarchical decomposition of binary spaces. Experiments conducted on reference datasets and a dataset of one billion binary codes demonstrate the effectiveness and efficiency of our method.

Yanping Ma, Hongtao Xie, Zhineng Chen, Qiong Dai, Yinfei Huang, Guangrong Ji

Data-Dependent Locality Sensitive Hashing

Locality sensitive hashing (LSH) is the most popular algorithm for approximate nearest neighbor (ANN) search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited performance. In this paper, we propose a new data-dependent LSH algorithm, which has two-level structures to perform ANN search in high dimensional spaces. In the first level, we first train a number of cluster centers, then use the cluster centers to divide the dataset into many clusters and the vectors in each cluster has near uniform distribution. In the second level, we construct LSH tables for each cluster. Given a query, we first determine a few clusters that it belongs to with high probability, and then perform ANN search in the corresponding LSH tables. Experimental results on the reference datasets show that the search speed can be increased by 48 times compared to E2LSH, while keeping high search precision.

Hongtao Xie, Zhineng Chen, Yizhi Liu, Jianlong Tan, Li Guo

Cosine Distance Metric Learning for Speaker Verification Using Large Margin Nearest Neighbor Method

In this paper, a novel cosine similarity metric learning based on large margin nearest neighborhood (LMNN) is proposed for an i-vector based speaker verification system. Generally, in an i-vector based speaker verification system, the decision is based on the cosine distance between the test i-vector and target i-vector. Metric learning methods are employed to reduce the within class variation and maximize the between class variation. In this proposed method, cosine similarity large margin nearest neighborhood (CSLMNN) metric is learned from the development data. The test and target i-vectors are linearly transformed using the learned metric. The objective of learning the metric is to ensure that the k-nearest neighbors that belong to the same speaker are clustered together, while impostors are moved away by a large margin. Experiments conducted on the NIST-2008 and YOHO databases show improved performance compared to speaker verification system, where no learned metric is used.

Waquar Ahmad, Harish Karnick, Rajesh M. Hegde

Neural Networks

Anchor Shot Detection with Deep Neural Network

Anchor Shot Detection (ASD) is a key step for segmenting news videos into stories. However, the existing ASD methods are either channel-related or channel-limited which could not satisfy the requirement for achieving effective management of large-scale broadcast news videos. Considering the variety and diversity of large-scale news videos and channels, in this paper we propose a universal scheme based on deep neural network for anchor shot detection (DNN_ASD). Firstly, DNN_ASD consists of a training procedure of deep neural network to learn the appropriate anchor shot detector. Secondly, accompanied with imbalanced sampling strategy and face-assist verification, a universal scheme of anchor shot detection for large-scale news videos and channels is available. Parallel to this, the width and depth of neural network and the transfer ability are empirically discussed respectively as well. Encouraging experimental results on news videos from 30 TV channels demonstrate the effectiveness of the proposed scheme, as well as its superiority on transfer ability over traditional ASD methods.

Bailan Feng, Jinfeng Bai, Zhineng Chen, Xiangsheng Huang, Bo Xu

Adaptation of ANN Based Video Stream QoE Prediction Model

Pseudo-Subjective Quality Assessment (PSQA) is an effective way to prediction the Quality of experience (QoE) of video stream. The ANN-based PSQA model gives a decent QoE prediction accuracy when it is tested under the same condition as training. However, the performance of the model under mismatched conditions is little studied, and how to effectively adapt the models from one condition to another is still an open question. In this work, we first evaluated the performance of the ANN-based QoE prediction model under mismatched conditions. Our study shows that the QoE prediction accuracy degrades significantly when the model is applied to conditions different from the training condition. Further, we developed a feature transformation based model adaptation method to adapt the model from one condition to another. Experiments results show that the QoE prediction accuracy under mismatched conditions can be improved substantially using as few as five data samples under the new condition for model adaptation.

Jianfeng Deng, Ling Zhang, Jinlong Hu, Dongli He

Image and Photo 2

Leveraging Color Harmony and Spatial Context for Aesthetic Assessment of Photographs

Computer aesthetic assessment of pictures is aimed at automatically computed aesthetic values of pictures. It has potential wide areas of application in real world. We apply color harmony, one of the most important aesthetic standards, and explore the spatial context of features. Based on the framework of [9], we provide a color harmony descriptor which includes the circular region sampling method, and follow the principle of Ordered-Bag-of-Features to explore the spatial context. And we conduct experiments on a public and large-scale aesthetic assessment dataset. Experimental results demonstrate the effectiveness of the proposed method.

Hong Lu, Jin Lin, Bohong Yang, Yiyi Chang, Yuefei Guo, Xiangyang Xue

A Multi-exposure Fusion Method Based on Locality Properties

A new method is proposed for fusing a multi-exposure sequence of images into a high quality image based on the locality properties of the sequence. We divide the images into uniform blocks and use variance to represent the information of blocks. The richest information (RI) image is computed by piecing together blocks with largest variances. We assume that images in the sequence are high dimensional data points lying on the same neighbourhood and borrow the idea from locally linear embedding (LLE) to fuse a result image which is closest to the RI image. The result is comparable to the state-of-art tone mapping operators and other exposure fusion methods.

Yuanchao Bai, Huizhu Jia, Hengjin Liu, Guoqing Xiang, Xiaodong Xie, Ming Jiang, Wen Gao

Image Abstraction Using Anisotropic Diffusion Symmetric Nearest Neighbor Filter

Image abstraction is an increasingly important task in various multimedia applications. It involves the artificial transformation of photorealistic images into cartoon-like images. To simplify image content, the bilateral and Kuwahara filters remain popular choices to date. However, these methods often produce undesirable over-blurring effects and are highly susceptible to the presence of noise. In this paper, we propose an image abstraction technique that balances region smoothing and edge preservation. The coupling of a classic Symmetric Nearest Neighbor (SNN) filter with anisotropic diffusion within our abstraction framework enables effective suppression of local patch artifacts. Our qualitative and quantitative evaluation demonstrate the significant appeal and advantages of our technique in comparison to standard filters in literature.

Zoya Shahcheraghi, John See, Alfian Abdul Halin


Reduction of Multichannel Sound System Based on Spherical Harmonics

In order to meet people’s demand for 3D audio in family, it’s a critical problem to recreate a 3D spatial sound field with few loudspeakers. In this paper, we introduce a


- to (


)-channel reduction method based on spherical harmonic decomposition and sound field of head can be perfectly reproduced. When loudspeakers are too few to perfectly reproduce the sound field of head, we ensure low distortions of sound field at ears. On this basis, multichannel reduction algorithm from


- to


-channel system is proposed. As an example, reduction of NHK 22.2 system has been implemented and eleven loudspeaker arrangements from 22 to 6 channels are derived. Results show the sound field of head can be reproduced perfectly until 10 channels and 8-, 6-channel systems can keep low distortions at ears. Compared with Ando’s multichannel conversion method by subjective evaluation, our proposed method is better in terms of sound localization.

Shanshan Yang, Xiaochen Wang, Dengshi Li, Ruimin Hu, Weiping Tu

Automatic Multichannel Simplification with Low Impacts on Sound Pressure at Ears

People hope to use minimum arrangement of loudspeakers to reproduce the experience of the film 3D sound at home. Although the Ando’s conversion can convert


- to


-channel sound system by maintaining the sound pressure at the origin, it is time-consuming and expensive to use lots of subjective evaluations to distinguish the experience of spatial sound reproduced by


-loudspeaker arrangements. To solve this problem, we not only ensure the sound pressure at the origin invariably, but also limit the absolute error of sound pressure at ears to a given threshold or less while simplifying from


- to (


)-channel, and an automatic simplification algorithm from


- to


-channel sound system is proposed. The 22.2 multichannel sound system without two low-frequency effect channels as an example can be simplified to eight channels automatically and the total loudspeaker arrangements is 23. The subjective evaluation is comparable to that of the Ando’s conversion and the cost of subjective evaluation is saved.

Dengshi Li, Ruimin Hu, Xiaochen Wang, Weiping Tu, Shanshan Yang

Acoustic Beamforming with Maximum SNR Criterion and Efficient Generalized Eigenvector Tracking

A recently proposed adaptive acoustic beamformer based on the maximization of the output SNR (Max-SNR beamformer) has an advantage of requiring no information of transfer functions. A key technology to implement Max-SNR beamformers is to estimate generalized eigenvector (GEV) of covariance matrices of target signal and noise, which are basically unknown. We develop a novel GEV tracking algorithm with decaying time windows that enable Max-SNR beamformer to adapt rapidly moving sources. Simulation results support the analysis.

Toshihisa Tanaka, Mitsuaki Shiono


Weitere Informationen