Skip to main content

Über dieses Buch

The two-volume proceedings LNCS 9314 and 9315, constitute the proceedings of the 16th Pacific-Rim Conference on Multimedia, PCM 2015, held in Gwangju, South Korea, in September 2015.

The total of 138 full and 32 short papers presented in these proceedings was carefully reviewed and selected from 224 submissions. The papers were organized in topical sections named: image and audio processing; multimedia content analysis; multimedia applications and services; video coding and processing; multimedia representation learning; visual understanding and recognition on big data; coding and reconstruction of multimedia data with spatial-temporal information; 3D image/video processing and applications; video/image quality assessment and processing; social media computing; human action recognition in social robotics and video surveillance; recent advances in image/video processing; new media representation and transmission technologies for emerging UHD services.



3D Image/Video Processing and Applications


Motion and Depth Assisted Workload Prediction for Parallel View Synthesis

In this paper, a parallel system together with a real-time workload balancing algorithm is proposed for view synthesis on multi-core platforms. First, a numerical relationship between the number of holes after warping and the work-load of view synthesis is derived based on correlation analysis for the texture regions. Then, according to the difference between the adjacent depth maps, a novel model is proposed to predict the synthesis workload accurately. Experimental results show that the workload difference among cores is reduced obviously and higher speedup ratio is achieved with negligible quality degradation by applying the proposed workload balancing system.

Zhanqi Liu, Xin Jin, Qionghai Dai

Graph Cuts Stereo Matching Based on Patch-Match and Ground Control Points Constraint

Stereo matching methods based on Patch-Match obtain good results on complex texture regions but show poor ability on low texture regions. In this paper, a new method that integrates Patch-Match and graph cuts (GC) is proposed in order to achieve good results in both complex and low texture regions. A label is randomly assigned for each pixel and the label is optimized through propagation process. All these labels constitute a label space for each iteration in GC. Also, a Ground Control Points (GCPs) constraint term is added to the GC to overcome the disadvantages of Patch-Match stereo in low texture regions. The proposed method has the advantage of the spatial propagation of Patch- Match and the global property of GC. The results of experiments are tested on the Middlebury evaluation system and outperform all the other PatchMatch based methods.

Xiaoshui Huang, Chun Yuan, Jian Zhang

Synthesized Views Distortion Model Based Rate Control in 3D-HEVC

In this paper, we propose a synthesized views distortion model based rate control algorithm for the high efficiency video coding (HEVC) based 3D video compression standard. The major contributions of the paper include the following two aspects. Firstly, we investigate the distortion dependency between the synthesized views and the coded views including texture video and depth maps. Then we propose a synthesized views distortion model for 3D-HEVC, and based on the distortion model an efficient joint bit allocation scheme is proposed. Experimental results show that the proposed rate control algorithm achieves better performance on both the coded texture views and synthesized views. The maximum overall (including all coded texture views and all synthesized views) performance improvement can be up to 14.4 % and the average BD-rate gain is 6.9 %. Moreover, it can accurately control the bitrate to satisfy the total bitrate constraint.

Songchao Tan, Siwei Ma, Shanshe Wang, Wen Gao

Efficient Depth Map Upsampling Method Using Standard Deviation

In this paper, we present an adaptive multi-lateral filtering method to increase depth map resolution. Joint bilateral upsampling (JBU) increases the resolution of a depth image considering the photometric property of corresponding high-resolution color image. The JBU uses both a spatial weighting function and a color weighting function evaluated on the data values. However, JBU causes a texture copying problem. Standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data in image. Therefore, it includes an edge information in the each kernel. In the proposed method, we decrease the texture copying problem of the upsampled depth map by using adaptive weighting functions are chosen by the edge information. Experimental results show that the proposed method outperformed compared to the other depth upsampling approaches in terms of bad pixel rate.

Su-Min Hong, Yo-Sung Ho

Orthogonal and Smooth Subspace Based on Sparse Coding for Image Classification

Many real-world problems usually deal with high-dimensional data, such as images, videos, text, web documents and so on. In fact, the classification algorithms used to process these high-dimensional data often suffer from the low accuracy and high computational complexity. Therefore, we propose a framework of transforming images from a high-dimensional image space to a low-dimensional target image space, based on learning an orthogonal smooth subspace for the SIFT sparse codes (SC-OSS). It is a two stage framework for subspace learning. Firstly, a sparse coding followed by spatial pyramid max pooling is used to get the image representation. Then, the image descriptor is mapped into an orthonormal and smooth subspace to classify images in low dimension. The proposed algorithm adds the orthogonality and a Laplacian smoothing penalty to constrain the projective function coefficient to be orthogonal and spatially smooth. The experimental results on the public datasets have shown that the proposed algorithm outperforms other subspace methods.

Fushuang Dai, Yao Zhao, Dongxia Chang, Chunyu Lin

Video/Image Quality Assessment and Processing


Sparse Representation Based Image Quality Assessment with Adaptive Sub-dictionary Selection

This paper presents a sparse representation based image Quality metric with Adaptive Sub-Dictionaries (QASD). An overcomplete dictionary is first learned using natural images. A reference image block is represented using the overcomplete dictionary, and the used basis vectors are employed to form an undercomplete sub-dictionary. Then the corresponding distorted image block is represented using all the basis vectors in the sub-dictionary. The sparse coefficients are used to generate two feature maps, based on which a local quality map is generated. With the consideration that sparse features are insensitive to weak distortions and image quality is affected by various factors, image gradient, color and luminance are integrated as auxiliary features. Finally, a sparse-featurebased weighting map is proposed to conduct the pooling, producing an overall quality score. Experiments on public image databases demonstrate the advantages of the proposed method.

Leida Li, Hao Cai, Yabin Zhang, Jiansheng Qian

Single Image Super-Resolution via Iterative Collaborative Representation

We propose a new model called iterative collaborative representation (ICR) for image super-resolution (SR). Most of popular SR approaches extract low-resolution (LR) features from the given LR image directly to recover its corresponding high-resolution (HR) features. However, they neglect to utilize the reconstructed HR image for further image SR enhancement. Based on this observation, we extract features from the reconstructed HR image to progressively upscale LR image in an iterative way. In the learning phase, we use the reconstructed and the original HR images as inputs to train the mapping models. These mapping models are then used to upscale the original LR images. In the reconstruction phase, mapping models and LR features extracted from the LR and reconstructed image are then used to conduct image SR in each iteration. Experimental results on standard images demonstrate that our ICR obtains state-of-the-art SR performance quantitatively and visually, surpassing recently published leading SR methods.

Yulun Zhang, Yongbing Zhang, Jian Zhang, Haoqian Wang, Qionghai Dai

Influence of Spatial Resolution on State-of-the-Art Saliency Models

Visual attention has been widely investigated and applied in recent decades. Various computation models have been proposed to modeling visual attention, but most researches are conducted under the assumption that given images have few limited spatial resolutions. Spatial resolution is an important feature of image. Image resolution may have some influence on visual attention, and it may also affect the effectiveness of visual attention models. The influence of spatial resolution on saliency models has not been systematically investigated before. In this paper, we discuss two problems related to image resolution and saliency: (1) Most saliency models contain down-sampling function which changes the resolution of original images to lower the computational complexity and keep the formalization of the algorithm. In the first part, we discuss the influence of the down-sampling ratio on the effectiveness of classic saliency models. (2) In the second part, we investigate the effectiveness of saliency models in images of various resolutions. A dataset which provides images and corresponding eye movement data in various spatial resolutions is used in this part. We apply the default rescaling parameters and keep them unchanged. Then we analyze the performance of classic models on 8 resolution levels. In summary, we systematically investigate and analyze problems concerning spatial resolution in the research of saliency modeling. The results of this work can provide a guide to the use of classic models in images of different resolutions and they are helpful to computational complexity optimization.

Zhaohui Che, Guangtao Zhai, Xiongkuo Min

Depth Map Upsampling via Progressive Manner Based on Probability Maximization

Depth maps generated by modern depth cameras, such as Kinect or Time of Flight cameras, usually have lower resolution and polluted by noises. To address this problem, a novel depth upsampling method via progressive manner is proposed in this paper. Based on the assumption that HR depth value can be generated from a distribution determined by the ones in its neighborhood, we formulate the depth upsampling as a probability maximization problem. Accordingly, we give a progressive solution, where the result in current iteration is fed into the next to further refine the upsampled depth map. Taking advantage of both local probability distribution assumption and generated result in previous iteration, the proposed method is able to improve the quality of upsampled depth while eliminating noises. We have conducted various experiments, which show an impressive improvement both in subjective and objective evaluations compared with state-of-art methods.

Rongqun Lin, Yongbing Zhang, Haoqian Wang, Xingzheng Wang, Qionghai Dai

Perceptual Quality Improvement for Synthesis Imaging of Chinese Spectral Radioheliograph

Chinese Spectral Radioheliography can generate the images of the Sun with good spatial resolutions. It employs the Aperture Synthesis principle to image the Sun with plentiful solar radio activities. However, due to the limitation of the hardware, specifically the limited number of antennas, the recorded signal is extremely sparse in practice, which results in unsatisfied solar radio image quality. In this paper, we study the image reconstruction of Chinese Spectral RadioHeliograph (CSRH) by the aid of compressed sensing (CS) technique. In our proposed method, we adopt dictionary technique to represent solar radio images sparsely. The experimental results indicate that the proposed algorithm contributes both PSNR and subjective image quality improvements of synthesis imaging of CSRH markedly.

Long Xu, Lin Ma, Zhuo Chen, Yihua Yan, Jinjian Wu

Social Media Computing


Real-Life Voice Activity Detection Based on Audio-Visual Alignment

Voice activity detection (VAD) is a technology to identify whether the persons in multimedia are speaking. Most of the research efforts focused on utilizing audio and visual information to implement voice activity detection, which outperform audio or visual approach alone proposed earlier. However, current methods explore a supervised classifiers using new feature consist of audio and visual information. In the paper, we propose a novel method to detect voice activity by audio-visual alignment. Since the temporal order relationship of voice activity detection over the whole audio and visual information, we use Needleman- Wunsch algorithm to align two different sequences. Compared to existing VAD algorithms,our experimental results indicate that the proposed approach presents better results, and the accuracy rate reaches about 85% in real-life environment.

Jin Wang, Chao Liang, Xiaochen Wang, Zhongyuan Wang

Emotion Recognition from EEG Signals by Leveraging Stimulus Videos

This paper proposes a new emotion recognition method from electroencephalogram (EEG) signals by leveraging video stimulus as privileged information, which is only required during training. A Restricted Boltzmann Machine (RBM) is adopted to model the intrinsic relations between stimulus videos and users’ EEG response, and to generate new EEG features. Then, the support vector machine is used to recognize users’ emotion states from the generated EEG features. Experiments on two benchmark databases demonstrate that stimulus videos as the privileged information can help EEG signals construct better feature space, and RBM can model the high-order dependencies between stimulus videos and users’ EEG response successfully. Our proposed emotion recognition method leveraging video stimulus as privileged information outperforms the recognition method only from EEG signals.

Zhen Gao, Shangfei Wang

Twitter Event Photo Detection Using both Geotagged Tweets and Non-geotagged Photo Tweets

In this paper, we propose a system to detect event photos using geotagged tweets and non-geotagged photo tweets. In our previous work, only “geotagged photo tweets” was used for event photo detection the ratio of which to the total tweets was very limited. In the proposed system, we use geotagged tweets without photos for event detection, and non-geotagged photo tweets for event photo detection in addition to geotagged photo tweets. As results, we have detected about ten times of the photo events with higher accuracy compared to the previous work.

Kaneko Takamu, Nga Do Hang, Keiji Yanai

Weather-Adaptive Distance Metric for Landmark Image Classification

Visual appearance of landmark photos changes significantly in different weather conditions. In this work, we obtain weather information from a weather forecast website based on a landmark photo’s geotag and taken time information. With weather information, we adaptively adjust weightings for combining distances obtained based on different features and thus propose a weather-adaptive distance measure for landmark photo classification. We verify the effectiveness of this idea, and accomplish one of the early attempts to develop a landmark photo classification system that resists to weather changes.

Ding-Shiuan Ding, Wei-Ta Chu

Power of Tags: Predicting Popularity of Social Media in Geo-Spatial and Temporal Contexts

Generating multimedia content and sharing them in social networks has become one of our daily-life activities. Although a lot of people care about the quality of the content itself, much less attention is paid to the text annotations. In our previous work, we have shown that the popularity of the content in social media is strongly affected by its annotated tags, and we have proposed a TF-IDF-like algorithm to analyze which tags are more potentially important to earn more popularity. In this paper, we extend the idea to show how the important tags are geo-spatially varied and how the importance ranking of the tags evolves over time.

Toshihiko Yamasaki, Jiani Hu, Kiyoharu Aizawa, Tao Mei

Human Action Recognition in Social Robotics and Video Surveillance


Recognition of Human Group Activity for Video Analytics

Human activity recognition is an important and challenging task for video content analysis and understanding. Individual activity recognition has been well studied recently. However, recognizing the activities of human group with more than three people having complex interactions is still a formidable challenge. In this paper, a novel human group activity recognition method is proposed to deal with complex situation where there are multiple sub-groups. To characterize the inherent interactions of intra-subgroups and inter-subgroups with the varying number of participants, this paper proposes three types of group-activity descriptor using motion trajectory and appearance information of people. Experimental results on a public human group activity dataset demonstrate effectiveness of the proposed method.

Jaeyong Ju, Cheoljong Yang, Sebastian Scherer, Hanseok Ko

An Incremental SRC Method for Face Recognition

Face recognition has been studied for decades and been used widely in our daily life. However, when the practical application is concerned, not only the occlusion, pose and expression variations, but also the increasing training cost caused by the increasing number of training samples are problems we need to solve. In the paper we present a novel incremental SRC method aimed at solving the practical face recognition problems. On one hand, we divide the face into several components, select out the components affected greatly by face variations and abandon these components, the rest parts are used to rebuild the global face which contributes to the final result. On the other hand, inspired by the strategy of “Divide and Rule”, we divide the training samples into multiple groups and train in each group respectively. Therefore, when new training sample is added, we only need to update the model of the group to which the new sample is added, which can greatly decrease the retraining cost.

Numerous experiments are made on the AR and ORL face databases. Experimental results show that the performances of our method outperform the state-of-art linear representation algorithms. In the practical situation of single training sample, our method shows greater advantage than other methods.

Junjian Ye, Ruoyu Yang

A Survey on Media Interaction in Social Robotics

Social robots have attracted increasing research interests in academic and industry communities. The emerging media technologies greatly inspired human-robot interaction approaches, which aimed to tackle important challenges in practical applications. This paper presents a survey of recent works on media interaction in social robotics. We first introduce the state-of-the-art social robots and the related concepts. Then, we review the visual interaction approaches through various human actions such as facial expression, hand gesture and body motion, which have been widely considered as effective media interaction ways with robots. Furthermore, we summarize the event detection approaches which are crucial for robots to understand the environment and human intentions. While the emphasis is on vision-based interaction approaches, the multimodal interaction works are also briefly summarized for practitioners.

Lu Yang, Hong Cheng, Jiasheng Hao, Yanli Ji, Yiqun Kuang

Recognizing 3D Continuous Letter Trajectory Gesture Using Dynamic Time Warping

Letter trajectory gesture recognition is widely used in Human Computer Interaction. Many approaches for letter trajectory gesture recognition have been proposed in the past several years. Most of the traditional approaches detect letters based on the beginning/end points provided by the user. It causes low writing speed and uncomfortable writing experience. Moreover, traditional Dynamic Time Warping cannot classify the letters which have the familiar trajectory. In this paper, we combine Dynamic Time Warping with structured points of letters to overcome those problems. The main contribution of this paper is that we introduce the structured points information of letters in Time Warping process to detect letters from hand trajectories. Based on this, we can successfully recognize the letter from the weak inter-class feature and the continuous trajectory without beginning point and end point given by the user. Furthermore, we can handle the self-contained trajectory based on the complexity of letters. We evaluate this system in our gesture dataset, and it shows that the proposed approach can significantly outperform the traditional begin-end gesture approach.

Jingren Tang, Hong Cheng, Lu Yang

Rapid 3D Face Modeling from Video

In this paper, an efficient technique is developed to construct textured 3D face model from video containing a face rotating from frontal to profile. After two manual clicks on a profile to tell the system where the eye corner and bottom of the chin are, the system automatically generates a realistic looking 3D face model. The proposed method consists of three components. Firstly, based on the facial feature points extracted from frontal and profile images, an individual 3D geometric face model is generated by deforming the generic model with improved Radial basis function. Then the model is refined by using improved


-Subdivision. Secondly, the multi-resolution technique and weighted smoothing algorithm are combined to synthesize individual facial texture image. Finally, a realistic 3D face model is built by mapping the individual texture to the individual 3D geometric model. The accuracy and robustness of the method are demonstrated with a set of experiments.

Hong Song, Jie Lv, Yanming Wang

Recent Advances in Image/Video Processing


Score Level Fusion of Multibiometrics Using Local Phase Array

Local phase array for biometric recognition have demonstrated efficient performance in face, palmprint and finger knuckle recognition. If the matching score for each trait is calculated by one matcher using local phase array, the size of the system can be reduced and the simple score level fusion can be used to exhibit good performance for person authentication. In this paper, we consider the score level fusion of face, iris, palmprint, and finger knuckle whose matching scores are calculated using local phase array. Through a set of experiments using public databases, we demonstrate effectiveness of local phase array for multibiometric recognition compared with the combination of the stateof- the-art recognition algorithm for each trait.

Luis Rafael Marval Pérez, Shoichiro Aoyama, Koichi Ito, Takafumi Aoki

Histogram-Based Near-Lossless Data Hiding and Its Application to Image Compression

This paper proposes a near-lossless data hiding (DH) method for images where the proposed method can improve the image compression efficiency. The proposed method firstly quantizes an image in accordance with a user-given maximum allowed error. This method, then, embeds data to the quantized image based on histogram shifting (HS). Even this method uses HS-based DH which requires to memorize the shifted bins for data extraction, the method, under some conditions, takes data out from the marked image by just applying re-quantization as least significant bitplane (LSB) substitution-based DH. So the proposed method is based on unification of HS- and LSB substitution-based DH. In the method, lossless compression of the marked image can achieve better compression efficiency than lossy compression of the original image. Experimental results show the effectiveness of the proposed method.

Masaaki Fujiyoshi, Hitoshi Kiya

Hierarchical Learning for Large-Scale Image Classification via CNN and Maximum Confidence Path

We propose a framework to integrate the large scale image data visualization with image classification. The Convolution Neural Network is used to learn the feature vector for an image. A fast algorithm is developed for inter-class similarity measurement. The spectral clustering is implemented to construct a hierarchical visual tree. Instead of the flat classification way, a hierarchical classification is designed according to the visual tree, which is transformed to a path search problem. The path with the maximum joint probability is the final solution. Experimental results on the ILSVRC2010 dataset demonstrate that our method achieves the highest top-1 and top-5 classification accuracy in comparison with 6 state-of-the-art methods.

Chang Lu, Yanyun Qu, Cuiting Shi, Jianping Fan, Yang Wu, Hanzi Wang

Single Camera-Based Depth Estimation and Improved Continuously Adaptive Mean Shift Algorithm for Tracking Occluded Objects

This paper present a novel object tracking algorithm that can efficiently overcome the object occlusion problem by combining depth and color probability distribution information. The proposed algorithm consists of; (i) the depth estimation step using a color shift model (CSM)-based single camera, and (ii) the combination of depth and color probability distribution step using continuous adaptive mean shift (CAMSHIFT) algorithm, which is an adaptive version of the existing mean shift algorithm. In spite of the optimum object segmentation ability, the CAMSHIFT algorithm may fail in tracking if multiple occluded objects have similar colors. In order to overcome this limitation, the proposed algorithm combines depth and color probability distribution information. The experimental results show that the proposed algorithm is real time for well tracking the occluded object which cannot be tracked by the traditional CAMSHIFT algorithm, and the accuracy of depth estimation of the proposed algorithm is about 97.5 %.

Jaehyun Im, Jaehoon Jung, Joonki Paik

A Flexible Programmable Camera Control and Data Acquisition Hardware Platform

There are a number of standard video sequences produced by some organizations and companies, which simulate different scenes and conditions in order to properly test video coding methods. With the rapid development of sensing and imaging technologies, more information can be employed to assist video coding. However, the existed sequences cannot satisfy the recently proposed video coding methods. Moreover, to capture 3D video, multi-cameras, including texture and depth cameras, are used together. Thus, a synchronization mechanism is needed to control different cameras. In addition, some adverse conditions, such as light source flicker, affect the quality of the video. In order to acquire proper video sequences, we proposed a flexible control platform for texture and depth cameras, which could address the above issues. We have applied the platform to test several new methods proposed recently. With the assistance of this platform, we have achieved considerable positive results.

Fei Cheng, Jimin Xiao, Tammam Tillo, Yao Zhao

New Media Representation and Transmission Technologies for Emerging UHD Services


Comparison of Real-time Streaming Performance Between UDP and TCP Based Delivery Over LTE

Video traffic becomes dominant in mobile networks since users use mobile networks for enjoying more instant and handy video service. Currently, TCP and UDP are major transport layer protocols for video streaming service. This paper compares performance of those protocols over LTE in terms of Maximum Available Bitrate (MAB). The results will become the key clue to select one between DASH (Dynamic Adaptive Streaming over HTTP) and MMT (MPEG Media Transport).

Sookyung Park, Kyeongwon Kim, Doug Young Suh

Video Streaming for Multi-cloud Game

Depending on the development of the game industry, required hardware performance rises steadily. High-end game cannot be enjoyed by outmoded computer or smart mobile device. As a way to solve this problem, studies on the cloud gaming being actively conducted. However, latency of cloud gaming is greater than the latency of regular game and it interfere the smooth game play. Enjoying multi user game is not fair by different latency between each user. Thus we propose new cloud gaming system. It reduce the processing time in the cloud using the distribute processing, makes the same latency of each user through the appropriate distribution. In addition it corresponds to a loss by using the FEC. Accordingly, the proposed system enables a fair and good quality game on the thin client with the hardware of the lower-performance.

Yoonseok Heo, Taeseop Kim, Doug Young Suh

Performance Analysis of Scaler SoC for 4K Video Signal

This paper considers some issues to implement the scaler SoC. Because the size and power of SoC are constrained, the bit number to represent the coefficients of scaler kernel and LPF should be limited. In addition, the interpolation position should be located at the quantized phases. We analyze the effects of the various constraints in the performance of scaling system. The simulation results provide the guidance to implement the scaler SoC.

Soon-Jin Lee, Jong-Ki Han

Deblocking Filter for Depth Videos in 3D Video Coding Extension of HEVC

This paper presents a modified deblocking filter for depth video coding in the 3D video coding extension of High Efficiency Video Coding (3D-HEVC). The conventional 3D video coding extension of HEVC (3D-HEVC) employs a deblocking filter and sample adaptive offset (SAO) in the loop filter in which both tools are applied to color video coding only. Nevertheless, the deblocking filter can smooth out blocking artifacts existing in coded depth videos, resulting in improving the coding efficiency. In this paper, we modify the original deblocking filter of HEVC and apply it to depth video coding. The goal is to enhance the depth video coding efficiency. The modified filter is executed when a set of conditions regarding the boundary strength are satisfied. In addition, the impulse response is altered for more smoothing between block boundaries. Experiment results show 5.2 % BD-rate reduction in depth video coding in comparison to the conventional 3D-HEVC.

Yunseok Song, Yo-Sung Ho

Sparcity-Induced Structured Transform in Intra Video Coding for Screen Contents

In this paper, we propose a novel transform method based on learning a sparse model of residue in intra video coding. The proposed method considers generation of transformed coefficients locally grouped in a block as well as the sparsity of the coefficients, which can improve the coding efficiency of the screen content videos such as computer synthetic videos. The proposed method trains the transform on-line with using the residue, applied to the current frame. It is demonstrated with experiments that the proposed method improves the coding gain over HEVC Range extension standard.

Je-Won Kang

SS Poster


High-Speed Periodic Motion Reconstruction Using an Off-the-shelf Camera with Compensation for Rolling Shutter Effect

In recent years, high-speed signal reconstruction with sub- Nyquist sampling have attracted the attention of researchers in the signal processing field. Nonetheless, such methods have been limited either by the need to utilize multiple cameras, or relying on newly designed imaging hardware. In this paper, we propose a high-speed periodic motion reconstruction method, obtained by randomly delaying the camera exposure. This allows it to utilize a conventional off-the-shelf camera. In addition, the proposed method compensates the rolling shutter effect, which is inevitable if the camera’s image sensor is made of complementary metal-oxide semiconductor (CMOS), while reconstructing the highspeed periodic motion. Exhaustive and comparative experiments have been conducted to validate the proposed method, which showed promising performance in terms of reconstruction error, and effective compensation of the rolling shutter effect.

Jeong-Jik Seo, Wissam J. Baddar, Hyung-Il Kim, Yong Man Ro

Robust Feature Extraction for Shift and Direction Invariant Action Recognition

We propose a novel feature based on optical flow for action recognition. The feature is quite simple and has much lower computational load than the existing features for action recognition algorithms. It has invariance to scale, different time duration and direction of an action. Since raw optical flow is noisy on the background, several methods for noise reduction are presented. Firstly, we bundle up the fixed number of frames as a block and take the median value of optical flow (median flow). Secondly, we take normalization of histogram depending on the total magnitude. Lastly, we do low-pass filtering in frequency domain. Converting the time domain to frequency domain based on Fourier transform makes the feature invariant to shifted time duration of action. While constructing the histogram of optical flow, we align the direction of an action so that we can get direction invariant action representation. Experiments on benchmark action dataset (KTH) and our own dataset for smart class show that the proposed method gives a good performance comparable to the state-of-the-art approaches and has applicability to actual environments with smart class dataset.

Younghan Jeon, Tushar Sandhan, Jin Young Choi

Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras

This paper proposes a real-time human action recognition approach to static video surveillance systems. This approach predicts human actions using temporal images and convolutional neural networks (CNN). CNN is a type of deep learning model that can automatically learn features from training videos. Although the state-of-the-art methods have shown high accuracy, they consume a lot of computational resources. Another problem is that many methods assume that exact knowledge of human positions. Moreover, most of the current methods build complex handcrafted features for specific classifiers. Therefore, these kinds of methods are difficult to apply in real-world applications. In this paper, a novel CNN model based on temporal images and a hierarchical action structure is developed for real-time human action recognition. The hierarchical action structure includes three levels: action layer, motion layer, and posture layer. The top layer represents subtle actions; the bottom layer represents posture. Each layer contains one CNN, which means that this model has three CNNs working together; layers are combined to represent many different kinds of action with a large degree of freedom. The developed approach was implemented and achieved superior performance for the ICVL action dataset; the algorithm can run at around 20 frames per second.

Cheng-Bin Jin, Shengzhe Li, Trung Dung Do, Hakil Kim

Scalable Tamper Detection and Localization Scheme for JPEG2000 Codestreams

We propose an efficient tamper detection scheme for JPEG2000 codestreams. The proposed scheme embeds information while maintaining the scalability function of JPEG2000 and can detect tampered layers or tampered resolution levels for each decoded image quality level. Marker codes that delimit a JPEG2000 codestream must not be generated in the body data that possesses the image information.We can prevent new marker codes from being generated when embedding information into the JPEG2000 codestream. The experimental results show that our scheme can preserve the high quality of the embedded images.

Takeshi Ogasawara, Shoko Imaizumi, Naokazu Aoki

Developing a Visual Stopping Criterion for Image Mosaicing Using Invariant Color Histograms

For over a decade, image mosaicing techniques have been widely used in various applications


, generating a wide field-of-view image, 2D optical maps in remote sensing or medical imaging. In general, image mosaicing combines a sequence of images into a single image referred to as a mosaic image. Its process is roughly divided into the iterative image registration and blending. Unfortunately, the computational cost of iterative image registration increases exponentially given a large number of images. As a result, mosaicing for a large scale scene is often prohibitive for real-time applications. In this paper, we introduce an effective visual criterion to reduce the number of image mosaicing iterations while retaining the visual quality of the mosaic. We analyze the change in invariant color histograms of the mosaic image over iterations and use it to determine a termination condition. Based on various experimental evaluations using four different datasets, we significantly improve the computational efficiency of mosaicing algorithm.

Armagan Elibol, Hyunjung Shim

Intelligent Reconstruction and Assembling of Pipeline from Point Cloud Data in Smart Plant 3D

The laser-scanned data of subsisting industrial pipeline plants are not only astronomically immense, but are withal intricately entwined like a net. The users must identify 3D points corresponding to each pipeline to be modelled in immensely colossal laser-scanned data sets. To accurately identify the 3D points corresponding to each pipeline, the users need to have some cognizance of direction and design of the pipelines. In addition, manually identifying each pipeline from gigantic and intricate scanned data is proximately infeasible, timeconsuming and cumbersome process. In order to simplify and make the process more facile for reconstruction process an intelligent way of reconstruction and assembling of pipeline from point cloud data in Smart Plant 3D (SP3D) is proposed. The presented results shows that the proposed method indeed contribute automation of 3D pipeline model.

Pavitra Holi, Seong Sill Park, Ashok Kumar Patil, G. Ajay Kumar, Young Ho Chai

A Rotational Invariant Non-local Mean

The image restoration and noise reduction are used to improve image quality and to develop more robust and high performance algorithms to solve denoising problems in the image processing. Every approach has their own limitations and more practicable properties with the specific conditions. This paper considers rotational transformation with non-local means denoising algorithm. The non-local means denoising algorithm use repetitive patterns in the image for noise reduction. Therefore, the affine transformations will extend search space of the problem and will cause of more qualitative results.

Rassulzhan Poltayev, Byung-Woo Hong

Adaptive Layered Video Transmission with Channel Characteristics

In wireless video transmission, the layered video transmission combining with a layered video coding can gracefully accommodate the receivers’ heterogeneity. In this paper, we propose a new layered video transmission scheme based on the wireless channel characteristics of orthogonal frequency division multiplexing (OFDM) physical (PHY) layer, leading to an adaptive layered video transmission (ALAVIT). In our scheme, scalable video coding (SVC) is exploited to generate the layered video bit-streams; and the resulted base layer (BL) and enhancement layer (EL) bits are modulated differently to obtain their individual symbols. According to the estimated channel characteristics described by the H parameters of a connected wireless receiver, subcarriers with good channel quality and more power are allocated to BL symbols for the protection of these important bits. As compared to the state-of-the-art PHY layer techniques such as s-mod and MixCast, our ALAVIT scheme is able to provide a better performance.

Fan Zhang, Anhong Wang, Xiaoli Ma, Bing Zeng

An Accurate and Efficient Nonlinear Depth Quantization Scheme

As known, depth information exists as floating distance data, when firstly captured by depth sensor. In view of storage and transmission, it is necessary to be quantized into several depth layers. Generally, it is mutually contradictory between the efficiency of depth quantization and the accuracy of view synthesis. Actually, since 3D-warping rounding calculation exists during view synthesis, depth changes within a certain range will not cause different warped position. This phenomenon provides a good way to quantize depth data more efficiently. However, 3D-warping rounding calculation can also bring additional view synthesis distortion, if the warped-interval and imageresolution- interval are misaligned. Hence, to achieve efficient depth quantization without introducing additional view synthesis distortion, an

accurate and efficient nonlinear-depth quantization scheme (AE-NDQ)

is presented in which the alignment between warped-interval and image-resolution-interval is taken into consideration during the depth quantization. Experimental results show, compared with the

efficient nonlinear-depth-quantization (E-NDQ), AE-NDQ

needs almost the same bits to represent the depth layers but maintains more accurate on view synthesis. For the traditional

8-bits nonlinear-depth-quantization (NDQ), AE-NDQ

needs less bits to represent the depth layers, while has the same accuracy of the synthesized view.

Jian Jin, Yao Zhao, Chunyu Lin, Anhong Wang

Synthesis-Aware Region-Based 3D Video Coding

In a depth-image-based rendering (DIBR)-based 3D video system, the original 3D video is commonly compressed from the point of the video itself, paying little attention to its contribution to the synthesized virtual view. This paper first proposes a method to divide the original video into different regions according to its contribution to the synthesized view and then proposes to measure the regions using compressive sensing with different measurement rates, leading to a synthesis-aware region-based 3D video coding approach. Experimental results show that our approach can achieve better synthesized quality under the same equivalent measurement rate. Our approach is suitable for the applications when the virtual view is more important than the original views.

Zhiwei Xing, Anhong Wang, Jian Jin, Yingchun Wu

A Paradigm for Dynamic Adaptive Streaming over HTTP for Multi-view Video

HTTP-based delivery for Video on Demand (VoD) has been gaining popularity within recent years. With the recently proposed Dynamic Adaptive Streaming over HTTP (DASH), video clients may dynamically adapt the requested video quality and bitrate to match their current download rate. To avoid playback interruption, DASH clients attempt to keep the buffer occupancy above a certain minimum level. This mechanism works well for the single view video streaming. For multi-view video streaming application over DASH, the user originates view switching and that only one view of multi-view content is played by a DASH client at a given time. For such applications, it is an open problem how to exploit the buffered video data during the view switching process. In this paper, we propose two fast and efficient view switching approaches in the paradigm of DASH systems, which fully exploit the already buffered video data. The advantages of the proposed approaches are twofold. One is that the view switching delay will be short. The second advantage is that the rate-distortion performance during the view switching period will be high, i.e., using less request data to achieve comparable video playback quality. The experimental results demonstrate the effectiveness of the proposed method.

Jimin Xiao, Miska M. Hannuksela, Tammam Tillo, Moncef Gabbouj

Adaptive Model for Background Extraction Using Depth Map

Depth map has attracted great attention for image and video processing in recent years. Depth map gives one more dimensional information about the images besides color (intensity). Depth is independent of color, which is the advantage for extracting the background covered by objects with irregular repetitive motions e.g. rotation. A new algorithm for background extraction using Gaussian Mixture Models (GMM) combined with depth map is presented. The per-pixel mixture model and single Gaussian model are used to model the recent observation in color and depth space respectively. We also incorporate the color-depth consistency check mechanism into the algorithm to improve the accuracy. Our results show much greater robustness than prior state of the art method to handle challenging scenes.

Boyuan Sun, Tammam Tillo, Ming Xu

An Efficient Partition Scheme for Depth-Based Block Partitioning in 3D-HEVC

In the development of a 3D video extension of High Efficiency Video Coding (HEVC) standard, namely 3D-HEVC, Depth-based Block Partitioning (DBBP) is employed to code texture videos in dependent views by utilizing coded depth information of an independent view. With the DBBP, a proper partition mode is determined with the coded depth information, which divides the current texture block into two regions and thereafter allows for fine-grained motion compensation of foreground and background separately. In the DBBP, the original partition method consists of two steps specifically, i.e. threshold calculation and matched filtering based on down-sampling, which is relatively high complex and redundant with a segment mask generation process. Accordingly, a simple yet more effective partition scheme for the DBBP coding is proposed in this paper, based on the available binary segment mask. While reducing computational complexity significantly, the proposed method also demonstrates bitrate saving for all the dependent texture views and synthesized views under common test conditions (CTC) configuration specified in the 3D-HEVC.

Yuhua Zhang, Ce Zhu, Yongbing Lin, Jianhua Zheng, Yong Wang

Image Classification with Local Linear Decoding and Global Multi-feature Fusion

Recent years have witnessed a surge of interest in image classification. The combination of deep neural network with feature extraction has improved image classification performance dramatically. In order to improve the performance of image classification, this paper proposes an image classification algorithm based on deep neural network of linear decoder and softmax regression model. First, we learn features of some small image patches with linear decoder; secondly, by convolving and pooling the large images with the learned features, then we obtain the pooled convolved features; thirdly, we use softmax regression model to learn the features for image classification. Experimental results are encouraging and demonstrate the validity and superiority of our method.

Zhang Hong, Wu Ping

Hashing with Inductive Supervised Learning

Recent years have witnessed the effectiveness and efficiency of learning-based hashing methods which generate short binary codes preserving the Euclidean similarity in the original space of high dimension. However, because of their complexities and out-of-sample problems, most of methods are not appropriate for embedding of large-scale datasets. In this paper, we have proposed a new supervised hashing method to generate class-specific hash codes, which uses an inductive process based on the Inductive Manifold Hashing (IMH) model and leverage supervised information into hash codes generation to address these difficulties and boost the hashing quality. It is experimentally shown that this method gets excellent performance of image classification and retrieval on large-scale multimedia dataset just with very short binary codes.

Mingxing Zhang, Fumin Shen, Hanwang Zhang, Ning Xie, Wankou Yang

Graph Based Visualization of Large Scale Microblog Data

Visualization is an important but tough way to make sense of large scale dataset. In this paper, we propose a graph based method to visualize microblog data. In our scheme, the graph is constructed using the content similarities between data which is more robust than the widely used data relationships. Given a targeted dataset, we first adopt a duplicates removal strategy to reduce the size of the data and a subset is randomly sampled for visualization. Then a multilevel graph layout with a heat map is applied to generate an interactive interface which allows users to move on and scale the layout. In this way, different granularities of summarization information can be immediately presented to users when a certain area is specified in the interface; meanwhile more detailed knowledge on the selected area can be shown in nearly real time by leveraging a hash based microblog retrieval approach. Experiments are conducted on a Brand-Social-Net dataset which contains 3,000,000 microblogs and the experimental results show that, with our visualization method, some meaningful patterns of dataset can be found easily.

Yue Guan, Kaidi Meng, Haojie Li

Boosting Accuracy of Attribute Prediction via SVD and NMF of Instance-Attribute Matrix

Attribute-based methods for image classification have received much attentions in recent years due to the high-level or humanspecified nature of attributes. Given a new image, attribute-based methods can predict its category by exploiting the attribution representation of the given image. However, the foundation of attribute-based methods is predicting attributes precisely, which is still a difficult problem in real world applications. Therefore, in this paper, we propose an Attribute Prediction boosting framework with Matrix Factorization techniques (APMF) to boost the accuracy of attribute prediction. APMF explores the potential relationships of instances and attributes by utilizing the singular value decomposition (SVD) and non-negative matrix factorization (NMF). A series of experiments show that our APMF achieves better attribute prediction accuracy than the state-of-the-art methods.

Donghui Li, Zhuo Su, Hanhui Li, Xiaonan Luo

Fatigue Detection Based on Fast Facial Feature Analysis

A non-intrusive fatigue detection method based on fast facial feature analysis is proposed in this paper. Firstly, the facial landmarks are obtained by the supervised descent method, which automatically tracks the faces and fits the facial appearance very fast and accurately. It covers facial landmarks over a wide range of human head rotations. Then the aspect ratios of eyes and mouth are computed with the coordinates of the detected facial feature points. We interpolate and smooth those aspect ratios by a forgetting factor to deal with the occasionally missing detection of facial features. Thirdly, the degrees of eye closure and mouth opening are evaluated with two Gaussian based membership functions. Finally, the driver fatigue state is inferred by several IF-Then logical relationships by evaluating the duration of eye closure and mouth opening. Experiments are conducted on 41 videos to show the effectiveness of the proposed method.

Ruijiao Zheng, Chunna Tian, Haiyang Li, Minglangjun Li, Wei Wei

A Packet-Layer Model with Content Characteristics for Video Quality Assessment of IPTV

Due to the lightweight measurement and no access to the media signal, the packet-layer video quality assessment model is highly preferable and utilized in the non-intrusive and in-service network applications. In this paper, a novel packet-layer model is proposed to monitor the video quality of Internet protocol television (IPTV). Apart from predicting the coding distortion by the compression, the model highlights a novel loss-related scheme to predict the transmission distortion introduced by the packet loss, based on the structural decomposition of the video sequence, the development of temporal sensitivity function (TSF) simulating human visual perception, and the scalable incorporation of content characteristics. Experimental results demonstrate the performance improvement by comparing with existing models on cross-validation of various databases.

Qian Zhang, Lin Ma, Fan Zhang, Long Xu

Frame Rate and Perceptual Quality for HD Video

The frame rate (FR) of a video plays an important role in affecting the perceptual video quality. Most studies about the effect of FR on the video quality mainly focused on low frame rate, e.g. less than 30 frames per second (fps), at low resolutions like CIF or QCIF. As the video frame rate and resolution advance, we reconsider this issue and investigate the relationship between frame rate and the perceptual video quality under high frame rate and high resolution. In this paper, we discuss the impact of frame rate on the perceptual quality of High Definition (HD) video with high frame rates (up to 120 fps) considered. Firstly, we design and conduct subjective experiment to construct the video dataset, which includes video sequences at different frame rates and the corresponding mean opinion scores (MOS) which represent the perceptual video quality. Based on the MOS results, we analyze how perceptual video quality changes as frame rate varies among different video sequences and propose some meaningful findings. The video dataset will be made publicly available. We deem that this study will enrich video quality assessment and benefit the development of high frame rate and high definition video business.

Yutao Liu, Guangtao Zhai, Debin Zhao, Xianming Liu

No-Reference Image Quality Assessment Based on Singular Value Decomposition Without Learning

Recently no-reference image quality assessment (NR-IQA) methods take advantages of machine learning techniques. However, machine learning approaches need a number of human scored images and cause database dependency. In this paper, we propose a simple NR-IQA method that can estimate quality of distorted images without learning, producing comparable performance to learning based approaches. We employ singular value decomposition (SVD) since we have observed that singular values are commonly affected by various distortions. In detail, a decreasing rate of singular values is highly correlated to a degree of distortions regardless of their type. From the observation, our approach utilizes the decreasing rate of singular values to model a simple and reliable NR-IQA method. Experimental results show that the proposed method has reasonably high correlation to human scores. And the proposed method can secure simplicity and database independence.

Jonghee Kim, Hyunjun Eun, Changick Kim

An Improved Brain MRI Segmentation Method Based on Scale-Space Theory and Expectation Maximization Algorithm

Expectation Maximization (EM) algorithm is an unsupervised clustering algorithm, but initialization information especially the number of clusters is crucial to its performance. In this paper, a new MRI segmentation method based on scale-space theory and EM algorithm has been proposed. Firstly, gray level density of a brain MRI is estimated; secondly, the corresponding fingerprints which include initialization information for EM using scale-space theory are obtained; lastly, segmentation results are achieved by the initialized EM. During the initialization phase, restrictions of clustering component weights decrease the influence of noise or singular points. Brain MRI segmentation results indicate that our method can determine more reliable initialization information and achieve more accurate segmented tissues than other initialization methods.

Yuqing Song, Xiang Bao, Zhe Liu, Deqi Yuan, Minshan Song

User-Driven Sports Video Customization System for Mobile Devices

In this paper, we have implemented a user-driven sports video customization system, aiming to provide interesting video clips for mobile users according to their personalized preferences. In this system, we use the web-casting text to detect events from sports video and generate rich content description. In particular, the video clock time on the scoreboard is recognized for the purpose of aligning these events from web-casting text to sports video clips. The proposed extendedhidden Markov model (extended-HMM) is proved to be able to recognize the clock time precisely. To save mobile web traffic, an optional function based on the proposed event based video encoding approach is embedded in the system. Compared with traditional encoding approach, this approach provides bitrate saving of about 34% while the quality of frames which users are interested in keeps the same. Both quantitative and qualitative experiments have been conducted to prove the proposed approaches’ effectiveness.

Jian Qin, Jun Chen, Zheng Wang, Jiyang Zhang, Xinyuan Yu, Chunjie Zhang, Qi Zheng

Auditory Spatial Localization Studies with Different Stimuli

Many localization studies have tested the ability of auditory spatial localization in humans. However, broadband noise sources, such as Gaussian white noise and pink noise, were usually chosen as stimuli and the distribution is sparse. In this paper, an intuitive systematic subjective evaluation method is proposed. Subjective used a laser pointer to indicate the perceived direction accurately. Except the Gaussian white noise stimuli, the auditory localization performance was also tested with 1 kHz pure-tone stimuli ranging from - 45° to + 45° in the horizontal plane. In Experiment 1, stimuli is the Gaussian white noise and is distributed with a spacing of 10° to verify that the method is accurate and suitable for the localization research. In Experiment 2 and 3, the distribution of the speakers turns closer with each other. Stimuli are the Gaussian white noise and 1 kHz pure-tone separately. All experiment results are presented and compared with other studies.

Tao Zhang, Shuting Sun, Chunjie Zhang

Multichannel Simplification Based on Deviation of Loudspeaker Positions

People hope to achieve a good impression of threedimensional (3D) spatial sound with fewer loudspeakers at home. The present method simplified the amount of loudspeakers based on the minimum area enclosed by loudspeakers while maintaining the sound pressure at the origin. However, it doesn’t consider the distortion of a sound field within some specified regions since people always use two ears to listen. In this paper, we exploit that the distortion will be affected by the deviation of positions among loudspeakers and the selection of loudspeaker positions is redefined to obtain the weighting coefficients of each multichannel simplified system. For each multichannel simplified system, simulation result indicates that the distortion within the region of head generated by the proposed method is not more than that generated by the present method. Subjective evaluation shows the proposed method is slightly better in terms of sound localization.

Dengshi Li, Ruimin Hu, Xiaochen Wang, Shanshan Yang, Weiping Tu

Real-Time Understanding of Abnormal Crowd Behavior on Social Robots

Perceiving the crowd behavior is very important for social cloud robots, who serve as guiders at transportation junctions. In this paper, we propose a real-time algorithm based on background modeling to detect collective motions in complex scenes. The proposed algorithm not only avoids unstable foreground extraction, but also has low computational complexity. To detect the abnormal crowd escape, we refer to the definition of the moving energy of patches and use the energy histogram of the patches to effectively and accurately represent the crowd distribution information in the crowd scenes. We have applied the proposed algorithm to the real surveillance videos which contain the aggregation and dispersion events. The experimental results show the significant outperformance of the proposed algorithm in comparison to the-state-of-the-art approach.

Dekun Hu, Binghao Meng, Shengyi Fan, Hong Cheng, Lu Yang, Yanli Ji

Sparse Representation Based Approach for RGB-D Hand Gesture Recognition

In this paper, we present a new algorithm for RGB-D hand gesture recognition by using multi-attribute sparse representation enforced with group constraints. Firstly, the hand region is segmented from the background according to the depth information. Then, we process all gesture-performing hand region images with PCA to reduce the feature dimension. To obtain a more accurate and discriminative representation, a multi-attribute sparse representation is employed for hand gesture recognition from different view angles. The multiple attributes for a gesture image can be represented by individual binary matrices to indicate the group properties for each gesture. Then, these attribute matrices are incorporated into the formulation of



-minimization in the sparse coding framework. Finally, the effectiveness and robustness of the proposed method are demonstrated through experiments on a public RGB-D hand gesture dataset.

Te-Feng Su, Chin-Yun Fan, Meng-Hsuan Lin, Shang-Hong Lai

Eye Gaze Correction for Video Conferencing Using Kinect v2

In video conferencing, eye gaze correction is beneficial for effective communication. In this era, video conferencing at homes using laptops is straightforward. In this paper, we propose an eye gaze correction method with a low-cost simple setup using Kinect v2. Our method detects an ellipse that connects edge points of the face after identifying several feature points within the face using Kinect v2 SDK. Then, we apply a 3D affine transform that allows eye gaze correction using camera space points that are acquired from depth information. Thus, in the preprocessing step, an ellipse model should be extracted when the user gazes the camera and display, respectively. Also, we fill holes that are caused by the affine transform using color inpainting. As a result, we produced a natural eye gaze-corrected image in real-time.

Eunsang Ko, Woo-Seok Jang, Yo-Sung Ho

Temporally Consistence Depth Estimation from Stereo Video Sequences

In this paper, we propose complexity reduction method in stereo matching. For complexity reduction in video sequences, we start from generating of initial disparity information. Initial disparity information can give a restricted disparity search range when performing the local stereo matching. As an initial disparity information, we use 4 different kinds of input images. The initial disparity information types can divide into two main streams like ‘


’ and ‘


’ materials. Iterative stereo matching method, motion prediction stereo matching method and global matching result based stereo matching method are used as a ‘


’ initial value. Captured by depth camera image is used as ‘


’ information. By using those 4 different types of disparity information, we can save the time consuming when performing the local stereo matching with consecutive image sequences. Results of the experiment prove the efficiency of proposed method. By using proposed local stereo matching method, we can finish all procedures within a few seconds and conserve the quality of disparity images.

Ji-Hun Mun, Yo-Sung Ho

A New Low-Complexity Error Concealment Method for Stereo Video Communication

The emerging popularity of three dimensional content has instigated the advances of stereo video coding and transmission techniques. According to the existing video compression standards, highly compressed stereo video bit streams are susceptible to transmission errors. As a consequence of the unavoidable spatial, temporal, and inter-view error propagation, the display quality is severely degraded at the receiver side. In this paper, to deal with the whole-frame loss of the right view in H.264-compressed stereo video bitstream transmission, a highly efficient perceptual whole-frame loss of right view error concealment algorithm is proposed with a fast error concealment strategy. Firstly, a fast error concealment strategy is presented, and an efficient image quality Index, gradient magnitude similarity deviation (GMSD) is extended to the concepts of temporal GMSD (TGMSD) and inter-view GMSD (VGMSD), respectively, to evaluate the perceptual quality of stereo image. Then, according to the temporal correlation of video sequences, macroblock (MB) prediction modes of the previous right-view frame are used as the MB prediction mode of the lost frame. Then, MBs in the previous frame of the lost frame are also matched, in both temporal and inter-view domains, to obtain the pixel-based TGMSD and VGMSD maps. Finally, for each MB in the lost right-view frame, its TGMSD and VGMSD values are calculated and compared to obtain the MB prediction mode, after which either motion compensation or disparity compensation can be quickly decided and used for resilience of the lost right-view frame. Experimental results show that compared with traditional error concealment algorithms, the proposed algorithm has superior subjective and objective qualities, and compared with the error concealment algorithms based on structure similarity, its error concealment time is reduced by about 40 % with almost same perceptual quality.

Kesen Yan, Mei Yu, Zongju Peng, Feng Shao, Gangyi Jiang

Hole Filling Algorithm Using Spatial-Temporal Background Depth Map for View Synthesis in Free View Point Television

This paper presents a new hole filling algorithm based on background modeling and texture synthesis. Depth information of hole areas is computed by using a local background estimation method. In order to exploit the correlation between frames in temporal domain, a background modeling method is used to extract reliable background reference scenes. The holes are then filled by combining the temporal background information and the estimated depth information. A modified exemplar-based inpainting method is used to fill remaining hole pixels. Experimental results demonstrate the capability of the proposed algorithm.

Huu Noi Doan, Beomsu Kim, Min-Cheol Hong

Pattern Feature Detection for Camera Calibration Using Circular Sample

Camera calibration is a process to find camera parameters. Camera parameter consists of intrinsic and extrinsic configuration and it is important to deal with the three-dimensional (3-D) geometry of the cameras and 3-D scene. However, camera calibration is quite annoying process when the number of cameras and images increase because it is operated by hand to indicate exact points. In order to eliminate the inconvenience of a manual manipulation, we propose a new pattern feature detection algorithm. The proposed method employs the Harris corner detector to find the candidate for the pattern feature points in images. Among them, we extract valid pattern feature points by using a circular sample. Test results show that this algorithm can provide reasonable camera parameters compared to camera parameters using the Matlab calibration toolbox by hand but eliminated a burden of manual operation.

Dong-Won Shin, Yo-Sung Ho

Temporal Consistency Enhancement for Digital Holographic Video

Holography is an imaging technique to reconstruct wavefront information of the light scattered by real objects or a scene, allowing an observer to perceive three-dimensional (3D) images with the unassisted eye. Such 3D holographic images result from reproducing the intensity and phase of light by diffraction. This paper presents a temporal consistency enhancement for digital holographic video. The proposed temporal consistency enhancement method improves compression efficiency and visual quality by reducing the flickering artifact.

Kwan-Jung Oh, Hyon-Gon Choo, Jinwoong Kim

Efficient Disparity Map Generation Using Stereo and Time-of-Flight Depth Cameras

Three-dimensional content (3D) creation has received a lot of attention due to numerous successes of 3D entertainment. Accurate estimation of depth information is necessary for efficient 3D content creation. In this paper, we propose a disparity map estimation method based on stereo correspondence. The proposed system utilizes depth and stereo camera sets. While the stereo set carries out disparity estimation, depth camera information is projected to left and right camera positions using 3D warping and upsampling is processed in accordance with the image size. The upsampled depth is used for obtaining disparity data of left and right positions. Finally, disparity data from each depth sensor are combined. The experimental results demonstrate that our method produces more accurate disparity maps compared to the conventional approaches which use stereo cameras and a single depth sensor.

Woo-Seok Jang, Yo-Sung Ho

Super-Resolution of Depth Map Exploiting Planar Surfaces

Depth map, with per-pixel depth values, represents the relative distance between object in the scene and the capturing depth camera. Hence, it has been widely used in 3D applications and Depth Image-Based Rendering (DIBR) technique to provide an immersive 3D and free-viewpoint experience to the viewers. Depth maps could be generated by using software- or hardware-driven techniques. However, most generated depth maps suffer from a combination of the following shortcomings: noise, holes and limited spatial resolution. Therefore, to tackle the limited spatial resolution problem of Time-of-Flight depth images, in this paper, we present a planar-surface-based depth map super-resolution approach, which interpolates depth images by exploiting the equation of each detected planar surface. Aided with these equations the surfaces will be categorized into three groups, namely: planar surfaces, non-planar surfaces, and finally edges. For the first category the analytical equations of the planar surfaces will be used to super-resolve them, while a traditional interpolation method will be used for the non-planar surfaces, whereas, a combination of the two previous approaches will be used to up-sample edges. Both quantitative and qualitative experimental results demonstrate the effectiveness and robustness of our approach over the benchmark methods.

Tammam Tilo, Zhi Jin, Fei Cheng

Hierarchical Interpolation-Based Disocclusion Region Recovery for Two-View to N-View Conversion System

In this paper, we propose a novel disocclusion region recovery approach for two-view to n-view conversion system. Although the topic of view synthesis has been exhaustively studied for decades, a reliable disocclusion region recovery approach, an indispensable issue in synthesizing realistic content of virtual view, is still under research. The most common concept used for predicting these unknown pixels is inpainting-related method, which fills the disocclusion region with the information of mated exemplars in self-defined searching domain. In spite of widely taken in making up the missing values generated among the synthesis procedures, the result quality of inpainting-based approach is sensitive to the filling priority and also unstable in recovering large disocclusion region. Therefore, we propose a hierarchical interpolation-based approach to calculate the desired lost information under coarse-to-fine manner accompanied with the joint bilateral upsampling technology, applied for enlarging the estimation from small dimension to higher-resolution. Proposed hierarchical interpolation-based scheme is more robust in restoring the value of missing region and also induces fewer artifacts.We demonstrate the superior quality of the synthesized virtual views under the proposed recovery algorithm over the traditional inpainting-based method through experiments on several benchmarking video datasets.

Wun-Ting Lin, Chen-Ting Yeh, Shang-Hong Lai

UEP Network Coding for SVC Streaming

In this paper, we propose an optimized unequal error protection (UEP) network coding for scalable video coding (SVC). First, we introduce packet-level UEP network coding schemes which encode packets with different selection probabilities according to the priorities of video layers. Secondly, we explain how to prioritize the packet selection probability using distortion degree. The distortion degree accounts for the layers as well as group of pictures (GOP). Finally, experiment results demonstrate the performance of our proposed UEP network coding compared to other existing network coding methods in terms of decoding probability and peak signal to noise ratio (PSNR).

Seongyeon Kim, Yong-woo Lee, Jitae Shin

Overview on MPEG MMT Technology and Its Application to Hybrid Media Delivery over Heterogeneous Networks

The MPEG has recently developed a new standard, MPEG Media Transport (MMT), for the next-generation hybrid media delivery service over IP networks considering the emerging convergence of digital broadcast and broadband services. MMT is intended to overcome the current limitations of available standards for media streaming by providing a streaming format that is transportand file-format-friendly, cross-layer optimized between the video and transport layers, error-resilient for MPEG streams, and convertible between transport mechanisms and content adaptation to different networks. In this paper, we overview on the MPEG MMT technology and describe its application to hybrid media delivery over heterogeneous networks.

Tae-Jun Jung, Hong-rae Lee, Kwang-deok Seo

A Framework for Extracting Sports Video Highlights Using Social Media

Summarizing lengthy sports video into compact highlights has many applications and plays an essential role for effective media dissemination and delivery. To perform the highlights extraction correctly and effectively is of great challenge. Extensive research efforts have been made to this problem in recent years. In practice, sports video highlights are extracted either manually or based on video content analysis schemes. The former approach is not cost effective and naturally brings the scalability concern, while the later approach suffers from high computational complexity. In this paper, we start from a novel angle to address the sports video summarization problem; we employ real-time text stream, e.g. opinion comment posts, from social media to detect events and the event semantics in live sport videos. The main idea is that one can treat the volumes of comment posts over time as a time series, and the variation of the time series, such as a spike, may reveal events in a game, which therefore can be employed to identify the important moments in the game. By aligning the events with the sports videos over time, automatically summarizing sports video may be feasible. This paper describes the implementation of this idea and reports our experience of summarizing the 2014 World Cup Video. We also evaluate our technique compared to human-generated summaries and find that the results of our technique are quite similar to the human-generated result, which demonstrate the superiority of our technique.

Yao-Chung Fan, Huan Chen, Wei-An Chen


Weitere Informationen