Skip to main content

2016 | Buch

Image and Video Technology

7th Pacific-Rim Symposium, PSIVT 2015, Auckland, New Zealand, November 25-27, 2015, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of the 7th Pacific Rim Symposium on Image and Video Technology, PSIVT 2015, held in Auckland, New Zealand, in November 2015.

The total of 61 revised papers was carefully reviewed and selected from 133 submissions. The papers are organized in topical sections on color and motion, image/video coding and transmission, computational photography and arts, computer vision and applications, image segmentation and classification, video surveillance, biomedical image processing and analysis, object and pattern recognition, computer vision and pattern recognition, image/video processing and analysis, and pattern recognition.

Inhaltsverzeichnis

Frontmatter

Color and Motion

Frontmatter
Color Conversion for Color Blindness Employing Multilayer Neural Network with Perceptual Model

In this paper, we propose a novel digital image color conversion algorithm for color blindness using a multilayer neural network. The symptoms of “color blindness” are due to an innate lack or deficit of cone cells that recognize colors, and people with color blindness have difficulty discriminating combinations of specific colors. Those people require color conversion for the presented image such that the image can be a perceptible color representation. In the proposed method, we design a multilayer neural network composed of three building blocks: layers for image color conversion, layers for perceptual model of color blindness, and layers for color discrimination. In proposed framework, a neural network is learning about a relationship of an image data and a discrimination performance of colors in an image, and a color conversion rule is trained as a part of a neural network. To validate the effectiveness of proposed method, it is applied to several images that have various color combinations.

Hideaki Orii, Hideaki Kawano, Noriaki Suetake, Hiroshi Maeda
Synthesis of Oil-Style Paintings

Non-photorealistic rendering is an important research topic in computer graphics, where painterly (or stroke-based) rendering has received intensive attention from researchers in recent years. The goal of this paper is to design a fully automatic algorithm, which is able to turn a photograph into an oil-style painting. Different from the existing approaches that use real brush stroke images as templates, our brush strokes were created in random manner according to the characteristics of the local image region. For determining the direction of a brush stroke, we also proposed a new method based on template-matching to evaluate the major orientation of edge features within a local image window. Moreover, a novel method of deciding stroke locations was proposed, which is simple yet effective. All these features together significantly reduce the undesirable systematic impression, which appears to be a common artifact of painterly rendering.

Fay Huang, Bo-Hui Wu, Bo-Ru Huang
Multi-frame Feature Integration for Multi-camera Visual Odometry

State-of-the-art ego-motion estimation approaches in the context of visual odometry (VO) rely either on Kalman filters or bundle adjustment. Recently proposed multi-frame feature integration (MFI [1]) techniques aim at finding a compromise between accuracy and computation efficiency. In this paper we generalise an MFI algorithm towards the full use of multi-camera-based visual odometry for achieving more consistent ego-motion estimation in a parallel scalable manner. A series of experiments indicated that the generalised integration technique contributes to an improvement of above 70 % over our direct VO implementation, and further improved the monocular MFI technique by more than 20 %.

Hsiang-Jen Chien, Haokun Geng, Chia-Yen Chen, Reinhard Klette
A Robust Identification Scheme for JPEG XR Images with Various Compression Ratios

A robust scheme for identifying JPEG XR coded images is proposed in this paper. The aim is to identify the images that are generated from the same original image under various compression ratios. The proposed scheme is robust against a difference in compression ratios, and does not produce false negative matches in any compression ratio. A new property of the positive and negative signs of lapped biorthogonal transform coefficients is considered to robustly identify the images. The experimental results show the proposed scheme is effective for not only still images, but also video sequences in terms of the querying such as false positive, false negative and true positive matches.

Hiroyuki Kobayashi, Shoko Imaizumi, Hitoshi Kiya
Challenge to Scalability of Face Recognition Using Universal Eigenface

This paper approaches to the scalability problem of face recognition using the weight equations in a universal eigenface. Since the weight equations are linear equations, the optimal solution can be generated even when the number of registered faces exceeds the dimensionality of universal eigenface. Based on the characteristics of the underdetermined linear systems, this paper shows that effective preliminary elimination is possible with little loss by the parallel underdetermined systems. Finally, this paper proposes a preliminary elimination followed by a small-scale face recognition for a scalable face recognition.

Hisayoshi Chugan, Tsuyoshi Fukuda, Takeshi Shakunaga
From Optimised Inpainting with Linear PDEs Towards Competitive Image Compression Codecs

For inpainting with linear partial differential equations (PDEs) such as homogeneous or biharmonic diffusion, sophisticated data optimisation strategies have been found recently. These allow high-quality reconstructions from sparse known data. While they have been explicitly developed with compression in mind, they have not entered actual codecs so far: Storing these optimised data efficiently is a nontrivial task. Since this step is essential for any competetive codec, we propose two new compression frameworks for linear PDEs: Efficient storage of pixel locations obtained from an optimal control approach, and a stochastic strategy for a locally adaptive, tree-based grid. Suprisingly, our experiments show that homogeneous diffusion inpainting can surpass its often favoured biharmonic counterpart in compression. Last but not least, we demonstrate that our linear approach is able to beat both JPEG2000 and the nonlinear state-of-the-art in PDE-based image compression.

Pascal Peter, Sebastian Hoffmann, Frank Nedwed, Laurent Hoeltgen, Joachim Weickert
A Study on Size Optimization of Scanned Textual Documents

This paper reports a study on compression achieved on document images with different image formats, including PNG, GIF, PBM (zipped), JBIG and JBIG2. It also examines the issue of perceptual quality of the bi-level document images in these formats. It analyzes the impact of a common pre-processing step, namely the adaptive thresholding, on compression ratio and perceptual image quality in different image formats. We conclude that adaptive thresholding improves the compression ratio for common image formats, like the PNG and GIF, and make them comparable to JBIG/JBIG2 encoding; they result in significant improvement in perceptual image quality also. We also observe that the simple pre-processing step prevents perceptual information-loss in JBIG/JBIG2 encoding in certain situations.

Nidhi Saraswat, Hiranmay Ghosh
Combination of Mean Shift of Colour Signature and Optical Flow for Tracking During Foreground and Background Occlusion

This paper proposes a multiple hypothesis tracking for multiple object tracking with moving camera. The proposed model makes use of the stability of sparse optical flow along with the invariant colour property under size and pose variation, by merging the colour property of objects into optical flow tracking. To evaluate the algorithm five different videos are selected from broadcast horse races where each video represents different challenges that present in object tracking literature. A comparison study of the proposed method, with a colour based mean shift tracking proves the significant improvement in accuracy and stability of object tracking.

M. Hedayati, M. J. Cree, J. Scott
Rendered Benchmark Data Set for Evaluation of Occlusion-Handling Strategies of a Parts-Based Car Detector

Despite extensive efforts, state-of-the-art detection approaches show a strong degradation of performance with increasing level of occlusion. A fundamental problem for the development and analysis of occlusion-handling strategies is that occlusion information can not be labeled accurately enough in real world video streams. In this paper we present a rendered car detection benchmark with controlled levels of occlusion and use it to extensively evaluate a visibility-based existing occlusion-handling strategy for a parts-based detection approach. Thereby we determine the limitations and the optimal parameter settings of this framework. Based on these findings we later propose an improved strategy which is especially helpful for strongly occluded views.

Marvin Struwe, Stephan Hasler, Ute Bauer-Wersing
Moving Object Detection Using Energy Model and Particle Filter for Dynamic Scene

We proposed an algorithm that uses an energy model with smoothness assumption to identify a moving object by using optical flow, and uses a particle filter with a proposed observation and dynamic model to track the object. The algorithm is based on the assumption that the dominant motion is background flow and that foreground flow is separated from the background flow. The energy model provides the initial label foreground object well, and minimizes the number of noise pixels that are included in the bounding box. The tracking part uses HOG-3 as an observation model, and optical flow as the dynamic model. This combination of models improves the accuracy of tracking results. In experiments on challenging data set that have no initial labels, the algorithm achieved meaningful accuracy compared to a state-of-the-art technique that needs initial labels.

Wooyeol Jun, Jeongmok Ha, Hong Jeong
Logarithmically Improved Property Regression for Crowd Counting

Crowd counting based on video camera recordings faces two major problems, namely inter-occlusion among the people, and perspective scaling. Though the former issue has been adequately addressed using different regression- and model-based schemes, a solution to the later problem remains an open problem so far. This paper proposes a novel scene-independent solution to perspective scaling. We show that it supports promising results. A property matrix, combining both a grey-level co-occurrence matrix and segmentation properties, is first obtained which is subsequently weighted using logarithmic relationships between pixel distances and foreground regions. We apply a Gaussian process regression, using a compounded kernel, to acquire an estimate for the crowd count. We show that results are comparable to those obtained when using more complex and costly techniques.

Usman Khan, Reinhard Klette
Lesioned-Part Identification by Classifying Entire-Body Gait Motions

This paper proposes a physical motion evaluation system based on human pose sequences estimated by a depth sensor. While most similar systems measure and evaluate the motion of only a part of interest (e.g., knee), the proposed system comprehensively evaluates the motion of the entire body. The proposed system is designed for observing a human motion in daily life in order to find the sign of aging and physical disability. For daily use, in this paper, we focus on walking motions. Walking motions with a variety of physical disabilities are recorded and modeled for classification purpose. This classification is achieved with a set of pose features extracted from walking motion sequences. In experiments, the proposed features extracted from the entire body allowed us to identify where a subject was injured with 81.1 % accuracy. The superiority of the entire-body features was also validated in estimating the degree of lesion in contrast to local features extracted from only a body part of interest (77.1 % vs 65 %).

Tsuyoshi Higashiguchi, Toma Shimoyama, Norimichi Ukita, Masayuki Kanbara, Norihiro Hagita
Variable-Length Segment Copy for Compressing Index Map of Palette Coding in Screen Content Coding

With the emerging applications, such as screen mirroring, and remote play, screen contents coding (SCC) plays important role in video coding recently. Since the characteristics of screen contents are different from nature contents, palette coding is adopted in the current draft standard of HEVC-SCC. The basic idea of the palette coding is to represent the colors of a coding unit (CU) by the indices of selected representative colors. This paper presents that the produced index maps exhibit considerably high spatial correlation. To utilize the spatial correlation among indices, a general 2-D search method is proposed firstly for index map compression. To reduce the memory access and implementation complexity, three simplified search schemes are proposed to balance the coding performance and complexity. The experimental results show that the three simplified methods can achieve 0.6 %, 0.5 % and 0.9 % BD-rate saving respectively, as compared to HM-13.0 + RExt-6.0 test model.

Yao-Jen Chang, Ching-Chieh Lin, Chao-Hsiung Hung, Jih-Sheng Tu, Chun-Lung Lin, Pei-Hsuan Tsai
Automatic Construction of Action Datasets Using Web Videos with Density-Based Cluster Analysis and Outlier Detection

In this paper, we introduce a fully automatic approach to construct action datasets from noisy Web video search results. The idea is based on combining cluster structure analysis and density-based outlier detection. For a specific action concept, first, we download its Web top search videos and segment them into video shots. We then organize these shots into subsets using density-based hierarchy clustering. For each set, we rank its shots by their outlier degrees which are determined as their isolatedness with respect to their surroundings. Finally, we collect upper ranked shots as training data for the action concept. We demonstrate that with action models trained by our data, we can obtain promising precision rates in the task of action classification while offering the advantage of a fully automatic, scalable learning. Experiment results on UCF11, a challenging action dataset, show the effectiveness of our method.

Nga Hang Do, Keiji Yanai

Image/Video Coding and Transmission

Frontmatter
Fast Coding Strategy for HEVC by Motion Features and Saliency Applied on Difference Between Successive Image Blocks

Introducing a number of innovative and powerful coding tools, the High Efficiency Video Coding (HEVC) standard promises double compression efficiency, compared to its predecessor H.264, with similar perceptual quality. The increased computational time complexity is an important issue for the video coding research community as well. An attempt to reduce this complexity of HEVC is adopted in this paper, by efficient selection of appropriate block-partitioning modes based on motion features and the saliency applied to the difference between successive image blocks. As this difference gives us the explicit visible motion and salient information, we develop a cost function by combining the motion features and image difference salient feature. The combined features are then converted into area of interest (AOI) based binary pattern for the current block. This pattern is then compared with a previously defined codebook of binary pattern templates for a subset of mode selection. Motion estimation (ME) and motion compensation (MC) are performed only on the selected subset of modes, without exhaustive exploration of all modes available in HEVC. The experimental results reveal a reduction of 42 % encoding time complexity of HEVC encoder with similar subjective and objective image quality.

Pallab Kanti Podder, Manoranjan Paul, Manzur Murshed
Neighboring Sample Prediction Coding for HEVC Screen Content Coding

High Efficiency Video Coding (HEVC) Screen Content Coding (SCC) is under standardizing for the screen-captured contents. Because many areas are composed of the texts and lines featuring non-smooth textures, the traditional intra prediction is not suitable for those areas. Therefore, this paper proposes a neighboring sample prediction coding (NSPC) for HEVC SCC, which represents the samples of the coding unit (CU) by the indexes of samples selected from the neighboring CUs. A unified sample selection scheme (USSS) based on a neighboring sample list is also proposed to determine the priority ordering of the selected neighboring samples. The index coding with a single sample index skipping method is also designed for coding the converted indexes. Different NSPC signaling methods are evaluated under the common test condition for HEVC SCC, and the results show that the NSPC improves the Bjontegaard-Delta bitrate saving of Screen Content Coding Test Model 2.0 up to 1.0 %.

Yao-Jen Chang, Ching-Chieh Lin, Jih-Sheng Tu, Chun-Lung Lin, Chao-Hsiung Hung, Pei-Hsuan Tsai

Computational Photography and Arts

Frontmatter
Aesthetic Interactive Hue Manipulation for Natural Scene Images

One of the most common ways to adjust the aesthetic quality of a natural scene image is by editing its color. However, while lightness and saturation manipulation is well understood in computational terms, hue manipulation is poorly understood and non-trivial to affect. As a result, modifications made to hue with existing photo-editing tools often impart an unnatural look to the image. In this paper, we discuss a framework for hue manipulation and its contribution to the aesthetics of an image inspired by the use of hue in the visual arts. Our framework is designed to work with a segmentation approach based on superpixels that can segment the image into regions of similar color at varying scales. This allows the user the ability to quickly select semantically similar regions. These local regions also provide the basic region of operation for three proposed hue operations: (1) hue spread; (2) hue compression; and (3) hue shift. We show that when combined with superpixel segmentation these operations are capable of increasing, decreasing and changing regions of local hue contrast in a manner that produces natural or artistic looking images. We demonstrate our framework on a variety of input images and discuss its evaluation from the feedback of several expert users.

Jinze Yu, Martin Constable, Junyan Wang, Kap Luk Chan, Michael S. Brown
Cross-View Action Recognition by Projection-Based Augmentation

Challenging issue in cross-view action recognition is the difference between training viewpoint and testing viewpoint. Existing research deals with this problem by transferring knowledge, i.e., finding a viewpoint independent latent space in which action descriptors from different viewpoints are directly comparable. In this paper, we propose a novel approach to tackle the problem based on exploiting the discrimination in action execution through various viewpoints. We take the advantages of depth data to augment viewpoints from an initial camera viewpoint. In our framework, the local motion features and dedicated classifiers are built from the augmented viewpoints. We conduct experiments on the benchmark dataset, the Northwestern-UCLA Multiview Action 3D (N-UCLA3D) dataset. The experimental results indicated that our proposed method leads to outperform the state-of-the-art on the benchmark. In addition, we show the important role of viewpoints to improve the performance of action recognition.

Chien-Quang Le, Thanh Duc Ngo, Duy-Dinh Le, Shin’ichi Satoh, Duc Anh Duong
Star-Effect Simulation for Photography Using Self-calibrated Stereo Vision

Star effects are an important design factor for night photos. Progress in imaging technologies made it possible that night photos can be taken free-hand. For such camera settings, star effects are not achievable. We present a star-effect simulation method based on self-calibrated stereo vision. Given an uncalibrated stereo pair (i.e. a base image and a match image), which can be just two photos taken with a mobile phone with about the same pose, we follow a standard routine: Extract a family of feature-point pairs, calibrate the stereo pair by using the feature-point pairs, and obtain depth information by stereo matching. We detect highlight regions in the base image, estimate the luminance according to available depth information, and, finally, render star patterns with an input texture. Experiments show that our results are similar to real-world star effect photos, and that they are more natural than results of existing commercial applications. The paper reports for the first time research on automatically simulating photo-realistic star effects.

Dongwei Liu, Haokun Geng, Reinhard Klette

Computer Vision and Applications

Frontmatter
A Robust Stereo Vision with Confidence Measure Based on Tree Agreement

We present an improved non-learning-based confidence measure based on tree agreement for the stereo matching problem. To use confidence information for accurate matching, we propose an improved method to assemble a cost aggregation table with confidence measure. The proposed confidence measure and cost aggregation method were evaluated using the KITTI and HCI datasets. Compared to other non-learning-based confidence measures, the proposed confidence measure showed the best ability to detect wrongly-estimated pixels. By using estimated confidence measure, we showed that the proposed cost aggregation method improved disparity map quality compared to previous methods. The proposed algorithm could estimate disparity relatively accurately even in some very challenging outdoor scenes.

Jeongmok Ha, Hong Jeong
Semantics-Preserving Warping for Stereoscopic Image Retargeting

Due to availability and popularity of stereoscopic displays in the recent years, research into stereo image retargeting is receiving considerable attention. In this paper, we extend the tearable image warping method for stereo image retargeting. Our method retargets both the left and right image of the stereo image pair simultaneously to preserve scene consistency, and minimize distortion using a global optimization algorithm. It is also able to preserve stereoscopic properties of the resulting stereo image. Experimental results show that our approach can preserve the global image context better than stereoscopic cropping, preserve structural details better than stereoscopic seam carving, and protect objects better than stereoscopic traditional warping. Besides, compared to scene warping, our approach can guarantee semantic connectedness.

Chun-Hau Tan, Md Baharul Islam, Lai-Kuan Wong, Kok-Lim Low
Improved Poisson Surface Reconstruction with Various Passive Visual Cues from Multiple Camera Views

Poisson surface reconstruction with octree is widely used as the last step to retrieve the surface data from the point cloud. When the point cloud is generated by the triangulation of point correspondence in multiple images, the noisy positions of the 3D points and the inaccurate estimation of the normal vectors will impact the quality of the reconstructed surface. In this work, the mesh optimization using multiple visual cues will be applied to improve the output of the Poisson surface reconstruction. Usually, the active cues like shading and focusing require elaborate experimental setup, whereas the passive cues like silhouette and photometric property can be more easily acquired from the raw images. The experimental results show that adaptive integration of the multiple passive visual cues will deliver the surface mesh data with high quality. Besides, the optimization algorithm is easily to parallelize, as each vertex moves independently, which makes it appealing for the real-time 3D reconstruction system.

Ningqing Qian, Sohaib Kiani, Bahareh Shakibajahromi
Prediction of Vibrations as a Measure of Terrain Traversability in Outdoor Structured and Natural Environments

Terrain recognition is an important task that a mobile robot has to accomplish autonomously to navigate in hazardous territories safely with no additional human monitoring. For this, sensory information should be employed to construct a good model to estimate the degree of traversability of upcoming terrains. In this paper, a regression-based method is proposed to estimate mobile robot vibration from terrain images as a description for terrain traversability. Texture attributes obtained from evaluation of the fractal dimension to describe the terrains were combined with appropriate acceleration features for function approximation using Gaussian Process regression (GP). Results showed effectiveness of the method to predict motion data for different terrain configurations in structured and rough environments.

Mohammed Abdessamad Bekhti, Yuichi Kobayashi
Echo State Network for 3D Motion Pattern Indexing: A Case Study on Tennis Forehands

Open skill sports such as tennis have a large number of swing execution techniques. This study presents a novel approach to event detection and motion pattern indexing of forehand swings captured from fixed location multi-camera represented as a 3D motion data set of multi-time series sampled at 50 Hz. The achieved results utilising Echo State Network (ESN) demonstrate 100 % recognition of tennis forehands from previously unseen test data without ball impact information. In contrast to traditional, heuristic and feature extraction-based algorithmic approaches in exergames and augmented coaching technologies, the proposed ESN paradigm represents a viable and generic approach for future work in temporal and spatial detection and automated analysis of region of interest in human motion data processing.

Boris Bačić

Image Segmentation and Classification

Frontmatter
Multispectral Image Denoising Using Optimized Vector NLM Filter

In this paper, we present a Stein’s Unbiased Risk Estimator (SURE) approach for Non-Local Mean filter to denoise multispectral images. We extend this filter to the vector case in order to take advantage from the additional spectral information brought by the multispectral imaging system. Experimental results show that the proposed optimized vector non-local mean filter (OVNLM) presented good denoising performance compared to several other approaches.

Ahmed Ben Said, Sebti Foufou
Scene-Based Non-uniformity Correction with Readout Noise Compensation

Thermal cameras can not be calibrated as easily as RGB cameras, since their noise characteristics change over time; thus scene-based non-uniformity correction (SBNUC) has been developed. We present a method to boost the convergence of these algorithms by removing the readout noise form the image before it is processed. The readout noise can be estimated by capturing a series of pictures with varying exposure times, fitting a line for each pixel and thereby estimating the bias of the pixel. When this is subtracted from the image a noticeable portion of the noise is compensated. We compare the results of two common SBNUC algorithms with and without this compensation. The mean average error improves by several orders of magnitude, which allows faster convergence with smaller step sizes. The readout noise compensation (RNC) can be used to improve the performance of any SBNUC approach.

Martin Bürker, Hendrik P. A. Lensch
A Color Quantization Based on Vector Error Diffusion and Particle Swarm Optimization Considering Human Visibility

In this paper, we propose a new color quantization method for generation of the color-reduced images. The proposed method employs a vector error diffusion (VED) method and a particle swarm optimization (PSO). VED method based on Floyd-Steinberg dithering is used for display of the color-reduced image. Furthermore, a color palette used in VED method is optimized by PSO. PSO generates the effective color palette with evaluating a human visibility of the color-reduced image on the display. The validity and the effectiveness of the proposed method are confirmed by some experiments.

Ryosuke Kubota, Hakaru Tamukoh, Hideaki Kawano, Noriaki Suetake, Byungki Cha, Takashi Aso
Fast Interactive Image Segmentation Using Bipartite Graph Based Random Walk with Restart

Although random walk with restart(RWR) has been successfully used in interactive image segmentation, the traditional implementation of RWR does not scale for large images. As the images are usually stored on local disk prior to user interaction, we can preprocess the images to save user time. In this paper, we do an offline precomputation that over-segments the input image into superpixels with different scales and then aggregates superpixels and pixels into one bipartite graph which fuses the high level and low level information. Given user scribbles, we do a realtime RWR on the bipartite graph by applying an approximate method which maps the RWR from pixel level to superpixel level. As the number of superpixels is far more less than the number of pixels in the image, our method reduces the amount of user time significantly. The experimental results demonstrate that our method achieves a similar result compared to original RWR along with outperforming in speed.

Yunfan Du, Fei Li, Rujie Liu
Adaptive Window Strategy for High-Speed and Robust KLT Feature Tracker

The Kanade-Lucas-Tomasi tracking (KLT) algorithm is widely used for local tracking of features. As it employs a translation model to find the feature tracks, KLT is not robust in the presence of distortions around the feature resulting in high inaccuracies in the tracks. In this paper we show that the window size in KLT must vary to adapt to the presence of distortions around each feature point in order to increase the number of useful tracks and minimize noisy ones. We propose an adaptive window size strategy for KLT that uses the KLT iterations as an indicator of the quality of the tracks to determine near-optimal window sizes, thereby significantly improving its robustness to distortions. Our evaluations with a well-known tracking dataset show that the proposed adaptive strategy outperforms the conventional fixed-window KLT in terms of robustness. In addition, compared to the well-known affine KLT, our method achieves comparable robustness at an average runtime speedup of 7x.

Nirmala Ramakrishnan, Thambipillai Srikanthan, Siew Kei Lam, Gauri Ravindra Tulsulkar
Enhanced Phase Correlation for Reliable and Robust Estimation of Multiple Motion Distributions

Phase correlation is one of the classic methods for sparse motion or displacement estimation. It is renowned in the literature for high precision and insensitivity against illumination variations. We propose several important enhancements to the phase correlation (PhC) method which render it more robust against those situations where a motion measurement is not possible (low structure, too much noise, too different image content in the corresponding measurement windows). This allows the method to perform self-diagnosis in adverse situations.Furthermore, we extend the PhC method by a robust scheme for detecting and classifying the presence of multiple motions and estimating their uncertainties. Experimental results on the Middlebury Stereo Dataset and on the KITTI Optical Flow Dataset show the potential offered by the enhanced method in contrast to the PhC implementation of OpenCV.

Matthias Ochs, Henry Bradler, Rudolf Mester
Robust Visual Voice Activity Detection Using Long Short-Term Memory Recurrent Neural Network

Many traditional visual voice activity detection systems utilize features extracted from mouth region images which are sensitive to noisy observations of the visual domain. In addition, hyperparameters of the feature extraction process modulating the desired compromise between robustness, efficiency, and accuracy of the algorithm are difficult to be determined. Therefore, a visual voice activity detection algorithm which only utilizes simple lip shape information as features and a Long Short-Term Memory recurrent neural network (LSTM-RNN) as a classifier is proposed. Face detection is performed by structural SVM based on histogram of oriented gradient (HOG) features. Detected face template is used to initialize a kernelized correlation filter tracker. Facial landmark coordinates are then extracted from the tracked face. Centroid distance function is applied to the geometrically normalized landmarks surrounding the outer and inner lip contours. Finally, discriminative (LSTM-RNN) and generative (Hidden Markov Model) methods are used to model the temporal lip shape sequences during speech and non-speech intervals and their classification performances are compared. Experimental results show that the proposed algorithm using LSTM-RNN can achieve a classification rate of 98 % in labeling speech and non-speech periods. It is robust and efficient for realtime applications.

Zaw Htet Aung, Panrasee Ritthipravat
Wing-Surface Reconstruction of a Lanner-Falcon in Free Flapping Flight with Multiple Cameras

This paper presents a way to reconstruct the upper and lower surface of a curved and textured Lanner-falcon wing in flapping flight. A stereo camera system was used to take images of a free-flying bird in a wind-tunnel. The usage of two cameras allows for the finding of correspondences of sought-after surface points in both cameras. Furthermore, three dimensional coordinates can be triangulated. To get camera points, which belong together, a disparity map is calculated with the help of a Semi-Global Block Matching algorithm. A reduction of the complexity is achieved by rectifying the camera-images on the basis of the epipolar geometry. The analysis shows that the surface structure of a Lanner-falcon and the motion during a specified time-series can be reconstructed with sufficient accuracy.

Martin Heinold, Christian J. Kähler
Underwater Active Oneshot Scan with Static Wave Pattern and Bundle Adjustment

Structured Light Systems (SLS) are widely used for various purposes. Recently, a strong demand to apply SLS to underwater applications has emerged. When SLS is used in an air medium, the stereo correspondence problem can be solved efficiently by epipolar geometry due to the co-planarity of the 3D point and its corresponding 2D points on camera/projector planes. However, in underwater environments, the camera and projector are usually set in special housings and refraction occurs at the interfaces between water/glass and glass/air, resulting in invalid conditions for epipolar geometry which strongly affect the correspondence search process. In this paper, we tackle the problem of underwater 3D shape acquisition with SLS. In this paper, we propose a method to perform 3D reconstruction by calibrating the system as if they are in the air at multiple depth. Since refraction cannot be completely described by a polynomial approximation of distortion model, grid based SLS method solve the problem. Finally, we propose a bundle adjustment method to refine the final result. We tested our method with an underwater SLS prototype, consisting of custom-made diffractive optical element (DOE) laser and underwater housings, showing the validity of the proposed approach.

Hiroki Morinaga, Hirohisa Baba, Marco Visentini-Scarzanella, Hiroshi Kawasaki, Ryo Furukawa, Ryusuke Sagawa
Using Image Features and Eye Tracking Device to Predict Human Emotions Towards Abstract Images

Nowadays, emotional semantic image retrieval system enables users to access images that they want in a database according to emotional concept. This leads to affective image classification task which recently attracts researchers’ attention. However, different users may experience different emotions depending on where, in the image, they are gazing on. This paper presents an improved prediction method by taking into account the users eye movement as implicit feedback while they are looking at the image. Our experimental results show that using both eye movement information and image feature together to determine users emotion gave more accurate predictions than using image feature alone.

Kitsuchart Pasupa, Panawee Chatkamjuncharoen, Chotiros Wuttilertdeshar, Masanori Sugimoto

Video Surveillance

Frontmatter
Personal Authentication Based on 3D Configuration of Micro-feature Points on Facial Surface

This paper proposes a personal authentication method based on the 3D configuration of micro-feature points such as moles and freckles, plus common feature points such as corners of eyes, edges of mouth and nostrils. The basic idea behind the proposed method is the assumption that such 3D configuration can be uniquely determined by individuals. To compare two configurations of feature points effectively, the concept of 3D shape subspace in a high-dimensional vector space is introduced. With this idea, the task of comparing the sets of feature points is converted to that of measuring the structural similarity between the corresponding shape subspaces. The validity of the proposed method is demonstrated through experiments with feature points from actual face images. In addition, the performance limit of the method is explored using sets of artificially generated feature points.

Takao Yoshinuma, Hideitsu Hino, Kazuhiro Fukui
6-DOF Direct Homography Tracking with Extended Kalman Filter

This paper considers a robust direct homography tracking that takes advantage of the known intrinsic parameters of the camera to estimate its pose in real scale, to speed-up the convergence, and to drastically increase the robustness of the tracking. Indeed, our new formulation for direct homography tracking allows us to explicitly solve a 6 Degrees Of Freedom (DOF) rigid transformation between the plane and the camera. Furthermore, it simplifies the integration of the Extended Kalman Filter (EKF) which allows us to increase the computational speed and deal with large motions. For the sake of robustness, our approach also includes a pyramidal optimization using an Enhanced Correlation Coefficient (ECC) based objective function. The experiments show the high efficiency of our approach against state of the art methods and under challenging conditions.

Hyowon Ha, François Rameau, In So Kweon
Tracking a Human Fast and Reliably Against Occlusion and Human-Crossing

Tracking a human using the computer vision techniques is essential in the automatic surveillance task. Not only its accuracy and speed but also how it deals with occlusion and human-crossing are the challenges for a reliable tracking framework. Among many, Kernelized Correlation Filter (KCF) has become a state-of-the-art tracker partly because of its high speed, although its performance in dealing diverse situations requires some improvement. We present a new tracking method whereby the reliability is greatly enhanced while maintaining its speed by integrating a Kalman filter with the KCF. The tracker works as follow. After the KCF estimates target’s position based on the prediction by the Kalman filter, then the estimated value is given to the updating step of the Kalman filter. During the KCF learning phase, the kernel model is updated using the correct state. Evaluation result using the standard tracking databases suggests that the present tracker outperforms the standard KCF, MOSSE and MIL trackers, respectively. In particular, it is the only tracker that can deal very well with the occlusion and human-crossing tasks, which are the crucial requirements for the high-end surveillance.

Xuan-Phung Huynh, In-Ho Choi, Yong-Guk Kim

Biomedical Image Processing and Analysis

Frontmatter
Automatic BI-RADS Classification of Mammograms

Mammograms provide a significant amount of information, which allows the classification of breast tissue into one of four breast density categories. The higher the category score, the greater the amount of dense (fibroglandular) tissue in the breast. These categories were proposed to give an indication of the sensitivity of mammography, but it is also widely acknowledged that breast density is associated with the risk of developing cancer. Thus, accurate and reproducible measures of classifying breast density are important for breast cancer screening and risk assessment.We present our VolparaTM algorithm to automatically estimate the volumetric breast density (VBD) from mammograms. VBD is the percentage of fibroglandular tissue in the breast and is a physiological measure of the breast composition. Volpara uses a physics model together with image information derived from a mammogram to report the breast density. In this paper, we compare Volpara’s VBD with various statistical texture measures across 1179 mammograms. This comparison shows that Volpara has the best performance in categorising breast density with respect to radiologist’s readings.

Nabeel Khan, Kaier Wang, Ariane Chan, Ralph Highnam
Analyzing Muscle Activity and Force with Skin Shape Captured by Non-contact Visual Sensor

Estimating physical information by vision as humans do is useful for the applications with physical interaction in the real world. For example, observing muscle bulging infers how much force a person puts on the muscle to interact with an object or environment. Since the human skin deforms due to muscle activity, it is expected that skin deformation gives information to analyze human motion. This paper demonstrates that biomechanical information can be derived from skin shape by analyzing the relationship between skin deformation, force produced by muscles, and muscle activity. We first obtained the dataset simultaneously acquired by a range sensor, a force sensor, and electromyograph (EMG) sensors. Since recent range sensors based on non-contact visual measurement acquires accurate and dense shape of an object at high frame rate, the deforming skin can be observed. The deformation is calculated by finding the correspondence between a template shape and each range scan. The relationship between skin deformation and other data is learned. In this paper, the following problems are considered: (1) estimating force from skin shape, (2) estimating muscle activity from skin shape, (3) synthesizing skin shape from muscle activity. In the experiments, the database learned from the sensor data can be used for the above problems, and the skin shape gives useful information to explain the muscle activity.

Ryusuke Sagawa, Yusuke Yoshiyasu, Alexander Alspach, Ko Ayusawa, Katsu Yamane, Adrian Hilton
Regression as a Tool to Measure Segmentation Quality and Preliminary Indicator of Diseased Lungs

Segmentation of the lung from HRCT Thorax images was studied. An automatic method of determining segmentation area is proposed. High quality of segmentation is considered achieved when the segmented area from the proposed algorithm is almost identical to the area obtained from the manual tracings by lung expert (ground truth). High correlation between the two types of segmented areas showed that regression may be used as a tool to measure segmentation quality. Supplementary information may also be obtained from the regression plot. Prediction interval may be used as a possible indicator of diseased whilst outliers may show or indicate low segmentation quality or a possible severity of the disease.

Norliza Mohd. Noor, Omar Mohd. Rijal, Joel Chia Ming Than, Rosminah M. Kassim, Ashari Yunus
An Image Registration Method with Radial Feature Points Sampling: Application to Follow-Up CT Scans of a Solitary Pulmonary Nodule

In order to support radiologists’ follow-up task of two CT scans captured in the past and in the present, we aimed to develop a system that displays both a region of interest (ROI) in one image selected by a radiologist and a corresponding ROI in another image. In this paper, we propose a registration method for the system. A typical registration method identifies several pairs of matched feature points (i.e., matching pairs) between two images within the range of a predefined distance from the ROI’s center point (i.e., interest point) to correct a positional shift of an organ caused by heartbeats and breathing. However, low accuracy of registration is often observed because of biased distribution or a small number of matching pairs, depending on the sampling range. We developed a novel registration method that radially and evenly searches for several nearest matching pairs around the interest point and then estimates a translation vector at the interest point as a weighted average of these nearest pairs using a weighting factor based on its distance from the interest point. This method was based on the assumption that the transformation of an interest point work with the transformation of a near point since the lung is a continuum. The results of a comparative evaluation of the existing method and the proposed method on the basis of 15 cases showed that the accuracy of the proposed method was higher than that of the existing method in 13/15 cases. We analyzed the association between the accuracy and the range of sampling and found that the accuracy of the proposed method was similar to the best performance of the existing method with an ideal range of sampling the matching pairs. Finally, we showed evidence that the new method was reasonably consistent in terms of giving the best performance.

Masaki Ishihara, Yuji Matsuda, Masahiko Sugimura, Susumu Endo, Hiroaki Takebe, Takayuki Baba, Yusuke Uehara

Object and Pattern Recognition

Frontmatter
Time Consistent Estimation of End-Effectors from RGB-D Data

End-effectors are usually related to the location of the free end of a kinematic chain. Each of them contains rich structure information about the entity. Hence, estimating stable end-effectors of different entities enables robust tracking as well as a generic representation. In this paper, we present a system for end-effector estimation from RGB-D stream data. Instead of relying on a specific pose or configuration for initialization, we exploit time coherence without making any assumption with respect to the prior knowledge. This makes the estimation process more robust in a predict-update framework. Qualitative and quantitative experiments are performed against the reference method with promising results.

Xiao Lin, Josep R. Casas, Montse Pardás
Volume-Based Semantic Labeling with Signed Distance Functions

Research works on the two topics of Semantic Segmentation and SLAM (Simultaneous Localization and Mapping) have been following separate tracks. Here, we link them quite tightly by delineating a category label fusion technique that allows for embedding semantic information into the dense map created by a volume-based SLAM algorithm such as KinectFusion. Accordingly, our approach is the first to provide a semantically labeled dense reconstruction of the environment from a stream of RGB-D images. We validate our proposal using a publicly available semantically annotated RGB-D dataset and (a) employing ground truth labels, (b) corrupting such annotations with synthetic noise, (c) deploying a state of the art semantic segmentation algorithm based on Convolutional Neural Networks.

Tommaso Cavallari, Luigi Di Stefano
Simultaneous Camera, Light Position and Radiant Intensity Distribution Calibration

We propose a practical method for calibrating the position and the Radiant Intensity Distribution (RID) of light sources from images of Lambertian planes. In contrast with existing techniques that rely on the presence of specularities, we prove a novel geometric property relative to the brightness of Lambertian planes that allows to robustly calibrate the illuminant parameters without the detrimental effects of view-dependent reflectance and a large decrease in complexity. We further show closed form solutions for position and RID of common types of light sources. The proposed method can be seamlessly integrated within the camera calibration pipeline, and its validity against the state-of-the-art is shown both on synthetic and real data.

Marco Visentini-Scarzanella, Hiroshi Kawasaki
A General Vocabulary Based Approach for Fine-Grained Object Recognition

In this paper, we deal with the classification problem of visually similar objects which is also known as fine-grained recognition. We consider both rigid and non-rigid types of objects. We investigate the classification performance of different combinations of bag-of-visual words models to find out a generalized set of visual words for different types of fine-grained classification. We combine the feature sets using multi-class multiple learning algorithm. We evaluate the models on two datasets; in the non-rigid, deformable object category, Oxford 102 class flower dataset is chosen and 17 class make and model recognition car dataset is selected in the rigid category. Results show that our combination of vocabulary sets provides reasonable accuracies of 81.05 % and 96.76 % in the flower and car datasets, respectively.

Shubhra Aich, Chil-Woo Lee
A Triangle Mesh Reconstruction Method Taking into Account Silhouette Images

In this paper, we propose a novel approach to reconstruct triangle meshes from point sets by taking the silhouette of the target object into consideration. Recently, many approaches have been proposed for complete 3D reconstruction of moving objects. For example, motion capture techniques are used to acquire 3D data of human motion. However, it needs to attach markers onto the joints, which results in limiting the capturing environments and the number of data that can be acquired. In contrast, to obtain dense data of 3D object, multi-view stereo scanning system is one of the powerful methods. It utilize images taken by several directions and enables to reconstruct 3D dense point sets by using Epipolar geometry. However, it is still challenging problem to reconstruct 3D triangle mesh from the 3D point sets due to the abundant points originated by mismatched points between images. We propose a novel approach to obtain more accurate triangle mesh reconstruction method than the previous one. We take advantage of silhouette images acquired in the process of reconstructing 3D point sets that result in removing noises and compensating holes. Finally, we demonstrate that the proposed method can generate the details of the surface, where the previous method loses from a small number of points.

Michihiro Mikamo, Yoshinori Oki, Marco Visentini-Scarzanella, Hiroshi Kawasaki, Ryo Furukawa, Ryusuke Sagawa
All-Focus Image Fusion and Depth Image Estimation Based on Iterative Splitting Technique for Multi-focus Images

This paper concerns about processing of multi-focus images which are captured by adjusting the positions of the imaging plane step by step so that objects at different depths will have their best focus at different images. Our goal is to synthesize an all-focus image and estimate the corresponding depth image for this multi-focus image set. In contrast to traditional pixel- or block-based techniques, our focus measures are computed based on irregular regions that are iteratively refined/split to adapt to varying image content. At first, an initial all-focus image is obtained and then segmented to get initial region definitions. The regional Focus Evaluation Curve (FEC) along the focal-length axis and a regional label histogram are then analyzed to determine whether a region should be subject to further splitting. After convergence, the final region definitions are used to perform WTA (Winner-take-all) for choosing image pixels of best focus from the image set. Depth image then corresponds to the label image by which image pixels of best focus are chosen. Experiments show that our adaptive region-based algorithm has performances (in synthesis quality, depth map, and speed) superior to other prior works and commercial software that adopt pixel-weighting strategy.

Wen-Nung Lie, Chia-Che Ho
Stereo Matching Techniques for High Dynamic Range Image Pairs

We investigate the stereo matching techniques for high dynamic range (HDR) image pairs. It is an emerging topic in computer vision and multimedia applications due to the availability of HDR image capture devices. The disparity computation will eventually take the stereo HDR input. In this work, three state-of-the-art stereo matching algorithms are modified and used to test the advantages of HDR stereo matching. By performing the HDR bit-plane slicing, it is found that only about 16 bits per channel is required for the HDR image format. We propose a 16-bit unsigned integer format to store the HDR image, which allows the available stereo matching algorithms to be adopted for disparity computation. Experiments and performance evaluation are carried out using Middlebury stereo datasets.

Huei-Yung Lin, Chung-Chieh Kao
Discriminative Properties in Directional Distributions for Image Pattern Recognition

We clarify mathematical properties for accurate and robust achievement of the histogram of the oriented gradients method. This method extracts image features from the distribution of gradients by shifting bounding box. We show that this aggregating distribution of local regions extracts low-frequency components of an image. Furthermore, we show that the normalisation of histograms in this method is a nonlinear mapping. Moreover, we show a combination of dominant directional distribution and the Wasserstein distance recognise the image of particular object as accurately as the histogram of oriented gradients method.

Hayato Itoh, Atsushi Imiya, Tomoya Sakai
Deep Boltzmann Machines for i-Vector Based Audio-Visual Person Identification

We propose an approach using DBM-DNNs for i-vector based audio-visual person identification. The unsupervised training of two Deep Boltzmann Machines DBM$$_{\text {speech}}$$ and DBM$$_\text {face}$$ is performed using unlabeled audio and visual data from a set of background subjects. The DBMs are then used to initialize two corresponding DNNs for classification, referred to as the DBM-DNN$$_{\text {speech}}$$ and DBM-DNN$$_{\text {face}}$$ in this paper. The DBM-DNNs are discriminatively fine-tuned using the back-propagation on a set of training data and evaluated on a set of test data from the target subjects. We compared their performance with the cosine distance (cosDist) and the state-of-the-art DBN-DNN classifier. We also tested three different configurations of the DBM-DNNs. We show that DBM-DNNs with two hidden layers and 800 units in each hidden layer achieved best identification performance for 400 dimensional i-vectors as input. Our experiments were carried out on the challenging MOBIO dataset.

Mohammad Rafiqul Alam, Mohammed Bennamoun, Roberto Togneri, Ferdous Sohel
Improved DSIFT Descriptor Based Copy-Rotate-Move Forgery Detection

In recent years, there has been a dramatic increase in the number of images captured by users. This is due to the wide availability of digital cameras and mobile phones which are able to capture and transmit images. Simultaneously, image-editing applications have become more usable, and a casual user can easily improve the quality of an image or change its content. The most common type of image modification is cloning, or copy-move forgery (CMF), which is easy to implement and difficult to detect. In most cases, it is hard to detect CMF with the naked eye and many possible manipulations (attacks) can be used to make the doctored image more realistic. In CMF, the forger copies part(s) of the image and pastes them back into the same image. One possible transformation is rotation, where an object is copied, rotated and pasted. Rotation-invariant features need to be used to detect Copy-Rotate-Move (CRM) forgery. In this paper we present three contributions. First, a new technique to detect CMF is developed, using Dense Scale-Invariant Feature Transform (DSIFT). Second, a new improved DSIFT descriptor is implemented which is more robust to rotation than Zernike moments. Third, a new method to remove false matching is proposed. Extensive experiments have been conducted to train, evaluate and test the algorithms, the new feature vector and the suggested method to remove false matching. We show that the proposed method can detect forgery in images with blurring, brightness change, colour reduction, JPEG compression, variations in contrast and added noise.

Ali Retha Hasoon Khayeat, Xianfang Sun, Paul L. Rosin
Local Clustering Patterns in Polar Coordinate for Face Recognition

Facial recognition is an important issue and has various practical applications in visual surveillance system. In this paper, we propose a novel local pattern descriptor called the Local Clustering Pattern (LCP) with low computational cost operating in the polar coordinate system for recognizing face. The local derivative variations with multi-direction are considered and that are integrated on the pairwise combinatorial direction. To generate the discriminative local pattern, the features of local derivative variations are transformed into the polar coordinate system by generating the characteristics of distance (r) and angle ($$\theta $$). LCP is ensemble of several decisions from the clustering algorithm for each pixel in the polar coordinate system (P.C.S.). Differs from the existing local pattern descriptors, such as local binary pattern (LBP) [1, 8], local derivation pattern (LDP) [11], and local tetra pattern (LTrP) [7], LCP generates the discriminative local clustering pattern with low-order derivative space and low computational cost which are stable in the process of face recognition. The performance of the proposed method is compared with LBP, LDP, LTrP on the Extended Yale B [4, 5] and CAS-PEAL [3] databases.

Chih-Wei Lin, Kuan-Yin Lu

Computer Vision and Pattern Recognition

Frontmatter
Deep Convolutional Neural Network in Deformable Part Models for Face Detection

Deformable Part Models and Convolutional Neural Network are state-of-the-art approaches in object detection. While Deformable Part Models makes use of the general structure between parts and root models, Convolutional Neural Network uses all information of input to create meaningful features. These two types of characteristics are necessary for face detection. Inspired by this observation, first, we propose an extension of DPM by adaptively integrating CNN for face detection called DeepFace DPM and propose a new combined model for face representation. Second, a new way of calculating non-maximum suppression is also introduced to boost up detection accuracy. We use Face Detection Data Set and Benchmark to evaluate the merit of our method. Experimental results show that our method surpasses the highest result of existing methods for face detection on the standard dataset with 87.06 % in true positive rate at 1000 number false positive images. Our method sheds a light in face detection which is commonly regarded as a saturated area.

Dinh-Luan Nguyen, Vinh-Tiep Nguyen, Minh-Triet Tran, Atsuo Yoshitaka
Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

In this paper, we present a novel method for multimodal gesture recognition based on neural networks. Our multi-stream recurrent neural network (MRNN) is a completely data-driven model that can be trained from end to end without domain-specific hand engineering. The MRNN extends recurrent neural networks with Long Short-Term Memory cells (LSTM-RNNs) that facilitate the handling of variable-length gestures. We propose a recurrent approach for fusing multiple temporal modalities using multiple streams of LSTM-RNNs. In addition, we propose alternative fusion architectures and empirically evaluate the performance and robustness of these fusion strategies. Experimental results demonstrate that the proposed MRNN outperforms other state-of-the-art methods in the Sheffield Kinect Gesture (SKIG) dataset, and has significantly high robustness to noisy inputs.

Noriki Nishida, Hideki Nakayama

Image/Video Processing and Analysis

Frontmatter
A Spatially Constrained Asymmetric Gaussian Mixture Model for Image Segmentation

Gaussian mixture models with spatial constraint play an important role in image segmentation. Nevertheless, most methods suffer from one or more challenges such as limited robustness to outliers, over-smoothness for segmentations, and lack of flexibility to fit different shapes of observed data. To address above issues, in this paper, we propose a spatially constrained asymmetric Gaussian mixture model for image segmentation. The asymmetric distribution is utilized to fit different shapes of observed data. Then our asymmetric model can be constructed based on the posterior and prior probabilities of within-cluster and between-cluster. Moreover, we introduce two pseudo likelihood quantities which respectively couple neighboring priors of within-cluster and between-cluster based on the Kullback-Leibler divergence. Finally, we derive an expectation maximization algorithm to iteratively maximize the approximation of the lower bound of the data log-likelihood. Experimental results on synthetic and real images demonstrate the superior performance of the proposed algorithm comparing with state-of-the-art segmentation approaches.

Zexuan Ji, Jinyao Liu, Hengdong Yuan, Yubo Huang, Quansen Sun
Object Recognition in Baggage Inspection Using Adaptive Sparse Representations of X-ray Images

In recent years, X-ray screening systems have been used to safeguard environments in which access control is of paramount importance. Security checkpoints have been placed at the entrances to many public places to detect prohibited items such as handguns and explosives. Human operators complete these tasks because automated recognition in baggage inspection is far from perfect. Research and development on X-ray testing is, however, ongoing into new approaches that can be used to aid human operators. This paper attempts to make a contribution to the field of object recognition by proposing a new approach called Adaptive Sparse Representation (XASR+). It consists of two stages: learning and testing. In the learning stage, for each object of training dataset, several random patches are extracted from its X-ray images in order to construct representative dictionaries. A stop-list is used to remove very common words of the dictionaries. In the testing stage, random test patches of the query image are extracted, and for each test patch a dictionary is built concatenating the ‘best’ representative dictionary of each object. Using this adapted dictionary, each test patch is classified following the Sparse Representation Classification (SRC) methodology. Finally, the query image is classified by patch voting. Thus, our approach is able to deal with less constrained conditions including some contrast variability, pose, intra-class variability, size of the image and focal distance. We tested the effectiveness of our method for the detection of four different objects. In our experiments, the recognition rate was more than 95 % in each class, and more than 85 % if the object is occluded less than 15 %. Results show that XASR+ deals well with unconstrained conditions, outperforming various representative methods in the literature.

Domingo Mery, Erick Svec, Marco Arias
Real-Time Lane Estimation Using Deep Features and Extra Trees Regression

In this paper, we present a robust real-time lane estimation algorithm by adopting a learning framework using the convolutional neural network and extra trees. By utilising the learning framework, the proposed algorithm predicts the ego-lane location in the given image even under conditions of lane marker occlusion or absence. In the algorithm, the convolutional neural network is trained to extract robust features from the road images. While the extra trees regression model is trained to predict the ego-lane location from the extracted road features. The extra trees are trained with input-output pairs of road features and ego-lane image points. The ego-lane image points correspond to Bezier spline control points used to define the left and right lane markers of the ego-lane. We validate our proposed algorithm using the publicly available Caltech dataset and an acquired dataset. A comparative analysis with a baseline algorithms, shows that our algorithm reports better lane estimation accuracy, besides being robust to the occlusion and absence of lane markers. We report a computational time of 45 ms per frame. Finally, we report a detailed parameter analysis of our proposed algorithm.

Vijay John, Zheng Liu, Chunzhao Guo, Seiichi Mita, Kiyosumi Kidono
Contrast Based Hierarchical Spatial-Temporal Saliency for Video

Predicting human attention for video is requires exploiting temporal knowledge included in the video. We propose a novel hierarchical spatial-temporal saliency model for video based on the center-surround framework using both static features and temporal features. Saliency cues are analyzed through a hierarchical segmentation model, and fused across multiple levels, yielding the spatial-temporal saliency map. An adaptive temporal window using motion information is also developed to combine saliency values of consecutive frames in order to keep temporal consistency across frames. Performance evaluation on several popular benchmark datasets validates that our method outperforms existing state-of-the-arts.

Trung-Nghia Le, Akihiro Sugimoto

Pattern Recognition

Frontmatter
Binary Descriptor Based on Heat Diffusion for Non-rigid Shape Analysis

This paper presents an efficient feature point descriptor for non-rigid shape analysis. The descriptor is developed based on the properties of the heat diffusion process on a shape. We use, for the first time, the Heat Kernel Signature of a particular time scale to define the scalar field on a manifold. Then, motivated by the successful use of a local reference frame for rigid shape analysis, we construct a repetitive local polar coordinate system, which is invariant under isometric deformations. Finally, a binary descriptor is derived by comparing the intensities of the neighboring points for each feature point. We show that the descriptor is highly discriminative and can be computed simply using ‘intensity comparisons’ on a shape. Furthermore, its similarity can be evaluated using the Hamming distance, which is very efficient to compute compared with the commonly used $$L_{2}$$ norm. Our experiments demonstrate a superior performance compared to existing techniques on the standard benchmark TOSCA.

Xupeng Wang, Ferdous Sohel, Mohammed Bennamoun, Hang Lei
Table Detection from Slide Images

In this paper we propose a solution to detect tables from slide images. Presentation slides are one type of document with growing importance. But the layout difference between slides and traditional documents makes many existing table detection methods less effective on slides. The proposed solution works with both high-resolution slide images from digital files and low-resolution slide screenshots from videos. By taking OCR (Optical Character Recognition) as initial step, a heuristic analysis on page layout focuses not only on the table structure but also the textual content. The evaluation result shows that the proposed solution achieves an approximate accuracy of 80 %. It is way better than the open-source academic solution Tesseract and also outperforms the commercial software ABBYY FineReader, which is supposed to be one of the best table detection tools.

Xiaoyin Che, Haojin Yang, Christoph Meinel
Face Search in Encrypted Domain

Visual information of images and videos usually is encrypted for the purposes of security applications. Straightforward manipulations on the encrypted data without requiring any decryption have the advantage of speed over performing those operations in spatial, temporal, frequency or compressed domain. In this paper, we will investigate encrypted image search. More specifically, given a face image as the target object, we search it amongst encrypted images. We accomplish the search by using a novel method that extracts features and locates the face object region within the given encrypted image. We evaluate the search results by using precision and recall as well as F-measure. Our experiments reveal that there exists a trade-off between the quality of search and the quality of encryption, namely, stronger encryption leads to poorer search results.

Wei Qi Yan, Mohan S. Kankanhalli
Backmatter
Metadaten
Titel
Image and Video Technology
herausgegeben von
Thomas Bräunl
Brendan McCane
Mariano Rivera
Xinguo Yu
Copyright-Jahr
2016
Electronic ISBN
978-3-319-29451-3
Print ISBN
978-3-319-29450-6
DOI
https://doi.org/10.1007/978-3-319-29451-3

Neuer Inhalt