Skip to main content

2015 | Buch

Computer Vision - ACCV 2014 Workshops

Singapore, Singapore, November 1-2, 2014, Revised Selected Papers, Part I

insite
SUCHEN

Über dieses Buch

The three-volume set, consisting of LNCS 9008, 9009, and 9010, contains carefully reviewed and selected papers presented at 15 workshops held in conjunction with the 12th Asian Conference on Computer Vision, ACCV 2014, in Singapore, in November 2014. The 153 full papers presented were selected from numerous submissions. LNCS 9008 contains the papers selected for the Workshop on Human Gait and Action Analysis in the Wild, the Second International Workshop on Big Data in 3D Computer Vision, the Workshop on Deep Learning on Visual Data, the Workshop on Scene Understanding for Autonomous Systems and the Workshop on Robust Local Descriptors for Computer Vision. LNCS 9009 contains the papers selected for the Workshop on Emerging Topics on Image Restoration and Enhancement, the First International Workshop on Robust Reading, the Second Workshop on User-Centred Computer Vision, the International Workshop on Video Segmentation in Computer Vision, the Workshop: My Car Has Eyes: Intelligent Vehicle with Vision Technology, the Third Workshop on E-Heritage and the Workshop on Computer Vision for Affective Computing. LNCS 9010 contains the papers selected for the Workshop on Feature and Similarity for Computer Vision, the Third International Workshop on Intelligent Mobile and Egocentric Vision and the Workshop on Human Identification for Surveillance.

Inhaltsverzeichnis

Frontmatter

Human Gait and Action Analysis in the Wild: Challenges and Applications

Frontmatter
A New Gait-Based Identification Method Using Local Gauss Maps

We propose a new descriptor for human identification based on gait. The current and most prevailing trend in gait representation revolves around encoding body shapes as silhouettes averaged over gait cycles. Our method, however, captures geometric properties of the silhouettes boundaries. Namely, we evaluate contour curvatures locally using Gauss maps. This results in an improved shape representation, as contrasted to average silhouettes. In addition, our approach does not require prior training. We thoroughly demonstrate the superiority of our method in gait-based human identification compared to state-of-the-art approaches. We use the OU-ISIR Large Population dataset, with over 4000 subjects captured at different viewing angles, to provide statistically reliable results.

Hazem El-Alfy, Ikuhisa Mitsugami, Yasushi Yagi
Hand Detection and Tracking in Videos for Fine-Grained Action Recognition

In this paper, we develop an effective method of detecting and tracking hands in uncontrolled videos based on multiple cues including hand shape, skin color, upper body position and flow information. We apply our hand detection results to perform fine-grained human action recognition. We demonstrate that motion features extracted from hand areas can help classify actions even when they look familiar and they are associated with visually similar objects. We validate our method of detecting and tracking hands on VideoPose2.0 dataset and apply our method of classifying actions to the playing-instrument group of UCF-101 dataset. Experimental results show the effectiveness of our approach.

Nga H. Do, Keiji Yanai
Enhancing Person Re-identification by Integrating Gait Biometric

This paper proposes a method to enhance person re-identification by integrating gait biometric. The framework consists of the hierarchical feature extraction and matching methods. Considering the appearance feature is not discriminative in some cases, the feature in this work composes of the appearance feature and the gait feature for shape and temporal information. In order to solve the view-angle change problem and measuring similarity, metric learning to rank is adopted. In this way, data are mapped into a metric space so that distances between people can be measured accurately. Then two fusion strategies are proposed. The score-level fusion computes distances of the appearance feature and the gait feature respectively and combine them as the final distance between samples. Besides, the feature-level fusion firstly installs two types of features in series and then computes distances by the fused feature. Finally, our method is tested on CASIA gait dataset. Experiments show that gait biometric is an effective feature integrated with appearance features to enhance person re-identification.

Zheng Liu, Zhaoxiang Zhang, Qiang Wu, Yunhong Wang
Real Time Gait Recognition System Based on Kinect Skeleton Feature

Gait recognition is a kind of biometric feature recognition technique, which utilizes the pose of walking to recognize the identity. Generally people analyze the normal video data to extract the gait feature. These days, some researchers take advantage of Kinect to get the depth information or the position of joints for recognition. This paper mainly focus on the length of bones namely static feature and the angles of joints namely dynamic feature based on Kinect skeleton information. After preprocessing, we stored the two kinds of feature templates into database which we established for the system. For the static feature, we calculate the distance with Euclidean distance, and we calculated the distance in dynamic time warping algorithm (DTW) for the dynamic distance. We make a feature fusion for the distance between the static and dynamic. At last, we used the nearest neighbor (NN) classifier to finish the classification, and we got a real time recognition system and a good recognition result.

Shuming Jiang, Yufei Wang, Yuanyuan Zhang, Jiande Sun
2-D Structure-Based Gait Recognition in Video Using Incremental GMM-HMM

Gait analysis is a feasible approach for human identification in intelligent video surveillance. However, the effectiveness of the dominant silhouette-based approaches are severely affected by dressing, bag, hair style and the like. In this paper, we propose a useful 2-D structural feature, named skeleton-based feature, effective improvements for human pose estimation in human walking environment and a recognition framework based on GMM-HMM using incremental learning, which can greatly improve the availability of gait traits in intelligent video surveillance. Our skeleton-based feature uses a 15-DOFs, which is effective in eliminating the interference of dressing, bag, hair style and the like, to represent the torso. In addition, to imitate the natural way of human walking, a Hidden Markov Model (HMM) representing the gait dynamics of human walking incrementally evolves from an average human walking model that represents the average motion process of human walking. Our work makes the gait recognition more robust to noise. Experiments on widely adopted databases prove that our proposed method achieves excellent performance.

Rui Pu, Yunhong Wang
Unsupervised Temporal Ensemble Alignment for Rapid Annotation

This paper presents a novel framework for the unsupervised alignment of an ensemble of temporal sequences. This approach draws inspiration from the axiom that an ensemble of temporal signals stemming from the same source/class should have lower rank when “aligned” rather than “misaligned”. Our approach shares similarities with recent state of the art methods for unsupervised images ensemble alignment (e.g. RASL) which breaks the problem into a set of image alignment problems (which have well known solutions i.e. the Lucas-Kanade algorithm). Similarly, we propose a strategy for decomposing the problem of temporal ensemble alignment into a similar set of independent sequence problems which we claim can be solved reliably through Dynamic Time Warping (DTW). We demonstrate the utility of our method using the Cohn-Kanade+ dataset, to align expression onset across multiple sequences, which allows us to automate the rapid discovery of event annotations.

Ashton Fagg, Sridha Sridharan, Simon Lucey
Motion Boundary Trajectory for Human Action Recognition

In this paper, we propose a novel approach to extract local descriptors of a video, based on two ideas, one using motion boundary between objects, and, second, the resulting motion boundary trajectories extracted from videos, together with other local descriptors in the neighbourhood of the extracted motion boundary trajectories, histogram of oriented gradients, histogram of optical flow, motion boundary histogram, can be used as local descriptors for video representations. The motion boundary approach captures more information between moving objects which might be caused by camera movements. We compare the performance of the proposed motion boundary trajectory approach with other state-of-the-art approaches, e.g., trajectory based approach, on a number of human action benchmark datasets (YouTube, UCF sports, Olympic Sports, HMDB51, Hollywood2 and UCF50), and found that the proposed approach gives improved recognition results.

Sio-Long Lo, Ah-Chung Tsoi
Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Human action recognition in video has found widespread applications in many fields. However, this task is still facing many challenges due to the existence of intra-class diversity and inter-class overlaps among different action categories. The key trick of action recognition lies in the extraction of more comprehensive features to cover the action, as well as a compact and discriminative video encoding representation. Based on this observation, in this paper we propose a hybrid feature descriptor, which combines both static descriptor and motional descriptor to cover more action information inside video clips. We also adopt the usage of VLAD encoding method to encapsulate more structural information within the distribution of feature vectors. The recognition effects of our framework are evaluated on three benchmark datasets: KTH, Weizmann, and YouTube. The experimental results demonstrate that the hybrid descriptor, facilitated with VLAD encoding method, outperforms traditional descriptors by a large margin.

Dong Xing, Xianzhong Wang, Hongtao Lu
Human Action Recognition Based on Oriented Motion Salient Regions

Motion is the most informative cue for human action recognition. Regions with high motion saliency indicate where actions occur and contain visual information that is most relevant to actions. In this paper, we propose a novel approach for human action recognition based on oriented motion salient regions (OMSRs). Firstly, we apply a bank of 3D Gabor filters and an opponent inhibition operator to detect OMSRs of videos, each of which corresponds to a specific motion direction. Then, a new low-level feature, named as oriented motion salient descriptor (OMSD), is proposed to describe the obtained OMSRs through the statistics of the texture in the regions. Next, we utilize the obtained OMSDs to explore the oriented characteristics of action classes and generate a set of class-specific oriented attributes (CSOAs) for each class. These CSOAs provide a compact and discriminative middle-level representation for human actions. Finally, an SVM classifier is utilized for human action classification and a new compatibility function is devised for measuring how well a given action matches to the CSOAs of a certain class. We test the proposed approach on four public datasets and the experimental results validate the effectiveness of our approach.

Baoxin Wu, Shuang Yang, Chunfeng Yuan, Weiming Hu, Fangshi Wang
3D Activity Recognition Using Motion History and Binary Shape Templates

This paper presents our work on activity recognition in 3D depth images. We propose a global descriptor that is accurate, compact and easy to compute as compared to the state-of-the-art for characterizing depth sequences. Activity enactment video is divided into temporally overlapping blocks. Each block (set of image frames) is used to generate Motion History Templates (MHTs) and Binary Shape Templates (BSTs) over three different views - front, side and top. The three views are obtained by projecting each video frame onto three mutually orthogonal Cartesian planes. MHTs are assembled by stacking the difference of consecutive frame projections in a weighted manner separately for each view. Histograms of oriented gradients are computed and concatenated to represent the motion content. Shape information is obtained through a similar gradient analysis over BSTs. These templates are built by overlaying all the body silhouettes in a block, separately for each view. To effectively trace shape-growth, BSTs are built additively along the blocks.

Consequently, the complete ensemble of gradient features carries both 3D shape and motion information to effectively model the dynamics of an articulated body movement. Experimental results on 4 standard depth databases (MSR 3D Hand Gesture, MSR Action, Action-Pairs, and UT-Kinect) prove the efficacy as well as the generality of our compact descriptor. Further, we successfully demonstrate the robustness of our approach to (impulsive) noise and occlusion errors that commonly affect depth data.

Saumya Jetley, Fabio Cuzzolin
Gait Recognition Based Online Person Identification in a Camera Network

In this paper, we propose a novel online multi-camera framework for person identification based on gait recognition using Grassmann Discriminant Analysis. We propose an online method wherein the gait space of individuals are created as they are tracked. The gait space is view invariant and the recognition process is carried out in a distributed manner. We assume that only a fixed known set of people are allowed to enter the area under observation. During the training phase, multi-view data of each individual is collected from each camera in the network and their global gait space is created and stored. During the test phase, as an unknown individual is observed by the network of cameras, simultaneously or sequentially, his/her gait space is created. Grassmann manifold theory is applied for classifying the individual. The gait space of an individual is a point on a Grassmann manifold and distance between two gait spaces is the same as distance between two points on a Grassmann manifold. Person identification is, therefore, carried out on-the-fly based on the uniqueness of gait, using Grassmann discriminant analysis.

Ayesha Choudhary, Santanu Chaudhury
Gesture Recognition Performance Score: A New Metric to Evaluate Gesture Recognition Systems

In spite of many choices available for gesture recognition algorithms, the selection of a proper algorithm for a specific application remains a difficult task. The available algorithms have different strengths and weaknesses making the matching between algorithms and applications complex. Accurate evaluation of the performance of a gesture recognition algorithm is a cumbersome task. Performance evaluation by recognition accuracy alone is not sufficient to predict its successful real-world implementation. We developed a novel Gesture Recognition Performance Score (

$$GRPS$$

) for ranking gesture recognition algorithms, and to predict the success of these algorithms in real-world scenarios. The

$$GRPS$$

is calculated by considering different attributes of the algorithm, the evaluation methodology adopted, and the quality of dataset used for testing. The

$$GRPS$$

calculation is illustrated and applied on a set of vision based hand/ arm gesture recognition algorithms reported in the last 15 years. Based on

$$GRPS$$

a ranking of hand gesture recognition algorithms is provided. The paper also presents an evaluation metric namely Gesture Dataset Score (

$$GDS$$

) to quantify the quality of gesture databases. The

$$GRPS$$

calculator and results are made publicly available (

http://software.ihpc.a-star.edu.sg/grps/

).

Pramod Kumar Pisharady, Martin Saerbeck

Second International Workshop on Big Data in 3D Computer Vision

Frontmatter
Object Recognition in 3D Point Cloud of Urban Street Scene

In this paper we present a novel street scene semantic recognition framework, which takes advantage of 3D point clouds captured by a high-definition LiDAR laser scanner. An important problem in object recognition is the need for sufficient labeled training data to learn robust classifiers. In this paper we show how to significantly reduce the need for manually labeled training data by reduction of scene complexity using non-supervised ground and building segmentation. Our system first automatically segments grounds point cloud, this is because the ground connects almost all other objects and we will use a connect component based algorithm to oversegment the point clouds. Then, using binary range image processing building facades will be detected. Remained point cloud will grouped into voxels which are then transformed to super voxels. Local 3D features extracted from super voxels are classified by trained boosted decision trees and labeled with semantic classes e.g. tree, pedestrian, car, etc. The proposed method is evaluated both quantitatively and qualitatively on a challenging fixed-position

Terrestrial Laser Scanning

(TLS) Velodyne data set and two

Mobile Laser Scanning

(MLS), Paris-rue-Madam and NAVTEQ True databases. Robust scene parsing results are reported.

Pouria Babahajiani, Lixin Fan, Moncef Gabbouj
Completed Dense Scene Flow in RGB-D Space

Conventional scene flow containing only translational vectors is not able to model 3D motion with rotation properly. Moreover, the accuracy of 3D motion estimation is restricted by several challenges such as large displacement, noise, and missing data (caused by sensing techniques or occlusion). In terms of solution, there are two kinds of approaches: local approaches and global approaches. However, local approaches can not generate smooth motion field, and global approaches is difficult to handle large displacement motion. In this paper, a completed dense scene flow framework is proposed, which models both rotation and translation for general motion estimation. It combines both a local method and a global method considering their complementary characteristics to handle large displacement motion and enforce smoothness respectively. The proposed framework is applied on the RGB-D image space where the computation efficiency is further improved. According to the quantitative evaluation based on Middlebury dataset, our method outperforms other published methods. The improved performance is further confirmed on the real data acquired by Kinect sensor.

Yucheng Wang, Jian Zhang, Zicheng Liu, Qiang Wu, Philip Chou, Zhengyou Zhang, Yunde Jia
Online Learning of Binary Feature Indexing for Real-Time SLAM Relocalization

In this paper, we propose an indexing method for approximate nearest neighbor search of binary features. Being different from the popular Locality Sensitive Hashing (LSH), the proposed method construct the hash keys by an online learning process instead of pure randomness. In the learning process, the hash keys are constructed with the aim of obtaining uniform hash buckets and high collision rates, which makes the method more efficient on approximate nearest neighbor search than LSH. By distributing the online learning into the simultaneous localization and mapping (SLAM) process, we successfully apply the method to SLAM relocalization. Experiments show that camera poses can be successfully recovered in real time even there are tens of thousands of landmarks in the map.

Youji Feng, Yihong Wu, Lixin Fan
Depth-Based Real-Time Hand Tracking with Occlusion Handling Using Kalman Filter and DAM-Shift

In this paper, we propose real-time hand tracking with a depth camera by using a Kalman Filter and an improved DAM-Shift (Depth-based adaptive mean shift) algorithm for occlusion handling. DAM-Shift is a useful algorithm for hand tracking, but difficult to track when occlusion occurs. To detect the hand region, we use a classifier that combines a boosting and a cascade structure. To verify occlusion, we predict in real time the center position of the hand region using Kalman Filter and calculate the major axis using the central moment of the preceding depth image. Using these factors, we measure real-time hand thickness through a projection and the threshold value of the thickness using a 2nd linear model. If the hand region is partially occluded, we cut the useless region. Experimental results show that the proposed approach outperforms the existing method.

Kisang Kim, Hyung-Il Choi
Evaluation of Depth-Based Super Resolution on Compressed Mixed Resolution 3D Video

The MVC+D standard specifies coding of Multiview Video plus Depth (MVD) data for enabling advanced 3D video applications. MVC+D defines that all views are coded with H.264/MVC encoder at equal spatial resolution. To improve compression efficiency it is possible to use mixed resolution coding in which part of texture views are coded at reduced spatial resolution. In this paper we evaluate the performance of Depth-Based Super Resolution (DBSR) on compressed mixed resolution MVD data. Experimental results show that for sequences with accurate depth data the objective coding performance metric increases. Even though some sequences, with poor depth quality, show slight decrease in coding performance with respect to objective metric, subjective evaluation shows that perceived quality of DBSR method is equal to symmetric resolution case. We also show that depth re-projection consistency check step of the DBSR can be changed to simpler consistency check method. In this way the DBSR computational complexity is reduced by 26 % with 0.2 % dBR average bitrate reduction for coded views and 0.1 % average bitrate increase for synthesized views. We show that proposed scheme outperforms the anchor MVC+D coding scheme by 7.2 % of dBR on average for total coded bitrate and by 10.9 % of dBR on average for synthesized views.

Michal Joachimiak, Payman Aflaki, Miska M. Hannuksela, Moncef Gabbouj
Global Volumetric Image Registration Using Local Linear Property of Image Manifold

We propose a three-dimensional global image registration method for a sparse dictionary. To achieve robust and accurate registration, which based on template matching, a large number of transformed images are prepared and stored in the dictionary. To reduce the spatial complexity of this image dictionary, we introduce a method of generating a new template image from a collection of images stored in the image dictionary. This generated template image allows us to achieve accurate image registration even if the population of the image dictionary is relatively small and the template has a small pattern perturbation. To further reduce the complexity, we compute a matching process in a low-dimensional Euclidean space projected by a random projection.

Hayato Itoh, Atsushi Imiya, Tomoya Sakai
A Comparative Study of GPU-Accelerated Multi-view Sequential Reconstruction Triangulation Methods for Large-Scale Scenes

The angular error-based triangulation method and the parallax path method are both high-performance methods for large-scale multi-view sequential reconstruction that can be parallelized on the GPU. We map parallax paths to the GPU and test its performance and accuracy as a triangulation method for the first time. To this end, we compare it with the angular method on the GPU for both performance and accuracy. Furthermore, we improve the recovery of path scales and perform more extensive analysis and testing compared with the original parallax paths method. Although parallax paths requires sequential and piecewise-planar camera positions, in such scenarios, we can achieve a speedup of up to 14x over angular triangulation, while maintaining comparable accuracy.

Jason Mak, Mauricio Hess-Flores, Shawn Recker, John D. Owens, Kenneth I. Joy
Indoor Objects and Outdoor Urban Scenes Recognition by 3D Visual Primitives

Object detection, recognition and pose estimation in 3D images have gained momentum due to availability of 3D sensors (RGB-D) and increase of large scale 3D data, such as city maps. The most popular approach is to extract and match 3D shape descriptors that encode local scene structure, but omits visual appearance. Visual appearance can be problematic due to imaging distortions, but the assumption that local shape structures are sufficient to recognise objects and scenes is largely invalid in practise since objects may have similar shape, but different texture (e.g., grocery packages). In this work, we propose an alternative appearance-driven approach which first extracts 2D primitives justified by Marr’s primal sketch, which are “accumulated” over multiple views and the most stable ones are “promoted” to 3D visual primitives. The 3D promoted primitives represent both structure and appearance. For recognition, we propose a fast and effective correspondence matching using random sampling. For quantitative evaluation we construct a semi-synthetic benchmark dataset using a public 3D model dataset of 119 kitchen objects and another benchmark of challenging street-view images from 4 different cities. In the experiments, our method utilises only a stereo view for training. As the result, with the kitchen objects dataset our method achieved almost perfect recognition rate for

$$\pm 10^\circ $$

camera view point change and nearly 80 % for

$$\pm 20^\circ $$

, and for the street-view benchmarks it achieved 75 % accuracy for 160 street-view images pairs, 80 % for 96 street-view images pairs, and 92 % for 48 street-view image pairs.

Junsheng Fu, Joni-Kristian Kämäräinen, Anders Glent Buch, Norbert Krüger
3D Reconstruction of Planar Surface Patches: A Direct Solution

We propose a novel solution for reconstructing planar surface patches. The theoretical foundation relies on variational calculus, which yields a closed form solution for the normal and distance of a 3D planar surface patch, when an affine transformation is known between the corresponding image region pairs. Although we apply the proposed method to projective cameras, the theoretical derivation itself is not restricted to perspective projection. The method is quantitatively evaluated on a large set of synthetic data as well as on real images of urban scenes, where planar surface reconstruction is often needed. Experimental results confirm that the method provides good reconstructions in real-time.

Jozsef Molnar, Rui Huang, Zoltan Kato

Deep Learning on Visual Data

Frontmatter
Hybrid CNN-HMM Model for Street View House Number Recognition

We present an integrated model for using deep neural networks to solve street view number recognition problem. We didn’t follow the traditional way of first doing segmentation then perform recognition on isolated digits, but formulate the problem as a sequence recognition problem under probabilistic treatment. Our model leverage a deep Convolutional Neural Network(CNN) to represent the highly variable appearance of digits in natural images. Meanwhile, hidden Markov model(HMM) is used to deal with the dynamics of the sequence. They are combined in a hybrid fashion to form the hybrid CNN-HMM architecture. By using this model we can perform the training and recognition procedure both at word level. There is no explicit segmentation operation at all which save lots of labour of sophisticated segmentation algorithm design or finegrained character labeling. To the best of our knowledge, this is the first time using hybrid CNN-HMM model directly on the whole scene text images. Experiments show that deep CNN can dramaticly boost the performance compared with shallow Gausian Mixture Model(GMM)-HMM model. We obtaied competitive results on the street view house number(SVHN) dataset.

Qiang Guo, Dan Tu, Jun Lei, Guohui Li
View and Illumination Invariant Object Classification Based on 3D Color Histogram Using Convolutional Neural Networks

Object classification is an important step in visual recognition and semantic analysis of visual content. In this paper, we propose a method for classification of objects that is invariant to illumination color, illumination direction and viewpoint based on 3D color histogram. A 3D color histogram of an image is represented as a 2D image, to capture the color composition while preserving the neighborhood information of color bins, to realize the necessary visual cues for classification of objects. Also, the ability of convolutional neural network (CNN) to learn invariant visual patterns is exploited for object classification. The efficacy of the proposed method is demonstrated on Amsterdam Library of Object Images (ALOI) dataset captured under various illumination conditions and angles-of-view.

Earnest Paul Ijjina, C. Krishna Mohan
Human Action Recognition Using Action Bank Features and Convolutional Neural Networks

With the advancement in technology and availability of multimedia content, human action recognition has become a major area of research in computer vision that contributes to semantic analysis of videos. The representation and matching of spatio-temporal information in videos is a major factor affecting the design and performance of existing convolution neural network approaches for human action recognition. In this paper, in contrast to the traditional approach of using raw video as input, we derive attributes from action bank features to represent and match spatio-temporal information effectively. The derived features are arranged in a square matrix and used as input to the convolutional neural network for action recognition. The effectiveness of the proposed approach is demonstrated on KTH and UCF Sports datasets.

Earnest Paul Ijjina, C. Krishna Mohan
Deep Learning in the EEG Diagnosis of Alzheimer’s Disease

EEG (electroencephalogram) has a lot of advantages compared to other methods in the analysis of Alzheimer’s disease such as diagnosing Alzheimer’s disease in an early stage. Traditional EEG analysis method needs a lot of artificial works such as calculating coherence between different pair of electrodes. In our work we applied deep learning network in the analysis of EEG data of Alzheimer’s disease to fully use the advantage of the unsupervised feature learning. We studied EEG based deep learning on 15 clinically diagnosed Alzheimer’s disease patients and 15 healthy people. Each person has 16 electrodes. The time domain EEG data of each electrode is cut into 40 data units according to the data size in a period. In our work we first train the deep learning network with 25 data units on each electrode separately and then test with 15 data units to get the accuracy on each electrode. Finally we will combine the learning results on 16 electrodes and train them with SVM and get a final result. We report a 92 % accuracy after combining 16 electrodes of each person. In order to improve the deep learning model on Alzheimer’s disease with the upcoming new data, we use incremental learning to make full use of the existing data while decrease the expenses on memory space and computing time by replacing the exising data with new data. We report a 0.5 % improvement in accuracy with incremental learning.

Yilu Zhao, Lianghua He
Pedestrian Detection with Deep Convolutional Neural Network

The problem of pedestrian detection in image and video frames has been extensively investigated in the past decade. However, the low performance in complex scenes shows that it remains an open problem. In this paper, we propose to cascade simple Aggregated Channel Features (ACF) and rich Deep Convolutional Neural Network (DCNN) features for efficient and effective pedestrian detection in complex scenes. The ACF based detector is used to generate candidate pedestrian windows and the rich DCNN features are used for fine classification. Experiments show that the proposed approach achieved leading performance in the INRIA dataset and comparable performance to the state-of-the-art in the Caltech and ETH datasets.

Xiaogang Chen, Pengxu Wei, Wei Ke, Qixiang Ye, Jianbin Jiao

Workshop on Scene Understanding for Autonomous Systems

Frontmatter
Surface Prediction for a Single Image of Urban Scenes

In the paper we present a novel method for three-dimensional scene recovering from one image of a man-made environment. We use image segmentation and perspective cues such as parallel lines in space. The algorithm models a scene as a composition of surfaces (or planes) which belong to their vanishing points. The main idea is that we exploit obtained planes to recover neighbor surfaces. Unlike previous approaches which use one base plane to place reconstructed objects on it, we show that our method recovers objects that lie on different levels of a scene. Furthermore, we show that our technique improves results of other methods. For evaluation we have manually labeled two publicly available datasets. On those datasets we demonstrate the ability of our algorithm to recover scene surfaces in different conditions and show several examples of plausible scene reconstruction.

Foat Akhmadeev
Scene Parsing and Fusion-Based Continuous Traversable Region Formation

Determining the categories of different parts of a scene and generating a continuous traversable region map in the physical coordinate system are crucial for autonomous vehicle navigation. This paper presents our efforts in these two aspects for an autonomous vehicle operating in open terrain environment. Driven by the ideas that have been proposed in our Cognitive Architecture, we have designed novel strategies for the top-down facilitation process to explicitly interpret spatial relationship between objects in the scene, and have incorporated a visual attention mechanism into the image-based scene parsing module. The scene parsing module is able to process images fast enough for real-time vehicle navigation applications. To alleviate the challenges in using sparse 3D occupancy grids for path planning, we are proposing an approach to interpolate the category of occupancy grids not hit by 3D LIDAR, with reference to the aligned image-based scene parsing result, so that a continuous

$$2\frac{1}{2}D$$

traversable region map can be formed.

Xuhong Xiao, Gee Wah Ng, Yuan Sin Tan, Yeo Ye Chuan
Combining Multiple Shape Matching Techniques with Application to Place Recognition Task

Many methods have been proposed to solve the problem of shape matching, where the task is to determine similarity between given shapes. In this paper, we propose a novel method to combine many shape matching methods using procedural knowledge to increase the precision of the shape matching process in retrieval problems like place recognition task. The idea of our approach is to assign the best matching method to each template shape providing the best classification for this template. The new incoming shape is compared against all templates using their assigned method. The proposed method increases the accuracy of the classification and decreases the time complexity in comparison to generic classifier combination methods.

Karel Košnar, Vojtěch Vonásek, Miroslav Kulich, Libor Přeučil
A Model-Based Approach for Fast Vehicle Detection in Continuously Streamed Urban LIDAR Point Clouds

Detection of vehicles in crowded 3-D urban scenes is a challenging problem in many computer vision related research fields, such as robot perception, autonomous driving, self-localization, and mapping. In this paper we present a model-based approach to solve the recognition problem from 3-D range data. In particular, we aim to detect and recognize vehicles from continuously streamed LIDAR point cloud sequences of a rotating multi-beam laser scanner. The end-to-end pipeline of our framework working on the raw streams of 3-D urban laser data consists of three steps (1) producing distinct groups of points which represent different urban objects (2) extracting reliable 3-D shape descriptors specifically designed for vehicles, considering the need for fast processing speed (3) executing binary classification on the extracted descriptors in order to perform vehicle detection. The extraction of our efficient shape descriptors provides a significant speedup with and increased detection accuracy compared to a PCA based 3-D bounding box fitting method used as baseline.

Attila Börcs, Balázs Nagy, Milán Baticz, Csaba Benedek
Large-Scale Indoor/Outdoor Image Classification via Expert Decision Fusion (EDF)

In this work, we propose an Expert Decision Fusion (EDF) system to tackle the large-scale indoor/outdoor image classification problem using two key ideas, namely, data grouping and decision stacking. By data grouping, we partition the entire data space into multiple disjoint sub-spaces so that a more accurate prediction model can be trained in each sub-space. After data grouping, the EDF system integrates soft decisions from multiple classifiers (called experts here) through stacking so that multiple experts can compensate each other’s weakness. The EDF system offers more accurate and robust classification performance since it can handle data diversity effectively while benefiting from data abundance in large-scale datasets. The advantages of data grouping and decision stacking are explained and demonstrated in detail. We conduct experiments on the SUN dataset and show that the EDF system outperforms all existing methods by a significant margin with a correct classification rate of 91 %.

Chen Chen, Yuzhuo Ren, C.-C. Jay Kuo
Search Guided Saliency

We propose a new type of saliency as inspired by findings from visual search studies - the searching difficulty is correlated with the target-distractor contrast, the distractor homogeneity, as well as the target uniqueness. By putting an image pixel as the target and the surrounding pixels as distractors, a search guided saliency model is designed in accordance with these findings. In particular, three saliency measures in correspondence to the three searching factors are simultaneously computed and integrated by using a series of contextual histograms. The proposed model has been evaluated over three public datasets and experiments show superior prediction of the human fixations when compared to the state-of-the-art models.

Shijian Lu, Byung-Uck Kim, Nicolas Lomenie, Joo-Hwee Lim, Jianfei Cai
Salient Object Detection via Saliency Spread

Salient object detection aims to localize the most attractive objects within an image. For such a goal, accurately determining the saliency values of image regions and keeping the saliency consistency of interested objects are two key challenges. To tackle the issues, we first propose an adaptive combination method of incorporating texture with the dominant color, for enriching the informativeness and discrimination of features, and then propose saliency spread to encourage the image regions of the same object producing equal saliency values. In particular, saliency spread propagates the saliency values of the most salient regions to their similar regions, where the similarity serves for measuring the degree of belonging to the same object of different regions. Experimental results on the benchmark database MSRA-1000 show that our proposed method can produce more consistent saliency maps, which is beneficial to accurately segment salient objects, and is quite competitive compared with the advanced methods in previous literatures.

Dao Xiang, Zilei Wang
Biologically Inspired Composite Vision System for Multiple Depth-of-field Vehicle Tracking and Speed Detection

This paper presents a new vision-based traffic monitoring system, which is inspired by the visual structure found in raptors, to provide multiple depth-of-field vision information for vehicle tracking and speed detection. The novelty of this design is the usage of multiple depth-of-field information for tracking expressway vehicles over a longer range, and thus provide accurate speed information for overspeed vehicle detection. A novel speed calculation algorithm was designed for the composite vision information acquired by the system. The calculated speed of the vehicles was found to conform with the real-world driving speed.

Lin Lin, Bharath Ramesh, Cheng Xiang
Robust Maximum Margin Correlation Tracking

Recent decade has seen great interest in the use of discriminative classifiers for tracking. Most trackers, however, focus on correct classification between the target and background. Though it achieves good generalization performance, the highest score of the classifier may not correspond to the correct location of the object. And this will produce localization error. In this paper, we propose an online Maximum Margin Correlation Tracker (MMCT) which combines the design principle of Support Vector Machine (SVM) and the adaptive Correlation Filter (CF). In principle, bipartite classifier SVM is designed to offer good generalization, rather than accurate localization. In contrast, CF can provide accurate target location, but it is not explicitly designed to offer good generalization. Through incorporating SVM with CF, MMCT demonstrates good generalization as well as accurate localization. And because the appearance can be learned in Fourier domain, the computational burden is reduced significantly. Extensive experiments on public benchmark sequences have proven the superior performance of MMCT over many state-of-the-art tracking algorithms.

Han Wang, Yancheng Bai, Ming Tang
Scene Classification by Feature Co-occurrence Matrix

Classifying scenes (such as mountains, forests) is not an easy task owing to their variability, ambiguity, and the wide range of illumination and scale conditions that may apply. Bag of features (BoF) model have achieved impressive performances in many famous databases (such as the

15 scene

dataset). A main drawback of the BoF model is it disregards all information about the spatial layout of the features, leads to a limited descriptive ability. In this paper, we use co-occurrence matrix to implant the spatial relations between local features, and demonstrate that feature co-occurrence matrix (FCM) is a potential discriminative character to scenes classification. We propose three FCM based image representations for scenes classification. The experimental results show that, under equal protocol, the proposed method outperforms BoF model and Spatial Pyramid (SP) model and achieves a comparable performance to the state-of-the-art.

Haitao Lang, Yuyang Xi, Jianying Hu, Liang Du, Haibin Ling

RoLoD: Robust Local Descriptors for Computer Vision

Frontmatter
Local Associated Features for Pedestrian Detection

Local features are usually used to describe pedestrian appearance. While most of existing pedestrian detection methods don’t make full use of context cues, such as associated relationships between local different locations. This paper proposes two novel kinds of local associated features, gradient orientation associated feature (GOAF) and local difference of ACF (ACF-LD), to exploit context information. In our work, pedestrian samples are enlarged to contain some background regions besides human body, and GOAF, ACF and ACF-LD are combined together to describe pedestrian sample. GOAF are constructed by encoding gradient orientation features from two different positions into a single value. These two positions are come from different distance and different direction. For ACF-LD, the sample is divided into several sub regions and the ACF difference matrixes between these areas are computed to exploit the associated information between pedestrian and surrounding background. The proposed local associated features can provide complementary information for detection tasks. Finally, these features are fused with ACF to form candidate feature pool, and AdaBoost is used to select features and train a cascaded classifier of depth-two decision trees. Experimental results on two public datasets show that the proposed framework can achieve promising results compared with the state of the arts.

Song Shao, Hong Liu, Xiangdong Wang, Yueliang Qian
Incorporating Two First Order Moments into LBP-Based Operator for Texture Categorization

Within different techniques for texture modelling and recognition, local binary patterns and its variants have received much interest in recent years thanks to their low computational cost and high discrimination power. We propose a new texture description approach, whose principle is to extend the LBP representation from the local gray level to the regional distribution level. The region is represented by pre-defined structuring element, while the distribution is approximated using the two first statistical moments. Experimental results on four large texture databases, including Outex, KTH-TIPS 2b, CUReT and UIUC show that our approach significantly improves the performance of texture representation and classification with respect to comparable methods.

Thanh Phuong Nguyen, Antoine Manzanera
Log-Gabor Weber Descriptor for Face Recognition

It is well recognized that image representation is the most fundamental task of the face recognition, effective and efficient image feature extraction not only has small intraclass variations and large interclass similarity but also robust to the impact of pose, illumination, expression and occlusion. This paper proposes a new local image descriptor for face recognition, named Log–Gabor Weber descriptor (LGWD). The idea of LGWD is based on the image Log-Gabor wavelet representation and the Weber local binary pattern(WLBP) features. The main motivation of the LGWD is to enhance the multiple scales and orientations Log-Gabor magnitude and phase feature by applying the WLBP coding method. Histograms extracted from the encoded magnitude and phase images are concatenated into one to form the image description finally. The experimental results on the ORL, Yale and UMIST face database verify the representation ability of our proposed descriptor.

Jing Li, Nong Sang, Changxin Gao
Robust Line Matching Based on Ray-Point-Ray Structure Descriptor

In this paper, we propose a novel two-view line matching method through converting matching line segments extracted from two uncalibrated images to matching the introduced Ray-Point-Ray (RPR) structures. The method first recovers the partial connectivity of line segments through sufficiently exploiting the gradient map. To efficiently matching line segments, we introduce the Ray-Point-Ray (RPR) structure consisting of a joint point and two rays (line segments) connected to the point. Two sets of RPRs are constructed from the connected line segments extracted from two images. These RPRs are then described with the proposed SIFT-like descriptor for efficient initial matching to recover the fundamental matrix. Based on initial RPR matches and the recovered fundamental matrix, we propose a match propagation scheme consisting of two stages to refine and find more RPR matches. The first stage is to propagate matches among those initially formed RPRs, while the second stage is to propagate matches among newly formed RPRs constructed by intersecting unmatched line segments with those matched ones. In both stages, candidate matches are evaluated by comprehensively considering their descriptors, the epipolar line constraint, and the topological consistency with neighbor point matches. Experimental results demonstrate the good performance of the proposed method as well as its superiority to the state-of-the-art methods.

Kai Li, Jian Yao, Xiaohu Lu
Local-to-Global Signature Descriptor for 3D Object Recognition

In this paper, we present a novel 3D descriptor that bridges the gap between global and local approaches. While local descriptors proved to be a more attractive choice for object recognition within cluttered scenes, they remain less discriminating exactly due to the limited scope of the local neighborhood. On the other hand, global descriptors can better capture relationships between distant points, but are generally affected by occlusions and clutter. So, we propose the Local-to-Global Signature (LGS) descriptor, which relies on surface point classification together with signature-based features to overcome the drawbacks of both local and global approaches. As our tests demonstrate, the proposed LGS can capture more robustly the exact structure of the objects while remaining robust to clutter and occlusion and avoiding sensitive, low-level features, such as point normals. The tests performed on four different datasets demonstrate the robustness of the proposed LGS descriptor when compared to three of the SOTA descriptors today: SHOT, Spin Images and FPFH. In general, LGS outperformed all three descriptors and for some datasets with a 50–70% increase in Recall.

Isma Hadji, Guilherme N. DeSouza
Evaluation of Descriptors and Distance Measures on Benchmarks and First-Person-View Videos for Face Identification

Face identification (FI) has made significant amount of progress in the last three decades. Its application is now moving towards wearable devices (like Google Glass and mobile devices) leading to the problem of FI on first-person-views (FPV) or ego-centric videos for scenarios like business networking, memory assistance, etc. In the existing literature, performance analysis of various image descriptors on FPV data are little known. In this paper, we evaluate four popular image descriptors: local binary patterns (LBP), scale invariant feature transform (SIFT), local phase quantization (LPQ) and binarized statistical image features (BSIF) and ten different distance measures: Euclidean, Cosine, Chi square, Spearman, Cityblock, Minkowski, Correlation, Hamming, Jaccard and Chebychev with first nearest neighbor (1-NN) and support vector machines (SVM) as classifiers for FI task on both benchmark databases: FERET, AR, GT and FPV database collected using wearable devices like Google Glass (GG). Comparative analysis on these databases using various descriptors shows the superiority of BSIF with Cosine, Chi square and Cityblock distance measures using 1-NN as classifier over other descriptors and distance measures and even some of the current state-of-art benchmark database results.

Bappaditya Mandal, Wang Zhikai, Liyuan Li, Ashraf A. Kassim
Local Feature Based Multiple Object Instance Identification Using Scale and Rotation Invariant Implicit Shape Model

In this paper, we propose a Scale and Rotation Invariant Implicit Shape Model (SRIISM), and develop a local feature matching based system using the model to accurately locate and identify large numbers of object instances in an image. Due to repeated instances and cluttered background, conventional methods for multiple object instance identification suffer from poor identification results. In the proposed SRIISM, we model the joint distribution of object centers, scale, and orientation computed from local feature matches in Hough voting, which is not only invariant to scale changes and rotation of objects, but also robust to false feature matches. In the multiple object instance identification system using SRIISM, we apply a fast 4D bin search method in Hough space with complexity

$$O(n)$$

, where

$$n$$

is the number of feature matches, in order to segment and locate each instance. Furthermore, we apply maximum likelihood estimation (MLE) for accurate object pose detection. In the evaluation, we created datasets simulating various industrial applications such as pick-and-place and inventory management. Experiment results on the datasets show that our method outperforms conventional methods in both accuracy (5 %–30 % gain) and speed (2x speed up).

Ruihan Bao, Kyota Higa, Kota Iwamoto
Efficient Detection for Spatially Local Coding

In this paper, we present an efficient detector for the Spatially Local Coding (SLC) object model. SLC is a recent, high performing object classifier that has yet to be applied in a detection (object localization) setting. SLC uses features that jointly code for both appearance and location, making it difficult to apply the existing approaches to efficient detection. We design an approximate Hough transform for the SLC model that uses a cascade of thresholds followed by gradient descent to achieve efficiency as well as accurate localization. We evaluate the resulting detector on the Daimler Monocular Pedestrian dataset.

Sancho McCann, David G. Lowe
Performance Evaluation of Local Descriptors for Affine Invariant Region Detector

Local feature descriptors are widely used in many computer vision applications. Over the past couple of decades, several local feature descriptors have been proposed which are robust to challenging conditions. Since they show different characteristics in different environment, it is necessary to evaluate their performance in an intensive and consistent manner. However, there has been no relevant work that addresses this problem, especially for the affine invariant region detectors which are popularly used in object recognition and classification. In this paper, we present a useful and rigorous performance evaluation of local descriptors for affine invariant region detector, in which MSER (maximally stable extremal regions) detector is employed. We intensively evaluate local patch based descriptors as well as binary descriptors, including SIFT (scale invariant feature transform), SURF (speeded up robust features), BRIEF (binary robust independent elementary features), FREAK (fast retina keypoint), Shape descriptor, and LIOP (local intensity order pattern). Intensive evaluation on standard dataset shows that LIOP outperforms the other descriptors in terms of precision and recall metric.

Man Hee Lee, In Kyu Park
Unsupervised Footwear Impression Analysis and Retrieval from Crime Scene Data

Footwear impressions are one of the most frequently secured types of evidence at crime scenes. For the investigation of crime series they are among the major investigative notes. In this paper, we introduce an unsupervised footwear retrieval algorithm that is able to cope with unconstrained noise conditions and is invariant to rigid transformations. A main challenge for the automated impression analysis is the separation of the actual shoe sole information from the structured background noise. We approach this issue by the analysis of periodic patterns. Given unconstrained noise conditions, the redundancy within periodic patterns makes them the most reliable information source in the image. In this work, we present four main contributions: First, we robustly measure local periodicity by fitting a periodic pattern model to the image. Second, based on the model, we normalize the orientation of the image and compute the window size for a local Fourier transformation. In this way, we avoid distortions of the frequency spectrum through other structures or boundary artefacts. Third, we segment the pattern through robust point-wise classification, making use of the property that the amplitudes of the frequency spectrum are constant for each position in a periodic pattern. Finally, the similarity between footwear impressions is measured by comparing the Fourier representations of the periodic patterns. We demonstrate robustness against severe noise distortions as well as rigid transformations on a database with real crime scene impressions. Moreover, we make our database available to the public, thus enabling standardized benchmarking for the first time.

Adam Kortylewski, Thomas Albrecht, Thomas Vetter
Reliable Point Correspondences in Scenes Dominated by Highly Reflective and Largely Homogeneous Surfaces

Common Structure from Motion (SfM) tasks require reliable point correspondences in images taken from different views to subsequently estimate model parameters which describe the 3D scene geometry. For example when estimating the fundamental matrix from point correspondences using RANSAC. The amount of noise in the point correspondences drastically affect the estimation algorithm and the number of iterations needed for convergence grows exponentially with the level of noise. In scenes dominated by highly reflective and largely homogeneous surfaces such as vehicle panels and buildings with a lot of glass, existing approaches give a very high proportion of spurious point correspondences. As a result the number of iterations required for subsequent model estimation algorithms become intractable. We propose a novel method that uses descriptors evaluated along points in image edges to obtain a sufficiently high proportion of correct point correspondences. We show experimentally that our method gives better results in recovering the epipolar geometry in scenes dominated by highly reflective and homogeneous surfaces compared to common baseline methods on stereo images taken from considerably wide baselines.

Srimal Jayawardena, Stephen Gould, Hongdong Li, Marcus Hutter, Richard Hartley
ORB in 5 ms: An Efficient SIMD Friendly Implementation

One of the key challenges today in computer vision applications is to be able to reliably detect features in real-time. The most prominent feature extraction methods are Speeded up Robust Features(SURF), Scale Invariant Feature Transform(SIFT) and Oriented FAST and Rotated BRIEF(ORB), which have proved to yield reliable features for applications such as object recognition and tracking. In this paper, we propose an efficient single instruction multiple data(SIMD) friendly implementation of ORB. This solution shows that ORB feature extraction can be effectively implemented in about 5.5 ms on a Vector SIMD engine such as Embedded Vision Engine(EVE) of Texas Instruments(TI). We also show that our implementation is reliable with the help of repeatability test.

Prashanth Viswanath, Pramod Swami, Kumar Desappan, Anshu Jain, Anoop Pathayapurakkal
Hierarchical Local Binary Pattern for Branch Retinal Vein Occlusion Recognition

Branch retinal vein occlusion (BRVO) is one of the most common retinal vascular diseases of the elderly that would dramatically impair one’s vision if it is not diagnosed and treated timely. Automatic recognition of BRVO could significantly reduce an ophthalmologist’s workload, make the diagnosis more efficient, and save the patients’ time and costs. In this paper, we propose for the first time, to the best of our knowledge, automatic recognition of BRVO using fundus images. In particular, we propose Hierarchical Local Binary Pattern (HLBP) to represent the visual content of an fundus image for classification. HLBP is comprised of Local Binary Pattern (LBP) in a hierarchical fashion with max-pooling. In order to evaluate the performance of HLBP, we establish a BRVO dataset for experiments. HLBP is compared with several state-of-the-art feature presentation methods on the BRVO dataset. Experimental results demonstrate the superior performance of our proposed method for BRVO recognition.

Zenghai Chen, Hui Zhang, Zheru Chi, Hong Fu
Extended Keypoint Description and the Corresponding Improvements in Image Retrieval

The paper evaluates an alternative approach to BoW-based image retrieval in large databases. The major improvements are in the re-ranking step (verification of candidates returned by BoW). We propose a novel keypoint description which allows the verification based only on individual keypoint matching (no spatial consistency over groups of matched keypoints is tested). Standard Harris-Affine and Hessian-Affine keypoint detectors with SIFT descriptor are used. The proposed description assigns to each keypoint several words representing photometry and geometry of the keypoint in the context of neighbouring image fragments. The words are Cartesian products of typical SIFT-based words so that huge vocabularies can be built. The preliminary experiments on several popular datasets show significant improvements in the pre-retrieval phase combined with a dramatically lower complexity of the re-ranking process. Because of that, the proposed methodology is particularly recommended for the retrieval in very large datasets.

Andrzej Śluzek
An Efficient Face Recognition Scheme Using Local Zernike Moments (LZM) Patterns

In this paper, we introduce a novel face recognition scheme using Local Zernike Moments (LZM). In this scheme, we follow two different approaches to construct a feature vector. In our first approach, we use Phase Magnitude Histograms (PMHs) on the complex components of LZM. In the second approach, we generate Local Zernike Xor Patterns (LZXP) by encoding the phase components, and we create gray level histograms on LZXP maps. For both of these methods, firstly, we divide images into sub-regions, then we construct the feature vectors by concatenating the histograms calculated in each of these sub-regions. The dimensionality of the feature vectors constructed in this way may be very high. So, we use a block based dimensionality reduction method, and with this method, we obtain higher performance. We evaluate our method on FERET database and achieve significant results.

Emrah Basaran, Muhittin Gokmen
Face Detection Based on Multi-block Quad Binary Pattern

A novel local texture descriptor, called

multi-block quad binary pattern

(MB-QBP), is proposed in this paper. To demonstrate its effectiveness on local feature representation and potential usage in computer vision applications, the proposed MB-QBP is applied to face detection. Compared with the

multi-block local binary pattern

(MB-LBP), MB-QBP has more features to conduct a better training process to refine the classifier. Consequently, the over-fitting problem becomes much smaller in the MB-QBP-based classifier. Extensive simulation results conducted by using the test images from the BioID and CMU+MIT databases have clearly shown that the proposed MB-QBP-based face detector outperforms the MB-LBP-based approach by about 6 % on the correct detection rate under the same training conditions.

Zhubei Ge, Canhui Cai, Huanqiang Zeng, Jianqing Zhu, Kai-Kuang Ma
Backmatter
Metadaten
Titel
Computer Vision - ACCV 2014 Workshops
herausgegeben von
C.V. Jawahar
Shiguang Shan
Copyright-Jahr
2015
Electronic ISBN
978-3-319-16628-5
Print ISBN
978-3-319-16627-8
DOI
https://doi.org/10.1007/978-3-319-16628-5