Skip to main content

2015 | Buch

Computer Vision - ACCV 2014 Workshops

Singapore, Singapore, November 1-2, 2014, Revised Selected Papers, Part III

insite
SUCHEN

Über dieses Buch

The three-volume set, consisting of LNCS 9008, 9009, and 9010, contains carefully reviewed and selected papers presented at 15 workshops held in conjunction with the 12th Asian Conference on Computer Vision, ACCV 2014, in Singapore, in November 2014. The 153 full papers presented were selected from numerous submissions. LNCS 9008 contains the papers selected for the Workshop on Human Gait and Action Analysis in the Wild, the Second International Workshop on Big Data in 3D Computer Vision, the Workshop on Deep Learning on Visual Data, the Workshop on Scene Understanding for Autonomous Systems, and the Workshop on Robust Local Descriptors for Computer Vision. LNCS 9009 contains the papers selected for the Workshop on Emerging Topics on Image Restoration and Enhancement, the First International Workshop on Robust Reading, the Second Workshop on User-Centred Computer Vision, the International Workshop on Video Segmentation in Computer Vision, the Workshop: My Car Has Eyes: Intelligent Vehicle with Vision Technology, the Third Workshop on E-Heritage, and the Workshop on Computer Vision for Affective Computing. LNCS 9010 contains the papers selected for the Workshop on Feature and Similarity for Computer Vision, the Third International Workshop on Intelligent Mobile and Egocentric Vision, and the Workshop on Human Identification for Surveillance.

Inhaltsverzeichnis

Frontmatter

Feature and Similarity Learning for Computer Vision (FSLCV)

Frontmatter
Discovering Multi-relational Latent Attributes by Visual Similarity Networks

The key problems in visual object classification are: learning discriminative feature to distinguish between two or more visually similar categories (e.g. dogs and cats), modeling the variation of visual appearance within instances of the same class (e.g. Dalmatian and Chihuahua in the same category of dogs), and tolerate imaging distortion (3D pose). These account to within and between class variance in machine learning terminology, but in recent works these additional pieces of information,

latent dependency

, have been shown to be beneficial for the learning process. Latent attribute space was recently proposed and verified to capture the latent dependent correlation between classes. Attributes can be annotated manually, but more attempting is to extract them in an unsupervised manner. Clustering is one of the popular unsupervised approaches, and the recent literature introduces similarity measures that help to discover visual attributes by clustering. However, the latent attribute structure in real life is multi-relational, e.g. two different sport cars in different poses vs. a sport car and a family car in the same pose - what attribute can dominate similarity? Instead of clustering, a network (graph) containing multiple connections is a natural way to represent such multi-relational attributes between images. In the light of this, we introduce an unsupervised framework for network construction based on pairwise visual similarities and experimentally demonstrate that the constructed network can be used to automatically discover multiple discrete (e.g. sub-classes) and continuous (pose change) latent attributes. Illustrative examples with publicly benchmarking datasets can verify the effectiveness of capturing multi- relation between images in the unsupervised style by our proposed network.

Fatemeh Shokrollahi Yancheshmeh, Joni-Kristian Kämäräinen, Ke Chen
Blur-Robust Face Recognition via Transformation Learning

This paper introduces a new method for recognizing faces degraded by blur using transformation learning on the image feature. The basic idea is to transform both the sharp images and blurred images to a same feature subspace by the method of multidimensional scaling. Different from the method of finding blur-invariant descriptors, our method learns the transformation which both preserves the manifold structure of the original shape images and, at the same time, enhances the class separability, resulting in a wide applications to various descriptors. Furthermore, we combine our method with subspace-based point spread function (PSF) estimation method to handle cases of unknown blur degree, by applying the feature transformation corresponding to the best matched PSF, where the transformation for each PSF is learned in the training stage. Experimental results on the FERET database show the proposed method achieve comparable performance against the state-of-the-art blur-invariant face recognition methods, such as LPQ and FADEIN.

Jun Li, Chi Zhang, Jiani Hu, Weihong Deng
Spectral Shape Decomposition by Using a Constrained NMF Algorithm

In this paper, the shape decomposition problem is addressed as a solution of an appropriately constrained Nonnegative Matrix Factorization Problem (NMF). Inspired from an idealization of the visibility matrix having a block diagonal form, special requirements while formulating the NMF problem are taken into account. Starting from a contaminated observation matrix, the objective is to reveal its low rank almost block diagonal form. Although the proposed technique is applied to shapes on the MPEG7 database, it can be extended to 3D objects. The preliminary results we have obtained are very promising.

Foteini Fotopoulou, Emmanouil Z. Psarakis
A Simple Stochastic Algorithm for Structural Features Learning

A conceptually very simple unsupervised algorithm for learning structure in the form of a hierarchical probabilistic model is described in this paper. The proposed probabilistic model can easily work with any type of image primitives such as edge segments, non-max-suppressed filter set responses, texels, distinct image regions, SIFT features, etc., and is even capable of modelling non-rigid and/or visually variable objects. The model has recursive form and consists of sets of simple and gradually growing sub-models that are shared and learned individually in layers. The proposed probabilistic framework enables to exactly compute the probability of presence of a certain model, regardless on which layer it actually is. All these learned models constitute a rich set of independent structure elements of variable complexity that can be used as features in various recognition tasks.

Jan Mačák, Ondřej Drbohlav
Inter-Concept Distance Measurement with Adaptively Weighted Multiple Visual Features

Most of the existing methods for measuring the inter-concept distance (ICD) between two concepts from their image instances use only a single kind of visual feature extracted from each instance. However, a single kind of feature is not enough for appropriately measuring ICDs due to a wide variety of perspectives for similarity evaluation (e.g., color, shape, size, hardness, heaviness, and functions); the relationships between different concept pairs are more appropriately modeled from different perspectives provided by multiple kinds of features. In this paper, we propose extracting two or more kinds of visual features from each image instance and measuring ICDs using these multiple features. Moreover, we present a method for adaptively weighting the visual features on the basis of their appropriateness for each concept pair. Experiments demonstrated that the proposed method outperformed a method using only a single kind of visual feature and one combining multiple kinds of features with a fixed weight.

Kazuaki Nakamura, Noboru Babaguchi
Extended Supervised Descent Method for Robust Face Alignment

Supervised Descent Method (SDM) is a highly efficient and accurate approach for facial landmark locating/face alignment. It learns a sequence of descent directions that minimize the difference between the estimated shape and the ground truth in HOG feature space during training, and utilize them in testing to predict shape increment iteratively. In this paper, we propose to modify SDM in three respects: (1) Multi-scale HOG features are applied orderly as a coarse-to-fine feature detector; (2) Global to local constraints of the facial features are considered orderly in regression cascade; (3) Rigid Regularization is applied to obtain more stable prediction results. Extensive experimental results demonstrate that each of the three modifications could improve the accuracy and robustness of the traditional SDM methods. Furthermore, enhanced by the three-fold improvements, the extended SDM compares favorably with other state-of-the-art methods on several challenging face data sets, including LFPW, HELEN and 300 Faces in-the-wild.

Liu Liu, Jiani Hu, Shuo Zhang, Weihong Deng
Local Similarity Based Linear Discriminant Analysis for Face Recognition with Single Sample per Person

Fisher linear discriminant analysis (LDA) is one of the most popular projection techniques for feature extraction and has been widely applied in face recognition. However, it cannot be used when encountering the single sample per person problem (SSPP) because the intra-class variations cannot be evaluated. In this paper, we propose a novel method coined local similarity based linear discriminant analysis (LS_LDA) to solve this problem. Motivated by the “divide-conquer” strategy, we first divide the face into local blocks, and classify each local block, and then integrate all the classification results to make final decision. To make LDA feasible for SSPP problem, we further divide each block into overlapped patches and assume that these patches are from the same class. Experimental results on two popular databases show that our method not only generalizes well to SSPP problem but also has strong robustness to expression, illumination, occlusion and time variation.

Fan Liu, Ye Bi, Yan Cui, Zhenmin Tang
Person Re-identification Using Clustering Ensemble Prototypes

This paper presents an appearance-based model to deal with the person re-identification problem. Usually in a crowded scene, it is observed that, the appearances of most people are similar with regard to the combination of attire. In such situation it is a difficult task to distinguish an individual from a group of alike looking individuals and yields an ambiguity in recognition for re-identification. The proper organization of the individuals based on the appearance characteristics leads to recognize the target individual by comparing with a particular group of similar looking individuals. To reconstruct a group of individual according to their appearance is a crucial task for person re-identification. In this work we focus on unsupervised based clustering ensemble approach for discovering prototypes where each prototype represents similar set of gallery image instances. The formation of each prototype depends upon the appearance characteristics of gallery instances. The estimation of k-NN classifier is employed to specify a prototype to a given probe image. The similarity measure computation is performed between the probe and a subset of gallery images, that shares the same prototype with the probe and thus reduces the number of comparisons. Re-identification performance on benchmark datasets are presented using cumulative matching characteristic (CMC) curves.

Aparajita Nanda, Pankaj K. Sa
Automatic Lung Tumor Detection Based on GLCM Features

For diagnosis of lung tumors, CT scan of lungs is one of the most common imaging modalities. Manually identifying tumors from hundreds of CT image slices for any patient may prove to be a tedious and time consuming task for the radiologists. Therefore, to assist the physicians we propose an automatic lung tumor detection method based on textural features. The lung parenchyma region is segmented as a pre-processing because the tumors reside within the region. This reduces the search space over which we look for the tumors, thereby increasing computational speed. This also reduces the chance of false identification of tumors. For tumor classification, we used GLCM based textural features. A sliding window is used to search over the lung parenchyma region and extract the features. Chi-Square distance measure is used to classify the tumor. The performance of GLCM features for tumor classification is evaluated with the histogram features.

Mir Rayat Imtiaz Hossain, Imran Ahmed, Md. Hasanul Kabir
A Flexible Semi-supervised Feature Extraction Method for Image Classification

This paper proposes a novel discriminant semi-supervised feature extraction for generic classification and recognition tasks. The paper has two main contributions. First, we propose a flexible linear semi-supervised feature extraction method that seeks a non-linear subspace that is close to a linear one. The proposed method is based on a criterion that simultaneously exploits the discrimination information provided by the labeled samples, maintains the graph-based smoothness associated with all samples, regularizes the complexity of the linear transform, and minimizes the discrepancy between the unknown linear regression and the unknown non-linear projection. Second, we provide extensive experiments on four benchmark databases in order to study the performance of the proposed method. These experiments demonstrate much improvement over the state-of-the-art algorithms that are either based on label propagation or semi-supervised graph-based embedding.

Fadi Dornaika, Youssof El Traboulsi
Metric Tensor and Christoffel Symbols Based 3D Object Categorization

In this paper we propose to address the problem of 3D object categorization. We model 3D object as a piecewise smooth Riemannian manifold and propose metric tensor and Christoffel symbols as a novel set of features. The proposed set of features captures the local and global geometry of 3D objects by exploiting the uniqueness and compatibility of the features. The metric tensor represents a geometrical signature of the 3D object in a Riemannian manifold. To capture global geometry we propose to use combination of metric tensor and Christoffel symbols, as Christoffel symbols measure the deviations in the metric tensor. The categorization of 3D objects is carried out using polynomial kernel SVM classifier. The effectiveness of the proposed framework is demonstrated on 3D objects obtained from different datasets and achieved comparable results.

Syed Altaf Ganihar, Shreyas Joshi, Shankar Setty, Uma Mudenagudi
Feature Learning for the Image Retrieval Task

In this paper we propose a generic framework for the optimization of image feature encoders for image retrieval. Our approach uses a triplet-based objective that compares, for a given query image, the similarity scores of an image with a matching and a non-matching image, penalizing triplets that give a higher score to the non-matching image. We use stochastic gradient descent to address the resulting problem and provide the required gradient expressions for generic encoder parameters, applying the resulting algorithm to learn the power normalization parameters commonly used to condition image features. We also propose a modification to codebook-based feature encoders that consists of weighting the local descriptors as a function of their distance to the assigned codeword before aggregating them as part of the encoding process. Using the VLAD feature encoder, we show experimentally that our proposed optimized power normalization method and local descriptor weighting method yield improvements on a standard dataset.

Aakanksha Rana, Joaquin Zepeda, Patrick Perez
Evaluation of Smile Detection Methods with Images in Real-World Scenarios

Discriminative methods such as SVM, have been validated extremely efficient in pattern recognition issues. We present a systematic study on smile detection with different SVM classifiers. We experimented with linear SVM classifier, RBF kernel SVM classifier and a recently-proposed local linear SVM (LL-SVM) classifier. In this paper, we focus on smile detection in face images captured in real-world scenarios, such as those in GENKI4K database. In the meantime, illumination normalization, alignment and feature representation methods are also taken into consideration. Compared with the commonly used pixel-based representation, we find that local-feature-based methods achieve not only higher detection performance but also better robustness against misalignment. Almost all the illumination normalization methods have no effect on the detection accuracy. Among all the SVM classifiers, the novel LL-SVM is verified to find a balance between accuracy and efficiency. And among all the features including pixel value intensity, Gabor, LBP and HOG features, we find that HOG features are the most appropriate features to detect smiling faces, which, combined with RBF kernel SVM, achieve an accuracy of

$$93.25\,\%$$

on GENKI4K database.

Zhoucong Cui, Shuo Zhang, Jiani Hu, Weihong Deng
Everything is in the Face? Represent Faces with Object Bank

Object Bank (OB) [

1

] has been recently proposed as an object-level image representation for high-level visual recognition. OB represents an image from its responses to many pre-trained object filters. While OB has been validated in general image recognition tasks, it might seem ridiculous to represent a face with OB. However, in this paper, we study this anti-intuitive potential and show how OB can well represent faces amazingly, which seems a proof of the saying that “Everything is in the face”. With OB representation, we achieve results better than many low-level features and even competitive to state-of-the-art methods on LFW dataset under unsupervised setting. We then show how we can achieve state of the art results by combining OB with some low-level feature (e.g. Gabor).

Xin Liu, Shiguang Shan, Shaoxin Li, Alexander G. Hauptmann
Quasi Cosine Similarity Metric Learning

It is vital to select an appropriate distance metric for many learning algorithm. Cosine distance is an efficient metric for measuring the similarity of descriptors in classification task. However, the cosine similarity metric learning (CSML) [

3

] is not widely used due to the complexity of its formulation and time consuming. In this paper, a Quasi Cosine Similarity Metric Learning (QCSML) is proposed to make it easy. The normalization and Lagrange multipliers are employed to convert cosine distance into simple formulation, which is convex and its derivation is easy to calculate. The complexity of the QCSML algorithm is O(

$$t\times p\times d$$

) (The parameters

$$t$$

,

$$p$$

,

$$d$$

represent the number of iterations, the dimensionality of descriptors and the compressed features.), while the complexity of CSML is O(

$$r\times b\times g\times s\times d\times m$$

) (From the paper [

3

],

$$r$$

is the number of iterations used to optimize the projection matrix,

$$b$$

is the number of values tested in cross validation process,

$$g$$

is the number of steps in the Conjugate Gradient method,

$$s$$

is the number of training data,

$$d$$

and

$$m$$

are the dimensions of projection matrix.). The experimental results of our method on UCI datasets for classification task and LFW dataset for face verification problem are better than the state-of-the-art methods. For classification task, the proposed approach is employed on Iris, Ionosphere and Wine dataset and the classification accuracy and the time consuming are much better than the compared methods. Moreover, our approach obtains

$$92.33\,\%$$

accuracy for face verification on unrestricted setting of LFW dataset, which outperforms the state-of-the-art algorithms.

Xiang Wu, Zhi-Guo Shi, Lei Liu
Hand Gesture Recognition Based on the Parallel Edge Finger Feature and Angular Projection

In this paper, a novel high-level hand feature extraction method is proposed by the aid of finger parallel edge feature and angular projection. The finger is modelled as a cylindrical object and the finger images can be extracted from the convolution with a specific operator as salient hand edge images. Hand center, hand orientation and wrist location are then determined via the analysis of finger image and hand silhouette. The angular projection of the finger images with origin point on the wrist is calculated, and five fingers are located by analyzing the angular projection. The proposed algorithm can detect extensional fingers as well as flexional fingers. It is robust to the hand rotation, side movements of fingers and disturbance from the arm. Experimental results demonstrate that the proposed method can directly estimate simple hand poses in real-time.

Yimin Zhou, Guolai Jiang, Guoqing Xu, Yaorong Lin
Image Retrieval by Using Non-subsampled Shearlet Transform and Krawtchouk Moment Invariants

In this paper, we use non-subsampled shearlet transform (NSST) and Krawtchouk Moment Invariants (KMI) to realize image retrieval based on texture and shape features. Shearlet is a new sparse representation tool of multidimensional function, which provides a simple and efficient mathematical framework.We decompose the images by NSST. The directional subband coefficients are modeled by Generalized Gaussian Distribution (GGD). The distribution parameters are used to build texture feature vectors which are measured by Kullback–Leibler distance (KLD). Meanwhile, low-order KMI are employed to extract shape features which are measured by Euclidean distance (ED). Finally, the image retrieval is achieved based on weighted distance measurement. Experimental results show the proposed retrieval system can obtain the highest retrieval rate comparing with the methods based on DWT, Contourlet, NSCT and DT-CWT.

Cheng Wan, Yiquan Wu
Curve Matching from the View of Manifold for Sign Language Recognition

Sign language recognition is a challenging task due to the complex action variations and the large vocabulary set. Generally, sign language conveys meaning through multichannel information like trajectory, hand posture and facial expression simultaneously. Obviously, trajectories of sign words play an important role for sign language recognition. Although the multichannel features are helpful for sign representation, this paper only focuses on the trajectory aspect. A method of curve matching based on manifold analysis is proposed to recognize isolated sign language word with 3D trajectory captured by Kinect. From the view of manifold, the main structure of the curve is found by the intrinsic linear segments, which are characterized by some geometric features. Then the matching between curves is transformed into the matching between two sets of sequential linear segments. The performance of the proposed curve matching strategy is evaluated on two different sign language datasets. Our method achieves a top-1 recognition rate of 78.3 % and 61.4 % in a 370 daily words dataset and a large dataset containing 1000 vocabularies.

Yushun Lin, Xiujuan Chai, Yu Zhou, Xilin Chen
Learning Partially Shared Dictionaries for Domain Adaptation

Real world applicability of many computer vision solutions is constrained by the mismatch between the training and test domains. This mismatch might arise because of factors such as change in pose, lighting conditions, quality of imaging devices, intra-class variations inherent in object categories etc. In this work, we present a dictionary learning based approach to tackle the problem of domain mismatch. In our approach, we jointly learn dictionaries for the source and the target domains. The dictionaries are partially shared, i.e. some elements are common across both the dictionaries. These shared elements can represent the information which is common across both the domains. The dictionaries also have some elements to represent the domain specific information. Using these dictionaries, we separate the domain specific information and the information which is common across the domains. We use the latter for training cross-domain classifiers i.e., we build classifiers that work well on a new target domain while using labeled examples only in the source domain. We conduct cross-domain object recognition experiments on popular benchmark datasets and show improvement in results over the existing state of art domain adaptation approaches.

Viresh Ranjan, Gaurav Harit, C. V. Jawahar
Learning Discriminative Hidden Structural Parts for Visual Tracking

Part-based visual tracking is attractive in recent years due to its robustness to occlusion and non-rigid motion. However, how to automatically generate the discriminative structural parts and consider their interactions jointly to construct a more robust tracker still remains unsolved. This paper proposes a discriminative structural part learning method while integrating the structure information, to address the visual tracking problem. Particulary, the state (e.g. position, width and height) of each part is regarded as a hidden variable and inferred automatically by considering the inner structure information of the target and the appearance difference between the target and the background. The inner structure information considering the relationship between neighboring parts, is integrated using a graph model based on a dynamically constructed pair-wise Markov Random Field. Finally, we adopt Metropolis-Hastings algorithm integrated with the online Support Vector Machine to complete the hidden variable inference task. The experimental results on various challenging sequences demonstrate the favorable performance of the proposed tracker over the state-of-the-art ones.

Longyin Wen, Zhaowei Cai, Dawei Du, Zhen Lei, Stan Z. Li
Image Based Visibility Estimation During Day and Night

The meteorological visibility estimation is an important task, for example, in road traffic control and aviation safety, but its reliable automation is difficult. The conventional light scattering measurements are limited into a small space and the extrapolated values are often erroneous. The current meteorological visibility estimates relying on a single camera work only with data captured in day light. We propose a new method based on feature vectors that are projections of the scene images with lighting normalization. The proposed method was combined with the high dynamic range imaging to improve night time image quality. Visibility classification accuracy (F1) of 85.5 % was achieved for data containing both day and night images. The results show that the approach can compete with commercial visibility measurement devices.

Sami Varjo, Jari Hannuksela
Symmetric Feature Extraction for Pose Neutralization

This paper proposes a method to neutralize the pose of facial databases.

Efficient use of the feature extractors

and its properties leads to the pose neutralization. Feature extractors discussed here are few transforms like Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT).

Symmetric behavior

of transforms is the basis of the proposed method.

Modulo

based approach in extracting the features was found to provide better results than the conventional techniques for pose neutralization. Experiments are conducted on various benchmark facial databases mainly pose variant FERET and FEI which show the promising performance of the proposed method in neutralizing pose.

S. G. Charan
3D Laplacian Pyramid Signature

We introduce a simple and effective point descriptor, called 3D Laplacian Pyramid Signature (3DLPS), by extending and adapting the Laplacian Pyramid defined in 2D images to 3D shapes. The signature is represented as a high-dimensional feature vector recording the magnitudes of mean curvatures, which are captured through sequentially applying Laplacian of Gaussian (LOG) operators on each vertex of 3D shapes. We show that 3DLPS organizes the intrinsic geometry information concisely, while possessing high sensitivity and specificity. Compared with existing point signatures, 3DLPS is robust and easy to compute, yet captures enough information embedded in the shape. We describe how 3DLPS may potentially benefit the applications involved in shape analysis, and especially demonstrate how to incorporate it in point correspondence detection, best view selection and automatic mesh segmentation. Experiments across a collection of shapes have verified its effectiveness.

Kaimo Hu, Yi Fang
Commonality Preserving Multiple Instance Clustering Based on Diverse Density

Image-set clustering is a problem decomposing a given image set into disjoint subsets satisfying specified criteria. For single vector image representations, proximity or similarity criterion is widely applied, i.e., proximal or similar images form a cluster. Recent trend of the image description, however, is the local feature based, i.e., an image is described by multiple local features, e.g., SIFT, SURF, and so on. In this description, which criterion should be employed for the clustering? As an answer to this question, this paper presents an image-set clustering method based on commonality, that is, images preserving strong commonality (coherent local features) form a cluster. In this criterion, image variations that do not affect common features are harmless. In the case of face images, hair-style changes and partial occlusions by glasses may not affect the cluster formation. We defined four commonality measures based on Diverse Density, that are used in agglomerative clustering. Through comparative experiments, we confirmed that two of our methods perform better than other methods examined in the experiments.

Takayuki Fukui, Toshikazu Wada

Third International Workshop on Intelligent Mobile and Egocentric Vision (IMEV2014)

Frontmatter
Interactive RGB-D SLAM on Mobile Devices

In this paper we present a new RGB-D SLAM system specifically designed for mobile platforms. Though the basic approach has already been proposed, many relevant changes are required to suit a user-centered mobile environment. In particular, our implementation tackles the strict memory constraints and limited computational power of a typical tablet device, thus delivering interactive usability without hindering effectiveness. Real-time 3D reconstruction is achieved by projecting measurements from aligned RGB-D keyframes, so to provide the user with instant feedback. We analyze quantitatively the accuracy vs. speed trade-off of diverse variants of the proposed pipeline, we estimate the amount of memory required to run the application and we also provide qualitative results dealing with reconstructions of indoor environments.

Nicholas Brunetto, Nicola Fioraio, Luigi Di Stefano
3D Reconstruction with Automatic Foreground Segmentation from Multi-view Images Acquired from a Mobile Device

We propose a novel foreground object segmentation algorithm for a silhouette-based 3D reconstruction system. Our system requires several multi-view images as input to reconstruct a complete 3D model. The proposed foreground segmentation algorithm is based on graph-cut optimization with the energy function developed for planar background assumption. We parallelize parts of our program with GPU programming. The 3D reconstruction system consists of camera calibration, foreground segmentation, visual hull reconstruction, surface reconstruction, and texture mapping. The proposed 3D reconstruction process is accelerated with GPU implementation. In the experimental result, we demonstrate the improved accuracy by using the proposed segmentation method and show the reconstructed 3D models computed from several image sets.

Ping-Cheng Kuo, Chao-An Chen, Hsing-Chun Chang, Te-Feng Su, Shang-Hong Lai
Accelerating Local Feature Extraction Using Two Stage Feature Selection and Partial Gradient Computation

In this paper, we present a fast local feature extraction method, which is our contribution to ongoing MPEG standardization of compact descriptor for visual search (CDVS). To reduce time complexity of feature extraction, two-stage feature selection, which is based on the feature selection method of CDVS Test Model (TM), and partial gradient computation are introduced. The proposed method is examined on SIFT and compared to SIFT and SURF extractor with the previous feature selection method. In addition, the proposed method is compared to various feature extraction methods of the current CDVS TM 11 in CDVS evaluation framework. Experimental results show that the proposed method significantly reduces the time complexity while maintaining the matching and retrieval performance of previous work. For its efficiency, the proposed method has been integrated into CDVS TM since

$$107^{\text {th}}$$

MPEG meeting. This method will be also useful for feature extraction on mobile devices, where the use of computational resource is limited.

Keundong Lee, Seungjae Lee, Weon-Geun Oh
Hybrid Feature and Template Based Tracking for Augmented Reality Application

Visual tracking is the core technology that enables the vision-based augmented reality application. Recent contributions in visual tracking are dominated by template-based tracking approaches such as ESM due to its accuracy in estimating the camera pose. However, it is shown that the template-based tracking approach is less robust against large inter-frames displacements and image variations than the feature-based tracking. Therefore, we propose to combine the feature-based and template-based tracking into a hybrid tracking model to improve the overall tracking performance. The feature-based tracking is performed prior to the template-based tracking. The feature-based tracking estimates pose changes between frames using the tracked feature-points. The template-based tracking is then used to refine the estimated pose. As a result, the hybrid tracking approach is robust against large inter-frames displacements and image variations. It also accurately estimates the camera pose. Furthermore, we will show that the pose adjustment performed by the feature-based tracking reduces the number of iterations necessary for the ESM to refine the estimated pose.

Gede Putra Kusuma Negara, Fong Wee Teck, Li Yiqun
A Mobile Augmented Reality Framework for Post-stroke Patient Rehabilitation

In this paper, we put forward a novel framework based on mobile augmented reality (AR) to enhance the post stroke patients participation in the rehabilitation process. The exercises performed in the rehabilitation centers are monotonous and thus requires maximum effort and time from both the patients and the occupational therapists. We propose to combine these tedious activities with the interactive mobile augmented reality technologies. We call this framework Cogni-Care. In this paper, we introduce the underlying architecture of the system that eases the work of the stakeholders involved in the process of stroke recovery. We also present two exercises to improve the fine motor skills, AR-Ball exercise and AR-Maze exercise, as examples and perform the initial usability study.

Sujay Babruwad, Rahul Avaghan, Uma Mudenagudi
Estimation of 3-D Foot Parameters Using Hand-Held RGB-D Camera

Most people choose shoes mainly based on their foot sizes. However, a foot size only reflects the foot length which does not consider the foot width. Therefore, some people use both width and length of their feet to select shoes, but those two parameters cannot fully characterize the 3-D shape of a foot and are certainly not enough for selecting a pair of comfortable shoes. In general, the ball-girth is also required for shoe selection in addition to the width and the length of a foot. In this paper, we propose a foot measurement system which consists of a low cost Intel Creative Senz3D RGB-D camera, an A4-size reference pattern, and a desktop computer. The reference pattern is used to provide video-rate camera pose estimation. Therefore, the acquired 3-D data can be converted into a common reference coordinate system to form a set of complete foot surface data. Also, we proposed a markerless ball-girth estimation method which uses the lengthes of two toes gaps to infer the joint locations of the big/little toes and the metatarsals. Results from real experiments show that the proposed method is accurate enough to provide three major foot parameters for shoe selection.

Yang-Sheng Chen, Yu-Chun Chen, Peng-Yuan Kao, Sheng-Wen Shih, Yi-Ping Hung
A Wearable Face Recognition System on Google Glass for Assisting Social Interactions

In this paper, we present a wearable face recognition (FR) system on Google Glass (GG) to assist users in social interactions. FR is the first step towards face-to-face social interactions. We propose a wearable system on GG, which acts as a social interaction assistant, the application includes face detection, eye localization, face recognition and a user interface for personal information display. To be useful in natural social interaction scenarios, the system should be robust to changes in face pose, scale and lighting conditions. OpenCV face detection is implemented in GG. We exploit both OpenCV and ISG (Integration of Sketch and Graph patterns) eye detectors to locate a pair of eyes on the face, between them the former is stable for frontal view faces and the latter performs better for oblique view faces. We extend the eigenfeature regularization and extraction (ERE) face recognition approach by introducing subclass discriminant analysis (SDA) to perform within-subclass discriminant analysis for face feature extraction. The new approach improves the accuracy of FR over varying face pose, expression and lighting conditions. A simple user interface (UI) is designed to present relevant personal information of the recognized person to assist in the social interaction. A standalone independent system on GG and a Client-Server (CS) system via Bluetooth to connect GG with a smart phone are implemented, for different levels of privacy protection. The performance on database created using GG is evaluated and comparisons with baseline approaches are performed. Numerous experimental studies show that our proposed system on GG can perform better real-time FR as compared to other methods.

Bappaditya Mandal, Shue-Ching Chia, Liyuan Li, Vijay Chandrasekhar, Cheston Tan, Joo-Hwee Lim
Lifelog Scene Change Detection Using Cascades of Audio and Video Detectors

The advent of affordable wearable devices with a video camera has established the new form of social data, lifelogs, where lives of people are captured to video. Enormous amount of lifelog data and need for on-site processing demand new fast video processing methods. In this work, we experimentally investigate seven hours of lifelogs and point out novel findings: (1) audio cues are exceptionally strong for lifelog processing; (2) cascades of audio and video detectors improve accuracy and enable fast (super frame rate) processing speed. We first construct strong detectors using state-of-the-art audio and visual features: Mel-frequency cepstral coefficients (MFCC), colour (RGB) histograms, and local patch descriptors (SIFT). In the second stage, we construct a cascade of the trained detectors and optimise cascade parameters. Separating the detector and cascade optimisation stages simplify training and results to a fast and accurate processing pipeline.

Katariina Mahkonen, Joni-Kristian Kämäräinen, Tuomas Virtanen
Activity Recognition in Egocentric Life-Logging Videos

With the increasing availability of wearable cameras, research on first-person view videos (egocentric videos) has received much attention recently. While some effort has been devoted to collecting various egocentric video datasets, there has not been a focused effort in assembling one that could capture the diversity and complexity of activities related to

life-logging

, which is expected to be an important application for egocentric videos. In this work, we first conduct a comprehensive survey of existing egocentric video datasets. We observe that existing datasets do not emphasize activities relevant to the life-logging scenario. We build an egocentric video dataset dubbed LENA (Life-logging EgoceNtric Activities) (

http://people.sutd.edu.sg/~1000892/dataset

) which includes egocentric videos of 13 fine-grained activity categories, recorded under diverse situations and environments using the Google Glass. Activities in LENA can also be grouped into 5 top-level categories to meet various needs and multiple demands for activities analysis research. We evaluate state-of-the-art activity recognition using LENA in detail and also analyze the performance of popular descriptors in egocentric activity recognition.

Sibo Song, Vijay Chandrasekhar, Ngai-Man Cheung, Sanath Narayan, Liyuan Li, Joo-Hwee Lim
3D Line Segment Based Model Generation by RGB-D Camera for Camera Pose Estimation

In this paper, we propose a novel method for generating 3D line segment based model from an image sequence taken with a RGB-D camera. Constructing 3D geometrical representation by 3D model is essential for model based camera pose estimation that can be performed by corresponding 2D features in images with 3D features of the captured scene. While point features are mostly used as such features for conventional camera pose estimation, we aim to use line segment features for improving the performance of the camera pose estimation. In this method, using RGB images and depth images of two continuous frames, 2D line segments from the current frame and 3D line segments from the previous frame are corresponded. The 2D-3D line segment correspondences provide camera pose of the current frame. All of 2D line segments are finally back-projected to the world coordinate based on the estimated camera pose for generating 3D line segment based model of the target scene. In experiments, we confirmed that the proposed method can successfully generate line segment based models, while 3D models based on the point features often fail to successfully represent the target scene.

Yusuke Nakayama, Hideo Saito, Masayoshi Shimizu, Nobuyasu Yamaguchi
Integrated Vehicle and Lane Detection with Distance Estimation

In this paper, we propose an integrated system that combines vehicle detection, lane detection, and vehicle distance estimation in a collaborative manner. Adaptive search windows for vehicles provide constraints on the width between lanes. By exploiting the constraints, the search space for lane detection can be efficiently reduced. We employ local patch constraints for lane detection to improve the reliability of lane detection. Moreover, it is challenging to estimate the vehicle distance from images/videos captured form monocular camera in real time. In our approach, we utilize lane marker with the associated 3D constraint to estimate the camera pose and the distances to frontal vehicles. Experimental results on real videos show that the proposed system is robust and accurate in terms of vehicle and lane detection and vehicle distance estimation.

Yu-Chun Chen, Te-Feng Su, Shang-Hong Lai
Collaborative Mobile 3D Reconstruction of Urban Scenes

Reconstruction of the surrounding 3D world is of particular interest either for mapping, civil applications or for entertainment. The wide availability of smartphones with cameras and wireless networking capabilities makes collecting 2D images of a particular scene easy. In contrast to the client-server architecture adopted by most mobile services, we propose an architecture where data, computations and results can be shared in a collaborative manner among the participating devices without centralization. Camera calibration and pose estimation parameters are determined using classical image-based methods. The reconstruction is based on interactively selected arbitrary planar regions which is especially suitable for objects having large (near) planar surfaces often found in urban scenes (

e.g.

building facades, windows, etc). The perspective distortion of a planar region in two views makes it possible to compute the normal and distance of the region w.r.t the world coordinate system. Thus a fairly precise 3D model can be built by reconstructing a set of planar regions with different orientation. We also show how visualization, data sharing and communication can be solved. The applicability of the method is demonstrated on reconstructing real urban scenes.

Attila Tanács, András Majdik, Levente Hajder, József Molnár, Zsolt Sánta, Zoltan Kato

Workshop on Human Identification for Surveillance (HIS)

Frontmatter
Gaussian Descriptor Based on Local Features for Person Re-identification

This paper proposes a novel image representation for person re-identification. Since one person is assumed to wear the same clothes in different images, the color information of person images is very important to distinguish one person from the others. Motivated by this, in this paper, we propose a simple but effective representation named Gaussian descriptor based on Local Features (GaLF). Compared with traditional color features, such as histogram, GaLF can not only represent the color information of person images, but also take the texture and spatial structure as the supplement. Specifically, there are three stages in extracting GaLF. First, pedestrian parsing and lightness constancy methods are applied to eliminate the influence of illumination and background. Then, a very simple 7-d feature is extracted on each pixel in the person image. Finally, the local features in each body part region are represented by the mean vector and covariance matrix of a Gaussian model. After getting the representation of GaLF, the similarity between two person images are measured by the distance of two set of Gaussian models based on the product of Lie group. To show the effectiveness of the proposed representation, this paper conducts experiments on two person re-identification tasks (VIPeR and i-LIDS), on which it improves the current state-of-the-art performance.

Bingpeng Ma, Qian Li, Hong Chang
Privacy Preserving Multi-target Tracking

Automated people tracking is important for a wide range of applications. However, typical surveillance cameras are controversial in their use, mainly due to the harsh intrusion of the tracked individuals’ privacy. In this paper, we explore a privacy-preserving alternative for multi-target tracking. A network of infrared sensors attached to the ceiling acts as a low-resolution, monochromatic camera in an indoor environment. Using only this low-level information about the presence of a target, we are able to reconstruct entire trajectories of several people. Inspired by the recent success of offline approaches to multi-target tracking, we apply an energy minimization technique to the novel setting of infrared motion sensors. To cope with the very weak data term from the infrared sensor network we track in a continuous state space with soft, implicit data association. Our experimental evaluation on both synthetic and real-world data shows that our principled method clearly outperforms previous techniques.

Anton Milan, Stefan Roth, Konrad Schindler, Mineichi Kudo
Full-Body Human Pose Estimation from Monocular Video Sequence via Multi-dimensional Boosting Regression

In this work, we propose a scheme to estimate two-dimensional full-body human poses in a monocular video sequence. For each frame in the video, we detect the human region using a support vector machine, and estimate the full-body human pose in the detected region using multi-dimensional boosting regression. For the human pose estimation, we design a joints relationship tree, corresponding to the full hierarchical structure of joints in a human body. Further, we make a complete set of spatial and temporal feature descriptors for each frame. Utilizing the well-designed joints relationship tree and feature descriptors, we learn a hierarchy of regressors in the training stage and employ the learned regressors to determine all the joint’s positions in the testing stage. As experimentally demonstrated, the proposed scheme achieves outstanding estimation performance.

Yonghui Du, Yan Huang, Jingliang Peng
Improve Pedestrian Attribute Classification by Weighted Interactions from Other Attributes

Recent works have shown that visual attributes are useful in a number of applications, such as object classification, recognition, and retrieval. However, predicting attributes in images with large variations still remains a challenging problem. Several approaches have been proposed for visual attribute classification; however, most of them assume independence among attributes. In fact, to predict one attribute, it is often useful to consider other related attributes. For example, a pedestrian with

long hair

and

skirt

usually imply the

female

attribute. Motivated by this, we propose a novel pedestrian attribute classification method which exploits interactions among different attributes. Firstly, each attribute classifier is trained independently. Secondly, for each attribute, we also use the decision scores of other attribute classifiers to learn the attribute interaction regressor. Finally, prediction of one attribute is achieved by a weighted combination of the independent decision score and the interaction score from other attributes. The proposed method is able to keep the balance of the independent decision score and interaction of other attributes to yield more robust classification results. Experimental results on the Attributed Pedestrian in Surveillance (APiS 1.0) [

1

] database validate the effectiveness of the proposed approach for pedestrian attribute classification.

Jianqing Zhu, Shengcai Liao, Zhen Lei, Stan Z. Li
Face Recognition with Image Misalignment via Structure Constraint Coding

Face recognition (FR) via sparse representation has been widely studied in the past several years. Recently many sparse representation based face recognition methods with simultaneous misalignment were proposed and showed interesting results. In this paper, we present a novel method called structure constraint coding (SCC) for face recognition with image misalignment. Unlike those sparse representation based methods, our method does image alignment and image representation via structure constraint based regression simultaneously. Here, we use the nuclear norm as a structure constraint criterion to characterize the error image. Compared with the sparse representation based methods, SCC is more robust for dealing with illumination variations and structural noise (especially block occlusion). Experimental results on public face databases verify the effectiveness of our method.

Ying Tai, Jianjun Qian, Jian Yang, Zhong Jin
People Re-identification Based on Bags of Semantic Features

People re-identification has attracted a lot of attention recently. As an important part in disjoint cameras based surveillance system, it faces many problems. Various factors like illumination condition, viewpoint of cameras and occlusion make people re-identification a difficult task. In this paper, we exploit the performance of bags of semantic features on people re-identification. Semantic features are mid-level features that can be directly described by words, such as hair length, skin tone, race, clothes colors and so on. Although semantic features are not as discriminative as local features used in existing methods, they are more invariant. Therefore, good performance on people re-identification can be expected by combining a set of semantic features. Experiments are carried out on VIPeR dataset. Comparison with some state-of-the-art works is provided and the proposed method shows better performance.

Zhi Zhou, Yue Wang, Eam Khwang Teoh
Tracking Pedestrians Across Multiple Cameras via Partial Relaxation of Spatio-Temporal Constraint and Utilization of Route Cue

We tackle multiple people tracking across multiple non-overlapping surveillance cameras installed in a wide area. Existing methods attempt to track people across cameras by utilizing appearance features and spatio-temporal cues to re-identify people across adjacent cameras. @ However, in relatively wide public areas like a shopping mall, since many people may walk and stay arbitrarily, the spatio-temporal constraint is too strict to reject correct matchings, which results in matching errors. Additionally, appearance features can be severely influenced by illumination conditions and camera viewpoints against people, making it difficult to match tracklets by appearance features. These two issues cause fragmentation of tracking trajectories across cameras. We deal with the former issue by selectively relaxing the spatio-temporal constraint and the latter one by introducing a route cue. We show results on data captured by cameras in a shopping mall, and demonstrate that the accuracy of across-camera tracking can be significantly increased under considered settings.

Toru Kokura, Yasutomo Kawanishi, Masayuki Mukunoki, Michihiko Minoh
Discovering Person Identity via Large-Scale Observations

Person identification is a well studied problem in the last two decades. In a typical automated person identification scenario, the system always contains the prior knowledge, either person-based model or reference mugshot, of the person-of-interest. However, the challenge of automated person identification would increase by multiple folds if the prior information is not available. In today’s world, rich and large quantity of information are easily attainable through the Internet or closed-loop surveillance network. This provides us an opportunity to employ an automated approach to perform person identification with minimum prior knowledge, presume that there are sufficient amount of observations. In this paper, we propose a dominant set based person identification framework to learn the identity of a person through large-scale observations, where each observation contains instances from various modality. Through experiments on two challenging face datasets we show the potential of the proposed approach. We also explore the conditions required to obtain satisfy performance and discuss the potential future research directions.

Yongkang Wong, Lekha Chaisorn, Mohan S. Kankanhalli
Robust Ear Recognition Using Gradient Ordinal Relationship Pattern

A reliable personal recognition based on ear biometrics is highly in demand due to its vast application in automated surveillance, law enforcement

etc

. In this paper a robust ear recognition system is proposed using gradient ordinal relationship pattern. A reference point based normalization is proposed along with a novel ear transformation over normalized ear, to obtain robust ear representations. Ear samples are enhanced using a local enhancement technique. Later a dissimilarity measure is proposed that can be used for matching ear samples. Two publicly available ear databases IITD and UND-E are used for the performance analysis. The proposed system has shown very promising results and significant improvement over the existing state of the art ear systems. The proposed system has shown robustness against small amount of illumination variations and affine transformations due to the virtue of ear transformation and tracking based matching respectively.

Aditya Nigam, Phalguni Gupta
Gait-Assisted Person Re-identification in Wide Area Surveillance

Gait has been shown to be an effective feature for person recognition and could be well suited for the problem of multi-frame person re-identification (Re-ID). However, person Re-ID poses very unique set of challenges, with changes in view angles and environments across cameras. Thus, the feature needs to be highly discriminative as well as robust to drastic variations to be effective for Re-ID. In this paper, we study the applicability of gait to person Re-ID when combined with color features. The combined features based Re-ID is tested for short period Re-ID on dataset we collected using 9 cameras and 40 IDs. Additionally, we also investigate the potential of gait features alone for Re-ID under real world surveillance conditions. This allows us to understand the potential of gait for long period Re-ID as well as under scenarios where color features cannot be leveraged. Both combined and gait-only features based Re-ID is tested on the publicly available, SAIVT SoftBio dataset. We select two popular gait features, namely Gait Energy Images (GEI) and Frame Difference Energy Images (FDEI) for Re-ID and propose a sparsified representation based gait recognition method.

Apurva Bedagkar-Gala, Shishir K. Shah
Cross Dataset Person Re-identification

Until now, most existing researches on person re-identification aim at improving the recognition rate on single dataset setting. The training data and testing data of these methods are form the same source. Although they have obtained high recognition rate in experiments, they usually perform poorly in practical applications. In this paper, we focus on the cross dataset person re-identification which make more sense in the real world. We present a deep learning framework based on convolutional neural networks to learn the person representation instead of existing hand-crafted features, and cosine metric is used to calculate the similarity. Three different datasets Shinpuhkan2014dataset, CUHK and CASPR are chosen as the training sets, we evaluate the performances of the learned person representations on VIPeR. For the training set Shinpuhkan2014dataset, we also evaluate the performances on PRID and iLIDS. Experiments show that our method outperforms the existing cross dataset methods significantly and even approaches the performances of some methods in single dataset setting.

Yang Hu, Dong Yi, Shengcai Liao, Zhen Lei, Stan Z. Li
Spatio-Temporal Consistency for Head Detection in High-Density Scenes

In this paper we address the problem of detecting reliably a subset of pedestrian targets (heads) in a high-density crowd exhibiting extreme clutter and homogeneity, with the purpose of obtaining tracking initializations. We investigate the solution provided by discriminative learning where we require that the detections in the image space be localized over most of the target area and temporally stable. The results of our tests show that discriminative learning strategies provide valuable cues about the target localization which may be combined with other complementary strategies in order to bootstrap tracking algorithms in these challenging environments.

Emanuel Aldea, Davide Marastoni, Khurom H. Kiyani
Multi-target Tracking with Sparse Group Features and Position Using Discrete-Continuous Optimization

Multi-target tracking of pedestrians is a challenging task due to uncertainty about targets, caused mainly by similarity between pedestrians, occlusion over a relatively long time and a cluttered background. A usual scheme for tackling multi-target tracking is to divide it into two sub-problems: data association and trajectory estimation. A reasonable approach is based on joint optimization of a discrete model for data association and a continuous model for trajectory estimation in a Markov Random Field framework. Nonetheless, usual solutions of the data association problem are based only on location information, while the visual information in the images is ignored. Visual features can be useful for associating detections with true targets more reliably, because the targets usually have discriminative features. In this work, we propose a combination of position and visual feature information in a discrete data association model. Moreover, we propose the use of group Lasso regularization in order to improve the identification of particular pedestrians, given that the discriminative regions are associated with particular visual blocks in the image. We find promising results for our approach in terms of precision and robustness when compared with a state-of-the-art method in standard datasets for multi-target pedestrian tracking.

Billy Peralta, Alvaro Soto
Hybrid Focal Stereo Networks for Pattern Analysis in Homogeneous Scenes

In this paper we address the problem of multiple camera calibration in the presence of a homogeneous scene, and without the possibility of employing calibration object based methods. The proposed solution exploits salient features present in a larger field of view, but instead of employing active vision we replace the cameras with stereo rigs featuring a long focal analysis camera, as well as a short focal registration camera. Thus, we are able to propose an accurate solution which does not require intrinsic variation models as in the case of zooming cameras. Moreover, the availability of the two views simultaneously in each rig allows for pose re-estimation between rigs as often as necessary. The algorithm has been successfully validated in an indoor setting, as well as on a difficult scene featuring a highly dense pilgrim crowd in Makkah.

Emanuel Aldea, Khurom H. Kiyani
Backmatter
Metadaten
Titel
Computer Vision - ACCV 2014 Workshops
herausgegeben von
C. V. Jawahar
Shiguang Shan
Copyright-Jahr
2015
Electronic ISBN
978-3-319-16634-6
Print ISBN
978-3-319-16633-9
DOI
https://doi.org/10.1007/978-3-319-16634-6