Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 10th International Conference on Computer Vision Systems, ICVS 2015, held in Copenhagen, Denmark, in July 2015. The 48 papers presented were carefully reviewed and selected from 92 submissions. The paper are organized in topical sections on biological and cognitive vision; hardware-implemented and real-time vision systems; high-level vision; learning and adaptation; robot vision; and vision systems applications.



Biological and Cognitive Vision


Comparison of Statistical Features for Medical Colour Image Classification

Analysis of cells and tissues allow the evaluation and diagnosis of a vast number of diseases. Nowadays this analysis is still performed manually, involving numerous drawbacks, in particular the results accuracy heavily depends on the operator skills. Differently, the automated analysis by computer is performed quickly, requires only one image of the sample and provides precise results. In this work we investigate different texture descriptors extracted from colour medical images. We compare and combine these features in order to identify the features set able to properly classify medical images presenting different classification problems. The tested feature sets are based on a generalization of some existent grey scale approaches for feature extraction to colour images. The generalization has been applied to the calculation of Grey-Level Co-Occurrence Matrix, Grey-Level Difference Matrix and Grey-Level Run-Length Matrix. Furthermore, we calculate Grey-Level Run-Length Matrix starting from the Grey-Level Difference Matrix. The resulting feature sets performances have been compared using the Support Vector Machine model. To validate our method we have used three different databases, HistologyDS, Pap-smear and Lymphoma, that present different medical problems and so they represent different classification problems. The obtained experimental results have showed that the features extracted from the generalized Grey-Level Co-Occurrence Matrix perform better than the other set of features, demonstrating also that a combination of features selected from all the feature subsets leads always to better performances.

Cecilia Di Ruberto, Giuseppe Fodde, Lorenzo Putzu

Improving FREAK Descriptor for Image Classification

In this paper we propose a new set of bio-inspired descriptors for image classification based on low-level processing performed by the retina. Taking as a starting point a descriptor called FREAK (Fast Retina Keypoint), we further extend it mimicking the center-surround organization of ganglion receptive fields. To test our approach we compared the performance of the original FREAK and our proposal on the 15 scene categories database. The results show that our approach outperforms the original FREAK for the scene classification task.

Cristina Hilario Gomez, Kartheek Medathati, Pierre Kornprobst, Vittorio Murino, Diego Sona

Arabic-Latin Offline Signature Recognition Based on Shape Context Descriptor

Offline signature recognition is a very difficult task due to normal variability in signatures and the unavailability of dynamic information regarding the pen path. In this paper, a technique for signature recognition is proposed based on shape context that summarizes the global signature features in a rich local descriptor. The proposed system reaches 100 % accuracy but had some scalability problems as a result of the correspondence problem between the queried signature and all the data set signatures. To address the scalability problem of using shape context for signature matching, the proposed method speeds up the matching stage by representing the shape context features as a feature vector and then applies a clustering algorithm to assign signatures to their corresponding classes.

Ahmed M. Omar, Nagia M. Ghanem, Mohamed A. Ismail, Sahar M. Ghanem

Saliency-Guided Object Candidates Based on Gestalt Principles

We present a new method for generating general object candidates for cluttered RGB-D scenes. Starting from an over-segmentation of the image, we build a graph representation and define an object candidate as a subgraph that has maximal internal similarity as well as minimal external similarity. These candidates are created by successively adding segments to a seed segment in a saliency-guided way. Finally, the resulting object candidates are ranked based on Gestalt principles. We show that the proposed algorithm clearly outperforms three other recent methods for object discovery on the challenging Kitchen dataset.

Thomas Werner, Germán Martín-García, Simone Frintrop

Person Re-identification Based on Multi-directional Saliency Metric Learning

Aiming for the problem of inconsistent saliency between matched patches in person re-identification, a multi-directional salience similarity evaluation for person re-identification based on metric learning is proposed. A distribution analysis for salience consistency between the patches is taken, and the similarity between matched patches is established by weighted fusion of multi-directional salience. The weight of saliency in each direction is obtained using metric learning in the base of Structural SVM Ranking. It improves the discriminative and accuracy performance of re-identification. Compared with the similar algorithms, the method achieves higher re-identification rate with more comprehensive similarity measure.

Zhonghua Huo, Ying Chen, Chunjian Hua

Sleep Pose Recognition in an ICU Using Multimodal Data and Environmental Feedback

Clinical evidence suggests that sleep pose analysis can shed light onto patient recovery rates and responses to therapies. In this work, we introduce a formulation that combines features from multimodal data to classify human sleep poses in an Intensive Care Unit (ICU) environment. As opposed to the current methods that combine data from multiple sensors to generate a single feature, we extract features independently. We then use these features to estimate candidate labels and infer a pose. Our method uses modality trusts – each modality’s classification ability – to handle variable scene conditions and to deal with sensor malfunctions. Specifically, we exploit shape and appearance features extracted from three sensor modalities: RGB, depth, and pressure. Classification results indicate that our method achieves 100 % accuracy (outperforming previous techniques by 6 %) in bright and clear (ideal) scenes, 70 % in poorly illuminated scenes, and 90 % in occluded ones.

Carlos Torres, Scott D. Hammond, Jeffrey C. Fried, B. S. Manjunath

Hardware-Implemented and Real-Time Vision Systems


A Flexible High-Resolution Real-Time Low-Power Stereo Vision Engine

Stereo Vision has been a focus of research for decades. In the meantime, many real-time stereo vision systems are available on low-power platforms. Several products using stereo vision exist on the market. So far, all of them are based on image sizes up to 1 MP. They either use a local correlation-like stereo engine or perform some variant of Semi-Global Matching (SGM).

However, many modern cameras deliver 2 MP images (full High Definition) at framerates beyond 20 Hz. In this contribution we propose a stereo vision engine tailored for automotive and mobile applications, that is able to process 2 MP images in real-time. Note that also the disparity range has to be increased when maintaining the same field of view with higher resolution. We implement the SGM algorithm with search space reduction techniques on a reconfigurable hardware platform, yielding a low power consumption of under 1 W. The algorithm runs at 22 Hz processing 2 MP image pairs and computing disparity maps with up to 255 disparities. The conducted evaluations on the KITTI Dataset and on a challenging bad weather dataset show that full depth resolution is obtained for small disparities and robustness of the method is maintained at a fraction of the resources of a regular SGM engine.

Stefan K. Gehrig, Reto Stalder, Nicolai Schneider

Real Time Vision System for Obstacle Detection and Localization on FPGA

Obstacle detection is a mandatory function for a robot navigating in an indoor environment especially when interaction with humans is done in a cluttered environment. Commonly used vision-based solutions like SLAM (Simultaneous Localization and Mapping) or optical flow tend to be computation intensive and require powerful computation resources to meet low speed real-time constraints. Solutions using LIDAR (Light Detection And Ranging) sensors are more robust but not cost effective. This paper presents a real-time hardware architecture for vision-based obstacle detection and localization based on IPM (Inverse Perspective Mapping) for obstacle detection, and Otsu’s method plus Bresenham’s algorithm for obstacle segmentation and localization under the hypothesis of a flat ground. The proposed architecture combines cost effectiveness, high frame-rate with low latency, low power consumption and without any prior knowledge of the scene compared to existing implementations.

Ali Alhamwi, Bertrand Vandeportaele, Jonathan Piat

Bayesian Formulation of Gradient Orientation Matching

Gradient orientations are a common feature used in many computer vision algorithms. It is a good feature when the gradient magnitudes are high, but can be very noisy when the magnitudes are low. This means that some gradient orientations are matched with more confidence than others. By estimating this uncertainty, more weight can be put on the confident matches than those with higher uncertainty. To enable this, we derive the probability distribution of gradient orientations based on a signal to noise ratio defined as the gradient magnitude divided by the standard deviation of the Gaussian noise. The noise level is reasonably invariant over time, while the magnitude, has to be measured for every frame. Using this probability distribution we formulate the matching of gradient orientations as a Bayesian classification problem.

A common application where this is useful is feature point matching. Another application is background/foreground segmentation. This paper will use the latter application as an example, but is focused on the general formulation. It is shown how the theory can be used to implement a very fast background/foreground segmentation algorithm that is capable of handling complex lighting variations.

Håkan Ardö, Linus Svärm

Can Speedup Assist Accuracy? An On-Board GPU-Accelerated Image Georeference Method for UAVs

This paper presents a georeferenced map extraction method, for Medium-Altitude Long-Endurance UAVs. The adopted technique of projecting world points to an image plane is a perfect candidate for a GPU implementation. The achieved high frame rate leads to a plethora of measurements even in the case of a low-power mobile processing unit. These measurements can later be combined in order to refine the output and create a more accurate result.

Loukas Bampis, Evangelos G. Karakasis, Angelos Amanatiadis, Antonios Gasteratos

High-Level Vision


Surface Reconstruction from Intensity Image Using Illumination Model Based Morphable Modeling

We present a new method for reconstructing depth of a known object from a single still image using deformed underneath sign matrix of a similar object. Existing Shape from Shading(SFS) methods try to establish a relationship between intensity values of a still image and surface normal of corresponding depth, but most of them resort to error minimization based approaches. Given the fact that these reconstruction approaches are fundamentally ill-posed, they have limited successes for surfaces like a human face. Photometric Stereo (PS) or Structure from Motion (SfM) based methods extend SFS by adding additional information/constraints about the target. Our goal is identical to SFS, however, we tackle the problem by building a relationship between gradient of depth and intensity value at the corresponding location of image of the same object. This formula is simplified and approximated for handing different materials, lighting conditions and, the underneath sign matrix is also obtained by resizing/deforming Region of Interest(ROI) with respect to its counterpart of a similar object. The target object is then reconstructed from its still image. In addition to the process, delicate details of the surface is also rebuilt using a Gabor Wavelet Network(GWN) on different ROIs. Finally, for merging the patches together, a Self-Organizing Maps(SOM) based method is used to retrieve and smooth boundary parts of ROIs. Compared with state of art SFS based methods, the proposed method yields promising results on both widely used benchmark datasets and images in the wild.

Zhi Yang, Varun Chandola

Learning Appearance Features for Pain Detection Using the UNBC-McMaster Shoulder Pain Expression Archive Database

We propose a supervised approach to solve the task of automatic pain detection from facial expressions. A pain detection algorithm should be both robust to face pose and the identity of the face (identity bias). In order to achieve invariance to face pose, we use an Active Appearance Model (AAM) to warp all face images into frontal pose. The main contribution of our paper is a discriminative feature extractor that addresses identity bias by learning appearance features that separate pain-related factors from other factors, such as those related to the identity of the face. The system achieves state-of-the-art performance on the UNBC-McMaster Shoulder Pain Expression Archive Database.

Henrik Pedersen

How Good Is Kernel Descriptor on Depth Motion Map for Action Recognition

This paper presents a new method for action recognition using depth data. Each depth sequence is represented by depth motion maps from three projection views (front, side and top) to exploit different aspects of the motion. However, different from state of the art works extracting local binary pattern or histogram of oriented gradients, we describe an action based on gradient kernel descriptor. The proposed method is evaluated on two benchmark datasets (MSRAction3D and MSRGestures3D) and obtains very competitive performances with the best state of the arts methods. Our best recognition rate is 91.57 % on MSRAction3D and 100 % on MSRGestures3D dataset whereas [


] achieved 93.77 % and 94.60 % respectively.

Thanh-Hai Tran, Van-Toi Nguyen

An Informative Logistic Regression for Cross-Domain Image Classification

Cross-domain image classification is a challenge problem in numerous practical applications and has attracted a lot of interests from research and industry communities. It differs from traditional closed set image classification due to the variance between the training and testing datesets. Although the semantics of the image categories are the same, the image variance between testing and training often results in significant loss of performance. To solve the problem, most previous works resort to data pre-processing approaches, such as minimizing the difference between the distributions of the training and testing datasets. In this paper, we propose a novel informative feature preserving classifier for cross-domain image classification. We introduce the idea of maximizing the variance of unlabeled training data into a L1 based logistic regression model, so that the informative features can be preserved in the model training which consequently leads to performance improvement in the testing. Experiments conducted on commonly used benchmarks for cross-domain image classification show that our method significantly outperforms the state-of-the-art.

Guangtang Zhu, Hanfang Yang, Lan Lin, Guichun Zhou, Xiangdong Zhou

Robust Facial Feature Localization using Data-Driven Semi-supervised Learning Approach

In this paper, we present a novel localization method of facial feature points with generalization ability based on a data-driven semi-supervised learning approach. Even though a powerful facial feature detector can be built using a number of human-annotated training data, the collection process is time-consuming and very often impractical due to the high cost and error-prone process of manual annotations. The proposed method takes advantage of a data-driven semi-supervised learning that optimizes a hybrid detector by interacting with a hierarchical data model to suppress and regularize noisy outliers. The competitive performance comparing to other state-of-the-art technology is also shown using benchmark datasets, Bosprous, BioID.

Yoon Young Kim, Sung Jin Hong, Ji Hye Rhee, Mi Young Nam, Phill Kyu Rhee

Quantitative Analysis of Surface Reconstruction Accuracy Achievable with the TSDF Representation

During the last years KinectFusion and related algorithms have facilitated significant advances in real-time simultaneous localization and mapping (SLAM) with depth-sensing cameras. Nearly all of these algorithms represent the observed area with the truncated signed distance function (TSDF). The reconstruction accuracy achievable with the representation is crucial for camera pose estimation and object reconstruction. Therefore, we evaluate this reconstruction accuracy in an optimal context, i.e. assuming error-free camera pose estimation and depth measurement. For this purpose we use a synthetic dataset of depth image sequences and corresponding camera pose ground truth and compare the reconstructed point clouds with the ground truth meshes. We investigate several influencing factors, especially the TSDF resolution and show that the TSDF is a very powerful representation even for low resolutions.

Diana Werner, Philipp Werner, Ayoub Al-Hamadi

Learning and Adaptation


An Online Adaptive Fuzzy Clustering and Its Application for Background Suppression

Background suppression in video sequences has attracted growing attention and is one of the heated issues in almost every task of video processing. An online fuzzy clustering for automatic background suppression is presented in this paper. First, in the classical fuzzy clustering methods, we have to wait until all data have been generated before the learning process begins. It is impractical because in real application for background suppression, the video length is unknown and the video frames are generated dynamically in a streaming environment and arrive one at a time. Our method has an ability to adapt and change through complex scenes in a true online fashion. Secondly, different from previous works for background suppression, where the information of the detected background is ignored, we propose a new way to incorporate this information. Finally, to estimate the model parameters, the scoring method is adopted to minimize the fuzzy objective function with the Kullback-Leibler divergence information. Experiments on real datasets are presented. The performance of the proposed model is compared to that of other background modeling techniques, demonstrating the robustness and accuracy of our method.

Thanh Minh Nguyen, Q. M. Jonathan Wu, Dibyendu Mukherjee

Object Detection and Terrain Classification in Agricultural Fields Using 3D Lidar Data

Autonomous navigation and operation of agricultural vehicles is a challenging task due to the rather unstructured environment. An uneven terrain consisting of ground and vegetation combined with the risk of non-traversable obstacles necessitates a strong focus on safety and reliability. This paper presents an object detection and terrain classification approach for classifying individual points from 3D point clouds acquired using single multi-beam lidar scans. Using a support vector machine (SVM) classifier, individual 3D points are categorized as either ground, vegetation, or object based on features extracted from local neighborhoods. Experiments performed at a local working farm show that the proposed method has a combined classification accuracy of


, detecting points belonging to objects such as humans, animals, cars, and buildings with


accuracy, while classifying vegetation with an accuracy of



Mikkel Kragh, Rasmus N. Jørgensen, Henrik Pedersen

Enhanced Residual Orientation for Improving Fingerprint Quality

Fingerprint possesses unique, hard to lose, and reliable characteristics. In the recent years, it has been widely applied in biometrics. However, in fingerprint identification, blurred images often occur owing to uneven pressing force; and result in recognition errors. This study proposes an innovative fingerprint quality improvement algorithm to enhance the contrast of fingerprint image and to reduce blurs. By employing D4 discrete wavelet transformation, images are transformed from spatial domain to four frequency domain sub-bands. Then interactive compensation is performed on each band through the multi-resolution characteristic of wavelet transformation and singular value decomposition. Finally, compensated images are reconstructed through inverse-wavelet transformation. After going through our developed fuzzy fingerprint detection system, the fuzzy extent of compensated images can be effectively improved for later backend identification. This study employed NIST-4 and FVC fingerprint databases. The experimental results showed that our method actually could effectively improve blurs in fingerprint.

Jing-Wein Wang, Ngoc Tuyen Le, Tzu-Hsiung Chen

Learning Human Priors for Task-Constrained Grasping

An autonomous agent using manmade objects must understand how task conditions the grasp placement. In this paper we formulate task based robotic grasping as a feature learning problem. Using a human demonstrator to provide examples of grasps associated with a specific task, we learn a representation, such that similarity in task is reflected by similarity in feature. The learned representation discards parts of the sensory input that is redundant for the task, allowing the agent to ground and reason about the relevant features for the task. Synthesized grasps for an observed task on previously unseen objects can then be filtered and ordered by matching to learned instances without the need of an analytically formulated metric. We show on a real robot how our approach is able to utilize the learned representation to synthesize and perform valid task specific grasps on novel objects.

Martin Hjelm, Carl Henrik Ek, Renaud Detry, Danica Kragic

Region-of-Interest Retrieval in Large Image Datasets with Voronoi VLAD

We investigate the problem of visual-query based retrieval from large image datasets when the visual queries comprise arbitrary regions of interest (ROI) rather than entire images. Our proposal is a compact image descriptor that combines the vector of locally aggregated descriptors (VLAD) of Jegou

et. al.

with a multi-level, Voronoi-based, spatial partitioning of each dataset image, and it is termed as the Voronoi VLAD (VVLAD). The proposed multi-level Voronoi partitioning uses a spatial hierarchical K-means over interest-point locations, and computes a VLAD over each cell. In order to reduce the matching complexity when handling very large datasets, we propose the following modifications. First, we utilize the tree structure of the spatial hierarchical K-means to perform a top-to-bottom pruning for local similarity maxima, rather than exhaustively matching against all cells (Fast-VVLAD). Second, we propose to aggregate VLADs of adjacent Voronoi cells in order to reduce the overall VVLAD storage requirement per image. Finally, we propose a new image similarity score for Fast-VVLAD that combines relevant information from all partition levels into a single measure for similarity. For a range of ROI queries in two standard datasets, Fast-VVLAD achieves comparable or higher mean Average Precision against the state-of-the-art Multi-VLAD framework while offering more than two-fold acceleration.

Aaron Chadha, Yiannis Andreopoulos

Geostatistics for Context-Aware Image Classification

Context information is fundamental for image understanding. Many algorithms add context information by including semantic relations among objects such as neighboring tendencies, relative sizes and positions. To achieve context inclusion, popular context-aware classification methods rely on probabilistic graphical models such as Markov Random Fields (MRF) or Conditional Random Fields (CRF). However, recent studies showed that MRF/CRF approaches do not perform better than a simple smoothing on the labeling results.

The need for more context awareness has motivated the use of different methods where the semantic relations between objects are further enforced. With this, we found that on particular application scenarios where some specific assumptions can be made, the use of context relationships is greatly more effective.

We propose a new method, called


, to compute the labels of mosaic images with context label agreement. Our method trains a transition probability model to enforce properties such as class size and proportions. The method draws inspiration from Geostatistics, usually used to model spatial uncertainties. We tested the proposed method in two different ocean seabed classification context, obtaining state-of-art results.

Felipe Codevilla, Silvia S. C. Botelho, Nelson Duarte, Samuel Purkis, A. S. M. Shihavuddin, Rafael Garcia, Nuno Gracias

Robot Vision


Querying 3D Data by Adjacency Graphs

The need for robots to search the 3D data they have saved is becoming more apparent. We present an approach for finding structures in 3D models such as those built by robots of their environment. The method extracts geometric primitives from point cloud data. An attributed graph over these primitives forms our representation of the surface structures. Recurring substructures are found with frequent graph mining techniques. We investigate if a model invariant to changes in size and reflection using only the geometric information of and between primitives can be discriminative enough for practical use. Experiments confirm that it can be used to support queries of 3D models.

Nils Bore, Patric Jensfelt, John Folkesson

Towards a Robust System Helping Underwater Archaeologists Through the Acquisition of Geo-referenced Optical and Acoustic Data

In the framework of the ARROWS project (September 2012 - August 2015), a venture funded by the European Commission, several modular Autonomous Underwater Vehicles (AUV) have been developed to the main purposes of mapping, diagnosing, cleaning, and securing underwater and coastal archaeological sites. These AUVs consist of modular mobile robots, designed and manufactured according to specific suggestions formulated by a pool of archaeologists featuring long-standing experience in the field of Underwater Cultural Heritage preservation. The vehicles are typically equipped with acoustic modems to communicate during the dive and with different payload devices to sense the environment. The selected sensors represent appealing choices to the oceanographic engineer since they provide complementary information about the surrounding environment. The main topics discussed in this paper concern (i) performing a systematic mapping of the marine seafloors, (ii) processing the output maps to detect and classify potential archaeological targets and finally (iii) developing dissemination systems with the purpose of creating virtual scenes as a photorealistic and informative representation of the surveyed underwater sites.

Benedetto Allotta, Riccardo Costanzi, Massimo Magrini, Niccoló Monni, Davide Moroni, Maria Antonietta Pascali, Marco Reggiannini, Alessandro Ridolfi, Ovidio Salvetti, Marco Tampucci

3D Object Pose Refinement in Range Images

Estimating the pose of objects from range data is a problem of considerable practical importance for many vision applications. This paper presents an approach for accurate and efficient 3D pose estimation from 2.5D range images. Initialized with an approximate pose estimate, the proposed approach refines it so that it accurately accounts for an acquired range image. This is achieved by using a hypothesize-and-test scheme that combines Particle Swarm Optimization (PSO) and graphics-based rendering to minimize a cost function of object pose that quantifies the misalignment between the acquired and a hypothesized, rendered range image. Extensive experimental results demonstrate the superior performance of the approach compared to the Iterative Closest Point (ICP) algorithm that is commonly used for pose refinement.

Xenophon Zabulis, Manolis Lourakis, Panagiotis Koutlemanis

Revisiting Robust Visual Tracking Using Pixel-Wise Posteriors

In this paper we present an in-depth evaluation of a recently published tracking algorithm [


] which intelligently couples rigid-registration and color-based segmentation using level-sets. The original method did not arouse the deserved interest in the community, most likely due to challenges in reimplementation and the lack of a quantitative evaluation. Therefore, we reimplemented this baseline approach, evaluated it on state-of-the-art datasets (VOT and OOT) and compared it to alternative segmentation-based tracking algorithms. We believe this is a valuable contribution as such a comparison is missing in the literature. The impressive results help promoting segmentation-based tracking algorithms, which are currently under-represented in the visual tracking benchmarks. Furthermore, we present various extensions to the color model, which improve the performance in challenging situations such as confusions between fore- and background. Last, but not least, we discuss implementation details to speed up the computation by using only a sparse set of pixels for the propagation of the contour, which results in tracking speed of up to 200 Hz for typical object sizes using a single core of a standard 2.3 GHz CPU.

Falk Schubert, Daniele Casaburo, Dirk Dickmanns, Vasileios Belagiannis

Teach it Yourself - Fast Modeling of Industrial Objects for 6D Pose Estimation

In this paper, we present a vision system that allows a human to create new 3D models of novel industrial parts by placing the part in two different positions in the scene. The two shot modeling framework generates models with a precision that allows the model to be used for 6D pose estimation without loss in pose accuracy. We quantitatively show that our modeling framework reconstructs noisy but adequate object models with a mean RMS error at 2.7 mm, a mean standard deviation at 0.025 mm and a completeness of 70.3 % over all 14 reconstructed models, compared to the ground truth CAD models. In addition, the models are applied in a pose estimation application, evaluated with 37 different scenes with 61 unique object poses. The pose estimation results show a mean translation error on 4.97 mm and a mean rotation error on 3.38 degrees.

Thomas Sølund, Thiusius Rajeeth Savarimuthu, Anders Glent Buch, Anders Billesø Beck, Norbert Krüger, Henrik Aanæs

Shape Dependency of ICP Pose Uncertainties in the Context of Pose Estimation Systems

The iterative closest point (ICP) algorithm is used to fine tune the alignment of two point clouds in many pose estimation algorithms. The uncertainty in these pose estimation algorithms is thus mainly dependent on the pose uncertainty in ICP.

This paper investigates the uncertainties in the ICP algorithm by the use of Monte Carlo simulation. A new descriptor based on object shape and a pose error descriptor are introduced. Results show that it is reasonable to approximate the pose errors by multivariate Gaussian distributions, and that there is a linear relationship between the parameters of the Gaussian distributions and the shape descriptor. As a consequence the shape descriptor potentially provides a computationally cheap way to approximate pose uncertainties.

Thorbjørn Mosekjær Iversen, Anders Glent Buch, Norbert Krüger, Dirk Kraft

D $$^{2}$$ CO: Fast and Robust Registration of 3D Textureless Objects Using the Directional Chamfer Distance

This paper introduces a robust and efficient vision based method for object detection and 3D pose estimation that exploits a novel edge-based registration algorithm we called Direct Directional Chamfer Optimization (D


CO). Our approach is able to handle textureless and partially occluded objects and does not require any off-line object learning step. Depth edges and visible patterns extracted from the 3D CAD model of the object are matched against edges detected in the current grey level image by means of a 3D distance transform represented by an image tensor, that encodes the minimum distance to an edge point in a joint direction/location space. D


CO refines the object position employing a non-linear optimization procedure, where the cost being minimized is extracted directly from the 3D image tensor. Differently from other popular registration algorithms as ICP, that require to constantly update the correspondences between points, our approach does not require any iterative re-association step: the data association is implicitly optimized while inferring the object position. This enables D


CO to obtain a considerable gain in speed over other registration algorithms while presenting a wider basin of convergence. We tested our system with a set of challenging untextured objects in presence of occlusions and cluttered background, showing accurate results and often outperforming other state-of-the-art methods.

Marco Imperoli, Alberto Pretto

Comparative Evaluation of 3D Pose Estimation of Industrial Objects in RGB Pointclouds

3D pose estimation is a crucial element for enabling robots to work in industrial environment to perform tasks like bin-picking or depalletizing. Even though there exist various pose estimation algorithms, they usually deal with common daily objects applied in lab environments. However, coping with real-world industrial objects is a much harder challenge for most pose estimation techniques due to the difficult material and structural properties of those objects. A comparative evaluation of pose estimation algorithms in regard to these object characteristics has yet to be done. This paper aims to provide a description and evaluation of selected state-of-the-art pose estimation techniques to investigate their object-related performance in terms of time and accuracy. The evaluation shows that there is indeed not a general algorithm which solves the task for all different objects, but it outlines the issues that real-world application have to deal with and what the strengths and weaknesses of the different pose estimation approaches are.

Bjarne Großmann, Mennatullah Siam, Volker Krüger

Object Detection Using a Combination of Multiple 3D Feature Descriptors

This paper presents an approach for object pose estimation using a combination of multiple feature descriptors. We propose to use a combination of three feature descriptors, capturing both surface and edge information. Those descriptors individually perform well for different object classes. We use scenes from an established RGB-D dataset and our own recorded scenes to justify the claim that by combining multiple features, we in general achieve better performance. We present quantitative results for descriptor matching and object detection for both datasets.

Lilita Kiforenko, Anders Glent Buch, Norbert Krüger

Differential Optical Flow Estimation Under Monocular Epipolar Line Constraint

In this paper, a new method is presented to use the epipolar constraint for the estimations of optical flows. We derive the necessary formulation to add the epipolar constraint in terms of optical flow components and force the components to transform points from the first frame to the next consecutive frame such that the points lie on their correspondent epipolar lines. In this work, no smoothness term is utilized and the performance of the proposed method is evaluated based only on data terms. We conducted different evaluations using two different point matching methods (SIFT and Lucas-Kanade) and used them in two different fundamental matrix estimation methods required to calculate epipolar line coefficients. It is demonstrated that epipolar constraint yields noticeable improvements almost in all of the cases.

Mahmoud A. Mohamed, M. Hossein Mirabdollah, Bärbel Mertsching

General Object Tip Detection and Pose Estimation for Robot Manipulation

Robot manipulation tasks like inserting screws and pegs into a hole or automatic screwing require precise tip pose estimation. We propose a novel method to detect and estimate the tip of elongated objects. We demonstrate that our method can estimate tip pose to millimeter-level accuracy. We adopt a probabilistic, appearance-based object detection framework to detect pegs and bits for electric screw drivers. Screws are difficult to detect with feature- or appearance-based methods due to their reflective characteristics. To overcome this we propose a novel adaptation of RANSAC with a parallel-line model. Subsequently, we employ image moments to detect the tip and its pose. We show that the proposed method allows a robot to perform object insertion with only two pairs of orthogonal views, without visual servoing.

Dadhichi Shukla, Özgür Erkent, Justus Piater

Visual Estimation of Attentive Cues in HRI: The Case of Torso and Head Pose

Capturing visual human-centered information is a fundamental input source for effective and successful human-robot interaction (HRI) in dynamic multi-party social settings. Torso and head pose, as forms of nonverbal communication, support the derivation people’s focus of attention, a key variable in the analysis of human behaviour in HRI paradigms encompassing social aspects. Towards this goal, we have developed a model-based approach for torso and head pose estimation to overcome key limitations in free-form interaction scenarios and issues of partial intra- and inter-person occlusions. The proposed approach builds up on the concept of

Top View Re-projection

(TVR) to uniformly treat the respective body parts, modelled as cylinders. For each body part a number of pose hypotheses is sampled from its configuration space. Each pose hypothesis is evaluated against the a scoring function and the hypothesis with the best score yields for the assumed pose and the location of the joints. A refinement step on head pose is applied based on tracking facial patch deformations to compute for the horizontal off-plane rotation. The overall approach forms one of the core component of a vision system integrated in a robotic platform that supports socially appropriate, multi-party, multimodal interaction in a bartending scenario. Results in the robot’s environment during real HRI experiments with varying number of users attest for the effectiveness of our approach.

Markos Sigalas, Maria Pateraki, Panos Trahanias

Vision Systems Applications


Efficient Media Retrieval from Non-Cooperative Queries

Text is ubiquitous in the artificial world and easily attainable when it comes to book title and author names. Using the images from the book cover set from the Stanford Mobile Visual Search dataset and additional book covers and metadata from, we construct a large scale book cover retrieval dataset, complete with 100 K distractor covers and title and author strings for each.

Because our query images are poorly conditioned for clean text extraction, we propose a method for extracting a matching noisy and erroneous OCR readings and matching it against clean author and book title strings in a standard document look-up problem setup. Finally, we demonstrate how to use this text-matching as a feature in conjunction with popular retrieval features such as VLAD using a simple learning setup to achieve significant improvements in retrieval accuracy over that of either VLAD or the text alone.

Kevin Shih, Wei Di, Vignesh Jagadeesh, Robinson Piramuthu

Quantifying the Effect of a Colored Glove in the 3D Tracking of a Human Hand

Research in vision-based 3D hand tracking targets primarily the scenario in which a bare hand performs unconstrained motion in front of a camera system. Nevertheless, in several important application domains, augmenting the hand with color information so as to facilitate the tracking process constitutes an acceptable alternative. With this observation in mind, in this work we propose a modification of a state of the art method [


] for markerless 3D hand tracking, that takes advantage of the richer observations resulting from a colored glove. We do so by modifying the 3D hand model employed in the aforementioned hypothesize-and-test method as well as the objective function that is minimized in its optimization step. Quantitative and qualitative results obtained from a comparative evaluation of the baseline method to the proposed approach confirm that the latter achieves a remarkable increase in tracking accuracy and robustness and, at the same time, reduces drastically the associated computational costs.

Konstantinos Roditakis, Antonis A. Argyros

An Efficient Eye Tracking Using POMDP for Robust Human Computer Interaction

We propose an adaptive eye tracking system for robust human-computer interaction under dynamically changing environments based on the partially observable Markov Decision Process (


. In our system, real-time eye tracking optimization is tackled using a flexible world-context model based


approach that requires less data and time in adaptation than those of hard world-context model approaches. The challenge is to divide the huge belief space into world-context models, and to search for optimal control parameters in the current world-context model with real-time constraints. The offline learning determines multiple world-context models based on image-quality analysis over the joint space of transition, observation, reward distributions, and an approximate world-context model is balanced with the online learning over a localized horizon. The online learning is formulated as a dynamic parameter control with incomplete information under real-time constraints, and is solved by the real-time


-learning approach. Extensive experiments conducted using realistic videos have provided us with very encouraging results.

Ji Hye Rhee, Won Jun Sung, Mi Young Nam, Hyeran Byun, Phill Kyu Rhee

A Vision-Based System for Movement Analysis in Medical Applications: The Example of Parkinson Disease

We present a vision-based approach for analyzing a Parkinson patient’s movements during rehabilitation treatments. We describe therapeutic movements using relevant quantitative measurements, which can be applied both for diagnosis and monitoring of the disease progress.

Since our long-term goal is to develop an affordable and portable system, suitable for home usage, we use the Kinect device for data acquisition. All recorded exercises are approved by neurologists and therapists and designed to examine the presence of characteristic symptoms caused by neurological disorders. In this study, we focus on Parkinson’s patients in the early stages of the disease.

Our approach underlines relevant rehabilitation measurements and allows to determine which ones are more informative for separating healthy from non-healthy subjects. Finally, we propose the symmetry ratio, well known in motor control, as a novel feature that can be extracted from rehabilitation exercises and used in the decision-making (diagnosis support) and monitoring procedures.

Sofija Spasojević, José Santos-Victor, Tihomir Ilić, Slađan Milanović, Veljko Potkonjak, Aleksandar Rodić

Estimating the Number of Clusters with Database for Texture Segmentation Using Gabor Filter

This paper addresses a novel solution of the problem of image segmentation by its texture using Gabor filter. Texture segmentation has been worked well by using Gabor filter, but there still is a problem; the number of clusters. There are several studies about estimating number of clusters with statistical approaches such as gap statistic. However, there are some problems to apply those methods to texture segmentation in terms of accuracy and time complexity. To overcome these limits, this paper proposes novel method to estimate optimal number of clusters for texture segmentation by using training dataset and several assumptions which are appropriate for image segmentation. We evaluate the proposed method on dataset consists of texture image and limit possible number of clusters from 2 to 5. And we also evaluate the proposed method by real image contains various texture such as rock stratum.

Minkyu Kim, Jeong-Mook Lim, Heesook Shin, Changmok Oh, Hyun-Tae Jeong

Robust Marker-Based Tracking for Measuring Crowd Dynamics

We present a system to conduct laboratory experiments with thousands of pedestrians. Each participant is equipped with an individual marker to enable us to perform precise tracking and identification. We propose a novel rotation invariant marker design which guarantees a minimal Hamming distance between all used codes. This increases the robustness of pedestrian identification. We present an algorithm to detect these markers, and to track them through a camera network. With our system we are able to capture the movement of the participants in great detail, resulting in precise trajectories for thousands of pedestrians. The acquired data is of great interest in the field of pedestrian dynamics. It can also potentially help to improve multi-target tracking approaches, by allowing better insights into the behaviour of crowds.

Wolfgang Mehner, Maik Boltes, Markus Mathias, Bastian Leibe

Including 3D-textures in a Computer Vision System to Analyze Quality Traits of Loin

Texture analysis by co-occurrences on magnetic resonance imaging (MRI) involves a non-invasive nor destructive method for studying the distribution of several texture features inside meat products. Traditional methods are based on 2D image sequences, which limit the distribution of texture to a single plane. That implies a loss of information when texture features are studied from different orientations. In this paper a new 3D algorithm is proposed and included in a computer vision system to study the distribution of textures in 3D images of Iberian loin from different orientations. The semantic interpretation of textural composition in each orientation is also reached.

M. Mar Ávila, Daniel Caballero, M. Luisa Durán, Andrés Caro, Trinidad Pérez-Palacios, Teresa Antequera

Adaptive Neuro-Fuzzy Controller for Multi-object Tracker

Sensitivity to scene such as contrast and illumination intensity, is one of the factors significantly affecting the performance of object trackers. In order to overcome this issue, tracker parameters need to be adapted based on changes in contextual information. In this paper, we propose an intelligent mechanism to adapt the tracker parameters, in a real-time and online fashion. When a frame is processed by the tracker, a controller extracts the contextual information, based on which it adapts the tracker parameters for successive frames. The proposed controller relies on a learned neuro-fuzzy inference system to find satisfactory tracker parameter values. The proposed approach is trained on nine publicly available benchmark video data sets and tested on three unrelated video data sets. The performance comparison indicates clear tracking performance improvement in comparison to tracker with static parameter values, as well as other state-of-the art trackers.

Duc Phu Chau, K. Subramanian, François Brémond

Human Action Recognition Using Dominant Motion Pattern

The proposed method addresses human action recognition problem in a realistic video. The content of such videos are influenced by irregular background motion and camera shakes. We construct the human pose descriptors by using a modified version of optical flow (we call it as hybrid motion optical flow). We quantize the hybrid motion optical flow (HMOF) into different labels. The orientations of the HMOF vectors are corrected using probabilistic relaxation labelling, where the HMOF vectors with locally maximum magnitude are retained. A sequence of 2D points, called tracks, representing the motion of the person, are constructed. We select top dominant tracks of the sequence based on a cost function. The dominant tracks are further processed to represent the feature descriptor of a given action.

Snehasis Mukherjee, Apurbaa Mallik, Dipti Prasad Mukherjee

Human Action Recognition Using Dominant Pose Duplet

We propose a Bag-of-Words (BoW) based technique for human action recognition in videos containing challenges like illumination changes, background changes and camera shaking. We build the pose descriptors corresponding to the actions, based on the gradient-weighted optical flow (GWOF) measure, to minimize the noise related to camera shaking. The pose descriptors are clustered and stored in a dictionary of poses. We further generate a reduced dictionary, where words are termed as pose duplet. The pose duplets are constructed by a graphical approach, considering the probability of occurrence of two poses sequentially, during an action. Here, poses of the initial dictionary, are considered as the nodes of a weighted directed graph called the duplet graph. Weight of each edge of the duplet graph is calculated based on the probability of the destination node of the edge to appear after the source node of the edge. The concatenation of the source and destination pose vectors is called pose duplet. We rank the pose duplets according to the weight of the edge between them. We form the reduced dictionary with the pose duplets with high edge weights (called dominant pose duplet). We construct the action descriptors for each actions, using the dominant pose duplets and recognize the actions. The efficacy of the proposed approach is tested on standard datasets.

Snehasis Mukherjee

Online Re-calibration for Robust 3D Measurement Using Single Camera- PantoInspect Train Monitoring System

Vision-based inspection systems measures defects accurately with the help of a checkerboard calibration (CBC) method. However, the 3D measurements of such systems are prone to errors, caused by physical misalignment of the object-of-interest and noisy image data. The

PantoInspect Train Monitoring System

(PTMS), is one such system that inspects defects on pantographs mounted on top of the electric trains. In PTMS, the measurement errors can compromise railway safety. Although this problem can be solved by re-calibrating the cameras, the process involves manual intervention leading to large servicing times.

Therefore, in this paper, we propose Feature Based Calibration (FBC) in place of CBC, to cater an obvious need for online re-calibration that enhances the usability of the system. FBC involves feature extraction, pose estimation, back-projection of defect points and estimation of 3D measurements. We explore four state-of-the-art pose estimation algorithms in FBC using very few feature points.

This paper evaluates and discusses the performance of FBC and its robustness against practical problems, in comparison to CBC. As a result, we identify the best FBC algorithm type and operational scheme for PTMS. In conclusion, we show that, by adopting FBC in PTMS and other related 3D systems, better performance and robustness can be achieved compared to CBC.

Deepak Dwarakanath, Carsten Griwodz, Pål Halvorsen, Jacob Lildballe

Image Saliency Applied to Infrared Images for Unmanned Maritime Monitoring

This paper presents a method to detect boats and life rafts on long wave infrared (LWIR) images, captured by an aerial platform. The method applies the concept of image saliency to highlight distinct areas on the images. However saliency algorithms always highlight salient points in the image, even in the absence of targets. We propose a statistical method based on the saliency algorithm output to distinguish frames with or without targets. To evaluate the detection algorithm, we have equipped a fixed wing unmanned aerial vehicle with a LWIR camera and gathered a dataset with more than 44000 frames, containing several boats and a life raft. The proposed detection strategy demonstrates a good performance, specially, a low rate of false positives and low computational complexity.

Gonçalo Cruz, Alexandre Bernardino

CBIR Service for Object Identification

This paper proposes an architecture for an exact object detection system. The implementation as well as the communication between individual system components is detailed in the paper. Well known methods for feature detection and extraction were used. Fast and precise method for feature comparison is presented.

The proposed system was evaluated by training the dataset and querying the dataset. With 12 Workers, the response time of querying the dataset consisting of


images were just below 20 seconds. Also system trained dataset of this size with same amount of workers in about an hour.

Josef Hák, Martin Klíma, Mikuláš Krupička, Václav Čadek

Soil Surface Roughness Using Cumulated Gaussian Curvature

Optimal use of farming machinery is important for efficiency and sustainability. Continuous automated control of the machine settings throughout the tillage operation requires sensory feedback estimating the seedbed quality. In this paper we use a laser range scanner to capture high resolution maps of soil aggregates in a laboratory setting as well as full soil surface maps in a field test. Gaussian curvature is used to estimate the size of single aggregates under controlled circumstances. Additionally, a method is proposed, which cumulates the Gaussian curvature of full soil surface maps to estimate the degree of tillage.

Thomas Jensen, Lars J. Munkholm, Ole Green, Henrik Karstoft


Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!