Skip to main content

About this book

The field of robotic vision has advanced dramatically recently with the development of new range sensors. Tremendous progress has been made resulting in significant impact on areas such as robotic navigation, scene/environment understanding, and visual learning. This edited book provides a solid and diversified reference source for some of the most recent important advancements in the field of robotic vision. The book starts with articles that describe new techniques to understand scenes from 2D/3D data such as estimation of planar structures, recognition of multiple objects in the scene using different kinds of features as well as their spatial and semantic relationships, generation of 3D object models, approach to recognize partially occluded objects, etc. Novel techniques are introduced to improve 3D perception accuracy with other sensors such as a gyroscope, positioning accuracy with a visual servoing based alignment strategy for microassembly, and increasing object recognition reliability using related manipulation motion models. For autonomous robot navigation, different vision-based localization and tracking strategies and algorithms are discussed. New approaches using probabilistic analysis for robot navigation, online learning of vision-based robot control, and 3D motion estimation via intensity differences from a monocular camera are described. This collection will be beneficial to graduate students, researchers, and professionals working in the area of robotic vision.

Table of Contents


Multi-modal Manhattan World Structure Estimation for Domestic Robots

Spatial structure, typically dealt with by robots in domestic environments conform to Manhattan spatial orientations. In other words, much of the 3D point cloud space conform to one of three primal planar orientations. Hence analysis of such planar spatial structures is significant in robotic environments. This process has become a fundamental component in diverse robot vision systems since the introduction of low-cost RGB-D cameras such as the Kinect, ASUS and the Primesense that have been widely mounted on various indoor robots. These structured light/ time-of-flight commercial depth cameras are capable of providing high quality 3D reconstruction in real-time. There are a number of techniques that can be applied to determination of multi-plane structure in 3D scenes. Most of these techniques require prior knowledge modality of the planes or inlier scale of the data points in order to successfully discriminate between different planar structures. In this paper, we present a novel approach towards estimation of multi-plane structures without prior knowledge, based on Jensen-Shannon Divergence (JSD), which is a similarity measurement method used to represent pairwise relationship between data. Our model based on the JSD incorporates information about whether pairwise relationships exist in a model’s inlier data set or not as well as the pairwise geometrical relationship between data points.
Tests on datasets comprised of noisy inliers and a large percentage of outliers demonstrate that the proposed solution can efficiently estimate multiple models without prior information. Experimental results shown using our model also demonstrate successful discrimination of multiple planar structures in both real and synthetic scenes. Pragmatic tests with a robot vision system also demonstrate the validity of the proposed approach. Furthermore, it is shown that our model is not just restricted to linear kernel models such as planes but also be used to fit data using non-linear kernel models.
Kai Zhou, Karthik Mahesh Varadarajan, Michael Zillich, Markus Vincze

RMSD: A 3D Real-Time Mid-level Scene Description System

This paper introduces a system for real-time, visual 3D scene description. A scene is described by planar patches and conical objects (cylinders, cones and spheres). The system makes use of sensor’s natural point order, dimensionality reduction and fast incremental model update (in O(1)) to first build 2D geometric features. These features approximate the original data and form candidate sets of possible 3D object models. The candidate sets are used by a region growing algorithm to extract all targeted 3D objects. This two step (raw data to 2D features to 3D objects) approach is able to process 30 frames per second on Kinect depth data, which allows for real-time tracking and feature based robot mapping based on 3D range data.
Kristiyan Georgiev, Rolf Lakaemper

Semantic and Spatial Content Fusion for Scene Recognition

In the field of scene recognition, it is usually insufficient to use only one visual feature regardless of how discriminative the feature is. Therefore, the spatial location and semantic relationships of local features need to be captured together with the scene contextual information. In this paper we proposed a novel framework to project image contextual feature space with semantic space of local features into a map function. This embedding is performed based on a subset of training images denoted as an exemplar-set. This exemplar-set is composed of images that describe better the scene category’s attributes than the other images. The proposed framework learns a weighted combination of local semantic topics as well as global and spatial information, where the weights represent the features’ contributions in each scene category. An empirical study was performed on two of the most challenging scene datasets 15-Scene categories and 67-Indoor Scenes and the promising results of 89.47 and 45.0 were achieved respectively.
Elahe Farahzadeh, Tat-Jen Cham, Wanqing Li

Improving RGB-D Scene Reconstruction Using Rolling Shutter Rectification

Scene reconstruction, i.e. the process of creating a 3D representation (mesh) of some real world scene, has recently become easier with the advent of cheap RGB-D sensors (e.g. the Microsoft Kinect).
Many such sensors use rolling shutter cameras, which produce geometrically distorted images when they are moving. To mitigate these rolling shutter distortions we propose a method that uses an attached gyroscope to rectify the depth scans.We also present a simple scheme to calibrate the relative pose and time synchronization between the gyro and a rolling shutter RGB-D sensor.
For scene reconstruction we use the Kinect Fusion algorithm to produce meshes. We create meshes from both raw and rectified depth scans, and these are then compared to a ground truth mesh. The types of motion we investigate are: pan, tilt and wobble (shaking) motions.
As our method relies on gyroscope readings, the amount of computations required is negligible compared to the cost of running Kinect Fusion.
This chapter is an extension of a paper at the IEEE Workshop on Robot Vision [10]. Compared to that paper, we have improved the rectification to also correct for lens distortion, and use a coarse-to-fine search to find the time shift more quicky.We have extended our experiments to also investigate the effects of lens distortion, and to use more accurate ground truth. The experiments demonstrate that correction of rolling shutter effects yields a larger improvement of the 3D model than correction for lens distortion.
Hannes Ovrén, Per-Erik Forssén, David Törnqvist

Modeling Paired Objects and Their Interaction

Object categorization and human action recognition are two important capabilities for an intelligent robot. Traditionally, they are treated separately. Recently, more researchers started to model the object features, object affordance, and human action at the same time. Most of the works build a relation model between single object features and human action or object affordance and uses the models to improve object recognition accuracies [16, 21, 12].
Yu Sun, Yun Lin

Probabilistic Active Recognition of Multiple Objects Using Hough-Based Geometric Matching Features

3D Object recognition is an important task for mobile platforms to dynamically interact in human environments. This computer vision task also plays a fundamental role in the areas of automated surveillance, Simultaneous Localization and Mapping (SLAM) applications for robots and video retrieval. The recognition of objects in realistic circumstances, where multiple objects may appear together with significant occlusions and clutter from distracter objects, is a complicated and challenging problem. Particularly in such situations multiple viewpoints are necessary for recognition [17] as single viewpoints may be of poor quality and not contain sufficient information to reliably recognise or verify all objects’ identities unambiguously.
Natasha Govender, Philip Torr, Mogomotsi Keaikitse, Fred Nicolls, Jonathan Warrell

Incremental Light Bundle Adjustment: Probabilistic Analysis and Application to Robotic Navigation

This paper focuses on incremental light bundle adjustment (iLBA), a recently introduced [13] structureless bundle adjustment method, that reduces computational complexity by algebraic elimination of camera-observed 3D points and using incremental smoothing to efficiently optimize only the camera poses.We consider the probability distribution that corresponds to the iLBA cost function, and analyze how well it represents the true density of the camera poses given the image measurements. The latter can be exactly calculated in bundle adjustment (BA) by marginalizing out the 3D points from the joint distribution of camera poses and 3D points. We present a theoretical analysis of the differences in the way that light bundle adjustment and bundle adjustment use measurement information. Using indoor and outdoor datasets we show that the first two moments of the iLBA and the true probability distributions are very similar in practice. Moreover, we present an extension of iLBA to robotic navigation, considering information fusion between high-rate IMU and a monocular camera sensor while avoiding explicit estimation of 3D points.We evaluate the performance of this method in a realistic synthetic aerial scenario and show that iLBA and incremental BA result in comparable navigation state estimation accuracy, while computational time is significantly reduced in the former case.
Vadim Indelman, Frank Dellaert

Online Learning of Vision-Based Robot Control during Autonomous Operation

Online learning of vision-based robot control requires appropriate activation strategies during operation. In this chapter we present such a learning approach with applications to two areas of vision-based robot control. In the first setting, selfevaluation is possible for the learning system and the system autonomously switches to learning mode for producing the necessary training data by exploration. The other application is in a setting where external information is required for determining the correctness of an action. Therefore, an operator provides training data when required, leading to an automatic mode switch to online learning from demonstration. In experiments for the first setting, the system is able to autonomously learn the inverse kinematics of a robotic arm. We propose improvements producing more informative training data compared to random exploration. This reduces training time and limits learning to regions where the learnt mapping is used. The learnt region is extended autonomously on demand. In experiments for the second setting, we present an autonomous driving system learning a mapping from visual input to control signals, which is trained by manually steering the robot. After the initial training period, the system seamlessly continues autonomously.Manual control can be taken back at any time for providing additional training.
Kristoffer Öfjäll, Michael Felsberg

3D Space Automated Aligning Task Performed by a Microassembly System Based on Multi-channel Microscope Vision Systems

In this chapter, a microassembly system based on 3-channle microvision system is proposed which achieves the automatic alignment of the mm sized complex micro parts. The design of the system is described firstly. Then two different alignment strategies are presented including position based and image Jacobian based methods.Also, a coarse-to-fine alignment strategy in combinationwith active zooming algorithm is proposed. In the coarse alignment stage, alignment process is conducted with maximum field of view (FOV). In the fine alignment stage, the microscope is of maximum magnification to ensure the highest assembly accuracy. For the image based control, the image jacobian based on several microscope vision systems controlling the micro part movement is derivation based on microscope vision model. It is proved that the image jacobian is a constant when controlling position movement. As active movement of the micro part, the image jacobian is online selfcalibration. The PD controller is adopted to control the micro partmovement quickly and effectively. The experiment verifies the effectiveness of the proposed algorithms.
Zhengtao Zhang, De Xu, Juan Zhang

Intensity-Difference Based Monocular Visual Odometry for Planetary Rovers

A monocular visual odometry algorithm is presented that is able to estimate the rover’s 3D motion by maximizing the conditional probability of the intensity differences between two consecutive images, which were captured by a monocular video camera before and after the rover’s motion. The camera is supposed to be rigidly attached to the rover. The intensity differences are measured at observation points only that are points with high linear intensity gradients. It represents an alternative to traditionally stereo visual odometry algorithms, where the rover’s 3D motion is estimated by maximizing the conditional probability of the 3D correspondences between two sets of 3D feature point positions, which were obtained from two consecutive stereo image pairs that were captured by a stereo video camera before and after the rover’s motion. Experimental results with synthetic and real image sequences revealed highly accurate and reliable estimates, respectively. Additionally, it seems to be an excellent candidate for mobile robot missions where space, weight and power supply are really very limited.
Geovanni Martinez


Additional information