Skip to main content
Top

2008 | Book

Computer Vision Systems

6th International Conference, ICVS 2008 Santorini, Greece, May 12-15, 2008 Proceedings

Editors: Antonios Gasteratos, Markus Vincze, John K. Tsotsos

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

In the past few years, with the advances in microelectronics and digital te- nology, cameras became a widespread media. This, along with the enduring increase in computing power boosted the development of computer vision s- tems. The International Conference on Computer Vision Systems (ICVS) covers the advances in this area. This is to say that ICVS is not and should not be yet another computer vision conference. The ?eld of computer vision is fully covered by many well-established and famous conferences and ICVS di?ers from these by covering the systems point of view. ICVS 2008 was the 6th International Conference dedicated to advanced research on computer vision systems. The conference, continuing a series of successful events in Las Palmas, Vancouver, Graz, New York and Bielefeld, in 2008 was held on Santorini. In all, 128 papers entered the review process and each was reviewed by three independent reviewers using the double-blind review method. Of these, 53 - pers were accepted (23 as oral and 30 as poster presentation). There were also two invited talks by P. Anandan and by Heinrich H. Bultho ¨ ?. The presented papers cover all aspects of computer vision systems, namely: cognitive vision, monitor and surveillance, computer vision architectures, calibration and reg- tration, object recognition and tracking, learning, human—machine interaction and cross-modal systems.

Table of Contents

Frontmatter

Cognitive Vision

Frontmatter
Visual Search in Static and Dynamic Scenes Using Fine-Grain Top-Down Visual Attention

Artificial visual attention is one of the key methodologies inspired from nature that can lead to robust and efficient visual search by machine vision systems. A novel approach is proposed for modeling of top-down visual attention in which separate saliency maps for the two attention pathways are suggested. The maps for the bottom-up pathway are built using unbiased rarity criteria while the top-down maps are created using fine-grain feature similarity with the search target as suggested by the literature on natural vision. The model has shown robustness and efficiency during experiments on visual search using natural and artificial visual input under static as well as dynamic scenarios.

Muhammad Zaheer Aziz, Bärbel Mertsching
Integration of Visual and Shape Attributes for Object Action Complexes

Our work is oriented towards the idea of developing cognitive capabilities in artificial systems through Object Action Complexes (OACs) [7]. The theory comes up with the claim that objects and actions are inseparably intertwined. Categories of objects are not built by visual appearance only, as very common in computer vision, but by the actions an agent can perform and by attributes perceivable. The core of the OAC concept is constituting objects from a set of attributes, which can be manifold in type (e.g color, shape, mass, material), to actions. This twofold of attributes and actions provides the base for categories. The work presented here is embedded in the development of an extensible system for providing and evolving attributes, beginning with attributes extractable from visual data.

Kai Huebner, Mårten Björkman, Babak Rasolzadeh, Martina Schmidt, Danica Kragic
3D Action Recognition and Long-Term Prediction of Human Motion

In this contribution we introduce a novel method for 3D trajectory based recognition and discrimination between different working actions and long-term motion prediction. The 3D pose of the human hand-forearm limb is tracked over time with a multi-hypothesis Kalman Filter framework using the Multiocular Contracting Curve Density algorithm (MOCCD) as a 3D pose estimation method. A novel trajectory classification approach is introduced which relies on the Levenshtein Distance on Trajectories (LDT) as a measure for the similarity between trajectories. Experimental investigations are performed on 10 real-world test sequences acquired from different viewpoints in a working environment. The system performs the simultaneous recognition of a working action and a cognitive long-term motion prediction. Trajectory recognition rates around 90% are achieved, requiring only a small number of training sequences. The proposed prediction approach yields significantly more reliable results than a Kalman Filter based reference approach.

Markus Hahn, Lars Krüger, Christian Wöhler
Tracking of Human Hands and Faces through Probabilistic Fusion of Multiple Visual Cues

This paper presents a new approach for real time detection and tracking of human hands and faces in image sequences. The proposed method builds upon our previous research on color-based tracking and extends it towards building a system capable of distinguishing between human hands, faces and other skin-colored regions in the image background. To achieve these goals, the proposed approach allows the utilization of additional information cues including motion information given by means of a background subtraction algorithm, and top-down information regarding the formed image segments such as their spatial location, velocity and shape. All information cues are combined under a probabilistic framework which furnishes the proposed approach with the ability to cope with uncertainty due to noise. The proposed approach runs in real time on a standard, personal computer. The presented experimental results, confirm the effectiveness of the proposed methodology and its advantages over previous approaches.

Haris Baltzakis, Antonis Argyros, Manolis Lourakis, Panos Trahanias
Enhancing Robustness of a Saliency-Based Attention System for Driver Assistance

Biologically motivated attention systems prefilter the visual environment for scene elements that pop out most or match the current system task best. However, the robustness of biological attention systems is difficult to achieve, given e.g., the high variability of scene content, changes in illumination, and scene dynamics. Most computational attention models do not show real time capability or are tested in a controlled indoor environment only. No approach is so far used in the highly dynamic real world scenario car domain. Dealing with such scenarios requires a strong system adaptation capability with respect to changes in the environment. Here, we focus on five conceptual issues crucial for closing the gap between artificial and natural attention systems operating in the real world. We show the feasibility of our approach on vision data from the car domain. The described attention system is part of a biologically motivated advanced driver assistance system running in real time.

Thomas Michalke, Jannik Fritsch, Christian Goerick
Covert Attention with a Spiking Neural Network

We propose an implementation of covert attention mechanisms with spiking neurons. Spiking neural models describe the activity of a neuron with precise spike-timing rather than firing rate. We investigate the interests offered by such a temporal code for low-level vision and early attentional process. This paper describes a spiking neural network which achieves saliency extraction and stable attentional focus of a moving stimulus. Experimental results obtained using real visual scene illustrate the robustness and the quickness of this approach.

Sylvain Chevallier, Philippe Tarroux
Salient Region Detection and Segmentation

Detection of salient image regions is useful for applications like image segmentation, adaptive compression, and region-based image retrieval. In this paper we present a novel method to determine salient regions in images using low-level features of luminance and color. The method is fast, easy to implement and generates high quality saliency maps of the same size and resolution as the input image. We demonstrate the use of the algorithm in the segmentation of semantically meaningful whole objects from digital images.

Radhakrishna Achanta, Francisco Estrada, Patricia Wils, Sabine Süsstrunk

Monitor and Surveillance

Frontmatter
The SAFEE On-Board Threat Detection System

Under the framework of the European Union Funded SAFEE project, this paper gives an overview of a novel monitoring and scene analysis system developed for use onboard aircraft in spatially constrained environments. The techniques discussed herein aim to warn on-board crew about pre-determined indicators of threat intent (such as running or shouting in the cabin), as elicited from industry and security experts. The subject matter experts believe that activities such as these are strong indicators of the beginnings of undesirable chains of events or scenarios, which should not be allowed to develop aboard aircraft. This project aimes to detect these scenarios and provide advice to the crew. These events may involve unruly passengers or be indicative of the precursors to terrorist threats. With a state of the art tracking system using homography intersections of motion images, and probability based Petri nets for scene understanding, the SAFEE behavioural analysis system automatically assesses the output from multiple intelligent sensors, and creates recommendations that are presented to the crew using an integrated airborn user interface. Evaluation of the system is conducted within a full size aircraft mockup, and experimental results are presented, showing that the SAFEE system is well suited to monitoring people in confined environments, and that meaningful and instructive output regarding human actions can be derived from the sensor network within the cabin.

Nicholas L. Carter, James M. Ferryman
Region of Interest Generation in Dynamic Environments Using Local Entropy Fields

This paper presents a novel technique to generate regions of interest in image sequences containing independent motions. The technique uses a novel motion segmentation method to segment optical flow using a local entropies field. Local entropy values are computed for each optical flow vector and are collected as input for a two state Markov Random Field that is used to discriminate the motion boundaries. Local entropy values are highly informative cues on the amount of information contained in the vector’s neighborhood. High values represent significative motion differences, low values express uniform motions. For each cluster a motion model is fitted and it is used to create a multiple hypothesis prediction for the following frame. Experiments have been performed on standard and outdoor datasets in order to show the validity of the proposed technique.

Luciano Spinello, Roland Siegwart
Real-Time Face Tracking for Attention Aware Adaptive Games

This paper presents a real time face tracking and head pose estimation system which is included in an attention aware game framework. This fast tracking system enables the detection of the player’s attentional state using a simple attention model. This state is then used to adapt the game unfolding in order to enhance user’s experience (in the case of adventure game) and improve the game attentional attractiveness (in the case of pedagogical game).

Matthieu Perreira Da Silva, Vincent Courboulay, Armelle Prigent, Pascal Estraillier
Rek-Means: A k-Means Based Clustering Algorithm

In this paper we present a new clustering method based on k-means that has been implemented on a video surveillance system. Rek-means does not require to specify in advance the number of clusters to search for and is more precise than k-means in clustering data coming from multiple Gaussian distributions with different co-variances, while maintaining real-time performance. Experiments on real and synthetic datasets are presented to measure the effectiveness and the performance of the proposed method.

Domenico Daniele Bloisi, Luca Iocchi
Smoke Detection in Video Surveillance: A MoG Model in the Wavelet Domain

The paper presents a new fast and robust technique of smoke detection in video surveillance images. The approach aims at detecting the spring or the presence of smoke by analyzing color and texture features of moving objects, segmented with background subtraction. The proposal embodies some novelties: first the temporal behavior of the smoke is modeled by a Mixture of Gaussians (MoG ) of the energy variation in the wavelet domain. The MoG takes into account the image energy variation due to either external luminance changes or the smoke propagation. It allows a distinction to energy variation due to the presence of real moving objects such as people and vehicles. Second, this textural analysis is enriched by a color analysis based on the blending function. Third, a Bayesian model is defined where the texture and color features, detected at block level, contributes to model the likelihood while a global evaluation of the entire image models the prior probability contribution. The resulting approach is very flexible and can be adopted in conjunction to a whichever video surveillance system based on dynamic background model. Several tests on tens of different contexts, both outdoor and indoor prove its robustness and precision.

Simone Calderara, Paolo Piccinini, Rita Cucchiara

Computer Vision Architectures

Frontmatter
Feature Extraction and Classification by Genetic Programming

This paper explores the use of genetic programming for constructing vision systems. A two-stage approach is used, with separate evolution of the feature extraction and classification stages. The strategy taken for the classifier is to evolve a set of partial solutions, each of which works for a single class. It is found that this approach is significantly faster than conventional genetic programming, and frequently results in a better classifier. The effectiveness of the approach is explored on three classification problems.

Olly Oechsle, Adrian F. Clark
GPU-Based Multigrid: Real-Time Performance in High Resolution Nonlinear Image Processing

Multigrid methods provide fast solvers for a wide variety of problems encountered in computer vision. Recent graphics hardware is ideally suited for the implementation of such methods, but this potential has not yet been fully realized. Typically, work in that area focuses on linear systems only, or on implementation of numerical solvers that are not as efficient as multigrid methods. We demonstrate that nonlinear multigrid methods can be used to great effect on modern graphics hardware. Specifically, we implement two applications: a nonlinear denoising filter and a solver for variational optical flow. We show that performing these computations on graphics hardware is between one and two orders of magnitude faster than comparable CPU-based implementations.

Harald Grossauer, Peter Thoman
Attention Modulation Using Short- and Long-Term Knowledge

A fast and reliable visual search is crucial for representing visual scenes. The modulation of bottom-up attention plays an important role here. The knowledge about target features is often used to bias the bottom-up pathway. In this paper we propose a system which does not only make use of knowledge about the target features, but also uses already acquired knowledge about objects in the current scene to speed up the visual search. Main ingredients are a relational short term memory in combination with a semantic relational long term memory and an adjustable bottom-up saliency. The focus of this work is to investigate mechanisms to use the memory of the system efficiently. We show a proof-of-concept implementation working in a real-world environment and performing visual search tasks. It becomes clear that using the relational semantic memory in combination with spatial and feature modulation of the bottom-up path is beneficial for speeding up such search tasks.

Sven Rebhan, Florian Röhrbein, Julian Eggert, Edgar Körner
PCA Based 3D Shape Reconstruction of Human Foot Using Multiple Viewpoint Cameras

This article describes a multiple camera based method to reconstruct a 3D shape of a human foot. From a feet database, an initial 3D model of the foot represented by a cloud of points is built. In addition, some shape parameters, which characterize any foot at more than 92%, are defined by using Principal Component Analysis. Then, the 3D model is adapted to the foot of interest captured in multiple images based on “active shape models” methods by applying some constraints (edge points’ distance, color variance for example). We insist here on the experiment part where we demonstrate the efficiency of the proposed method on a plastic foot model, and on real human feet with various shapes. We compare different ways to texture the foot, and conclude that using projectors can improve drastically the reconstruction’s accuracy. Based on experimental results, we finally propose some improvements regarding to the system integration.

Edmée Amstutz, Tomoaki Teshima, Makoto Kimura, Masaaki Mochimaru, Hideo Saito
An On-Line Interactive Self-adaptive Image Classification Framework

In this paper we present a novel image classification framework, which is able to automatically re-configure and adapt its feature-driven classifiers and improve its performance based on user interaction during on-line processing mode. Special emphasis is placed on the generic applicability of the framework to arbitrary surface inspection systems. The basic components of the framework include: recognition of regions of interest (objects), adaptive feature extraction, dealing with hierarchical information in classification, initial batch training with redundancy deletion and feature selection components, on-line adaptation and refinement of the classifiers based on operators’ feedback, and resolving contradictory inputs from several operators by ensembling outputs from different individual classifiers. The paper presents an outline on each of these components and concludes with a thorough discussion of basic and improved off-line and on-line classification results for artificial data sets and real-world images recorded during a CD imprint production process.

Davy Sannen, Marnix Nuttin, Jim Smith, Muhammad Atif Tahir, Praminda Caleb-Solly, Edwin Lughofer, Christian Eitzinger
Communication-Aware Face Detection Using Noc Architecture

Face detection is an essential first step towards many advanced computer vision, biometrics recognition and multimedia applications, such as face tracking, face recognition, and video surveillance. In this paper, we proposed an FPGA hardware design with NoC (Network-on-Chip) architecture based on an AdaBoost face detection algorithm. The AdaBoost-based method is the state-of-the-art face detection algorithm in terms of speed and detection rates and the NoC provides high communication capability architecture. This design is verified on a Xilinx Virtex-II Pro FPGA platform. Simulation results show the improvement in speed 40 frames per second compared to software implementation. The NoC architecture provides scalability so that our proposed face detection method can be sped up by adding multiple classifier modules.

Hung-Chih Lai, Radu Marculescu, Marios Savvides, Tsuhan Chen

Calibration and Registration

Frontmatter
A System for Geometrically Constrained Single View Reconstruction

This paper presents an overview of a system for recovering 3D models corresponding to scenes for which only a single perspective image is available. The system encompasses a versatile set of semi-automatic single view reconstruction techniques and couples them with limited interactive user input in order to reconstruct textured 3D graphical models corresponding to the imaged input scenes. Such 3D models can serve as the digital content for supporting interactive multimedia and virtual reality applications. Furthermore, they can support novel applications in areas such as video games, 3D photography, visual metrology, computer-assisted study of art and crime scene reconstruction, etc.

Manolis I. A. Lourakis
Monocular Omnidirectional Visual Odometry for Outdoor Ground Vehicles

This paper describes an algorithm for visually computing the ego-motion of a vehicle relative to the road under the assumption of planar motion. The algorithm uses only images taken by a single omnidirectional camera mounted on the roof of the vehicle. The front ends of the system are two different trackers. The first one is a homography-based tracker that detects and matches robust scale invariant features that most likely belong to the ground plane. The second one uses an appearance based approach and gives high resolution estimates of the rotation of the vehicle. This 2D pose estimation method has been successfully applied to videos from an automotive platform. We give an example of camera trajectory estimated purely from omnidirectional images over a distance of 400 meters. For performance evaluation, the estimated path is superimposed onto an aerial image. In the end, we use image mosaicing to obtain a textured 2D reconstruction of the estimated path.

Davide Scaramuzza, Roland Siegwart
Eyes and Cameras Calibration for 3D World Gaze Detection

Gaze tracking is a promising research area with application that goes from advanced human machine interaction systems, to human attention processes studying, modeling and use in cognitive vision fields. In this paper we propose a novel approach for the calibration and use of a head mounted dual eye gaze tracker. Key aspects are a robust pupil tracking algorithm based on prediction from infrared LED purkinje image position, and a new gaze localization method based on trifocal geometry considerations.

Stefano Marra, Fiora Pirri
Evaluating Multiview Reconstruction

We survey the state of evaluation in current multiview reconstruction algorithms, with a particular focus on uncalibrated reconstruction from video sequences. We introduce a new evaluation framework, with high quality ground truth, as a vehicle for accelerating research in the area. Our source code is also freely available under the GNU General Public License, (GPL); a first for complete end-to-end reconstruction systems.

Keir Mierle, W. James Maclean

Object Recognition and Tracking

Frontmatter
Detecting and Recognizing Abandoned Objects in Crowded Environments

In this paper we present a framework for detecting and recognizing abandoned objects in crowded environments. The two main components of the framework include background change detection and object recognition. Moving blocks are detected using dynamic thresholding of spatiotemporal texture changes. The background change detection is based on analyzing wavelet transform coefficients of non-overlapping and non-moving 3D texture blocks. Detected changed background becomes the region of interest which is scanned to recognize various objects under surveillance such as abandoned luggage. The object recognition is based on model histogram ratios of image gradient magnitude patches. Supervised learning of the objects is performed by support vector machine. Experimental results are demonstrated using various benchmark video sequences (PETS, CAVIAR, i-Lids) and an object category dataset (CalTech256).

Roland Miezianko, Dragoljub Pokrajac
Diagnostic System for Intestinal Motility Disfunctions Using Video Capsule Endoscopy

Wireless Video Capsule Endoscopy is a clinical technique consisting of the analysis of images from the intestine which are provided by an ingestible device with a camera attached to it. In this paper we propose an automatic system to diagnose severe intestinal motility disfunctions using the video endoscopy data. The system is based on the application of computer vision techniques within a machine learning framework in order to obtain the characterization of diverse motility events from video sequences. We present experimental results that demonstrate the effectiveness of the proposed system and compare them with the ground-truth provided by the gastroenterologists.

Santi Seguí, Laura Igual, Fernando Vilariño, Petia Radeva, Carolina Malagelada, Fernando Azpiroz, Jordi Vitrià
An Approach for Tracking the 3D Object Pose Using Two Object Points

In this paper, a novel and simple approach for tracking the object pose, position and orientation, using two object points when the object is rotated about one of the axes of the reference coordinate system is presented. The object rotation angle can be tracked up to a range of 180° for object rotations around each axis of the reference coordinate system from an initial object situation. The considered two object points are arbitrary points of the object which can be uniquely identified in stereo images. Since the approach requires only two object points, it is advantageous for the robotic applications where very few feature points can be obtained because of lack of pattern information on the objects. The paper also presents the results for the pose estimation of a meal tray in a rehabilitation robotics environment.

Sai Krishna Vuppala, Axel Gräser
Adaptive Motion-Based Gesture Recognition Interface for Mobile Phones

In this paper, we introduce a new vision based interaction technique for mobile phones. The user operates the interface by simply moving a finger in front of a camera. During these movements the finger is tracked using a method that embeds the Kalman filter and Expectation Maximization (EM) algorithms. Finger movements are interpreted as gestures using Hidden Markov Models (HMMs). This involves first creating a generic model of the gesture and then utilizing unsupervised Maximum a Posteriori (MAP) adaptation to improve the recognition rate for a specific user. Experiments conducted on a recognition task involving simple control commands clearly demonstrate the performance of our approach.

Jari Hannuksela, Mark Barnard, Pekka Sangi, Janne Heikkilä
Weighted Dissociated Dipoles: An Extended Visual Feature Set

The complexity of any learning task depends on the learning method as on finding a good data representation. In the concrete case of object recognition in computer vision, the representation of the images is one of the most important decisions in the design step. As a starting point, in this work we use the representation based on Haar-like filters, a biological inspired feature set based on local intensity differences, which has been successfully applied to different object recognition tasks, such as pedestrian or face recognition problems. From this commonly used representation, we jump to the dissociated dipoles, another biological plausible representation which also includes non-local comparisons. After analyzing the benefits of both representations, we present a more general representation which brings together all the good properties of Haar-like and dissociated dipoles representations. Since these feature sets cannot be used with the classical Adaboost approach due computational limitations, an evolutionary learning algorithm is used to test them over different state of the art object recognition problems. Besides, an extended statistically study of these results is performed in order to verify the relevance of these huge feature spaces.

Xavier Baró, Jordi Vitrià
Scene Classification Based on Multi-resolution Orientation Histogram of Gabor Features

This paper presents a scene classification method based on multi-resolution orientation histogram. In recent years, some scene classification methods have been proposed because scene category information is used as the context for object detection and recognition. Recent studies uses the local parts without topological information. However, the middle size features with rough topological information are more effective for scene classification. For this purpose, we use orientation histogram with rough topological information. Since we do not the appropriate subregion size for computing orientation histogram, various subregion sizes are prepared, and multi-resolution orientation histogram is developed. Support Vector Machine is used to classify the scene category. To improve the accuracy, the similarity between orientation histogram on the same subregion is used effectively. The proposed method is evaluated with the same database and protocol as the recent studies. We confirm that the proposed method outperforms the recent scene classification methods.

Kazuhiro Hotta
Automatic Object Detection on Aerial Images Using Local Descriptors and Image Synthesis

The presented work aims at defining techniques for the detection and localisation of objects, such as aircrafts in clutter backgrounds, on aerial or satellite images. A boosting algorithm is used to select discriminating features and a descriptor robust to background and target texture variations is introduced. Several classical descriptors have been studied and compared to the new descriptor, the HDHR. It is based on the assumption that targets and backgrounds have different textures. Image synthesis is then used to generate large amounts of learning data: the Adaboost has thus access to sufficiently representative data to take into account the variability of real operational scenes. Observed results prove that a vision system can be trained on adapted simulated data and yet be efficient on real images.

Xavier Perrotton, Marc Sturzel, Michel Roux
CEDD: Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval

This paper deals with a new low level feature that is extracted from the images and can be used for indexing and retrieval.This feature is called “Color and Edge Directivity Descriptor” and incorporates color and texture information in a histogram. CEDD size is limited to 54 bytes per image, rendering this descriptor suitable for use in large image databases. One of the most important attribute of the CEDD is the low computational power needed for its extraction, in comparison with the needs of the most MPEG-7 descriptors. The objective measure called ANMRR is used to evaluate the performance of the proposed feature. An online demo that implements the proposed feature in an image retrieval system is available at: http://orpheus.ee.duth.gr/image_retrieval.

Savvas A. Chatzichristofis, Yiannis S. Boutalis
Ranking Corner Points by the Angular Difference between Dominant Edges

In this paper a variant of the Harris corner point detector is introduced. The new algorithm use a covariance operator to compute the angular difference between dominant edges. Then, a new cornerness strength function is proposed by weighting the log Harris cornerness function by the angular difference between dominant edges. An important advantage of the proposed corner detector algorithm is its ability to reduce false corner responses in image regions where partial derivatives have similar values. In addition, we show qualitatively that ranking corner points with the new cornerness strength function better agrees with the intuitive notion of a corner than the original Harris function. To demonstrate the performance of the new algorithm, the new approach is applied on synthetic and real images. The results show that the proposed algorithm rank better the meaningful detected features and at the same time reduces false positive features detected when compared to the original Harris algorithm.

Rafael Lemuz-López, Miguel Arias Estrada
Skeletonization Based on Metrical Neighborhood Sequences

Skeleton is a shape descriptor which summarizes the general form of objects. It can be expressed in terms of the fundamental morphological operations. The limitation of that characterization is that its construction based on digital disks such that cannot provide good approximation to the Euclidean disks. In this paper we define a new type of skeleton based on neighborhood sequences that is much closer to the Euclidean skeleton. A novel method for quantitative comparison of skeletonization algorithms is also proposed.

Attila Fazekas, Kálmán Palágyi, György Kovács, Gábor Németh
Bottom-Up and Top-Down Object Matching Using Asynchronous Agents and a Contrario Principles

We experiment a vision architecture for object matching based on a hierarchy of independent agents running asynchronously in parallel. Agents communicate through bidirectional signals, enabling the mix of top-down and bottom-up influences. Following the so-called

a contrario

principle, each signal is given a strength according to the statistical relevance of its associated visual data. By handling most important signals first, the system focuses on most promising hypotheses and provides relevant results as soon as possible. Compared to an equivalent feed-forward and sequential algorithm, our architecture is shown capable of handling more visual data and thus reach higher detection rates in less time.

Nicolas Burrus, Thierry M. Bernard, Jean-Michel Jolion
A Tale of Two Object Recognition Methods for Mobile Robots

Object recognition is a key feature for building robots capable of moving and performing tasks in human environments. However, current object recognition research largely ignores the problems that the mobile robots context introduces. This work addresses the problem of applying these techniques to mobile robotics in a typical household scenario. We select two state-of-the-art object recognition methods, which are suitable to be adapted to mobile robots, and we evaluate them on a challenging dataset of typical household objects that caters to these requirements. The different advantages and drawbacks found for each method are highlighted, and some ideas for extending them are proposed. Evaluation is done comparing the number of detected objects and false positives for both approaches.

Arnau Ramisa, Shrihari Vasudevan, Davide Scaramuzza, Ramón López de Mántaras, Roland Siegwart
A Segmentation Approach in Novel Real Time 3D Plant Recognition System

One of the most invasive and persistent kind of weed in agriculture is Rumex Obtusifolius L. also called “Broad-leaved Dock”. The origin of the plant is Europe and northern Asia, but it has also been reported that this plant occurs in wide parts of Northern America. Eradication of this plant is labour-intensive and hence there is an interest in automatic weed control devices. Some vision systems were proposed that allow to localize and map plants in the meadow. However, these systems were designed and implemented for off-line processing. This paper presents a segmentation approach that allows for real-time recognition and application of herbicides onto the plant leaves. Instead of processing the gray-scale or colour images, our approach relays on 3D point cloud analysis and processing. 3D data processing has several advantages over 2D image processing approaches when it comes to extraction and recognition of plants in their natural environment.

Dejan Šeatović
Face Recognition Using a Color PCA Framework

This paper delves into the problem of face recognition using color as an important cue in improving recognition accuracy. To perform recognition of color images, we use the characteristics of a 3D color tensor to generate a subspace, which in turn can be used to recognize a new probe image. To test the accuracy of our methodology, we computed the recognition rate across two color face databases and also compared our results against a multi-class neural network model. We observe that the use of the color subspace improved recognition accuracy over the standard gray scale 2D-PCA approach [17] and the 2-layer feed forward neural network model with 15 hidden nodes. Additionally, due to the computational efficiency of this algorithm, the entire system can be deployed with a considerably short turn around time between the training and testing stages.

Mani Thomas, Senthil Kumar, Chandra Kambhamettu
Online Learning for Bootstrapping of Object Recognition and Localization in a Biologically Motivated Architecture

We present a modular architecture for recognition and localization of objects in a scene that is motivated from coupling the ventral (“what”) and dorsal (“where”) pathways of human visual processing. Our main target is to demonstrate how online learning can be used to bootstrap the representation from nonspecific cues like stereo depth towards object-specific representations for recognition and detection. We show the realization of the system learning objects in a complex real-world environment and investigate its performance.

Heiko Wersing, Stephan Kirstein, Bernd Schneiders, Ute Bauer-Wersing, Edgar Körner
Vein Segmentation in Infrared Images Using Compound Enhancing and Crisp Clustering

In this paper an efficient fully automatic method for finger vein pattern extraction is presented using the second order local structure of infrared images. In a sequence of processes, the veins structure is normalized and enhanced, eliminating also the fingerprint lines using wavelet decomposition methods. A compound filter which handles the second order local structure and exploits the multidirectional matching filter response in the direction of the smallest curvature is used in order to enrich the vein patterns. Edge suppression decreases the misclassified edges as veins in the forthcoming crisp clustering step. In a postprocessing module, a morphological majority filter is applied in the segmented image to smooth the contours and to remove some small isolated regions and a reconstruction process reduces the outliers in the finger vein pattern. The proposed method was evaluated in a small database of infrared images giving excellent detection accuracy of vein patterns.

Marios Vlachos, Evangelos Dermatas
Multiscale Laplacian Operators for Feature Extraction on Irregularly Distributed 3-D Range Data

Multiscale feature extraction in image data has been investigated for many years. More recently the problem of processing images containing irregularly distribution data has became prominent. We present a multiscale Laplacian approach that can be applied directly to irregularly distributed data and in particular we focus on irregularly distributed 3D range data. Our results illustrate that the approach works well over a range of irregular distributed and that the use of Laplacian operators on range data is much less susceptive to noise than the equivalent operators used on intensity data.

Shanmugalingam Suganthan, Sonya Coleman, Bryan Scotney

Learning

Frontmatter
A System That Learns to Tag Videos by Watching Youtube

We present a system that automatically tags videos, i.e. detects high-level semantic concepts like objects or actions in them. To do so, our system does not rely on datasets manually annotated for research purposes. Instead, we propose to use videos from online portals like

youtube.com

as a novel source of training data, whereas tags provided by users during upload serve as ground truth annotations. This allows our system to learn autonomously by automatically downloading its training set.

The key contribution of this work is a number of large-scale quantitative experiments on real-world online videos, in which we investigate the influence of the individual system components, and how well our tagger generalizes to novel content. Our key results are: (1) Fair tagging results can be obtained by a late fusion of several kinds of visual features. (2) Using more than one keyframe per shot is helpful. (3) To generalize to different video content (e.g., another video portal), the system can be adapted by expanding its training set.

Adrian Ulges, Christian Schulze, Daniel Keysers, Thomas M. Breuel
Geo-located Image Grouping Using Latent Descriptions

Image categorization is undoubtedly one of the most challenging problems faced in Computer Vision. The related literature is plenty of methods dedicated to specific classes of images; further, commercial systems are also going to be advertised in the market. Nowadays, additional data can also be associated to the images, enriching its semantic interpretation beyond the pure appearance. This is the case of

geo-location data

, that contain information about the geographical place where an image has been captured. This data allow, if not require, a different management of the images, for instance, to the purpose of easy retrieval and visualization from a geo-referenced image repository. This paper constitutes a first step in this sense, presenting a method for geo-referenced image categorization. The solution presented here places in the wide literature on the statistical latent descriptions, where the probabilistic Latent Semantic Analysis (pLSA) is one of the most known representative. In particular, we extend the pLSA paradigm, introducing a latent variable modelling the geographical area in which an image has been captured. In this way, we are able to describe the entire image data-set grouping effectively proximal images with similar appearance. Experiments on categorization have been carried out, employing a well-known geographical image repository: results are actually very promising, opening new interesting challenges and applications in this research field.

Marco Cristani, Alessandro Perina, Vittorio Murino
Functional Object Class Detection Based on Learned Affordance Cues

Current approaches to visual object class detection mainly focus on the recognition of basic level categories, such as cars, motorbikes, mugs and bottles. Although these approaches have demonstrated impressive performance in terms of recognition, their restriction to these categories seems inadequate in the context of embodied, cognitive agents. Here, distinguishing objects according to functional aspects based on object affordances is important in order to enable manipulation of and interaction between physical objects and cognitive agent.

In this paper, we propose a system for the detection of functional object classes, based on a representation of visually distinct hints on object affordances (

affordance cues

). It spans the complete range from tutor-driven acquisition of affordance cues, learning of corresponding object models, and detecting novel instances of functional object classes in real images.

Michael Stark, Philipp Lies, Michael Zillich, Jeremy Wyatt, Bernt Schiele
Increasing Classification Robustness with Adaptive Features

In machine vision features are the basis for almost any kind of high-level postprocessing such as classification. A new method is developed that uses the inherent flexibility of feature calculation to optimize the features for a certain classification task. By tuning the parameters of the feature calculation the accuracy of a subsequent classification can be significantly improved and the decision boundaries can be simplified. The focus of the methods is on surface inspection problems and the features and classifiers used for these applications.

Christian Eitzinger, Manfred Gmainer, Wolfgang Heidl, Edwin Lughofer
Learning Visual Quality Inspection from Multiple Humans Using Ensembles of Classifiers

Visual quality inspection systems nowadays require the highest possible flexibility. Therefore, the reality that multiple human operators may be training the system has to be taken into account. This paper provides an analysis of this problem and presents a framework which is able to learn from multiple humans. This approach has important advantages over systems which are unable to do so, such as a consistent level of quality of the products, the ability to give operator-specific feedback, the ability to capture the knowledge of every operator separately and an easier training of the system.

The level of contradiction between the decisions of the operators is assessed for data obtained from a real-world industrial system for visual quality inspection of the printing of labels on CDs, which was labelled separately by five different operators. The results of the experiments show that the system is able to resolve many of the contradictions which are present in the data. Furthermore, it is shown that in several cases the system even performs better than a classifier which is trained on the data provided by the supervisor itself.

Davy Sannen, Hendrik Van Brussel, Marnix Nuttin
Learning Contextual Variations for Video Segmentation

This paper deals with video segmentation in vision systems. We focus on the maintenance of background models in long-term videos of changing environment which is still a real challenge in video surveillance. We propose an original weakly supervised method for learning contextual variations in videos. Our approach uses a clustering algorithm to automatically identify different contexts based on image content analysis. Then, state-of-the-art video segmentation algorithms (e.g. codebook, MoG) are trained on each cluster. The goal is to achieve a dynamic selection of background models. We have experimented our approach on a long video sequence (24 hours). The presented results show the segmentation improvement of our approach compared to codebook and MoG.

Vincent Martin, Monique Thonnat
Learning to Detect Aircraft at Low Resolutions

An application of the Viola and Jones object detector to the problem of aircraft detection is presented. This approach is based on machine learning rather than morphological filtering which was mainly used in previous works. Aircraft detection using computer vision methods is a challenging problem since target aircraft can vary from subpixels to a few pixels in size and the background can be heavily cluttered. Such a system can be a part of a collision avoidance system to warn the pilots of potential collisions. Initial results suggest that this (static) approach on a frame to frame basis achieves a detection rate of about 80% and a false positive rate which is comparable with other approaches that use morphological filtering followed by a tracking stage. The system was evaluated on over 15000 frames which were extracted from real video sequences recorded by NASA and has the potential of real time performance.

Stavros Petridis, Christopher Geyer, Sanjiv Singh
A Novel Feature Selection Based Semi-supervised Method for Image Classification

Automated surface inspection of products as part of a manufacturing quality control process involves the applications of image processing routines to segment regions of interest (ROI) or objects which correspond to potential defects on the product or part. In these type of applications, it is not known in advance how many ROIs may be segmented from images, and so classification algorithms mainly make use of only image-level features, ignoring important object-level information. In this paper, we will investigate how to preprocess high-dimensional object-level features through a unsupervised learning system and present the outputs of that system as additional image-level features to the supervised learning system. Novel semi-supervised approaches based on K-Means/Tabu Search(TS) and SOM/Genetic Algorithm (GA) with C4.5 as supervised classifier have been proposed in this paper. The proposed algorithms are then applied on real-world CD/DVD inspection system. Results have indicated an increase in the performance in terms of classification accuracy when compared with various existing approaches.

Muhammad Atif Tahir, James E. Smith, Praminda Caleb-Solly
Sub-class Error-Correcting Output Codes

A common way to model multi-class classification problems is by means of Error-Correcting Output Codes (ECOC). One of the main requirements of the ECOC design is that the base classifier is capable of splitting each sub-group of classes from each binary problem. In this paper, we present a novel strategy to model multi-class classification problems using sub-class information in the ECOC framework. Complex problems are solved by splitting the original set of classes into sub-classes, and embedding the binary problems in a problem-dependent ECOC design. Experimental results over a set of UCI data sets and on a real multi-class traffic sign categorization problem show that the proposed splitting procedure yields a better performance when the class overlap or the distribution of the training objects conceil the decision boundaries for the base classifier.

Sergio Escalera, Oriol Pujol, Petia Radeva

Human Machine Interaction

Frontmatter
Spatio-temporal 3D Pose Estimation of Objects in Stereo Images

In this contribution we describe a vision system for model-based 3D detection and spatio-temporal pose estimation of objects in cluttered scenes. As low-level features, our approach requires 3D depth points along with information about their motion and the direction of the local intensity gradient. We extract these features by spacetime stereo based on local image intensity modelling. After applying a graph-based clustering approach to obtain an initial separation between the background and the object, a 3D model is adapted to the 3D point cloud based on an ICP-like optimisation technique, yielding the translational, rotational, and internal degrees of freedom of the object. We introduce an extended constraint line approach which allows to estimate the temporal derivatives of the translational and rotational pose parameters directly from the spacetime stereo data. Our system is evaluated in the scenario of person-independent “tracking by detection” of the hand-forearm limb moving in a non-uniform manner through a cluttered scene. The temporal derivatives of the current pose parameters are used for initialisation in the subsequent image. Typical accuracies of the estimation of pose differences between subsequent images are 1–3 mm for the translational motion, which is comparable to the pixel resolution, and 1–3 degrees for the rotational motion.

Björn Barrois, Christian Wöhler
Automatic Initialization for Facial Analysis in Interactive Robotics

The human face plays an important role in communication as it allows to discern different interaction partners and provides non-verbal feedback. In this paper, we present a soft real-time vision system that enables an interactive robot to analyze faces of interaction partners not only to identify them, but also to recognize their respective facial expressions as a dialog-controlling non-verbal cue. In order to assure applicability in real world environments, a robust detection scheme is presented which detects faces and basic facial features such as the position of the mouth, nose, and eyes. Based on these detected features, facial parameters are extracted using active appearance models (AAMs) and conveyed to support vector machine (SVM) classifiers to identify both persons and facial expressions. This paper focuses on four different initialization methods for determining the initial shape for the AAM algorithm and their particular performance in two different classification tasks with respect to either the facial expression DaFEx database and to the real world data obtained from a robot’s point of view.

Ahmad Rabie, Christian Lang, Marc Hanheide, Modesto Castrillón-Santana, Gerhard Sagerer
Face Recognition Across Pose Using View Based Active Appearance Models (VBAAMs) on CMU Multi-PIE Dataset

In this paper we address the challenge of performing face recognition of a probe set of non-frontal images by performing automatic pose correction using Active Appearance Models (AAMs) and matching against a set of enrollment gallery of frontal images. Active Appearance Models are used as a way to register and fitting the model to extract 79 facial fiducial points which are then used to partition the face into a wire-mesh of triangular polygons which are used to warp the facial image to a frontal facial mesh pose. We extend to use View-Based Active Appearance Models (VBAAMs) which are able to represent a preset number of poses better than synthesizing a single AAM to handle all possible pose variations. We demonstrate our approach is able to achieve high performance results on the new larger CMU Multi-PIE dataset using 249 different people with 15 different pose angles and 20 different illumination variations under 2 different expressions (total of 149400 images). We show that our proposed pose correction approach can improve the recognition performance of many baseline algorithms such PCA, LDA, Kernel Discriminant Analysis (KDA) on the CMU Multi-PIE dataset.

Jingu Heo, Marios Savvides

Cross Modal Systems

Frontmatter
Object Category Detection Using Audio-Visual Cues

Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.

Jie Luo, Barbara Caputo, Alon Zweig, Jörg-Hendrik Bach, Jörn Anemüller
Multimodal Interaction Abilities for a Robot Companion

Among the cognitive abilities a robot companion must be endowed with, human perception and speech understanding are both fundamental in the context of multimodal human-robot interaction. In order to provide a mobile robot with the visual perception of its user and means to handle verbal and multimodal communication, we have developed and integrated two components. In this paper we will focus on an interactively distributed multiple object tracker dedicated to two-handed gestures and head location in 3D. Its relevance is highlighted by in- and off- line evaluations from data acquired by the robot. Implementation and preliminary experiments on a household robot companion, including speech recognition and understanding as well as basic fusion with gesture, are then demonstrated. The latter illustrate how vision can assist speech by specifying location references, object/person IDs in verbal statements in order to interpret natural deictic commands given by human beings. Extensions of our work are finally discussed.

Brice Burger, Isabelle Ferrané, Frédéric Lerasle
Backmatter
Metadata
Title
Computer Vision Systems
Editors
Antonios Gasteratos
Markus Vincze
John K. Tsotsos
Copyright Year
2008
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-79547-6
Print ISBN
978-3-540-79546-9
DOI
https://doi.org/10.1007/978-3-540-79547-6

Premium Partner