Skip to main content
main-content

Über dieses Buch

The two volume set, consisting of LNCS 7728 and 7729, contains the carefully reviewed and selected papers presented at the nine workshops that were held in conjunction with the 11th Asian Conference on Computer Vision, ACCV 2012, in Daejeon, South Korea, in November 2012. From a total of 310 papers submitted, 78 were selected for presentation. LNCS 7728 contains the papers selected for the International Workshop on Computer Vision with Local Binary Pattern Variants, the Workshop on Computational Photography and Low-Level Vision, the Workshop on Developer-Centered Computer Vision, and the Workshop on Background Models Challenge. LNCS 7729 contains the papers selected for the Workshop on e-Heritage, the Workshop on Color Depth Fusion in Computer Vision, the Workshop on Face Analysis, the Workshop on Detection and Tracking in Challenging Environments, and the International Workshop on Intelligent Mobile Vision.

Inhaltsverzeichnis

Frontmatter

Workshop on e-Heritage

Historical Document Binarization Based on Phase Information of Images

In this paper, phase congruency features are used to develop a binarization method for degraded documents and manuscripts. Also, Gaussian and median filtering are used in order to improve the final binarized output. Gaussian filter is used for further enhance the output and median filter is applied to remove noises. To detect bleed-through degradation, a feature map based on regional minima is proposed and used. The proposed binarization method provides output binary images with high recall values and competitive precision values. Promising experimental results obtained on the DIBCO’09, H-DIBCO’10 and DIBCO’11 datasets, and this shows the robustness of the proposed binarization method against a large number of different types of degradation.

Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, Mohamed Cheriet

Can Modern Technologies Defeat Nazi Censorship?

Censorship of parts of written text was and is a common practice in totalitarian regimes. It is used to destroy information not approved by the political power. Recovering the censored text is of interest for historical studies of the text. This paper raises the question, whether a censored postcard from 1942 can be made legible by applying multispectral imaging in combination with laser cleaning. In the fields of art conservation (e.g. color measurements), investigation (e.g. analysis of underdrawings in paintings), and historical document analysis, multispectral imaging techniques have been applied successfully to give visibility to information hidden to the human eye.

The basic principle of laser cleaning is to transfer laser pulse energy to a contamination layer by an absorption process that leads to heating and evaporation of the layer. Partial laser cleaning of postcards is possible; dirt on the surface can be removed and the obscured pictures and writings made visible again. We applied both techniques to the postcard. The text could not be restored since the original ink seems to have suffered severe chemical damage.

Simone Pentzien, Ira Rabin, Oliver Hahn, Jörg Krüger, Florian Kleber, Fabian Hollaus, Markus Diem, Robert Sablatnig

Coarse-to-Fine Correspondence Search for Classifying Ancient Coins

In this paper, we build upon the idea of using robust dense correspondence estimation for exemplar-based image classification and adapt it to the problem of ancient coin classification. We thus account for the lack of available training data and demonstrate that the matching costs are a powerful dissimilarity metric to establish coin classification for training set sizes of one or two images per class. This is accomplished by using a flexible dense correspondence search which is highly insensitive to local spatial differences between coins of the same class and different coin rotations between images. Additionally, we introduce a coarse-to-fine classification scheme to decrease runtime which would be otherwise linear to the number of classes in the training set. For evaluation, a new dataset representing 60 coin classes of the Roman Republican period is used. The proposed system achieves a classification rate of 83.3 % and a runtime improvement of 93 % through the coarse-to-fine classification.

Sebastian Zambanini, Martin Kampel

Archiving Mural Paintings Using an Ontology Based Approach

In this paper, we propose an archiving scheme for heritage mural paintings. The mural paintings typically depict stories from folk-lore, mythology and history. These narratives provide content-based correlations between different pieces of art. Our e-heritage scheme for archiving the mural paintings is based on an ontology which captures the background knowledge of these narratives. Media features and patterns derived from the mural content are used to enrich the ontology with multimedia data. We have used the multimedia web ontology language as our ontology representation scheme, as it allows perceptual modelling of domain concepts in terms of their media properties, as well as reasoning with uncertainties. Besides the mural content and its knowledge, the ontology also helps encode other aspects of the mural paintings like their painting style, color, physical location, time-period, etc., which are important parameters of their preservation. We propose a framework to provide cross-modal semantic linkage between semantically annotated content of a repository of Indian mural paintings, and a collection of labelled text documents of their narratives. This framework, based on a multimedia ontology of the domain, helps preserve the cultural heritage encoded in these artefacts.

Anupama Mallik, Santanu Chaudhury, Shipra Madan, T. B. Dinesh, Uma V. Chandru

Robust Image Deblurring Using Hyper Laplacian Model

In recent years, many image deblurring algorithms have been proposed, most of which assume the noise in the deblurring process satisfies the Gaussian distribution. However, it is often unavoidable in practice both in non-blind and blind image deblurring, due to the error on the input kernel and the outliers in the blurry image. Without proper handing these outliers, the recovered image estimated by previous methods will suffer severe artifacts. In this paper, we mainly deal with two kinds of non-Gaussian noise in the image deblurring process, inaccurate kernel and compressed blurry image, and find that handling the noise as Laplacian distribution can get more robust result in these cases. Based on this point, the new non-blind and blind image deblurring algorithms are proposed to restore the clear image. To get more robust deblurred result, we also use 8 direction gradients of the image to estimate the blur kernel. The new minimization problem can be efficiently solved by the Iteratively Reweighted Least Squares(IRLS) and the experimental results on both synthesized and real-world images show the efficiency and robustness of our algorithm.

Yuquan Xu, Xiyuan Hu, Silong Peng

SVD Based Automatic Detection of Target Regions for Image Inpainting

We are often required to retouch images in order to improve their visual appearance, by removing the visual discontinuities like breaks and damaged regions. Such retouching of images may be achieved by inpainting. Current techniques for image inpainting require the user to manually select the target regions to be inpainted. Very few techniques for automatically detecting the target regions for inpainting are reported in the literature, which are suitable to detect an actual damage or alteration to the given photograph. In this paper, we propose a Singular Value Decomposition (SVD) based novel technique for automatic detection of the damaged regions in the photographed object / scene, for the purpose of digitally restoring them to their entirety using inpainting. Results on an exhaustive set of images suggest that the mask generated using the proposed technique can be suitably used for inpainting purpose to digitally restore the given images.

Milind G. Padalkar, Mukesh A. Zaveri, Manjunath V. Joshi

Presentation of Japanese Cultural Event Using Virtual Reality

With the development of computer graphics and virtual reality technologies, extensive researches have been carried out on digital archiving of cultural assets. In this paper, we introduce our work on presenting a traditional Japanese cultural event, namely Yamahoko Parade in Kyoto Gion Festival, using the latest technologies such as 3D CG modeling, motion capture, high-quality sound recording, vibration system, immersive virtual environment, and real-time interaction. This work is one part of the digital museum project, which intends to preserve and present Kyoto city’s culture and tradition.

Liang Li, Woong Choi, Kozaburo Hachimura, Takanobu Nishiura, Keiji Yano

High-Resolution and Multi-spectral Capturing for Digital Archiving of Large 3D Woven Cultural Artifacts

We propose a high-resolution and multi-spectral capturing for digital archiving of large 3D woven cultural artifacts. In the field of digital archive, it is important to measure, model, and represent the shape, color, and texture of the cultural artifact at high-definition, not only physical appearance but haptic impression. The many of the decorative hangings on the Fune-hoko in the “Gion Festival in Kyoto”, are very large and stuffed with cotton, so that they have a very noticeable 3D shape. Therefore, a high-resolution and multi-spectral capturing and a large-scale 3D measuring are necessary for the digital archiving of large 3D woven cultural artifact. Then we captured high-resolution images with a two-shot type 6-band image capturing system at low-cost, and modeled woven cultural artifacts in 3D. This paper describes a 3D measurement system with wheel-rail, a capturing system with multi-band camera, and a 3D modeling of large woven cultural artifacts, and show a high-resolution 3D model with multi-band image.

Wataru Wakita, Masaru Tsuchida, Shiro Tanaka, Takahito Kawanishi, Kunio Kashino, Junji Yamato, Hiromi T. Tanaka

ACCV Workshop on Color Depth Fusion in Computer Vision

Exploring High-Level Plane Primitives for Indoor 3D Reconstruction with a Hand-held RGB-D Camera

Given a hand-held RGB-D camera (e.g. Kinect), methods such as Structure from Motion (SfM) and Iterative Closest Point (ICP), perform poorly when reconstructing indoor scenes with few image features or little geometric structure information. In this paper, we propose to extract high level primitives–planes–from an RGB-D camera, in addition to low level image features (e.g. SIFT), to better constrain the problem and help improve indoor 3D reconstruction. Our work has two major contributions: first, for frame to frame matching, we propose a new scheme which takes into account both low-level appearance feature correspondences in RGB image and high-level plane correspondences in depth image. Second, in the global bundle adjustment step, we formulate a novel error measurement that not only takes into account the traditional 3D point re-projection errors, but also the planar surface alignment errors. We demonstrate with real datasets that our method with plane constraints achieves more accurate and more appealing results comparing with other state-of-the-art scene reconstruction algorithms in aforementioned challenging indoor scenarios.

Mingsong Dou, Li Guan, Jan-Michael Frahm, Henry Fuchs

A Fusion Framework of Stereo Vision and Kinect for High-Quality Dense Depth Maps

We present a fusion framework of stereo vision and Kinect for high-quality dense depth maps. The fusion problem is formulated as maximum a posteriori estimation of the Markov random field using the Bayes rule. We design a global energy function with a novel data term, which provides a reasonable, straight-forward and scalable way to fuse stereo vision and the depth data from Kinect. Particularly, visibility and pixelwise noises of the depth data from Kinect are taken into account in our fusion approach. Experimental results demonstrate effectiveness and accuracy of the proposed framework.

Yucheng Wang, Yunde Jia

Robust Fall Detection by Combining 3D Data and Fuzzy Logic

Falls are a major risk for the elderly and where immediate help is needed. The elderly, especially when suffering from dementia, are not able to react to emergency situations properly, thus falls need to be detected automatically. An overview of different classes of fall detection approaches is presented and a vision-based approach is introduced. We propose the use of a Kinect to obtain 3D data in combination with fuzzy logic for robust fall detection and show that our approach outperforms current state-of-the-art algorithms. Our approach is evaluated on 72 video sequences, containing 40 falls and 32 activities of daily living.

Rainer Planinc, Martin Kampel

KinectAvatar: Fully Automatic Body Capture Using a Single Kinect

We present a novel scanning system for capturing a full 3D human body model using just a single depth camera and no auxiliary equipment. We claim that data captured from a single Kinect is sufficient to produce a good quality full 3D human model. In this setting, the challenges we face are the sensor’s low resolution with random noise and the subject’s non-rigid movement when capturing the data. To overcome these challenges, we develop an improved super-resolution algorithm that takes color constraints into account. We then align the super-resolved scans using a combination of automatic rigid and non-rigid registration. As the system is of low price and obtains impressive results in several minutes, full 3D human body scanning technology can now become more accessible to everyday users at home.

Yan Cui, Will Chang, Tobias Nöll, Didier Stricker

Essential Body-Joint and Atomic Action Detection for Human Activity Recognition Using Longest Common Subsequence Algorithm

We present an effective algorithm to detect essential body-joints and their corresponding atomic actions from a series of human activity data for efficient human activity recognition/classification. Our human activity data is captured by a RGB-D camera, i.e. Kinect, where human skeletons are detected and provided by the Kinect SDK. Unique in our approach is the novel encoding that can effectively convert skeleton data into a symbolic sequence representation which allows us to detect the essential atomic actions of different human activities through longest common subsequence extraction. Our experimental results show that, through atomic action detection, we can recognize human activity that consists of complicated actions. In addition, since our approach is “simple”, our human activity recognition algorithm can be performed in real-time.

Sou-Young Jin, Ho-Jin Choi

Exploiting Depth and Intensity Information for Head Pose Estimation with Random Forests and Tensor Models

Real-time accurate head pose estimation is required for several applications. Methods based on 2D images might not provide accurate and robust head pose measurements due to large head pose variations and illumination changes. Robust and accurate head pose estimation can be achieved by integrating intensity and depth information. In this paper we introduce a head pose estimation system that employs random forests and tensor regression algorithms. The former allow the modeling of large head pose variations using large sets of training data, while the latter allow the estimation of more accurate head pose parameters. The combination of the above mentioned methods results in more robust and accurate predictions for large head pose variations. We also study the fusion of different sources of information (intensity and depth images) to determine how their combination affects the performance of a head pose estimation system. The efficiency of the proposed framework is tested on the Biwi Kinect Head Pose dataset, where it is shown that the proposed methodology outperforms typical random forests.

Sertan Kaymak, Ioannis Patras

Dynamic Hand Shape Manifold Embedding and Tracking from Depth Maps

Hand shapes vary for different views or hand rotations. In addition, the high degree of freedom of hand configurations makes it difficult to track hand shape variations. This paper presents a new manifold embedding method that models hand shape variations in different hand configurations and in different views due to hand rotation. Instead of traditional silhouette images, the hand shapes are modeled using depth map images, which provides rich shape information invariant to illumination changes. These depth map images vary for different viewing directions, similar to shape silhouettes. Sample data along view circles are collected for all the hand configuration variations. A new manifold embedding method using a 4D torus for modeling low dimensional hand configuration and hand rotation is proposed to model the product of three circular manifolds. After learning nonlinear mapping from the proposed embedding space to depth map images, we can achieve the tracking of arbitrary shape variations with hand rotation using particle filter on the embedding manifold. The experiment results from both synthetic and real data show accurate estimations of hand rotation through the estimation of the view parameters and hand configuration from key hand poses and hand configuration phases.

Chan-Su Lee, Sung Yong Chun, Shin Won Park

View-Invariant Object Detection by Matching 3D Contours

We propose an approach for view-invariant object detection directly in 3D with following properties: (i) The detection is based on matching of 3D contours to 3D object models. (ii) The matching is constrained with qualitative spatial relations such as above/below, left/right, and front/back. (iii) In order to ensure that any matching solution satisfies these constraints, we formulate the matching problem as finding maximum weight subgraphs with hard constraints, and utilize a novel inference framework to solve this problem. Given a single view of an RGB-D camera, we obtain 3D contours by ”back projecting” 2D contours extracted in the depth map. As our experimental results demonstrate, the proposed approach significantly outperforms the state-of-the-art 2D approaches, in particular, latent SVM object detector, as well as recently proposed approaches for object detection in RGB-D data.

Tianyang Ma, Meng Yi, Longin Jan Latecki

Human Detection with Occlusion Handling by Over-Segmentation and Clustering on Foreground Regions

Two-dimensional image based human detection methods have been widely used in surveillance system. However, detecting human in the presence of occlusion is still a challenge for such image based systems. In this paper, a human detection method aiming to handle occlusions by using the depth data obtained from 3D imaging methods, such as those easily acquired from the Microsoft Kinect depth sensor, is proposed. In the context of surveillance setting, background subtraction on the depth data can be used to extract foreground regions which may correspond to humans. The proposed method analyzes the 3D data of the foreground regions using a “split-merge” approach. Over-segmentation and clustering are preformed on foreground regions followed by the height validation. Experimental results demonstrate that the proposed method outperforms two state-of-art human detection methods.

Li Wang, Kap Luk Chan, Gang Wang

Spin Image Revisited: Fast Candidate Selection Using Outlier Forest Search

Spin-images have been widely used for surface registration and object detection from range images in that they are scale, rotation, and pose invariant. The computational complexity, however, is linear to the number of spin images in the model data set because valid candidates are chosen according to the similarity distribution between the input spin image and whole spin images in the data set. In this paper we present a fast method for valid candidate selection as well as approximate estimate of the similarity distribution using outlier search in the partitioned vocabulary trees. The sampled spin images in each tree are used for approximate density estimation and best matched candidates are then collected in the trees according to the statistics of the density. In contrast to the previous approaches that attempt to build compact representations of the spin images, the proposed method reduces the search space using the hierarchical clusters of the spin images such that the computational complexity is drastically reduced from O(

K

·

N

) to O(

K

·log

N

).

K

and

N

are the size of the spin-image features and the model data sets respectively. As demonstrated in the experimental results with a consumer depth camera, the proposed method is tens of times faster than the conventional method while the registration accuracy is preserved.

Young-Woon Cha, Hwasup Lim, Seong-Oh Lee, Hyoung-Gon Kim, Sang Chul Ahn

MRF Guided Anisotropic Depth Diffusion for Kinect Range Image Enhancement

Projected texture based 3D sensing modalities are being increasingly used for a variety of 3D computer vision applications. However, these sensing modalities, exemplified by the Microsoft Kinect Sensor, suffer from severe drawbacks that hamper the quality of the range estimate output from the sensor. It is well known that the quality of reconstruction of the 3D projected texture for range estimation is a function of the material properties of objects in the image. Objects colored black, yellow or deep red often do not reflect the texture in a manner suitable for the detector to estimate the range values. Furthermore, shiny or highly reflective objects can also scatter the projected texture patterns. Objects with skewed surface orientation, occlusions, object self-shadows and intra-object mutual shadows, transparency and other factors also create problems with projected texture reconstruction. In order to alleviate these concerns, depth interpolation techniques have been used in the past. These techniques, however, create loss of depth structures crucial for segmentation and detection processes. In order to alleviate these concerns, we present a novel MRF based color- depth fusion algorithm which uses information from the RGB sensor of the Kinect and couples it with the depth content to produce fine structure, high fidelity depth maps. This algorithm can be implemented in hardware on the Kinect device, thereby improving the depth resolution, fidelity of the sensor while eliminating range errors and shadows.

Karthik Mahesh Varadarajan, Markus Vincze

Workshop on Face Analysis: The Intersection of Computer Vision and Human Perception

A Priori-Driven PCA

Principal Component Analysis (PCA) is a multivariate statistical dimensionality reduction method that has been applied successfully in many pattern recognition problems. In the research area of analysis of faces particularly, PCA has been used not only as a pre-processing step to produce accurate analytical model for automated face recognition systems, but also as a conceptual framework for human face coding. Despite the well-known attractive properties of PCA, the traditional approach does not incorporate high level semantics from human reasoning which may steer its subspace computation. In this paper, we propose a method that allows PCA to incorporate such semantics explicitly. It allows an automatic selective treatment of the variables that compose the patterns of interest, performing data feature extraction and dimensionality reduction whenever some high level information in the form of labeled data are available. The method relies on spatial weights calculated, in this work, by separating hyperplanes. Several experiments using 2D frontal face images and different data sets have been carried out to illustrate the usefulness of the method for dimensionality reduction, interpretation, classification and reconstruction of face images.

Carlos Thomaz, Gilson Giraldi, Joaquim Costa, Duncan Gillies

The Face Speaks: Contextual and Temporal Sensitivity to Backchannel Responses

It is often assumed that one person in a conversation is active (the speaker) and the rest passive (the listeners). Conversational analysis has shown, however, that listeners take an active part in the conversation, providing feedback signals that can control conversational flow. The face plays a vital role in these

backchannel responses

. A deeper understanding of facial backchannel signals is crucial for many applications in social signal processing, including automatic modeling and analysis of conversations, or in the development of life-like, effective conversational agents. Here, we present results from two experiments testing the sensitivity to the context and the timing of backchannel responses. We utilised sequences from a newly recorded database of 5-minute, two-person conversations. Experiment 1 tested how well participants would be able to match backchannel sequences to their corresponding speaker sequence. On average, participants performed well above chance. Experiment 2 tested how sensitive participants would be to temporal misalignments of the backchannel sequence. Interestingly, participants were able to estimate the correct temporal alignment for the sequence pairs. Taken together, our results show that human conversational skills are highly tuned both towards context and temporal alignment, showing the need for accurate modeling of conversations in social signal processing.

Andrew J. Aubrey, Douglas W. Cunningham, David Marshall, Paul L. Rosin, AhYoung Shin, Christian Wallraven

Virtual View Generation Using Clustering Based Local View Transition Model

This paper presents an approach for realistic virtual view generation using appearance clustering based local view transition model, with its target application on cross-pose face recognition. Previously, the traditional global pattern based view transition model (VTM) method was extended to its local version called LVTM, which learns the linear transformation of pixel values between frontal and non-frontal image pairs using partial image in a small region for each location, rather than transforming the entire image pattern. In this paper, we show that the accuracy of the appearance transition model and the recognition rate can be further improved by better exploiting the inherent linear relationship between frontal-nonfrontal face image patch pairs. For each specific location, instead of learning a common transformation as in the LVTM, the corresponding local patches are first clustered based on appearance similarity distance metric and then the transition models are learned separately for each cluster. In the testing stage, each local patch for the input non-frontal probe image is transformed using the learned local view transition model corresponding to the most visually similar cluster. The experimental results on a real-world face dataset demonstrated the superiority of the proposed method in terms of recognition rate.

Xi Li, Tomokazu Takahashi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase

3D Facial Expression Synthesis from a Single Image Using a Model Set

In this paper, we present a system for synthesizing 3D human face models containing different expressions from a single facial image. Given a frontal image of the target face with neutral expression, we first detect several key points denoting the shape of the face by Active Shape Model (ASM). Then we apply a RBF-based scattered data interpolation to reconstruct a 3D target face using a neutral expression 3D face model as reference. By analyzing a series of 3D expression face model, we segment the 3D reference model into regions automatically that each region is correspondent to a facial organ. From the expression set we construct a motion model for each facial action with respect to the target face in a local consistent manner. At last, the reconstructed 3D target face model with neutral expression and the facial action motion model are combined to generate 3D target face of various expressions. There are 3 contributions of our work: (1) We employ a set of registered 3D facial expression models as input, which enabled us to generate more complex and visual-realistic expressions than other parameter-based approaches and 2D image-based methods. (2) On the basis of a clustering-based segmentation, we developed a localized linear expression model, which make it possible for us to generate different facial expressions both locally and globally, thusly enlarge the space of synthesize output and break the limitation by the limited scale of the input expression model set. (3) A local space transform procedure is included that the output expression can fit distinct facial shapes regardless of the scarcity of variation of the facial shapes (fat or thin) in the input model set.

Zhixin Shu, Lei Huang, Changping Liu

Face Hallucination on Personal Photo Albums

This paper presents a new approach to generate a high quality facial image from a low resolution facial image, based on a large set of facial images belongs to the same person but varies in pose and expression. The input images are taken by low-end cameras or cameras from a long distance. The facial poses and expressions are not consistent and aligned. Firstly, using a low resolution facial image as a query image, a set of high resolution images with similar pose and expression is retrieved from the image examples by the proposed similarity measurement based on its shape and texture information of the query image. The selected images are then aligned with the query image and used as the candidates for the face hallucination. A Markov random field (MRF) model based on a new proposed color and edge constraints is introduced to find an optimum solution for the hallucination image. In the experiments, high textural details of hallucination images which are four to eight times larger than the original low resolution images were generated by the proposed face hallucination approach. The high resolution outputs of our method are significantly improved in quality compared to other image superresolution methods. Moreover, we also showed that our new approach is able to handle underexposure and noisy images.

Yuan Ren Loke, Ping Tan, Ashraf A. Kassim

Techniques for Mimicry and Identity Blending Using Morph Space PCA

We describe a face modelling tool allowing image representation in a high-dimensional morph space, compression to a small number of coefficients using PCA [1], and expression transfer between face models by projection of the source morph description (a parameterisation of complex facial motion) into the target morph space. This technique allows creation of an identity-blended avatar model whose high degree of realism enables diverse applications in visual psychophysics, stimulus generation for perceptual experiments, animation and affective computing.

Fintan Nagle, Harry Griffin, Alan Johnston, Peter McOwan

Facial and Vocal Cues in Perceptions of Trustworthiness

The goal of the present research was to study the relative role of facial and acoustic cues in the formation of trustworthiness impressions. Furthermore, we investigated the relationship between perceived trustworthiness and perceivers’ confidence in their judgments. 25 young adults watched a number of short clips in which the video and audio channel were digitally aligned to form five different combinations of actors’ face and voice trustworthiness levels (neutral face + neutral voice, neutral face + trustworthy voice, neutral face + non-trustworthy voice, trustworthy face + neutral voice, and non-trustworthy face + neutral voice). Participants provided subjective ratings of the trustworthiness of the actor in each video, and indicated their level of confidence in each of those ratings. Results revealed a main effect of face-voice channel combination on trustworthiness ratings, and no significant effect of channel combination on confidence ratings. We conclude that there is a clear superiority effect of facial over acoustic cues in the formation of trustworthiness impressions, propose a method for future investigation of the judgmentconfidence link, and outline the practical implications of the experiment.

Elena Tsankova, Andrew J. Aubrey, Eva Krumhuber, Guido Möllering, Arvid Kappas, David Marshall, Paul L. Rosin

ACCV Workshop on Detection and Tracking in Challenging Environments (DTCE)

Disagreement-Based Multi-system Tracking

In this paper, we tackle the tracking problem from a fusion angle and propose a disagreement-based approach. While most existing fusion-based tracking algorithms work on different features or parts, our approach can be built on top of nearly any existing tracking systems by exploiting their disagreements. In contrast to assuming multi-view features or different training samples, we utilize existing well-developed tracking algorithms, which themselves demonstrate intrinsic variations due to their design differences. We present encouraging experimental results as well as theoretical justification of our approach. On a set of benchmark videos, large improvements (20% ~40%) over the state-of-the-art techniques have been observed.

Quannan Li, Xinggang Wang, Wei Wang, Yuan Jiang, Zhi-Hua Zhou, Zhuowen Tu

Monocular Pedestrian Tracking from a Moving Vehicle

Tracking of pedestrians from a moving vehicle equipped with a monocular camera is still considered as a challenging problem in the fields of both computer vision and robotics. In this paper, we address this problem in a particle filter framework, which well incorporates different cues from detector, dynamic model and target-specific tracking. In order to eliminate the effect of ego-motion when predicting the movement of pedestrians, we train one dynamic model for each driving behavior (moving forward, turning left/right) given a set of training trajectories. The learnt dynamic model is then utilized to predict the future movement of the pedestrian in the tracking process. We demonstrate our system works robustly on challenging dataset with strong illumination changes.

Zipei Fan, Zeliang Wang, Jinshi Cui, Franck Davoine, Huijing Zhao, Hongbin Zha

Tracking the Untrackable: How to Track When Your Object Is Featureless

We propose a novel approach to tracking objects by low-level line correspondences. In our implementation we show that this approach is usable even when tracking objects with lack of texture, exploiting situations, when feature-based trackers fails due to the aperture problem. Furthermore, we suggest an approach to failure detection and recovery to maintain long-term stability. This is achieved by remembering configurations which lead to good pose estimations and using them later for tracking corrections.

We carried out experiments on several sequences of different types. The proposed tracker proves itself as competitive or superior to state-of-the-art trackers in both standard and low-textured scenes.

Karel Lebeda, Jiri Matas, Richard Bowden

A Robust Particle Tracker via Markov Chain Monte Carlo Posterior Sampling

Particle Filters have grown to be a standard framework for visual tracking. This paper proposes a robust particle tracker based on Markov Chain Monte Carlo method, aiming at solving the thorny problems in visual tracking induced by object appearance change, occlusion, background clutter, and abrupt motion, etc. In this algorithm, we derive the posterior probability density function based on second order Markov assumption. The posterior probability density is the joint density of the previous two states. Additionally, a Markov Chain with certain length is used to approximate the posterior density, which consequently improves the searching ability of the proposed tracker. We compare our approach with several alternative tracking algorithms, and the experimental results demonstrate that our tracker is superior to others in dealing with various types of challenging scenarios.

Fasheng Wang, Mingyu Lu, Liran Shen

A Framework for Inter-camera Association of Multi-target Trajectories by Invariant Target Models

We propose a novel framework for associating multi-target trajectories across multiple non-overlapping views (cameras) by constructing an invariant model per each observed target. Ideally, these models represent the targets in a unique manner. The models are constructed by generating synthetic images that simulate how targets would be seen from different viewpoints. Our framework does not require any training or other supervised phases. Also, we do not make use of spatiotemporal coordinates of trajectories, i.e., our framework seamlessly works with both overlapping and non-overlapping field-of-views (FOVs) as well as widely separated ones. Also, contrary to many other related works, we do not try to estimate the relationship between cameras that tends to be error prone in environments like airports or supermarkets where targets wander about different areas, stop at times, or turn back to their starting location. We show the results obtained by our framework on a rather challenging dataset. Also, we propose a black-box approach based on Support Vector Machine (SVM) for fusing multiple pertinent algorithms and demonstrate the added value of our framework with respect to some basic techniques.

Shahar Daliyot, Nathan S. Netanyahu

Colour Descriptors for Tracking in Spatial Augmented Reality

Augmented Reality is an emerging research field, that aims for the composition of real and virtual imagery, by means of a camera and display device. Spatial augmented reality employs data projectors to augment the real world. In this setting, traditional tracking methods fall short due to the interference caused by the projector. Recent works assume a calibration process to model the projector and assume continuity in movement of the object being tracked. In this paper we present a tracking-by-detection system that does not require such a procedure and makes use of natural features represented by SIFT descriptors. We evaluate a set of photometric invariants that have previously been shown to improve the performance of object recognition, added to the descriptor to reduce the influence of the projector. We evaluate the descriptors based on precision-recall under projector distortion and the total system based on its tracking performance. Results show tracking is significantly more precise using one of the invariants.

Thijs Kooi, Francois de Sorbier, Hideo Saito

Covariance Descriptor Multiple Object Tracking and Re-identification with Colorspace Evaluation

This paper addresses the multi-target tracking problem with the help of a matching method where moving objects are detected in each frame, tracked when it is possible and matched by similarity of covariance matrices when difficulties arrive. Three contributions are proposed. First, a compact vector based on color invariants and Local Binary Patterns Variance is compared to more classical features vectors. To accelerate object re-identification, our second proposal is the use of a more efficient arrangement of the covariance matrices. Finally, a multiple-target algorithm with special attention in occlusion handling, merging and separation of the targets is analyzed. Our experiments show the relevance of the method, illustrating the trade-off that has to be made between distinctiveness, invariance and compactness of the features.

Andrés Romero, Michéle Gouiffés, Lionel Lacassagne

Iterative Hypothesis Testing for Multi-object Tracking with Noisy/Missing Appearance Features

This paper assumes prior detections of multiple targets at each time instant, and uses a graph-based approach to connect those detections across time, based on their position and appearance estimates. In contrast to most earlier works in the field, our framework has been designed to exploit the appearance features, even when they are only sporadically available, or affected by a non-stationary noise, along the sequence of detections. This is done by implementing an iterative hypothesis testing strategy to progressively aggregate the detections into short trajectories, named tracklets. Specifically, each iteration considers a node, named key-node, and investigates how to link this key-node with other nodes in its neighbourhood, under the assumption that the target appearance is defined by the key-node appearance estimate. This is done through shortest path computation in a temporal neighbourhood of the key-node. The approach is conservative in that it only aggregates the shortest paths that are sufficiently better compared to alternative paths. It is also multi-scale in that the size of the investigated neighbourhood is increased proportionally to the number of detections already aggregated into the key-node. The multi-scale and iterative nature of the process makes it both computationally efficient and effective. Experimental validations are performed extensively on a 15 minutes long real-life basketball dataset, captured by 7 cameras, and also on PETS’09 dataset.

Amit Kumar K.C., Damien Delannay, Laurent Jacques, Christophe De Vleeschouwer

Novel Adaptive Eye Detection and Tracking for Challenging Lighting Conditions

The paper develops a novel technique that significantly improves the performance of Haar-like feature-based object detectors in terms of speed, detection rate under difficult lighting conditions, and reduced number of false-positives. The method is implemented and validated for driver monitoring under very dark, very bright, and normal conditions. The framework includes a fast adaptive detector designed to cope with rapid lighting variations, as well as an implementation of a Kalman filter for reducing the search region and indirect support of eye monitoring and tracking. The proposed methodology effectively works under low-light conditions without using infrared illumination or any other extra lighting support. Experimental results, performance evaluation, and comparing a standard Haar-like detector with the proposed adaptive eye detector, show noticeable improvements.

Mahdi Rezaei, Reinhard Klette

Obstacles Extraction Using a Moving Camera

A method of automatic obstacles detection is proposed which employs a camera mounted on a vehicle. Although various methods of obstacles detection have already been reported, they normally detect moving objects such as pedestrians and bicycles. In this paper, a method is proposed for detecting obstacles on a road, irrespective of moving or static, by the employment of the background modeling and the road region classification. The background modeling is often used to detect moving objects when a camera is static. In this paper, we apply it to the moving camera case to get foreground images. Then we extract the road region using SVM. In this road region, we carry out region classification. Then we can delete all the things which are not obstacles in the foreground images using the result of the region classification. In the performed experiments, it is shown that the proposed method is able to extract the shapes of both static and moving obstacles in a frontal view from a car.

Shaohua Qian, Joo Kooi Tan, Hyoungseop Kim, Seiji Ishikawa, Takashi Morie

Scene Text Detection and Tracking for a Camera-Equipped Wearable Reading Assistant for the Blind

Visually impaired people suffer daily from their disability to read textual information. One of the most anticipated blind-assistive devices is a system equipped with a wearable camera capable of finding the textual information in natural scenes and translating it into sound through a speech synthesizer. To avoid duplicate readings, the device should be able to recognize text areas with the same content, and group them to obtain a single result. Scene text detection and tracking methods attract a lot of interest for these purposes. However, this field is still challenging and methods of scene text detection and tracking are yet to be perfected. This paper proposes a scene text tracking system capable of finding text regions and tracking them in video frames captured by a wearable camera. By combining a text detection method with a feature point tracker, we obtain a robust text tracker which produces much less false positive text images at 2.9 times faster speed compared with the conventional method.

Faustin Pégeot, Hideaki Goto

Object Tracking across Non-overlapping Cameras Using Adaptive Models

In this paper, we propose a novel approach to track multiple objects across non-overlapping cameras, which aims at giving each object a unique label during its appearance in the whole multi-camera system. We formulate the problem of the multiclass object recognition as a binary classification problem based on an AdaBoost classifier. As the illumination variance, viewpoint changes, and camera characteristic changes vary with camera pairs, appearance changes of objects across different camera pairs generally follow different patterns. Based on this fact, we use a categorical variable indicating the entry/exit cameras as a feature to deal with different patterns of appearance changes across cameras. For each labeled object, an adaptive model describing the intraclass similarity is computed and integrated into a sequence based matching framework, depending on which the final matching decisions are made. Multiple experiments are performed on different datasets. Experimental results demonstrate the effectiveness of the proposed method.

Xiaotang Chen, Kaiqi Huang, Tieniu Tan

Combining Fast Extracted Edge Descriptors and Feature Sharing for Rapid Object Detection

We mainly focus on feature sharing problem for object detection in cluttered scenes. The contributions are two-fold. First, a novel kind of edge/contour descriptors is presented and they serve as the basic features for sharing. Compared with HOGs (histograms of oriented gradients), the descriptors show the approximately equivalent efficiency while much less computational lost. Second, to exploit feature sharing techniques for object detection, a mathematical representation of shared features for ”sliding-window” based object detection methods is given. Also with the newly defined shared features, a learning framework based on Real-Adaboost algorithm and a reusing framework based on look-up table are proposed. Experimental results show both the efficiency of proposed features and feature sharing method.

Yali Li, Fei He, Wenhao Lu, Shengjin Wang

Motion Segmentation by Velocity Clustering with Estimation of Subspace Dimension

The performance of clustering based motion segmentation methods depends on the dimension of the subspace where the point trajectories are projected. This paper presents a strategy for estimating the best subspace dimension using a novel clustering error measure. For each obtained segmentation, the proposed measure estimates the average least square error between the point trajectories and synthetic trajectories generated based on the motion models from the segmentation. The second contribution of this paper is the use of the velocity vector instead of the traditional trajectory vector for segmentation. The evaluation on the Hopkins 155 video benchmark database shows that the proposed method is competitive with current state-of-the-art methods both in terms of overall performance and computational speed.

Liangjing Ding, Adrian Barbu, Anke Meyer-Baese

International Workshop on Intelligent Mobile Vision (IMV)

Beyond Spatial Pyramid Matching: Spatial Soft Voting for Image Classification

Recently, spatial partitioning approaches such as spatial pyramid matching (SPM) are commonly used in image classification to collect the global and local features of the images. They divide the input image into small sub-regions (typically in a hierarchical manner) and generate a feature vector for each of them. Although the codes for the descriptors are assigned softly in modern image feature representation techniques, the codes must fall into only a single sub-region when forming the feature vector. In other words, the soft code assignment is used in the descriptor space but the codes are still “hard” voted from the view point of the image space. This paper proposes a spatial soft voting method, in which the existence of the codes are expressed by a Gaussian function and the maps of the existence are sampled to form a feature vector. The generated feature vectors are “soft” both in the descriptor space and the image space. In addition, extra computational cost as compared to SPM is negligibly small. The concept of the spatial soft voting is general and can be applied to most hard spatial partitioning approaches.

Toshihiko Yamasaki, Tsuhan Chen

Efficient Geometric Re-ranking for Mobile Visual Search

The state-of-the-art mobile visual search approaches are based on the bag-of-visual-word (BoW). As BoW representation ignores geometric relationship among the local features, a full geometric constraint like RANSAC is usually used as a post-processing step to re-rank the matched images, which has been shown to greatly improve the precision but at high computational cost. In this paper we present a novel and efficient geometric re-ranking method. Our basic idea is that the true matching local features should be not only in a similar spatial context, but also have a consistent spatial relationship, thus we simultaneously introduce

context similarity

and

spatial similarity

to describe the geometric consistency. By incorporating these two geometric constraints, the co-occurring visual words in the same spatial context can be regarded as a “visual phrase”and significantly improve the discriminative power than single visual word. To evaluate our approach, we perform experiments on Star5k and ImageNet100k dataset. The comparison with the BoW method and Soft-assignment method highlights the effectiveness of our approach in both accuracy and speed.

Junwu Luo, Bo Lang

Intelligent Photographing Interface with On-Device Aesthetic Quality Assessment

This paper proposes a efficient method for instant photo aesthetics quality assessment that can be implemented on general portable devices. The classification performance is guaranteed to 0.89 on benchmark photo database. We also port our method onto a middle-level tablet computer to execute instantly and we find it reaches good acceptable efficiency. Moreover, an aesthetic information display to present the aesthetics evaluation results to users is introduced.

Kuo-Yen Lo, Keng-Hao Liu, Chu-Song Chen

Camera Pose Estimation of a Smartphone at a Field without Interest Points

An Augmented Reality (AR) system on mobile phones has recently attracted attention because smartphones have increasingly been popular. For an AR system, we have to know a camera pose of a smartphone. A sensor-based method is one of the most popular ways to estimate the camera pose, but it cannot estimate an accurate pose. A vision-based method is another way to estimate the camera pose, but it is not suitable to a scene with few interest points such as a sports field. In this paper, we propose a novel method of a camera pose estimation for a scene without interest points by combining a sensor-based and a vision-based approach. In our proposed method, we use an acceleration and a magnetic sensor to roughly estimate a camera pose, then search the accurate pose by matching a captured image with a set of reference images. Our experiments show that our proposed method is accurate and fast enough to apply a real-time AR system.

Ruiko Miyano, Takuya Inoue, Takuya Minagawa, Yuko Uematsu, Hideo Saito

Hierarchical Scan-Line Dynamic Programming for Optical Flow Using Semi-Global Matching

Dense and robust optical flow estimation is still a major challenge in low-level computer vision. In recent years, mainly variational methods contributed to the progress in this field. One reason for their success is their suitability to be embedded into hierarchical schemes, which makes them capable of handling large pixel displacements. Matching-based regularization techniques, like dynamic programming or belief propagation concepts, can also lead to accurate optical flow fields. However, results are limited to short- or mid-scale optical flow vectors, because these techniques are usually not combined with coarse-to-fine strategies. This paper introduces fSGM, a novel algorithm that is based on scan-line dynamic programming. It uses the cost integration strategy of semi-global matching, a concept well known in the area of stereo matching. The major novelty of fSGM is that it embeds the scan-line dynamic programming approach into a hierarchical scheme, which allows it to handle large pixel displacements with an accuracy comparable to variational methods. We prove the exceptional performance of fSGM by comparing it to current state-of-the-art methods on the KITTI Vision Benchmark Suite.

Simon Hermann, Reinhard Klette

Novel Multi-view Synthesis from a Stereo Image Pair for 3D Display on Mobile Phone

In this paper we present a novel view synthesis method for mobile platform. A disparity-based view interpolation is proposed to synthesize the virtual view from a left and right image pair. This makes to full use of the available information of both images to give accurate interpolation results and greatly decreases the pixel number in the disoccusion region, thus reducing the errors introduced by the patch-based image inpainting. Two boundary refining schemes considering gradient and color coherence and applying directional filters on boundaries are proposed to improve the synthesized results. Experimental results show that the proposed method is effective and suitable for mobile environment.

Chen-Hao Wei, Chen-Kuo Chiang, Yu-Wei Sun, Mei-Huei Lin, Shang-Hong Lai

An Accurate Method for Line Detection and Manhattan Frame Estimation

We address the problem of estimating the rotation of a camera relative to the canonical frame of an urban scene, from a single image. Solutions generally rely on the so-called ‘Manhattan World’ assumption [1] that the major structures in the scene conform to three orthogonal principal directions. This can be expressed as a generative model in which the dense gradient map of the image is explained by a mixture of the three principal directions and a background process [2]. It has recently been shown that using sparse oriented edges rather than the dense gradient map leads to substantial gains in both accuracy and speed [3]. Here we explore whether further gains can be made by basing inference on even sparser extended lines. Standard Houghing techniques suffer from quantization errors and noise that make line extraction unreliable. Here we introduce a probabilistic line extraction technique that eliminates these problems through two innovations. First, we accurately propagate edge uncertainty from the image to the Hough map through a bivariate normal kernel that uses natural image statistics, resulting in a non-stationary ‘soft-voting’ technique. Second, we eliminate multiple responses to the same line by updating the Hough map dynamically as each line is extracted. We evaluate the method on a standard benchmark dataset [3], showing that the resulting line representation supports reliable estimation of the Manhattan frame, bettering the accuracy of previous edge-based methods by a factor of 2 and the gradient-based Manhattan World method by a factor of 5.

Ron Tal, James H. Elder

Hierarchical Stereo Matching Based on Image Bit-Plane Slicing

We propose a new stereo matching framework based on image bit-plane slicing. A pair of image sequences with various intensity quantization levels constructed by taking different bit-rate of the images is used for hierarchical stereo matching. The basic idea is to use the low bit-rate image pairs to compute rough disparity maps. The hierarchical matching strategy is then performed iteratively to update the low confident disparities with the information provided by extra image bit-planes. Since the disparity computation is carried out on a need-to-know basis, the proposed technique is suitable for remote processing of the images acquired by a mobile camera. Our method provides a hierarchical matching framework and can be combined with the existing stereo matching algorithms. Experiments on Middlebury datasets show that our technique gives good results compared to the conventional full bit-rate matching.

Huei-Yung Lin, Pin-Zhi Lin

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise