Skip to main content

2011 | Buch

Computer Vision – ACCV 2010 Workshops

ACCV 2010 International Workshops, Queenstown, New Zealand, November 8-9, 2010, Revised Selected Papers, Part I

herausgegeben von: Reinhard Koch, Fay Huang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 6468-6469 contains the carefully selected and reviewed papers presented at the eight workshops that were held in conjunction with the 10th Asian Conference on Computer Vision, in Queenstown, New Zealand, in November 2010.

From a total of 167 submissions to all workshops, 89 papers were selected for publication. The contributions are grouped together according to the main workshops topics, which were: computational photography and aesthetics; computer vision in vehicle technology: from Earth to Mars; electronic cultural heritage; subspace based methods; video event categorization, tagging and retrieval; visual surveillance; application of computer vision for mixed and augmented reality.

Inhaltsverzeichnis

Frontmatter

Workshop on Visual Surveillance

Second-Order Polynomial Models for Background Subtraction

This paper is aimed at investigating background subtraction based on second-order polynomial models. Recently, preliminary results suggested that quadratic models hold the potential to yield superior performance in handling common disturbance factors, such as noise, sudden illumination changes and variations of camera parameters, with respect to state-of-the-art background subtraction methods. Therefore, based on the formalization of background subtraction as Bayesian regression of a second-order polynomial model, we propose here a thorough theoretical analysis aimed at identifying a family of suitable models and deriving the closed-form solutions of the associated regression problems. In addition, we present a detailed quantitative experimental evaluation aimed at comparing the different background subtraction algorithms resulting from theoretical analysis, so as to highlight those more favorable in terms of accuracy, speed and speed-accuracy tradeoff.

Alessandro Lanza, Federico Tombari, Luigi Di Stefano
Adaptive Background Modeling for Paused Object Regions

Background modeling has been widely researched to detect moving objects from image sequences. Most approaches have a falsenegative problem caused by a stopped object. When a moving object stops in an observing scene, it will be gradually trained as background since the observed pixel value is directly used for updating the background model. In this paper, we propose 1) a method to inhibit background training, and 2) a method to update an original background region occluded by stopped object. We have used probabilistic approach and predictive approach of background model to solve these problems. The great contribution of this paper is that we can keep paused objects from being trained.

Atsushi Shimad, Satoshi Yoshinaga, Rin-ichiro Taniguchi
Determining Spatial Motion Directly from Normal Flow Field: A Comprehensive Treatment

Determining motion from a video of the imaged scene relative to the camera is important for various robotics tasks including visual control and autonomous navigation. The difficulty of the problem lies mainly in that the flow pattern directly observable in the video is generally not the full flow field induced by the motion, but only partial information of it, which is known as the normal flow field. A few methods collectively referred to as the direct methods have been proposed to determine the spatial motion from merely the normal flow field without ever interpolating the full flows. However, such methods generally have difficulty addressing the case of general motion. This work proposes a new direct method that uses two constraints: one related to the direction component of the normal flow field, and the other to the magnitude component, to determine motion. The first constraint presents itself as a system of linear inequalities to bind the motion parameters; the second one uses the rotation magnitude’s globality to all image positions to constrain the motion parameters further. A two-stage iterative process in a coarse-to-fine framework is used to exploit the two constraints. Experimental results on benchmark data show that the new treatment can tackle even the case of general motion.

Tak-Wai Hui, Ronald Chung
Background Subtraction for PTZ Cameras Performing a Guard Tour and Application to Cameras with Very Low Frame Rate

Pan Tilt Zoom cameras have the ability to cover wide areas with an adapted resolution. Since the logical downside of high resolution is a limited field of view, a guard tour can be used to monitor a large scene of interest. However, this greatly increases the duration between frames associated to a specific location. This constraint makes most background algorithms ineffective. In this article we propose a background subtraction algorithm suitable to cameras with very low frame rate. Its main interest consists in the resulting robustness to sudden illumination changes. The background model which describes a wide scene of interest consisting of a collection of images can thus be successfully maintained. This algorithm is compared with the state of the art and a discussion regarding its properties follows.

C. Guillot, M. Taron, Patrick Sayd, Q. C. Pham, C. Tilmant, J. M. Lavest
Bayesian Loop for Synergistic Change Detection and Tracking

In this paper we investigate Bayesian visual tracking based on change detection. Although in many proposals change detection is key for tracking, little attention has been paid to sound modeling of the interaction between the change detector and the tracker. In this work, we develop a principled framework whereby both processes can virtuously influence each other according to a Bayesian loop: change detection provides a completely specified observation likelihood to the tracker and the tracker provides an informative prior to the change detector.

Samuele Salti, Alessandro Lanza, Luigi Di Stefano
Real Time Motion Changes for New Event Detection and Recognition

An original approach for real time detection of changes in motion is presented, for detecting and recognizing events. Current video change detection focuses on shot changes, based on appearance, not motion. Changes in motion are detected in pixels that are found to be active, and this motion is input to sequential change detection, which detects changes in real time. Statistical modeling of the motion data shows that the Laplace provides the most accurate fit. This leads to reliable detection of changes in motion for videos where shot change detection is shown to fail. Once a change is detected, the event is recognized based on motion statistics, size, density of active pixels. Experiments show that the proposed method finds meaningful changes, and reliable recognition.

Konstantinos Avgerinakis, Alexia Briassouli, Ioannis Kompatsiaris
Improving Detector of Viola and Jones through SVM

Boosted cascade proposed by Viola and Jones is applied to many object detection problems. In their cascade, the confidence value of each stage can only be used in the current stage so that interstage information is not utilized to enhance classification performance. In this paper, we present a new cascading structure added SVM stages which employ the confidence values of multiple preceding Adaboost stages as input. Specifically, a rejection hyperplane and a promotion hyperplane are learned for each added SVM stage. During detection process, negative detection windows are discarded earier by the rejection SVM hyperplane, and positive windows with high confidence value are boosted by promotion hyperplane to bypass the next stage of cascade. In order to construct the two distinct hyperplanes, different cost coefficients for training samples are chosen in SVM learning. Experiment results in UIUC data set demonstrate that the proposed method achieve high detection accuracy and better efficiency.

Zhenchao Xu, Li Song, Jia Wang, Yi Xu
Multi-camera People Localization and Height Estimation Using Multiple Birth-and-Death Dynamics

This paper presents a novel tool for localizing people in multi-camera environment using calibrated cameras. Additionally, we will estimate the height of each person in the scene. Currently, the presented method uses the human body silhouettes as input, but it can be easily modified to process other widely used object (

e.g.

head, leg, body) detection results. In the first step we project all the pixels of the silhouettes to the ground plane and to other parallel planes with different height. Then we extract our features, which are based on the physical properties of the 2-D image formation. The final configuration results (location and height) are obtained by an iterative stochastic optimization process, namely the multiple birth-and-death dynamics framework.

Ákos Utasi, Csaba Benedek
Unsupervised Video Surveillance

This paper addresses the problem of automatically learning common behaviors from long time observations of a scene of interest, with the purpose of classifying actions and, possibly, detecting anomalies. Unsupervised learning is used as an effective way to extract information from the scene with a very limited intervention of the user. The method we propose is rather general, but fits very naturally to a video-surveillance scenario, where the same environment is observed for a long time, usually from a distance. The experimental analysis is based on thousands of dynamic events acquired by three-weeks observations of a single-camera video-surveillance system installed in our department.

Nicoletta Noceti, Francesca Odone
Multicamera Video Summarization from Optimal Reconstruction

We propose a principled approach to video summarization using optimal reconstruction as a metric to guide the creation of the summary output. The spatio-temporal video patches included in the summary are viewed as observations about the local motion of the original input video and are chosen to minimize the reconstruction error of the missing observations under a set of learned predictive models. The method is demonstrated using fixed-viewpoint video sequences and shown to generalize to multiple camera systems with disjoint views, which can share activity already summarized in one view to inform the summary of another. The results show that this approach can significantly reduce or even eliminate the inclusion of patches in the summary that contain activities from the video that are already expected based on other summary patches, leading to a more concise output.

Carter De Leo, B. S. Manjunath
Noisy Motion Vector Elimination by Bi-directional Vector-Based Zero Comparison

Network cameras are becoming increasingly popular as surveillance devices. They compress the captured live video data into Motion JPEG and/or MPEG standard formats, and they transmit them through the IP network. MPEG-coded videos contain motion vectors that are useful information for video analysis. However, the motion vectors occurring in homogeneous, low-textured, and line regions tend to be unstable and noisy. To address this problem, the noisy motion vector elimination using vector-based zero comparison and global motion estimation was proposed. In this paper, we extend the existing elimination method by introducing a novel bi-directional vector-based zero comparison to enhance the accuracy of noisy motion vector elimination, and we propose an efficient algorithm for zero comparison. We demonstrate the effectiveness of the proposed method through several experiments using actual video data acquired by an MPEG video camera.

Takanori Yokoyama, Toshinori Watanabe
Spatio-Temporal Optimization for Foreground/Background Segmentation

We introduce a procedure for calibrated multi camera setups in which observed persons within a realistic and, thus, difficult surrounding are determined as foreground in image sequences via a fully automatic purely data driven segmentation.

In order to gain an optimal separation of fore- and background for each frame in terms of Expectation Maximization (EM), an algorithm is proposed which utilizes a combination of geometrical constraints of the scene and, additionally, temporal constraints for a optimization over the entire sequence to estimate the background. This background information is then used to determine accurate silhouettes of the foreground.

We demonstrate the effectiveness of our approach based on a qualitative data analysis and compare it to other state of the art approaches.

Tobias Feldmann
Error Decreasing of Background Subtraction Process by Modeling the Foreground

Background subtraction is often one of the first tasks involved in video surveillance applications. Classical methods use a statistical background model and compute a distance between each part (pixel or bloc) of the current frame and the model to detect moving targets. Segmentation is then obtained by thresholding this distance. This commonly used approach suffers from two main drawbacks. First, the segmentation is blinded done, without considering the foreground appearance. Secondly, threshold value is often empirically specified, according to visual quality evaluation; it means both that the value is scene-dependant and that its setting is not automated using objective criterion.

In order to address these drawbacks, we introduce in this article a foreground model to improve the segmentation process. Several segmentation strategies are proposed, and theoretically as well as experimentally compared. Thanks to theoretical error estimation, an optimal segmentation threshold can be deduced to control segmentation behaviour like hold an especially targeted false alarm rate. This approach improves segmentation results in video surveillance applications, in some difficult situations as non-stationary background.

Christophe Gabard, Laurent Lucat, Catherine Achard, C. Guillot, Patrick Sayd
Object Flow: Learning Object Displacement

Modelling the dynamic behaviour of moving objects is one of the basic tasks in computer vision. In this paper, we introduce the

Object Flow

, for estimating both the displacement and the direction of an object-of-interest. Compared to the detection and tracking techniques, our approach obtains the object displacement directly similar to optical flow, while ignoring other irrelevant movements in the scene. Hence,

Object Flow

has the ability to continuously focus on a specific object and calculate its motion field. The resulting motion representation is useful for a variety of visual applications (e.g., scene description, object tracking, action recognition) and it cannot be directly obtained using the existing methods.

Constantinos Lalos, Helmut Grabner, Luc Van Gool, Theodora Varvarigou
HOG-Based Descriptors on Rotation Invariant Human Detection

In the past decade, there have been many proposed techniques on human detection. Dalal and Triggs suggested Histogram of Oriented Gradient (HOG) features combined with a linear SVM to handle the task. Since then, there have been many variations of HOG-based detection introduced. They are, nevertheless, based on an assumption that the human must be in

upright

pose due to the limitation in geometrical variation. HOG-based human detections obviously fails in monitoring human activities in the daily life such as sleeping, lying down, falling, and squatting. This paper focuses on exploring various features based on HOG for rotation invariant human detection. The results show that square-shaped window can cover more poses but will cause a drop in performance. Moreover, some rotation-invariant techniques used in image retrieval outperform other techniques in human classification on

upright

pose and perform very well on various poses. This could help in neglecting the assumption of

upright

pose generally used.

Panachit Kittipanya-ngam, Eng How Lung
Fast and Accurate Pedestrian Detection Using a Cascade of Multiple Features

We propose a fast and accurate pedestrian detection framework based on cascaded classifiers with two complementary features. Our pipeline starts with a cascade of weak classifiers using Haar-like features followed by a linear SVM classifier relying on the Co-occurrence Histograms of Oriented Gradients (CoHOG). CoHOG descriptors have a strong classification capability but are extremely high dimensional. On the other hand, Haar features are computationally efficient but not highly discriminative for extremely varying texture and shape information such as pedestrians with different clothing and stances. Therefore, the combination of both classifiers enables fast and accurate pedestrian detection. Additionally, we propose reducing CoHOG descriptor dimensionality using Principle Component Analysis. The experimental results on the DaimlerChrysler benchmark dataset show that we can reach very close accuracy to the CoHOG-only classifier but in less than 1/1000 of its computational cost.

Alaa Leithy, Mohamed N. Moustafa, Ayman Wahba
Interactive Motion Analysis for Video Surveillance and Long Term Scene Monitoring

In video surveillance and long term scene monitoring applications, it is a challenging problem to handle slow-moving or stopped objects for motion analysis and tracking. We present a new framework by using two feedback mechanisms which allow interactions between tracking and background subtraction (BGS) to improve tracking accuracy, particularly in the cases of slow-moving and stopped objects. A publish-subscribe modular system that provides the framework for communication between components is described. The robustness and efficiency of the proposed method is tested on our real time video surveillance system. Quantitative performance evaluation is performed on a variety of sequences, including standard datasets. With the two feedback mechanisms enabled together, significant improvement in tracking performance are demonstrated particularly in handling slow moving and stopped objects.

Andrew W. Senior, YingLi Tian, Max Lu
Frontal Face Generation from Multiple Low-Resolution Non-frontal Faces for Face Recognition

We propose a method of frontal face generation from multiple low-resolution non-frontal faces for face recognition. The proposed method achieves an image-based face pose transformation by using the information obtained from multiple input face images without considering three-dimensional face structure. To achieve this, we employ a patch-wise image transformation strategy that calculates small image patches in the output frontal face from patches in the multiple input non-frontal faces by using a face image dataset. The dataset contains faces of a large number of individuals other than the input one. Using frontal face images actually transformed from low-resolution non-frontal face images, two kinds of experiments were conducted. The experimental results demonstrates that increasing the number of input images improves the RMSEs and the recognition rates for low-resolution face images.

Yuki Kono, Tomokazu Takahashi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase
Probabilistic Index Histogram for Robust Object Tracking

Color histograms are widely used for visual tracking due to their robustness against object deformations. However, traditional histogram representation often suffers from problems of partial occlusion, background cluttering and other appearance corruptions. In this paper, we propose a probabilistic index histogram to improve the discriminative power of the histogram representation. With this modeling, an input frame is translated into an index map whose entries indicate indexes to a separate bin. Based on the index map, we introduce spatial information and the bin-ratio dissimilarity in histogram comparison. The proposed probabilistic indexing technique, together with the two robust measurements, greatly increases the discriminative power of the histogram representation. Both qualitative and quantitative evaluations show the robustness of the proposed approach against partial occlusion, noisy and clutter background.

Wei Li, Xiaoqin Zhang, Nianhua Xie, Weiming Hu, Wenhan Luo, Haibin Ling
Mobile Surveillance by 3D-Outlier Analysis

We present a novel online method to model independent foreground motion by using solely traditional structure and motion (S+M) algorithms. On the one hand, the visible static scene can be reconstructed and on the other hand, the position and orientation (pose) of the observer (mobile camera) are estimated. Additionally, we use 3D-outlier analysis for foreground motion detection and tracking. First, we cluster the available 3D-information such that, with high probability, each cluster corresponds to a moving object. Next, we establish a purely geometry-based object representation that can be used to reliably estimate each object’s pose. Finally, we extend the purely geometry-based object representation and add local descriptors to solve the loop closing problem for the underlying S+M algorithm. Experimental results on single and multi-object video data demonstrate the viability of this method. Major results include the computation of a stable representation of moving foreground objects, basic recognition possibilities due to descriptors, and motion trajectories that can be used for motion analysis of objects. Our novel multibody structure and motion (MSaM) approach runs online and can be used to control active surveillance systems in terms of dynamic scenes, observer pose, and observer-to-object pose estimation, or to enrich available information in existing appearance- and shape-based object categorization.

Peter Holzer, Axel Pinz
Person Re-identification Based on Global Color Context

In this paper, we present a new solution to the problem of person re-identification. Person re-identification means to match observations of the same person across different time and possibly different cameras. The appearance based person re-identification must deal with several challenges such as variations of illumination conditions, poses and occlusions. Our proposed method inspires from the spirit of self-similarity. Self-similarity is an attractive property in visual recognition. Instead of comparing image descriptors between two images directly, the self-similarity measures how similar they are to a neighborhood of themselves. The self-similarities of image patterns within the image are modeled in two different ways in the proposed Global Color Context (GCC) method. The spatial distributions of self-similarities w.r.t. color words are combined to characterize the appearance of pedestrians. Promising results are obtained in the public ETHZ database compared with state-of-art performances.

Yinghao Cai, Matti Pietikäinen
Visual Object Tracking via One-Class SVM

In this paper, we propose a new visual object tracking approach via one-class SVM (OC-SVM), inspired by the fact that OC-SVM’s support vectors can form a hyper-sphere, whose center can be regarded as a robust object estimation from samples. In the tracking approach, a set of tracking samples are constructed in a predefined searching window of a video frame. And then a threshold strategy is proposed to select examples from the tracking sample set. Selected examples are used to train an OC-SVM model which estimates a hyper-sphere encircling most of the examples. Finally, we locate the center of the hyper sphere as the tracked object in the current frame. Extensive experiments demonstrate the effectiveness and robustness of the proposed approach in complex background.

Li Li, Zhenjun Han, Qixiang Ye, Jianbin Jiao
Attenuated Sequential Importance Resampling (A-SIR) Algorithm for Object Tracking

This paper presents a newly developed attenuating resampling algorithm for particle filtering that can be applied to object tracking. In any filtering algorithm adopting concept of particles, especially in visual tracking, re-sampling is a vital process that determines the algorithm’s performance and accuracy in the implementation step.It is usually a linear function of the weight of the particles, which decide the number of particles copied. If we use many particles to prevent sample impoverishment, however, the system becomes computationally too expensive. For better real-time performance with high accuracy, we introduce a steep Attenuated Sequential Importance Re-sample (A-SIR) algorithm that can require fewer highly weighted particles by introducing a nonlinear function into the resampling method. Using our proposed algorithm, we have obtained very impressive results for visual tracking with only a few particles instead of many. Dynamic parameter setting increases the steepness of resampling and reduces computational time without degrading performance. Since resampling is not dependent on any particular application, the A-SIR analysis is appropriate for any type of particle filtering algorithm that adopts a resampling procedure. We show that the A-SIR algorithm can improve the performance of a complex visual tracking algorithm using only a few particles compared with a traditional SIR-based particle filter.

Md. Zahidul Islam, Chi-Min Oh, Chil-Woo Lee
An Appearance-Based Approach to Assistive Identity Inference Using LBP and Colour Histograms

Robust identity inference is one of the biggest challenges in current visual surveillance systems. Although, face is an important biometric for generic identity inference, it is not always accessible in video-based surveillance systems due to the poor quality of the video or ineffective viewpoints where the captured face is not clearly visible. Hence, taking advantage of additional features to increase the accuracy and reliability of these systems is an increasing need. Appearance and clothing are potentially suitable for visual identification and tracking suspects. In this research we present a novel approach for recognition of upper body clothing, using local binary patterns (LBP) and colour information, as an assistive tool for identity inference.

Sareh Abolahrari Shirazi, Farhad Dadgostar, Brian C. Lovell
Vehicle Class Recognition Using Multiple Video Cameras

We present an approach to 3D vehicle class recognition (which of SUV, mini-van, sedan, pickup truck) with one or more fixed video-cameras in arbitrary positions with respect to a road. The vehicle motion is assumed to be straight. We propose an efficient method of Structure from Motion (SfM) for camera calibration and 3D reconstruction. 3D geometry such as vehicle and cabin length, width, height, and functions of these are computed and become features for use in a classifier. Classification is done by a minimum probability of error recognizer. Finally, when additional video clips taken elsewhere are available, we design classifiers based on two or more video clips, and this results in significant classification-error reduction.

Dongjin Han, Jae Hwang, Hern-soo Hahn, David B. Cooper
Efficient Head Tracking Using an Integral Histogram Constructing Based on Sparse Matrix Technology

In this paper, a sparse matrix technology-based integral histogram constructing is applied to a particle filter for efficient head tracking, which can significantly enhance the performance of the particle filter of large number of particles in terms of speed. Also, by exploiting the integral histogram constructing, a novel orientation histogram matching-based proposal is proposed for head tracking based on a circular shift orientation histogram matching, which is robust to in-plane rotation. The proposed head tracking is validated on S.Birchfields image sequences.

Jia-Tao Qiu, Yu-Shan Li, Xiu-Qin Chu

Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR)

Analyzing Diving: A Dataset for Judging Action Quality

This work presents a unique new dataset and objectives for action analysis. The data presents 3 key challenges: tracking, classification, and judging action quality. The last of these, to our knowledge, has not yet been attempted in the vision literature as applied to sports where technique is scored.

This work performs an initial analysis of the dataset with classification experiments, confirming that temporal information is more useful than holistic bag-of-features style analysis in distinguishing dives. Our investigation lays a groundwork of effective tools for working with this type of sports data for future investigations into judging the quality of actions.

Kamil Wnuk, Stefano Soatto
Appearance-Based Smile Intensity Estimation by Cascaded Support Vector Machines

Facial expression recognition is one of the most challenging research area in the image recognition field and has been studied actively for a long time. Especially, we think that smile is important facial expression to communicate well between human beings and also between human and machines. Therefore, if we can detect smile and also estimate its intensity at low calculation cost and high accuracy, it will raise the possibility of inviting many new applications in the future. In this paper, we focus on smile in facial expressions and study feature extraction methods to detect a smile and estimate its intensity only by facial appearance information (Facial parts detection, not required). We use Local Intensity Histogram (LIH), Center-Symmetric Local Binary Pattern (CS-LBP) or features concatenated LIH and CS-LBP to train Support Vector Machine (SVM) for smile detection. Moreover, we construct SVM smile detector as a cascaded structure both to keep the performance and reduce the calculation cost, and estimate the smile intensity by posterior probability. As a consequence, we achieved both low calculation cost and high performance with practical images and we also implemented the proposed methods to the PC demonstration system.

Keiji Shimada, Tetsu Matsukawa, Yoshihiro Noguchi, Takio Kurita
Detecting Frequent Patterns in Video Using Partly Locality Sensitive Hashing

Frequent patterns in video are useful clues to learn previously unknown events in an unsupervised way. This paper presents a novel method for detecting relatively long variable-length frequent patterns in video efficiently. The major contribution of the paper is that Partly Locality Sensitive Hashing (PLSH) is proposed as a sparse sampling method to detect frequent patterns faster than the conventional method with LSH. The proposed method was evaluated by detecting frequent everyday whole body motions in video.

Koichi Ogawara, Yasufumi Tanabe, Ryo Kurazume, Tsutomu Hasegawa
Foot Contact Detection for Sprint Training

We introduce a new algorithm to automatically identify the time and pixel location of foot contact events in high speed video of sprinters. We use this information to autonomously synchronise and overlay multiple recorded performances to provide feedback to athletes and coaches during their training sessions.

The algorithm exploits the variation in speed of different parts of the body during sprinting. We use an array of foreground accumulators to identify short-term static pixels and a temporal analysis of the associated static regions to identify foot contacts.

We evaluated the technique using 13 videos of three sprinters. It successfully identifed 55 of the 56 contacts, with a mean localisation error of 1.39±1.05 pixels. Some videos were also seen to produce additional, spurious contacts. We present heuristics to help identify the true contacts.

Robert Harle, Jonathan Cameron, Joan Lasenby
Interpreting Dynamic Meanings by Integrating Gesture and Posture Recognition System

Integration of information from different systems support enhanced functionality however it requires a rigorous pre-determined results for the fusion. This paper proposes a novel approach for determining the integration criteria using Particle filter for the fusion of hand gesture and posture recognition system at decision level. For decision level fusion, integration framework requires the classification of hand gesture and posture symbols in which HMM is used to classify the alphabets and numbers from hand gesture recognition system whereas ASL finger spelling signs (alphabets and numbers) are classified by posture recognition system using SVM. These classification results are input to integration framework to compute the contribution-weights. For this purpose, Condensation algorithm approximates the optimal a-posterior probability using a-prior probability and Gaussian based likelihood function thus making the weights independent of classification ambiguities. Considering the recognition as a problem of regular grammar, we have developed our production rules based on context free grammar (CFG) for the restaurant scenario. On the basis of contribution-weights, we mapped the recognized outcome over CFG rules and infer meaningful expressions. Experiments are conducted on 500 different combinations of restaurant orders with the overall 98.3% inference accuracy which proves the significance of proposed approach.

Omer Rashid Ahmed, Ayoub Al-Hamadi, Bernd Michaelis
Learning from Mistakes: Object Movement Classification by the Boosted Features

This paper proposes a robust object movement detection method via a classifier trained by mis-detection samples. The mis-detection are related to the environment, such as reflection on a display or small movement of a curtain, so learning the patterns of mis-detections will improve the detection precision. The mis-detections are expected to have several features, but selecting manually optimal features and thresholds is difficult. In order to acquire optimal classifier automatically, we employ a ensemble learning framework. The experiment shows the method can detect object movements sufficiently by constructing the classifier automatically by the proposed framework.

Shigeyuki Odashima, Tomomasa Sato, Taketoshi Mori
Modeling Multi-Object Activities in Phase Space

Modeling and recognition of complex activities involving multiple, interacting objects in video is a significant problem in computer vision. In this paper, we examine activities using relative distances in phase space via pairwise analysis of all objects. This allows us to characterize simple interactions directly by modeling multi-object activities with the Multiple Objects, Pairwise Analysis (MOPA) feature vector, which is based upon physical models of multiple interactions in phase space. In this initial formulation, we model paired motion as a damped oscillator in phase space. Experimental validation of the theory is provided on the standard VIVID and UCR Videoweb datasets capturing a variety of problem settings.

Ricky J. Sethi, Amit K. Roy-Chowdhury
Sparse Motion Segmentation Using Multiple Six-Point Consistencies

We present a method for segmenting an arbitrary number of moving objects in image sequences using the geometry of 6 points in 2D to infer motion consistency. The method has been evaluated on the Hopkins 155 database and surpasses current state-of-the-art methods such as SSC, both in terms of overall performance on two and three motions but also in terms of maximum errors. The method works by finding initial clusters in the spatial domain, and then classifying each remaining point as belonging to the cluster that minimizes a motion consistency score. In contrast to most other motion segmentation methods that are based on an affine camera model, the proposed method is fully projective.

Vasileios Zografos, Klas Nordberg, Liam Ellis
Systematic Evaluation of Spatio-Temporal Features on Comparative Video Challenges

In the last decade, we observed a great interest in evaluation of local visual features in the domain of images. The aim is to provide researchers guidance when selecting the best approaches for new applications and data-sets. Most of the state-of-the-art features have been extended to the temporal domain to allow for video retrieval and categorization using similar techniques to those used for images. However, there is no comprehensive evaluation of these. We provide the first comparative evaluation based on isolated and well defined alterations of video data. We select the three most promising approaches, namely the Harris3D, Hessian3D, and Gabor detectors and the HOG/HOF, SURF3D, and HOG3D descriptors. For the evaluation of the detectors, we measure their repeatability on the challenges treating the videos as 3D volumes. To evaluate the robustness of spatio-temporal descriptors, we propose a principled classification pipeline where the increasingly altered videos build a set of queries. This allows for an in-depth analysis of local detectors and descriptors and their combinations.

Julian Stöttinger, Bogdan Tudor Goras, Thomas Pöntiz, Allan Hanbury, Nicu Sebe, Theo Gevers
Two-Probabilistic Latent Semantic Model for Image Annotation and Retrieval

A novel latent variable modeling technique for image annotation and retrieval is proposed. This model is useful for annotating the images with relevant semantic meanings as well as for retrieving images which satisfy the users query with specific text or image. The framework of two-step latent variable is proposed to support multi-functionality of the retrieval and annotation system. Furthermore, the existing and the proposed image annotation models are compared in terms of their annotating performance. Images from standard databases are used in the comparison in order to identify the best model for automatic image annotation, using precision-recall measurement. Local features, or visual words, of each image in the database are extracted using Scale-Invariant Feature Transform (SIFT) and clustering techniques. Each image is then represented by Bag-of-Features (BoF) which is a histogram of visual words. Semantic meanings can then be related to each BoF using latent variable for annotation purposes. Subsequently, for image retrieval, each image query is also related to semantic meanings. Finally, image retrieval results are obtained by matching semantic meanings of the query with those of the images in the database using a second latent variable.

Nattachai Watcharapinchai, Supavadee Aramvith, Supakorn Siddhichai
Using Conditional Random Field for Crowd Behavior Analysis

The governing behaviors of individuals in crowded places offer unique and difficult challenges. In this paper, a novel framework is proposed to investigate the crowd behaviors and to localize the anomalous behaviors. Novelty of the proposed approach can be revealed in three aspects. First, we introduce block-clips by sectioning video segments into non-overlapping patches to marginalize the arbitrarily complicated dense flow field. Second, flow field is treated as a 2d distribution of samples in block-clips, which is parameterized by using mixtures of Gaussian keeping the generality intact. The parameters of each Gaussian model, particularly mean values are transformed into a sequence of Gaussian mean densities for each block-clip namely a sequence of latent-words. A bank of Conditional Random Field model is employed, one for each block-clip, which is learned from the sequence of latent-words and classifies each block-clip as normal and abnormal. Experiments are conducted on two challenging benchmark datasets PETS 2009 and University of Minnesota and results show that our method achieves higher accuracy in behavior detection and can effectively localize specific and overall anomalies. Besides, a comparative analysis is presented with similar approaches which demonstrates the dominating performance of our approach.

Saira Saleem Pathan, Ayoub Al-Hamadi, Bernd Michaelis

Workshop on Gaze Sensing and Interactions

Understanding Interactions and Guiding Visual Surveillance by Tracking Attention

The central tenet of this paper is that by determining where people are looking, other tasks involved with understanding and interrogating a scene are simplified. To this end we describe a fully automatic method to determine a person’s attention based on real-time visual tracking of their head and a coarse classification of their head pose. We estimate the head pose, or coarse gaze, using randomised ferns with decision branches based on both histograms of gradient orientations and colour based features. We use the coarse gaze for three applications to demonstrate its value: (i) we show how by building static and temporally varying maps of areas where people look we are able to identify interesting regions; (ii) we show how by determining the gaze of people in the scene we can more effectively control a multi-camera surveillance system to acquire faces for identification; (iii) we show how by identifying where people are looking we can more effectively classify human interactions.

Ian Reid, Ben Benfold, Alonso Patron, Eric Sommerlade
Algorithm for Discriminating Aggregate Gaze Points: Comparison with Salient Regions-Of-Interest

A novel method for distinguishing classes of viewers from their aggregated eye movements is described. The probabilistic framework accumulates uniformly sampled gaze as Gaussian point spread functions (heatmaps), and measures the distance of unclassified scanpaths to a previously classified set (or sets). A similarity measure is then computed over the scanpath durations. The approach is used to compare human observers’s gaze over video to regions of interest (ROIs) automatically predicted by a computational saliency model. Results show consistent discrimination between human and artificial ROIs, regardless of either of two differing instructions given to human observers (free or tasked viewing).

Thomas J. Grindinger, Vidya N. Murali, Stephen Tetreault, Andrew T. Duchowski, Stan T. Birchfield, Pilar Orero
Gaze Estimation Using Regression Analysis and AAMs Parameters Selected Based on Information Criterion

One of the most crucial techniques associated with Computer Vision is technology that deals with the automatic estimation of gaze orientation. In this paper, a method is proposed to estimate horizontal gaze orientation from a monocular camera image using the parameters of Active Appearance Models (AAM) selected based on several model selection methods. The proposed method can estimate horizontal gaze orientation more precisely than the conventional method (Ishikawa’s method) because of the following two unique points: simultaneous estimation of horizontal head pose and gaze orientation, and the most suitable model formula for regression selected based on each model selection method. The validity of the proposed method was confirmed by experimental results.

Manabu Takatani, Yasuo Ariki, Tetsuya Takiguchi
Estimating Human Body and Head Orientation Change to Detect Visual Attention Direction

This paper presents a method to estimate human body and head orientation change around yaw axis from low-resolution data. Body orientation is calculated by using Shape Context algorithm to match the outline of upper body with predefined shape templates within the ranges of 22.5 degrees. Then, motion flow vectors of SIFT features around head region are utilized to estimate the change in head orientation. Body orientation change and head orientation change can be added to the initial orientation to compute the new visual focus of attention of the person. Experimental results are presented to prove the effectiveness of the proposed method. Successful estimations, which are supported by a user study, were obtained from low-resolution data under various head pose articulations.

Ovgu Ozturk, Toshihiko Yamasaki, Kiyoharu Aizawa
Can Saliency Map Models Predict Human Egocentric Visual Attention?

The validity of using conventional saliency map models to predict human attention was investigated for video captured with an egocentric camera. Since conventional visual saliency models do not take into account visual motion caused by camera motion, high visual saliency may be erroneously assigned to regions that are not actually visually salient. To evaluate the validity of using saliency map models for egocentric vision, an experiment was carried out to examine the correlation between visual saliency maps and measured gaze points for egocentric vision. The results show that conventional saliency map models can predict visually salient regions better than chance for egocentric vision and that the accuracy decreases significantly with an increase in visual motion induced by egomotion, which is presumably compensated for in the human visual system. This latter finding indicates that a visual saliency model is needed that can better predict human visual attention from egocentric videos.

Kentaro Yamada, Yusuke Sugano, Takahiro Okabe, Yoichi Sato, Akihiro Sugimoto, Kazuo Hiraki
An Empirical Framework to Control Human Attention by Robot

Human attention control simply means that the shifting of one’s attention from one direction to another. To shift someone’s attention, gaining attention and meeting gaze are two most important pre-requisites. If a person would like to communicate with another, the person’s gaze should meet the receiver’s gaze, and they should make eye contact. However, it is difficult to set up eye contact when the two people are not facing each other in non-linguistic way. Therefore, the sender should perform some actions to capture the receiver’s attention so that they can meet face-to-face and establish eye contact. In this paper, we focus on what is the best action for a robot to attract human attention and how human and robot display gazing behavior each other for eye contact. In our system, the robot may direct its gaze toward a particular direction after making eye contact and the human will read the robot’s gaze. As a result, s/he will shift his/her attention to the direction indicated by the robot gaze. Experimental results show that the robot’s head motions can attract human attention, and the robot’s blinking when their gaze meet can make the human feel that s/he makes eye contact with the robot.

Mohammed Moshiul Hoque, Tomami Onuki, Emi Tsuburaya, Yoshinori Kobayashi, Yoshinori Kuno, Takayuki Sato, Sachiko Kodama
Improvement and Evaluation of Real-Time Tone Mapping for High Dynamic Range Images Using Gaze Information

Using gaze information in designing tone-mapping operators has many potentials over traditional global tone-mapping operators. In this paper, we evaluate a recently proposed real-time tone mapping operator based on gaze information and show that it is highly dependent on the input scene. We propose an important modification to the evaluated method to relief this dependency and to enhance the appearance of the resultant images using smaller processing area. Experimental results show that our method outperforms the evaluated technique.

Takuya Yamauchi, Toshiaki Mikami, Osama Ouda, Toshiya Nakaguchi, Norimichi Tsumura
Evaluation of the Impetuses of Scan Path in Real Scene Searching

The modern computer vision systems usually scan the image over positions and scales to detect a predefined object, whereas the human vision system performs this task in a more intuitive and efficient manner by selecting only a few regions to fixate on. A comprehensive understanding of human search will benefit computer vision systems in search modeling. In this paper, we investigate the contributions of the sources that affect human eye scan path while observers perform a search task in real scenes. The examined sources include saliency, task guidance, and oculomotor bias. Both their influence on each consecutive pair fixations and on the entire scan path are evaluated. The experimental results suggest that the influences of task guidance and oculomotor bias are comparable, and that of saliency is rather low. They also show that we could use these sources to predict not only where humans look in the image but also the order of their visiting.

Chen Chi, Laiyun Qing, Jun Miao, Xilin Chen
Backmatter
Metadaten
Titel
Computer Vision – ACCV 2010 Workshops
herausgegeben von
Reinhard Koch
Fay Huang
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-22822-3
Print ISBN
978-3-642-22821-6
DOI
https://doi.org/10.1007/978-3-642-22822-3