Skip to main content

2016 | Buch

Computer Vision – ECCV 2016 Workshops

Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I

insite
SUCHEN

Über dieses Buch

The three-volume set LNCS 9913, LNCS 9914, and LNCS 9915 comprises the refereed proceedings of the Workshops that took place in conjunction with the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.

27 workshops from 44 workshops proposals were selected for inclusion in the proceedings. These address the following themes: Datasets and Performance Analysis in Early Vision; Visual Analysis of Sketches; Biological and Artificial Vision; Brave New Ideas for Motion Representations; Joint Imagenet and MS Coco Visual Recognition Challenge; Geometry Meets Deep Learning; Action and Anticipation for Visual Learning; Computer Vision for Road Scene Understanding and Autonomous Driving; Challenge on Automatic Personality Analysis; BioImage Computing; Benchmarking Multi-Target Tracking: MOTChallenge; Assistive Computer Vision and Robotics; Transferring and Adapting Source Knowledge in Computer Vision; Recovering 6D Object Pose; Robust Reading; 3D Face Alignment in the Wild and Challenge; Egocentric Perception, Interaction and Computing; Local Features: State of the Art, Open Problems and Performance Evaluation; Crowd Understanding; Video Segmentation; The Visual Object Tracking Challenge Workshop; Web-scale Vision and Social Media; Computer Vision for Audio-visual Media; Computer VISion for ART Analysis; Virtual/Augmented Reality for Visual Artificial Intelligence; Joint Workshop on Storytelling with Images and Videos and Large Scale Movie Description and Understanding Challenge.

Inhaltsverzeichnis

Frontmatter

W02 – Visual Analysis of Sketches

Frontmatter
Face Recognition from Multiple Stylistic Sketches: Scenarios, Datasets, and Evaluation

Matching a face sketch against mug shots, which plays an important role in law enforcement and security, is an interesting and challenging topic in face recognition community. Although great progress has been made in recent years, main focus is the face recognition based on SINGLE sketch in existing studies. In this paper, we present a fundamental study of face recognition from multiple stylistic sketches. Three specific scenarios with corresponding datasets are carefully introduced to mimic real-world situations: (1) recognition from multiple hand-drawn sketches; (2) recognition from hand-drawn sketch and composite sketches; (3) recognition from multiple composite sketches. We further provide the evaluation protocols and several benchmarks on these proposed scenarios. Finally, we discuss the plenty of challenges and possible future directions that worth to be further investigated. All the materials will be publicly available online (Available at http://chunleipeng.com/FRMSketches.html.) for comparisons and further study of this problem.

Chunlei Peng, Nannan Wang, Xinbo Gao, Jie Li
Instance-Level Coupled Subspace Learning for Fine-Grained Sketch-Based Image Retrieval

Fine-grained sketch-based image retrieval (FG-SBIR) is a newly emerged topic in computer vision. The problem is challenging because in addition to bridging the sketch-photo domain gap, it also asks for instance-level discrimination within object categories. Most prior approaches focused on feature engineering and fine-grained ranking, yet neglected an important and central problem: how to establish a fine-grained cross-domain feature space to conduct retrieval. In this paper, for the first time we formulate a cross-domain framework specifically designed for the task of FG-SBIR that simultaneously conducts instance-level retrieval and attribute prediction. Different to conventional photo-text cross-domain frameworks that performs transfer on category-level data, our joint multi-view space uniquely learns from the instance-level pair-wise annotations of sketch and photo. More specifically, we propose a joint view selection and attribute subspace learning algorithm to learn domain projection matrices for photo and sketch, respectively. It follows that visual attributes can be extracted from such matrices through projection to build a coupled semantic space to conduct retrieval. Experimental results on two recently released fine-grained photo-sketch datasets show that the proposed method is able to perform at a level close to those of deep models, while removing the need for extensive manual annotations.

Peng Xu, Qiyue Yin, Yonggang Qi, Yi-Zhe Song, Zhanyu Ma, Liang Wang, Jun Guo
IIIT-CFW: A Benchmark Database of Cartoon Faces in the Wild

In this paper, we introduce the cartoon faces in the wild (IIIT-CFW) database and associated problems. This database contains 8,928 annotated images of cartoon faces of 100 public figures. It will be useful in conducting research on spectrum of problems associated with cartoon understanding. Note that to our knowledge, such realistic and large databases of cartoon faces are not available in the literature.

Ashutosh Mishra, Shyam Nandan Rai, Anand Mishra, C. V. Jawahar
Identifying Emotions Aroused from Paintings

Understanding the emotional appeal of paintings is a significant research problem related to affective image classification. The problem is challenging in part due to the scarceness of manually-classified paintings. Our work proposes to apply statistical models trained over photographs to infer the emotional appeal of paintings. Directly applying the learned models on photographs to paintings cannot provide accurate classification results, because visual features extracted from paintings and natural photographs have different characteristics. This work presents an adaptive learning algorithm that leverages labeled photographs and unlabeled paintings to infer the visual appeal of paintings. In particular, we iteratively adapt the feature distribution in photographs to fit paintings and maximize the joint likelihood of labeled and unlabeled data. We evaluate our approach through two emotional classification tasks: distinguishing positive from negative emotions, and differentiating reactive emotions from non-reactive ones. Experimental results show the potential of our approach.

Xin Lu, Neela Sawant, Michelle G. Newman, Reginald B. Adams Jr., James Z. Wang, Jia Li
Fast Face Sketch Synthesis via KD-Tree Search

Automatic face sketch synthesis has been widely applied in digital entertainment and law enforcement. Currently, most sketch synthesis algorithms focus on generating face portrait of good quality, but ignoring the time consumption. Existing methods have large time complexity due to dense computation of patch matching in the neighbor selection process. In this paper, we propose a simple yet effective fast face sketch synthesis method based on K dimensional-Tree (KD-Tree). The proposed method employs the idea of divide-and-conquer (i.e. piece-wise linear) to learn the complex nonlinear mapping between facial photos and sketches. In the training phase, all the training images are divided into regions and every region is divided into some small patches, then KD-Tree is built up among training photo patches in each region. In the test phase, the test photo is first divided into some patches as the same way in the training phase. KD-Tree search is conducted for K nearest neighbor selection by matching the test photo patches in each region against the constructed KD-Tree of training photo patches in the same region. The KD-Tree process builds index structure which greatly reduces the time consumption for neighbor selection. Compared with synthesis methods using classical greedy search strategy (i.e. KNN), the proposed method is much less time consuming but with comparable synthesis performance. Experiments on the public CUHK face sketch (CUFS) database illustrate the effectiveness of the proposed method. In addition, the proposed neighbor selection strategy can be further extended to other synthesis algorithms.

Yuqian Zhang, Nannan Wang, Shengchuan Zhang, Jie Li, Xinbo Gao

W08 – Computer Vision for Road Scene Understanding and Autonomous Driving

Frontmatter
Monocular Visual-IMU Odometry: A Comparative Evaluation of the Detector-Descriptor Based Methods

Visual odometry has been used in many fields, especially in robotics and intelligent vehicles. Since local descriptors are robust to background clutter, occlusion and other content variations, they have been receiving more and more attention in the application of the detector-descriptor based visual odometry. To our knowledge, however, there is no extensive, comparative evaluation investigating the performance of the detector-descriptor based methods in the scenario of monocular visual-IMU (Inertial Measurement Unit) odometry. In this paper, we therefore perform such an evaluation under a unified framework. We select five typical routes from the challenging KITTI dataset by taking into account the length and shape of routes, the impact of independent motions due to other vehicles and pedestrians. In terms of the five routes, we conduct five different experiments in order to assess the performance of different combinations of salient point detector and local descriptor in various road scenes, respectively. The results obtained in this study potentially provide a series of guidelines for the selection of salient point detectors and local descriptors.

Xingshuai Dong, Xinghui Dong, Junyu Dong
Road Segmentation for Classification of Road Weather Conditions

Using vehicle cameras to automatically assess road weather conditions requires that the road surface first be identified and segmented from the imagery. This is a challenging problem for uncalibrated cameras such as removable dash cams or cell phone cameras, where the location of the road in the image may vary considerably from image to image. Here we show that combining a spatial prior with vanishing point and horizon estimators can generate improved road surface segmentation and consequently better road weather classification performance. The resulting system attains an accuracy of 86 % for binary classification (bare vs. snow/ice-covered) and 80 % for 3 classes (dry vs. wet vs. snow/ice-covered) on a challenging dataset.

Emilio J. Almazan, Yiming Qian, James H. Elder
Recognizing Text-Based Traffic Guide Panels with Cascaded Localization Network

In this paper, we introduce a new top-down framework for automatic localization and recognition of text-based traffic guide panels (http://tinyurl.com/wiki-guide-signs) captured by car-mounted cameras from natural scene images. The proposed framework involves two contributions. First, a novel Cascaded Localization Network (CLN) joining two customized convolutional nets is proposed to detect the guide panels and the scene text on them in a coarse-to-fine manner. In this network, the popular character-wise text saliency detection is replaced with string-wise text region detection, which avoids numerous bottom-up processing steps such as character clustering and segmentation. Text information contained within detected text regions is then interpreted by a deep recurrent model without character segmentation required. Second, a temporal fusion of text region proposals across consecutive frames is introduced to significantly reduce the redundant computation in neighboring frames. A new challenging Traffic Guide Panel dataset is collected to train and evaluate the proposed framework, instead of the unsuited symbol-based traffic sign datasets. Experimental results demonstrate that our proposed framework outperforms multiple recently published text spotting frameworks in real highway scenarios.

Xuejian Rong, Chucai Yi, Yingli Tian
The Automatic Blind Spot Camera: A Vision-Based Active Alarm System

In this paper we present a vision-based active safety system targeting the blind spot zone of trucks. Each year, these blind spot accidents are responsible for numerous fatalities and heavily injured. Existing commercial systems seem not to be able to cope with this problem completely. Therefore, we propose a vision-based safety system relying solely on the blind spot camera images. Our system is able to detect all vulnerable road users (VRUs) in the blind spot zone, and automatically generates an alarm towards the truck driver. This inherently is a challenging task. Indeed, such active safety system implicitly requires extremely high accuracy demands at very low latency. These two demands are contradictory, and thus very difficult to unite. However, our real-life experiments show that our proposed active alarm system achieves excellent accuracy results while meeting these stringent requirements.

Kristof Van Beeck, Toon Goedemé
Extracting Driving Behavior: Global Metric Localization from Dashcam Videos in the Wild

Given the advance of portable cameras, many vehicles are equipped with always-on cameras on their dashboards (referred to as dashcam). We aim to utilize these dashcam videos harvested in the wild to extract the driving behavior—global metric localization of 3D vehicle trajectories (Fig. 1). We propose a robust approach to (1) extract a relative vehicle 3D trajectory from a dashcam video, (2) create a global metric 3D map using geo-localized Google StreetView RGBD panoramic images, and (3) align the relative vehicle 3D trajectory to the 3D map to achieve global metric localization. We conduct an experiment on 50 dashcam videos captured in 11 cities under various traffic conditions. For each video, we uniformly sample at least 15 control frames per road segment to manually annotate the ground truth 3D locations of the vehicle. On control frames, the extracted 3D locations are compared with these manually labeled ground truths to calculate the distance in meters. Our proposed method achieves an average error of 2.05 m and $$85.5\,\%$$85.5% of them have error no more than 5 m. Our method significantly outperforms other vision-based baseline methods and is a more accurate alternative method than the most widely used consumer-level Global Positioning System (GPS).

Shao-Pin Chang, Jui-Ting Chien, Fu-En Wang, Shang-Da Yang, Hwann-Tzong Chen, Min Sun
From On-Road to Off: Transfer Learning Within a Deep Convolutional Neural Network for Segmentation and Classification of Off-Road Scenes

Real-time road-scene understanding is a challenging computer vision task with recent advances in convolutional neural networks (CNN) achieving results that notably surpass prior traditional feature driven approaches. Here, we take an existing CNN architecture, pre-trained for urban road-scene understanding, and retrain it towards the task of classifying off-road scenes, assessing the network performance within the training cycle. Within the paradigm of transfer learning we analyse the effects on CNN classification, by training and assessing varying levels of prior training on varying sub-sets of our off-road training data. For each of these configurations, we evaluate the network at multiple points during its training cycle, allowing us to analyse in depth exactly how the training process is affected by these variations. Finally, we compare this CNN to a more traditional approach using a feature-driven Support Vector Machine (SVM) classifier and demonstrate state-of-the-art results in this particularly challenging problem of off-road scene understanding.

Christopher J. Holder, Toby P. Breckon, Xiong Wei
Joint Optical Flow and Temporally Consistent Semantic Segmentation

The importance and demands of visual scene understanding have been steadily increasing along with the active development of autonomous systems. Consequently, there has been a large amount of research dedicated to semantic segmentation and dense motion estimation. In this paper, we propose a method for jointly estimating optical flow and temporally consistent semantic segmentation, which closely connects these two problem domains and leverages each other. Semantic segmentation provides information on plausible physical motion to its associated pixels, and accurate pixel-level temporal correspondences enhance the accuracy of semantic segmentation in the temporal domain. We demonstrate the benefits of our approach on the KITTI benchmark, where we observe performance gains for flow and segmentation. We achieve state-of-the-art optical flow results, and outperform all published algorithms by a large margin on challenging, but crucial dynamic objects.

Junhwa Hur, Stefan Roth
Fusing Convolutional Neural Networks with a Restoration Network for Increasing Accuracy and Stability

In this paper, we propose a ConvNet for restoring images. Our ConvNet is different from state-of-art denoising networks in the sense that it is deeper and instead of restoring the image directly, it generates a pattern which is added with the noisy image for restoring the clean image. Our experiments shows that the Lipschitz constant of the proposed network is less than 1 and it is able to remove very strong as well as very slight noises. This ability is mainly because of the shortcut connection in our network. We compare the proposed network with another denoisnig ConvNet and illustrate that the network without a shortcut connection acts poorly on low magnitude noises. Moreover, we show that attaching the restoration ConvNet to a classification network increases the classification accuracy. Finally, our empirical analysis reveals that attaching a classification ConvNet with a restoration network can significantly increase its stability against noise.

Hamed H. Aghdam, Elnaz J. Heravi, Domenec Puig
Global Scale Integral Volumes

Integral volume is an important image representation technique, which is useful in many computer vision applications. Processing integral volumes for large scale 3D datasets is challenging due to high memory requirements. The difficulties lie in efficiently computing, storing, querying and updating the integral volume values. In this work, we address the above problems and present a novel solution for processing integral volumes for large scale 3D datasets efficiently. We propose an octree-based method where the worst-case complexity for querying the integral volume of arbitrary regions is $$\mathcal {O}(\log {}n)$$O(logn), here n is the number of nodes in the octree. We evaluate our proposed method on multi-resolution LiDAR point cloud data. Our work can serve as a tool to fast extract features from large scale 3D datasets, which can be beneficial for computer vision applications.

Sounak Bhattacharya, Lixin Fan, Pouria Babahajiani, Moncef Gabbouj
Aerial Scene Understanding Using Deep Wavelet Scattering Network and Conditional Random Field

This paper presents a fast and robust architecture for scene understanding for aerial images recorded from an Unmanned Aerial Vehicle. The architecture uses Deep Wavelet Scattering Network to extract Translation and Rotation Invariant features that are then used by a Conditional Random Field to perform scene segmentation. Experiments are conducted using the proposed framework on two annotated datasets of 1277 images and 300 aerial images, introduced in the paper. An overall pixel accuracy of 81 % and 78 % is achieved for the datasets. A comparison with another similar framework is also presented.

Sandeep Nadella, Amarjot Singh, S. N. Omkar

W10 – BioImage Computing

Frontmatter
Single-Image Insect Pose Estimation by Graph Based Geometric Models and Random Forests

We propose a new method for detailed insect pose estimation, which aims to detect landmarks as the tips of an insect’s antennae and mouthparts from a single image. In this paper, we formulate this problem as inferring a mapping from the appearance of an insect to its corresponding pose. We present a unified framework that jointly learns a mapping from the local appearance (image patch) and the global anatomical structure (silhouette) of an insect to its corresponding pose. Our main contribution is that we propose a data driven approach to learn the geometric prior for modeling various insect appearance. Combined with the discriminative power of Random Forests (RF) model, our method achieves high precision of landmark localization. This approach is evaluated using three challenging datasets of insects which we make publicly available. Experiments show that it achieves improvement over the traditional RF regression method, and comparably precision to human annotators.

Minmin Shen, Le Duan, Oliver Deussen
Feature Augmented Deep Neural Networks for Segmentation of Cells

In this work, we use a fully convolutional neural network for microscopy cell image segmentation. Rather than designing the network from scratch, we modify an existing network to suit our dataset. We show that improved cell segmentation can be obtained by augmenting the raw images with specialized feature maps such as eigen value of Hessian and wavelet filtered images, for training our network. We also show modality transfer learning, by training a network on phase contrast images and testing on fluorescent images. Finally we show that our network is able to segment irregularly shaped cells. We evaluate the performance of our methods on three datasets consisting of phase contrast, fluorescent and bright-field images.

Sajith Kecheril Sadanandan, Petter Ranefall, Carolina Wählby
3-D Density Kernel Estimation for Counting in Microscopy Image Volumes Using 3-D Image Filters and Random Decision Trees

We describe a means through which cells can be accurately counted in 3-D microscopy image data, using only weakly annotated images as input training material. We update an existing 2-D density kernel estimation approach into 3-D and we introduce novel 3-D features which encapsulate the 3-D neighbourhood surrounding each voxel. The proposed 3-D density kernel estimation (DKE-3-D) method, which utilises an ensemble of random decision trees, is computationally efficient and achieves state-of-the-art performance. DKE-3-D avoids the problem of discrete object identification and segmentation, common to many existing 3-D counting techniques, and we show that it outperforms other methods when quantification of densely packed and heterogeneous objects is desired. In this article we successfully apply the technique to two simulated and to two experimentally derived datasets and show that DKE-3-D has great potential in the biomedical sciences and any field where volumetric datasets are used.

Dominic Waithe, Martin Hailstone, Mukesh Kumar Lalwani, Richard Parton, Lu Yang, Roger Patient, Christian Eggeling, Ilan Davis
Dendritic Spine Shape Analysis: A Clustering Perspective

Functional properties of neurons are strongly coupled with their morphology. Changes in neuronal activity alter morphological characteristics of dendritic spines. First step towards understanding the structure-function relationship is to group spines into main spine classes reported in the literature. Shape analysis of dendritic spines can help neuroscientists understand the underlying relationships. Due to unavailability of reliable automated tools, this analysis is currently performed manually which is a time-intensive and subjective task. Several studies on spine shape classification have been reported in the literature, however, there is an on-going debate on whether distinct spine shape classes exist or whether spines should be modeled through a continuum of shape variations. Another challenge is the subjectivity and bias that is introduced due to the supervised nature of classification approaches. In this paper, we aim to address these issues by presenting a clustering perspective. In this context, clustering may serve both confirmation of known patterns and discovery of new ones. We perform cluster analysis on two-photon microscopic images of spines using morphological, shape, and appearance based features and gain insights into the spine shape analysis problem. We use histogram of oriented gradients (HOG), disjunctive normal shape models (DNSM), morphological features, and intensity profile based features for cluster analysis. We use x-means to perform cluster analysis that selects the number of clusters automatically using the Bayesian information criterion (BIC). For all features, this analysis produces 4 clusters and we observe the formation of at least one cluster consisting of spines which are difficult to be assigned to a known class. This observation supports the argument of intermediate shape types.

Muhammad Usman Ghani, Ertunç Erdil, Sümeyra Demir Kanık, Ali Özgür Argunşah, Anna Felicity Hobbiss, Inbal Israely, Devrim Ünay, Tolga Taşdizen, Müjdat Çetin
Cell Counting by Regression Using Convolutional Neural Network

The ability to accurately quantitate specific populations of cells is important for precision diagnostics in laboratory medicine. For example, the quantization of positive tumor cells can be used clinically to determine the need for chemotherapy in a cancer patient. In this paper, we describe a supervised learning framework with Convolutional Neural Network (CNN) and cast the cell counting task as a regression problem, where the global cell count is taken as the annotation to supervise training, instead of following the classification or detection framework. To further decrease the prediction error of counting, we tune several cutting-edge CNN architectures (e.g. Deep Residual Network) into the regression model. As the final output, not only the cell count is estimated for an image, but also its spatial density map is provided. The proposed method is evaluated with three state-of-the-art approaches on three cell image datasets and obtain superior performance.

Yao Xue, Nilanjan Ray, Judith Hugh, Gilbert Bigras
Measuring Process Dynamics and Nuclear Migration for Clones of Neural Progenitor Cells

Neural stem and progenitor cells (NPCs) generate processes that extend from the cell body in a dynamic manner. The NPC nucleus migrates along these processes with patterns believed to be tightly coupled to mechanisms of cell cycle regulation and cell fate determination. Here, we describe a new segmentation and tracking approach that allows NPC processes and nuclei to be reliably tracked across multiple rounds of cell division in phase-contrast microscopy images. Results are presented for mouse adult and embryonic NPCs from hundreds of clones, or lineage trees, containing tens of thousands of cells and millions of segmentations. New visualization approaches allow the NPC nuclear and process features to be effectively visualized for an entire clone. Significant differences in process and nuclear dynamics were found among type A and type C adult NPCs, and also between embryonic NPCs cultured from the anterior and posterior cerebral cortex.

Edgar Cardenas De La Hoz, Mark R. Winter, Maria Apostolopoulou, Sally Temple, Andrew R. Cohen
Histopathology Image Categorization with Discriminative Dimension Reduction of Fisher Vectors

In this paper, we present a histopathology image categorization method based on Fisher vector descriptors. While Fisher vector has been broadly successful for general computer vision and recently applied to microscopy image analysis, its feature dimension is very high and this could affect the classification performance especially when there is small amount of training images available. To address this issue, we design a dimension reduction algorithm in a discriminative learning model with similarity and representation constraints. In addition, to obtain the image-level Fisher vectors, we incorporate two types of local descriptors based on the standard texture feature and unsupervised feature learning. We use three publicly available datasets for experiments. Our evaluation shows that our overall approach achieves consistent performance improvement over existing approaches, our proposed discriminative dimension reduction algorithm outperforms the common dimension reduction techniques, and different local descriptors have varying effects on different datasets.

Yang Song, Qing Li, Heng Huang, Dagan Feng, Mei Chen, Weidong Cai
Automatic Detection and Segmentation of Exosomes in Transmission Electron Microscopy

Exosomes are nanosized, cell-derived vesicles that appear in different biological fluids. They attract a growing interest of the researcher community due to their important role in intercellular communication. An easy to use and reliable method for their quantification and characterization at the single-vesicle level is tremendously needed to help evaluating exosomal preparations in research as well as clinical studies. In this paper, we present a morphological method for automatic detection and segmentation of exosomes in transmission electron microscopy images. The exosome segmentation is carried out using morphological seeded watershed on gradient magnitude image, with the seeds established by applying a series of hysteresis thresholdings, followed by morphological filtering and cluster splitting. We tested the method on a diverse image data set, yielding the detection performance of slightly over 80 %.

Karel Štěpka, Martin Maška, Jakub Jozef Pálenik, Vendula Pospíchalová, Anna Kotrbová, Ladislav Ilkovics, Dobromila Klemová, Aleš Hampl, Vítězslav Bryja, Pavel Matula
Poisson Point Processes for Solving Stochastic Inverse Problems in Fluorescence Microscopy

Despite revolutionary developments in fluorescence based optical microscopy imaging, the quality of the images remains fundamentally limited by diffraction and noise. Hence, deconvolution methods are often applied to obtain better estimates of the biological structures than the measured images are providing prima facie, by reducing blur and noise as much as possible through image postprocessing. However, conventional deconvolution methods typically focus on accurately modeling the point-spread function of the microscope, and put less emphasis on properly modeling the noise sources. Here we propose a new approach to enhancing fluorescence microscopy images by formulating deconvolution as a stochastic inverse problem. We solve the problem using Poisson point processes and establish a connection between the classical Shepp-Vardi algorithm and probability hypothesis density filtering. Results of preliminary experiments on image data from various biological applications indicate that the proposed method compares favorably with existing approaches in jointly performing deblurring and denoising.

Ihor Smal, Erik Meijering
Deep Convolutional Neural Networks for Human Embryonic Cell Counting

We address the problem of counting cells in time-lapse microscopy images of developing human embryos. Cell counting is considered as an important step in analyzing biological phenomenon such as embryo viability. Traditional approaches to counting cells rely on hand crafted features and cannot fully take advantage of the growth in data set sizes. In this paper, we propose a framework to automatically count the number of cells in developing human embryos. The framework employs a deep convolutional neural network model trained to count cells from raw microscopy images. We demonstrate the effectiveness of our approach on a data set of 265 human embryos. The results show that the proposed framework provides robust estimates of the number of cells in a developing embryo up to the 5-cell stage (i.e., 48 h post fertilization).

Aisha Khan, Stephen Gould, Mathieu Salzmann

W15 – Robust Reading

Frontmatter
Robust Text Detection with Vertically-Regressed Proposal Network

Methods for general object detection, such as R-CNN [4] and Fast R-CNN [3], have been successfully applied to text detection, as in [7]. However, there exists difficulty when directly using RPN [10], which is a leading object detection method, for text detection. This is due to the difference between text and general objects. On one hand, text regions have variable lengths, and thus networks must be designed to have large receptive field sizes. On the other hand, positive text regions cannot be measured in the same way as that for general objects at training. In this paper, we introduce a novel vertically-regressed proposal network (VRPN), which allows text regions to be matched by multiple neighboring small anchors. Meanwhile, training regions are selected according to how much they overlap with ground-truth boxes vertically and the location of positive regions is regressed only in the vertical direction. Experiments on dataset provided by ICDAR 2015 Challenge 1 demonstrate the effectiveness of our methods.

Donglai Xiang, Qiang Guo, Yan Xia
Scene Text Detection with Adaptive Line Clustering

We propose a scene text detection system which can maintain a high recall while achieving a fair precision. In our method, no character candidate is eliminated based on character-level features. A weighted directed graph is constructed and the minimum average cost path algorithm is adopted to extract line candidates. After assigning three line-level probability values to each line, the final decisions are made according to the line candidate clustering of the current image. The proposed system has been evaluated on the ICDAR 2013 dataset. Compared with other published methods, it has achieved better performances.

Xinxu Qiao, He Zhu, Weiping Li
From Text Detection to Text Segmentation: A Unified Evaluation Scheme

Current text segmentation evaluation protocols are often incapable of properly handling different scenarios (broken/merged/partial characters). This leads to scores that incorrectly reflect the segmentation accuracy. In this article we propose a new evaluation scheme that overcomes most of the existent drawbacks by extending the EvaLTex protocol (initially designed to evaluate text detection at region level). This new unified platform has numerous advantages: it is able to evaluate a text understanding system at every detection stage and granularity level (paragraph/line/word and now character) by using the same metrics and matching rules; it is robust to all segmentation scenarios; it provides a qualitative and quantitative evaluation and a visual score representation that captures the whole behavior of a segmentation algorithm. Experimental results on nine segmentation algorithms using different evaluation frameworks are also provided to emphasize the interest of our method.

Stefania Calarasanu, Jonathan Fabrizio, Séverine Dubuisson
Dynamic Lexicon Generation for Natural Scene Images

Many scene text understanding methods approach the end-to-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely discussed. In this paper we propose a method that generates contextualized lexicons for scene images using only visual information. For this, we exploit the correlation between visual and textual information in a dataset consisting of images and textual content associated with them. Using the topic modeling framework to discover a set of latent topics in such a dataset allows us to re-rank a fixed dictionary in a way that prioritizes the words that are more likely to appear in a given image. Moreover, we train a CNN that is able to reproduce those word rankings but using only the image raw pixels as input. We demonstrate that the quality of the automatically obtained custom lexicons is superior to a generic frequency-based baseline.

Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas
End-to-End Interpretation of the French Street Name Signs Dataset

We introduce the French Street Name Signs (FSNS) Dataset consisting of more than a million images of street name signs cropped from Google Street View images of France. Each image contains several views of the same street name sign. Every image has normalized, title case folded ground-truth text as it would appear on a map. We believe that the FSNS dataset is large and complex enough to train a deep network of significant complexity to solve the street name extraction problem “end-to-end” or to explore the design trade-offs between a single complex engineered network and multiple sub-networks designed and trained to solve sub-problems. We present such an “end-to-end” network/graph for Tensor Flow and its results on the FSNS dataset.

Raymond Smith, Chunhui Gu, Dar-Shyang Lee, Huiyi Hu, Ranjith Unnikrishnan, Julian Ibarz, Sacha Arnoud, Sophia Lin
Efficient Exploration of Text Regions in Natural Scene Images Using Adaptive Image Sampling

An adaptive image sampling framework is proposed for identifying text regions in natural scene images. A small fraction of the pixels actually correspond to text regions. It is desirable to eliminate non-text regions at the early stages of text detection. First, the image is sampled row-by-row at a specific rate and each row is tested for containing text using an 1D adaptation of the Maximally Stable Extremal Regions (MSER) algorithm. The surrounding rows of the image are recursively sampled at finer rates to fully contain the text. The adaptive sampling process is performed on the vertical dimension as well for the identified regions. The final output is a binary mask which can be used for text detection and/or recognition purposes. The experiments on the ICDAR’03 dataset show that the proposed approach is up to 7x faster than the MSER baseline on a single CPU core with comparable text localization scores. The approach is inherently parallelizable for further speed improvements.

Ismet Zeki Yalniz, Douglas Gray, R. Manmatha
Downtown Osaka Scene Text Dataset

This paper presents a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). The dataset consists of sequential images captured in shopping streets in downtown Osaka with an omnidirectional camera. Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. DOST dataset contained 32,147 manually ground truthed sequential images. They contained 935,601 text regions consisting of 797,919 legible and 137,682 illegible. The legible regions contained 2,808,340 characters. The dataset is evaluated using two existing scene text detection methods and one powerful commercial end-to-end scene text recognition method to know the difficulty and quality in comparison with existing datasets.

Masakazu Iwamura, Takahiro Matsuda, Naoyuki Morimoto, Hitomi Sato, Yuki Ikeda, Koichi Kise

W17 – Egocentric Perception, Interaction and Computing

Frontmatter
DeepDiary: Automatically Captioning Lifelogging Image Streams

Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.

Chenyou Fan, David J. Crandall
Temporal Segmentation of Egocentric Videos to Highlight Personal Locations of Interest

With the increasing availability of wearable cameras, the acquisition of egocentric videos is becoming common in many scenarios. However, the absence of explicit structure in such videos (e.g., video chapters) makes their exploitation difficult. We propose to segment unstructured egocentric videos to highlight the presence of personal locations of interest specified by the end-user. Given the large variability of the visual content acquired by such devices, it is necessary to design explicit rejection mechanisms able to detect negatives (i.e., frames not related to any considered location) learning only from positive ones at training time. To challenge the problem, we collected a dataset of egocentric videos containing 10 personal locations of interest. We propose a method to segment egocentric videos performing discrimination among the personal locations of interest, rejection of negative frames, and enforcing temporal coherence between neighboring predictions.

Antonino Furnari, Giovanni Maria Farinella, Sebastiano Battiato
Face-Off: A Face Reconstruction Technique for Virtual Reality (VR) Scenarios

Virtual Reality (VR) headsets occlude a significant portion of human face. The real human face is required in many VR applications, for example, video teleconferencing. This paper proposes a wearable camera setup-based solution to reconstruct the real face of a person wearing VR headset. Our solution lies in the core of asymmetrical principal component analysis (aPCA). A user-specific training model is built using aPCA with full face, lips and eye region information. During testing phase, lower face region and partial eye information is used to reconstruct the wearer face. Online testing session consists of two phases, (i) calibration phase and (ii) reconstruction phase. In former, a small calibration step is performed to align test information with training data, while the later uses half face information to reconstruct the full face using aPCA-based trained-data. The proposed approach is validated with qualitative and quantitative analysis.

M. S. L. Khan, Shafiq Ur Réhman, Ulrik Söderström, Alaa Halawani, Haibo Li
GPU Accelerated Left/Right Hand-Segmentation in First Person Vision

Wearable cameras allow users to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favourable location, they frequently capture the hands of the user, and may thus represent a promising user-machine interaction tool for different applications. Existent First Person Vision, methods understand the hands as a background/foreground segmentation problem that ignores two important issues: (i) Each pixel is sequentially classified creating a long processing queue, (ii) Hands are not a single “skin-like” moving element but a pair of interacting entities (left-right hand). This paper proposes a GPU-accelerated implementation of a left right-hand segmentation algorithm. The GPU implementation exploits the nature of the pixel-by-pixel classification strategy. The left-right identification is carried out by following a competitive likelihood test based the position and the angle of the segmented pixels.

Alejandro Betancourt, Lucio Marcenaro, Emilia Barakova, Matthias Rauterberg, Carlo Regazzoni
Egocentric Vision for Visual Market Basket Analysis

This paper introduces a new application scenario for egocentric vision: Visual Market Basket Analysis (VMBA). The main goal in the proposed application domain is the understanding of customers behaviours in retails from videos acquired with cameras mounted on shopping carts (which we call narrative carts). To properly study the problem and to set the first VMBA challenge, we introduce the VMBA15 dataset. The dataset is composed by 15 different egocentric videos acquired with narrative carts during users shopping in a retail. The frames of each video have been labelled by considering 8 possible behaviours of the carts. The considered cart’s behaviours reflect the behaviour of the customers from the beginning (cart picking) to the end (cart releasing) of their shopping in a retail. The inferred information related to the time of stops of the carts within the retail, or to the shops at cash desks could be coupled with classic Market Basket Analysis information (i.e., receipts) to help retailers in a better management of spaces and marketing strategies. To benchmark the proposed problem on the introduced dataset we have considered classic visual and audio descriptors in order to represent video frames at each instant. Classification has been performed exploiting the Directed Acyclic Graph SVM learning architecture. Experiments pointed out that an accuracy of more than 93 % can be obtained on the 8 considered classes.

Vito Santarcangelo, Giovanni Maria Farinella, Sebastiano Battiato
SEMBED: Semantic Embedding of Egocentric Action Videos

We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5 %.

Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen
Interactive Feature Growing for Accurate Object Detection in Megapixel Images

Automatic object detection in megapixel images is quite inaccurate and a time and memory expensive task, even with feature detectors and descriptors like SIFT, SURF, ORB, and KAZE. In this paper we propose an interactive feature growing process, which draws on the efficiency of the users’ visual system. The performance of the visual system in search tasks is not affected by the pixel density, so the users’ gazes are used to boost feature extraction for object detection.Experimental tests of the interactive feature growing process show an increase of processing speed by $$50\,\%$$50% for object detection in 20 megapixel scenes at an object detection rate of $$95\,\%$$95%. Based on this method, we discuss the prospects of interactive features, possible use cases and further developments.

Julius Schöning, Patrick Faion, Gunther Heidemann
Towards Semantic Fast-Forward and Stabilized Egocentric Videos

The emergence of low-cost personal mobiles devices and wearable cameras and the increasing storage capacity of video-sharing websites have pushed forward a growing interest towards first-person videos. Since most of the recorded videos compose long-running streams with unedited content, they are tedious and unpleasant to watch. The fast-forward state-of-the-art methods are facing challenges of balancing the smoothness of the video and the emphasis in the relevant frames given a speed-up rate. In this work, we present a methodology capable of summarizing and stabilizing egocentric videos by extracting the semantic information from the frames. This paper also describes a dataset collection with several semantically labeled videos and introduces a new smoothness evaluation metric for egocentric videos that is used to test our method.

Michel Melo Silva, Washington Luis Souza Ramos, Joao Pedro Klock Ferreira, Mario Fernando Montenegro Campos, Erickson Rangel Nascimento
A3D: A Device for Studying Gaze in 3D

A wearable device for capturing 3D gaze information in indoor and outdoor environments is proposed. The hardware and software architecture of the device provides an estimate in quasi-real-time of 2.5D points of regard (POR) and then lift their estimations to 3D, by projecting them into the 3D reconstructed scene. The estimation procedure does not need any external device, and can be used both indoor and outdoor and with the subject wearing it moving, though some smooth constraint in the motion are required. To ensure a great flexibility with respect to depth a novel calibration method is proposed, which provides eye-scene calibration that explicitly takes into account depth information, in so ensuring a quite accurate estimation of the PORs. The experimental evaluation demonstrates that both 2.5D and 3D POR are accurately estimated.

Mahmoud Qodseya, Marta Sanzari, Valsamis Ntouskos, Fiora Pirri
Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager

With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and long-time running solutions; however, devices are too power-hungry for truly always-on operation, and are aggressively duty-cycled to achieve acceptable lifetimes. In this paper we present a wearable system for context change detection based on an egocentric camera with ultra-low power consumption that can collect data 24/7. Although the resolution of the captured images is low, experimental results in real scenarios demonstrate how our approach, based on Siamese Neural Networks, can achieve visual context awareness. In particular, we compare our solution with hand-crafted features and with state of art technique and propose a novel and challenging dataset composed of roughly 30000 low-resolution images.

Francesco Paci, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara, Luca Benini

W22 – Web–scale Vision and Social Media

Frontmatter
Label-Based Automatic Alignment of Video with Narrative Sentences

In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the sentences using both visual and textual information, which provides automatic timestamps for each narrative sentence. We compute the similarity between both types of information using vectorial descriptors and propose to cast this alignment task as a matching problem that we solve via dynamic programming. Our approach is simple to implement, highly efficient and does not require the presence of frequent dialogues, subtitles, and character face recognition. Experiments on various movies demonstrate that our method can successfully align the movie script sentences with the video frames of movies.

Pelin Dogan, Markus Gross, Jean-Charles Bazin
Towards Category Based Large-Scale Image Retrieval Using Transductive Support Vector Machines

In this study, we use transductive learning and binary hierarchical trees to create compact binary hashing codes for large-scale image retrieval applications. We create multiple hierarchical trees based on the separability of the visual object classes by random selection, and the transductive support vector machine (TSVM) classifier is used to separate both the labeled and unlabeled data samples at each node of the binary hierarchical trees (BHTs). Then the separating hyperplanes returned by TSVM are used to create binary codes. We propose a novel TSVM method that is more robust to the noisy labels by interchanging the classical Hinge loss with the robust Ramp loss. Stochastic gradient based solver is used to learn TSVM classifier to ensure that the method scales well with large-scale data sets. The proposed method improves the Euclidean distance metric and achieves comparable results to the state-of-art on CIFAR10 and MNIST data sets and significantly outperforms the state-of-art hashing methods on NUS-WIDE dataset.

Hakan Cevikalp, Merve Elmas, Savas Ozkan
Solving Multi-codebook Quantization in the GPU

We focus on the problem of vector compression using multi-codebook quantization (MCQ). MCQ is a generalization of k-means where the centroids arise from the combinatorial sums of entries in multiple codebooks, and has become a critical component of large-scale, state-of-the-art approximate nearest neighbour search systems. MCQ is often addressed in an iterative manner, where learning the codebooks can be solved exactly via least-squares, but finding the optimal codes results in a large number of combinatorial NP-Hard problems. Recently, we have demonstrated that an algorithm based on stochastic local search for this problem outperforms all previous approaches. In this paper we introduce a GPU implementation of our method, which achieves a $$30{\times }$$30× speedup over a single-threaded CPU implementation. Our code is publicly available (https://github.com/jltmtz/local-search-quantization).

Julieta Martinez, Holger H. Hoos, James J. Little
Learning Joint Representations of Videos and Sentences with Web Image Search

Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.

Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, Naokazu Yokoya
Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches.

Yi Zhu, Shawn Newsam
Cross-Dimensional Weighting for Aggregated Deep Convolutional Features

We propose a simple and straightforward way of creating powerful image representations via cross-dimensional weighting and aggregation of deep convolutional neural network layer outputs. We first present a generalized framework that encompasses a broad family of approaches and includes cross-dimensional pooling and weighting steps. We then propose specific non-parametric schemes for both spatial- and channel-wise weighting that boost the effect of highly active spatial responses and at the same time regulate burstiness effects. We experiment on different public datasets for image search and show that our approach outperforms the current state-of-the-art for approaches based on pre-trained networks. We also provide an easy-to-use, open source implementation that reproduces our results.

Yannis Kalantidis, Clayton Mellina, Simon Osindero
LOH and Behold: Web-Scale Visual Search, Recommendation and Clustering Using Locally Optimized Hashing

We propose a novel hashing-based matching scheme, called Locally Optimized Hashing (LOH), based on a state-of-the-art quantization algorithm that can be used for efficient, large-scale search, recommendation, clustering, and deduplication. We show that matching with LOH only requires set intersections and summations to compute and so is easily implemented in generic distributed computing systems. We further show application of LOH to: (a) large-scale search tasks where performance is on par with other state-of-the-art hashing approaches; (b) large-scale recommendation where queries consisting of thousands of images can be used to generate accurate recommendations from collections of hundreds of millions of images; and (c) efficient clustering with a graph-based algorithm that can be scaled to massive collections in a distributed environment or can be used for deduplication for small collections, like search results, performing better than traditional hashing approaches while only requiring a few milliseconds to run. In this paper we experiment on datasets of up to 100 million images, but in practice our system can scale to larger collections and can be used for other types of data that have a vector representation in a Euclidean space.

Yannis Kalantidis, Lyndon Kennedy, Huy Nguyen, Clayton Mellina, David A. Shamma

W24 – Computer Vision for Art Analysis

Frontmatter
The Art of Detection

The objective of this work is to recognize object categories in paintings, such as cars, cows and cathedrals. We achieve this by training classifiers from natural images of the objects. We make the following contributions: (i) we measure the extent of the domain shift problem for image-level classifiers trained on natural images vs paintings, for a variety of CNN architectures; (ii) we demonstrate that classification-by-detection (i.e. learning classifiers for regions rather than the entire image) recognizes (and locates) a wide range of small objects in paintings that are not picked up by image-level classifiers, and combining these two methods improves performance; and (iii) we develop a system that learns a region-level classifier on-the-fly for an object category of a user’s choosing, which is then applied to over 60 million object regions across 210,000 paintings to retrieve localised instances of that category.

Elliot J. Crowley, Andrew Zisserman
A Streamlined Photometric Stereo Framework for Cultural Heritage

In this paper, we propose a streamlined framework of robust 3D acquisition for cultural heritage using both photometric stereo and photogrammetric information. An uncalibrated photometric stereo setup is augmented by a synchronized secondary witness camera co-located with a point light source. By recovering the witness camera’s position for each exposure with photogrammetry techniques, we estimate the precise 3D location of the light source relative to the photometric stereo camera. We have shown a significant improvement in both light source position estimation and normal map recovery compared to previous uncalibrated photometric stereo techniques. In addition, with the new configuration we propose, we benefit from improved surface shape recovery by jointly incorporating corrected photometric stereo surface normals and a sparse 3D point cloud from photogrammetry.

Chia-Kai Yeh, Nathan Matsuda, Xiang Huang, Fengqiang Li, Marc Walton, Oliver Cossairt
Visual Link Retrieval in a Database of Paintings

This paper examines how far state-of-the-art machine vision algorithms can be used to retrieve common visual patterns shared by series of paintings. The research of such visual patterns, central to Art History Research, is challenging because of the diversity of similarity criteria that could relevantly demonstrate genealogical links. We design a methodology and a tool to annotate efficiently clusters of similar paintings and test various algorithms in a retrieval task. We show that pre-trained convolutional neural network can perform better for this task than other machine vision methods aimed at photograph analysis. We also show that retrieval performance can be significantly improved by fine-tuning a network specifically for this task.

Benoit Seguin, Carlotta Striolo, Isabella diLenardo, Frederic Kaplan
Hot Tiles: A Heat Diffusion Based Descriptor for Automatic Tile Panel Assembly

We revisit the problem of forming a coherent image by assembling independent pieces, also known as the jigsaw puzzle. Namely, we are interested in assembling tile panels, a relevant task for art historians, currently facing many disassembled panels. Existing jigsaw solving algorithms rely strongly on texture alignment to locally decide if two pieces belong together and build the complete jigsaw from local decisions. However, pieces in tile panels are handmade, independently painted, with poorly aligned patterns. In this scenario, existing algorithms suffer from severe degradation. We here introduce a new heat diffusion based affinity measure to mitigate the misalignment between two abutting pieces. We also introduce a global optimization approach to minimize the impact of wrong local decisions. We present experiments on Portuguese tile panels, where our affinity measure performs considerably better that state of the art and we can assemble large parts of a panel.

Susana Brandão, Manuel Marques
Novel Methods for Analysis and Visualization of Saccade Trajectories

Visualization of eye-tracking data is mainly based on fixations. However, saccade trajectories and their characteristics might contain more information than sole fixation positions. Artists, for example, can influence the way our eyes traverse a picture by employing composition methods. Repetitive saccade trajectories and the sequence of eye movements seem to correlate with this composition. In this work, we propose two novel methods to visualize saccade patterns during static stimulus viewing. The first approach, so-called saccade heatmap, utilizes a modified Gaussian density distribution to highlight frequent gaze paths. The second approach is based on clustering and assigns identical labels to similar saccades to thus filter for the most relevant gaze paths. We demonstrate and discuss the strengths and weaknesses of both approaches by examples of free-viewing paintings and compare them to other state-of-the-art visualization techniques.

Thomas Kübler, Wolfgang Fuhl, Raphael Rosenberg, Wolfgang Rosenstiel, Enkelejda Kasneci
Adversarial Training for Sketch Retrieval

Generative Adversarial Networks (GAN) are able to learn excellent representations for unlabelled data which can be applied to image generation and scene classification. Representations learned by GANs have not yet been applied to retrieval. In this paper, we show that the representations learned by GANs can indeed be used for retrieval. We consider heritage documents that contain unlabelled Merchant Marks, sketch-like symbols that are similar to hieroglyphs. We introduce a novel GAN architecture with design features that make it suitable for sketch retrieval. The performance of this sketch-GAN is compared to a modified version of the original GAN architecture with respect to simple invariance properties. Experiments suggest that sketch-GANs learn representations that are suitable for retrieval and which also have increased stability to rotation, scale and translation compared to the standard GAN architecture.

Antonia Creswell, Anil Anthony Bharath
Convolutional Sketch Inversion

In this paper, we use deep neural networks for inverting face sketches to synthesize photorealistic face images. We first construct a semi-simulated dataset containing a very large number of computer-generated face sketches with different styles and corresponding face images by expanding existing unconstrained face data sets. We then train models achieving state-of-the-art results on both computer-generated sketches and hand-drawn sketches by leveraging recent advances in deep learning such as batch normalization, deep residual learning, perceptual losses and stochastic optimization in combination with our new dataset. We finally demonstrate potential applications of our models in fine arts and forensic arts. In contrast to existing patch-based approaches, our deep-neural-network-based approach can be used for synthesizing photorealistic face images by inverting face sketches in the wild.

Yağmur Güçlütürk, Umut Güçlü, Rob van Lier, Marcel A. J. van Gerven
Detecting People in Artwork with CNNs

CNNs have massively improved performance in object detection in photographs. However research into object detection in artwork remains limited. We show state-of-the-art performance on a challenging dataset, People-Art, which contains people from photos, cartoons and 41 different artwork movements. We achieve this high performance by fine-tuning a CNN for this task, thus also demonstrating that training CNNs on photos results in overfitting for photos: only the first three or four layers transfer from photos to artwork. Although the CNN’s performance is the highest yet, it remains less than 60 % AP, suggesting further work is needed for the cross-depiction problem.

Nicholas Westlake, Hongping Cai, Peter Hall
Transferring Neural Representations for Low-Dimensional Indexing of Maya Hieroglyphic Art

We analyze the performance of deep neural architectures for extracting shape representations of binary images, and for generating low-dimensional representations of them. In particular, we focus on indexing binary images exhibiting compounds of Maya hieroglyphic signs, referred to as glyph-blocks, which constitute a very challenging dataset of arts given their visual complexity and large stylistic variety. More precisely, we demonstrate empirically that intermediate outputs of convolutional neural networks can be used as representations for complex shapes, even when their parameters are trained on gray-scale images, and that these representations can be more robust than traditional handcrafted features. We also show that it is possible to compress such representations up to only three dimensions without harming much of their discriminative structure, such that effective visualization of Maya hieroglyphs can be rendered for subsequent epigraphic analysis.

Edgar Roman-Rangel, Gulcan Can, Stephane Marchand-Maillet, Rui Hu, Carlos Pallán Gayol, Guido Krempel, Jakub Spotak, Jean-Marc Odobez, Daniel Gatica-Perez
Dynamic Narratives for Heritage Tour

We present a dynamic story generation approach for the egocentric videos from the heritage sites. Given a short video clip of a ‘heritage-tour’ our method selects a series of short descriptions from the collection of pre-curated text and create a larger narrative. Unlike in the past, these narratives are not merely monotonic static versions from simple retrievals. We propose a method to generate on the fly dynamic narratives of the tour. The series of the text messages selected are optimised over length, relevance, cohesion and information simultaneously. This results in ‘tour guide’ like narratives which are seasoned and adapted to the participants selection of the tour path. We simultaneously use visual and gps cues for precision localization on the heritage site which is conceptually formulated as a graph. The efficacy of the approach is demonstrated on a heritage site, Golconda Fort, situated in Hyderabad, India. We validate our approach on two hours of data collected over multiple runs across the site for our experiments.

Anurag Ghosh, Yash Patel, Mohak Sukhwani, C. V. Jawahar
Convolutional Neural Networks as a Computational Model for the Underlying Processes of Aesthetics Perception

Understanding the underlying processes of aesthetic perception is one of the ultimate goals in empirical aesthetics. While deep learning and convolutional neural networks (CNN) already arrived in the area of aesthetic rating of art and photographs, only little attempts have been made to apply CNNs as the underlying model for aesthetic perception. The information processing architecture of CNNs shows a strong match with the visual processing pipeline in the human visual system. Thus, it seems reasonable to exploit such models to gain better insight into the universal processes that drives aesthetic perception. This work shows first results supporting this claim by analyzing already known common statistical properties of visual art, like sparsity and self-similarity, with the help of CNNs. We report about observed differences in the responses of individual layers between art and non-art images, both in forward and backward (simulation) processing, that might open new directions of research in empirical aesthetics.

Joachim Denzler, Erik Rodner, Marcel Simon
Pose and Pathosformel in Aby Warburg’s Bilderatlas

look at Aby Warburg’s concept of Pathosformel, the repeatable formula for the expression of emotion, through the depiction of human pose in art. Using crowdsourcing, we annotate 2D human pose in one-third of the panels of Warburg’s atlas of art, and perform some exploratory data analysis. Concentrating only on the relative angles of limbs, we find meaningful clusters of related poses, explore the structure using a hierarchical model, and describe a novel method for visualising salient characteristics of the cluster. We find characteristic pose-clusters which correspond to Pathosformeln, and investigate their historical distribution; at the same time, we find morphologically similar poses can represent wildly different emotions. We hypothesise that this ambiguity comes from the static nature of our encoding, and conclude with some remarks about static and dynamic representations of human pose in art.

Leonardo Impett, Sabine Süsstrunk
A New Database and Protocol for Image Reuse Detection

The use of visual elements of an existing image while creating new ones is a commonly observed phenomenon in digital artworks. The practice, which is referred to as image reuse, is not an easy one to detect even with the human eye, less so using computational methods. In this paper, we study the automatic image reuse detection in digital artworks as an image retrieval problem. First, we introduce a new digital art database (BODAIR) that consists of a set of digital artworks that re-use stock images. Then, we evaluate a set of existing image descriptors for image reuse detection, providing a baseline for the detection of image reuse in digital artworks. Finally, we propose an image retrieval method tailored for reuse detection, by combining saliency maps with the image descriptors.

Furkan Isikdogan, İlhan Adıyaman, Alkım Almila Akdağ Salah, Albert Ali Salah
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2016 Workshops
herausgegeben von
Gang Hua
Hervé Jégou
Copyright-Jahr
2016
Electronic ISBN
978-3-319-46604-0
Print ISBN
978-3-319-46603-3
DOI
https://doi.org/10.1007/978-3-319-46604-0