Computer Vision – ACCV 2018
14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V
- 2019
- Buch
- Herausgegeben von
- C.V. Jawahar
- Hongdong Li
- Greg Mori
- Prof. Konrad Schindler
- Buchreihe
- Lecture Notes in Computer Science
- Verlag
- Springer International Publishing
Über dieses Buch
The six volume set LNCS 11361-11366 constitutes the proceedings of the 14th Asian Conference on Computer Vision, ACCV 2018, held in Perth, Australia, in December 2018. The total of 274 contributions was carefully reviewed and selected from 979 submissions during two rounds of reviewing and improvement. The papers focus on motion and tracking, segmentation and grouping, image-based modeling, dep learning, object recognition object recognition, object detection and categorization, vision and language, video analysis and event recognition, face and gesture analysis, statistical methods and learning, performance evaluation, medical image analysis, document analysis, optimization methods, RGBD and depth camera processing, robotic vision, applications of computer vision.
Inhaltsverzeichnis
-
Frontmatter
-
Oral Session O5: 3D Vision and Pose
-
Frontmatter
-
Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video
Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, Roberto CipollaAbstractWe present a system to recover the 3D shape and motion of a wide variety of quadrupeds from video. The system comprises a machine learning front-end which predicts candidate 2D joint positions, a discrete optimization which finds kinematically plausible joint correspondences, and an energy minimization stage which fits a detailed 3D model to the image. In order to overcome the limited availability of motion capture training data from animals, and the difficulty of generating realistic synthetic training images, the system is designed to work on silhouette data. The joint candidate predictor is trained on synthetically generated silhouette images, and at test time, deep learning methods or standard video segmentation tools are used to extract silhouettes from real data. The system is tested on animal videos from several species, and shows accurate reconstructions of 3D shape and pose. -
EdgeStereo: A Context Integrated Residual Pyramid Network for Stereo Matching
Xiao Song, Xu Zhao, Hanwen Hu, Liangji FangAbstractRecent convolutional neural networks, especially end-to-end disparity estimation models, achieve remarkable performance on stereo matching task. However, existed methods, even with the complicated cascade structure, may fail in the regions of non-textures, boundaries and tiny details. Focus on these problems, we propose a multi-task network EdgeStereo that is composed of a backbone disparity network and an edge sub-network. Given a binocular image pair, our model enables end-to-end prediction of both disparity map and edge map. Basically, we design a context pyramid to encode multi-scale context information in disparity branch, followed by a compact residual pyramid for cascaded refinement. To further preserve subtle details, our EdgeStereo model integrates edge cues by feature embedding and edge-aware smoothness loss regularization. Comparative results demonstrates that stereo matching and edge detection can help each other in the unified model. Furthermore, our method achieves state-of-art performance on both KITTI Stereo and Scene Flow benchmarks, which proves the effectiveness of our design. -
Rectification from Radially-Distorted Scales
James Pritts, Zuzana Kukelova, Viktor Larsson, Ondřej ChumAbstractThis paper introduces the first minimal solvers that jointly estimate lens distortion and affine rectification from repetitions of rigidly-transformed coplanar local features. The proposed solvers incorporate lens distortion into the camera model and extend accurate rectification to wide-angle images that contain nearly any type of coplanar repeated content. We demonstrate a principled approach to generating stable minimal solvers by the Gröbner basis method, which is accomplished by sampling feasible monomial bases to maximize numerical stability. Synthetic and real-image experiments confirm that the solvers give accurate rectifications from noisy measurements if used in a ransac-based estimator. The proposed solvers demonstrate superior robustness to noise compared to the state of the art. The solvers work on scenes without straight lines and, in general, relax strong assumptions about scene content made by the state of the art. Accurate rectifications on imagery taken with narrow focal length to fisheye lenses demonstrate the wide applicability of the proposed method. The method is automatic, and the code is published at https://github.com/prittjam/repeats. -
Self-supervised Learning of Depth and Camera Motion from 360 Videos
Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, Min SunAbstractAs 360\(^{\circ }\) cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360\(^{\circ }\) perception becomes more and more important. We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360\(^{\circ }\) video. In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360\(^{\circ }\) images efficiently. Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries. Secondly, we propose a novel “spherical” photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view. Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus. To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360\(^{\circ }\) videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation on PanoSUNCG with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction by acquiring model pre-trained on PanoSUNCG. -
Domain Transfer for 3D Pose Estimation from Color Images Without Manual Annotations
Mahdi Rad, Markus Oberweger, Vincent LepetitAbstractWe introduce a novel learning method for 3D pose estimation from color images. While acquiring annotations for color images is a difficult task, our approach circumvents this problem by learning a mapping from paired color and depth images captured with an RGB-D camera. We jointly learn the pose from synthetic depth images that are easy to generate, and learn to align these synthetic depth images with the real depth images. We show our approach for the task of 3D hand pose estimation and 3D object pose estimation, both from color images only. Our method achieves performances comparable to state-of-the-art methods on popular benchmark datasets, without requiring any annotations for the color images. -
Partially Occluded Hands:
A Challenging New Dataset for Single-Image Hand Pose Estimation Battushig Myanganbayar, Cristina Mata, Gil Dekel, Boris Katz, Guy Ben-Yosef, Andrei BarbuAbstractRecognizing the pose of hands matters most when hands are interacting with other objects. To understand how well both machines and humans perform on single-image 2D hand-pose reconstruction from RGB images, we collected a challenging dataset of hands interacting with 148 objects. We used a novel methodology that provides the same hand in the same pose both with the object being present and occluding the hand and without the object occluding the hand. Additionally, we collected a wide range of grasps for each object designing the data collection methodology to ensure this diversity. Using this dataset we measured the performance of two state-of-the-art hand-pose recognition methods showing that both are extremely brittle when faced with even light occlusion from an object. This is not evident in previous datasets because they often avoid hand-object occlusions and because they are collected from videos where hands are often between objects and mostly unoccluded. We annotated a subset of the dataset and used that to show that humans are robust with respect to occlusion, and also to characterize human hand perception, the space of grasps that seem to be considered, and the accuracy of reconstructing occluded portions of hands. We expect that such data will be of interest to both the vision community for developing more robust hand-pose algorithms and to the robotic grasp planning community for learning such grasps. The dataset is available at occludedhands.com. -
Cross Pixel Optical-Flow Similarity for Self-supervised Learning
Aravindh Mahendran, James Thewlis, Andrea VedaldiAbstractWe propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical-flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between their optical-flow vectors. At test time, the learned deep network can be used without access to video or flow information and transferred to tasks such as image classification, detection, and segmentation. Our method, which significantly simplifies previous attempts at using motion for self-supervision, achieves state-of-the-art results in self-supervision using motion cues, and is overall state of the art in self-supervised pre-training for semantic image segmentation, as demonstrated on standard benchmarks.
-
- Titel
- Computer Vision – ACCV 2018
- Herausgegeben von
-
C.V. Jawahar
Hongdong Li
Greg Mori
Prof. Konrad Schindler
- Copyright-Jahr
- 2019
- Electronic ISBN
- 978-3-030-20873-8
- Print ISBN
- 978-3-030-20872-1
- DOI
- https://doi.org/10.1007/978-3-030-20873-8
Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.