Skip to main content
Top
Published in: IPSJ Transactions on Computer Vision and Applications 1/2018

Open Access 01-12-2018 | Express Paper

Structure from motion using dense CNN features with keypoint relocalization

Authors: Aji Resindra Widya, Akihiko Torii, Masatoshi Okutomi

Published in: IPSJ Transactions on Computer Vision and Applications | Issue 1/2018

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Structure from motion (SfM) using imagery that involves extreme appearance changes is yet a challenging task due to a loss of feature repeatability. Using feature correspondences obtained by matching densely extracted convolutional neural network (CNN) features significantly improves the SfM reconstruction capability. However, the reconstruction accuracy is limited by the spatial resolution of the extracted CNN features which is not even pixel-level accuracy in the existing approach. Providing dense feature matches with precise keypoint positions is not trivial because of memory limitation and computational burden of dense features. To achieve accurate SfM reconstruction with highly repeatable dense features, we propose an SfM pipeline that uses dense CNN features with relocalization of keypoint position that can efficiently and accurately provide pixel-level feature correspondences. Then, we demonstrate on the Aachen Day-Night dataset that the proposed SfM using dense CNN features with the keypoint relocalization outperforms a state-of-the-art SfM (COLMAP using RootSIFT) by a large margin.

1 Introduction

Structure from motion (SfM) is getting ready for 3D reconstruction only using images, thanks to off-the-shelf softwares [13] and open-source libraries [410]. They provide impressive 3D models, especially, when targets are captured from many viewpoints with large overlaps. The state-of-the-art SfM pipelines, in general, start with extracting local features [1117] and matching them across images, followed by pose estimation, triangulation, and bundle adjustment [1820]. The performance of local features and their matching, therefore, is crucial for 3D reconstruction by SfM.
In this decade, the performance of local features, namely, SIFT [11] and its variants [16, 2124] are validated on 3D reconstruction as well as many other tasks [2527]. The local features give promising matches for well-textured surfaces/objects but significantly drop its performance for matching weakly textured objects [28], repeated patterns [29], extreme changes of viewpoints [21, 30, 31], and illumination change [32, 33] because of degradation in repeatability of feature point (keypoint) extraction [21, 31]. This problem can be mitigated by using densely detected features on a regular grid [34, 35] but their merit is only demonstrated in image retrieval [32, 36] or image classification tasks [26, 34] that use the features for global image representation and do not require one-to-one feature correspondences as in SfM.
Only recently, SfM with densely detected features are presented in [37]. DenseSfM [37] uses convolutional neural network (CNN) features as densely detected features, i.e., it extracts convolutional layers of deep neural network [38] and converts them as feature descriptors of keypoints on a grid pattern (Section 3.1). As the main focus of [37] is camera localization, the SfM architecture including neither dense CNN feature description and matching nor its 3D reconstruction performance is not studied in detail.

1.1 Contribution

In this work, we first review the details of the SfM pipeline with dense CNN feature extraction and matching. We then propose a keypoint relocalization that uses the structure of convolutional layers (Section 3.2) to overcome keypoint inaccuracy on the grid resolution and computational burden of dense feature matching. Finally, the performance of SfM with dense CNN feature using the proposed keypoint relocalization is evaluated on Aachen Day-Night [37] dataset and additionally on Strecha [39] dataset.

2.1 SfM and VisualSLAM

The state-of-the-art SfM is divided into a few mainstream pipelines: incremental (or sequential) [4, 6, 40], global [8, 9, 41], and hybrid [10, 42].
VisualSLAM approaches, namely, LSD-SLAM [43] and DTAM [44], repeat camera pose estimation based on selected keyframe and (semi-)dense reconstruction using the pixel-level correspondences in real-time. These methods are particularly designed to work with video streams, i.e., short baseline camera motion, but not with general wide-baseline camera motion.
Recently, Sattler et al. [37] introduces CNN-based DenseSfM that adopts densely detected and described features. But their SfM uses fixed poses and intrinsic parameters of reference images in evaluating the performance of query image localization. They also do not address keypoint inaccuracy of CNN features. Therefore, it remains as an open challenge.

2.2 Feature points

The de facto standard local feature, SIFT [11], is capable of matching images under viewpoint and illumination changes thanks to scale and rotation invariant keypoint patches described by histograms of the oriented gradient. ASIFT [21] and its variants [30, 31] explicitly generate synthesized views in order to improve repeatability of keypoint detection and description under extreme viewpoint changes.
An alternative approach to improve feature matching between images across extreme appearance changes is to use densely sampled features from images. Densely detected features are often used in multi-view stereo [45] with DAISY [46], or image retrieval and classification [35, 47] with Dense SIFT [34]. However, dense features are not spotlighted in the task of one-to-one feature correspondence search under unknown camera poses due to its loss of scale, rotation invariant, inaccuracy of localized keypoints, and computational burden.

2.3 CNN features

Fischer et al. [48] reported that, given feature positions, descriptors extracted from CNN layer have better matchability compared to SIFT [11]. More recently, Schonberger et al. [49] also showed that CNN-based learned local features such as LIFT [17], Deep-Desc [50], and ConvOpt [51] have higher recall compared to SIFT [11] but still cannot outperform its variants, e.g., DSP-SIFT [16] and SIFT-PCA [52].
Those studies motivate us to adopt CNN architecture for extracting features from images and matching them for SfM as it efficiently outputs multi-resolution features and has potential to be improved by better training or architecture.

3 The pipeline: SfM using dense CNN features with keypoint relocalization

Our SfM using densely detected features mimics the state-of-the-art incremental SfM pipeline that consists of feature extraction (Section 3.1), feature matching (Section 3.2 to 3.4), and incremental reconstruction (Section 3.5). Figure 1 overviews the pipeline. In this section, we describe each component while stating the difference to the sparse keypoint-based approaches.

3.1 Dense feature extraction

Firstly, our method densely extracts the feature descriptors and their locations from the input image. In the same spirit of [53, 54], we input images in a modern CNN architecture [38, 55, 56] and use the convolutional layers as densely detected keypoints on a regular grid, i.e., cropping out the fully connected and softmax layers. In the following, we chose VGG-16 [38] as the base network architecture and focus on the description tailored to it, but this can be replaced with other networks with marginal modification.
As illustrated in Fig. 2, VGG-16 [38] is composed of five max-pooling layers and 16 weight layers. We extract the max-pooling layers as dense features. As can be seen in Fig. 2, the conv1 max-pooling layer is not yet the same resolution as the input image. We, therefore, also extract conv1_2, one layer before the conv1 max-pooling layer, that has pixel-level accuracy.

3.2 Tentative matching

Given multi-level feature point locations and descriptors, tentative matching uses upper max-pooling layer (lower spatial resolution) to establish initial correspondences. This is motivated by that the upper max-pooling layer has a larger receptive field and encodes more semantic information [48, 57, 58] which potentially gives high matchability across appearance changes. Having the lower spatial resolution is also advantageous in the sense of computational efficiency.
For a pair of images, CNN descriptors are tentatively matched by searching their nearest neighbors (L2 distances) and refined by taking mutually nearest neighbors. Note that the standard ratio test [11] removes too many feature matches as neighborhood features on a regularly sampled grid tend to be similar to each other.
We perform feature descriptor matching for all the pairs of images or shortlisted images by image retrieval, e.g., NetVLAD [53].

3.3 Keypoint relocalization

The tentative matching using the upper max-pooling layers, e.g., conv5, generates distinctive correspondences but the accuracy of keypoint position is limited by their spatial resolution. This inaccuracy of keypoints can be mitigated by a coarse-to-fine matching from the extracted max-pooling layer up to conv1_2 layer utilizing extracted intermediate max-pooling layers between them. For example, the matched keypoints found on the conv3 layer are transferred to the conv2 (higher spatial resolution) and new correspondences are searched only in the area constrained by the transferred keypoints. This can be repeated until reaching conv1_2 layer. However, this naive coarse-to-fine matching generates too many keypoints that may lead to a problem in computational and memory usage in incremental SfM step, especially, bundle adjustment.
To generate dense feature matches with pixel-level accuracy while preserving their quantity, we propose a method of keypoint relocalization as follows. For each feature point at the current layer, we retrieve the descriptors on the lower layer (higher spatial resolution) in the corresponding K×K pixels1. The feature point is relocalized at the pixel position that has the largest descriptor norm (L2 norm) in the K×K pixels. This relocalization is repeated until it reaches the conv1_2 layer which has the same resolution as the input image (see also Fig. 3).

3.4 Feature verification using RANSAC with multiple homographies

Using all the relocated feature points, we next remove outliers from a set of tentative matches by Homography-RANSAC. We rather use a vanilla RANSAC instead of the state-of-the-art spatial verification [59] by taking into account the spatial density of feature correspondences. To detect inlier matches lying on several planes, Homography-RANSAC is repeated while excluding the inlier matches of the best hypothesis. The RANSAC inlier/outlier threshold is set to be loose to allow features off the planes.

3.5 3D reconstruction

Having all the relocalized keypoints filtered by RANSAC, we can export them to any available pipelines that perform pose estimation, point triangulation, and bundle adjustment.
Dense matching may produce many confusing feature matches on the scene with many repetitive structures, e.g., windows, doors, and pillars. In such cases, we keep only the N best matching image pairs for each image in the dataset based on the number of inlier matches of multiple Homography-RANSAC.

4 Experiments

We implement feature detection, description, and matching (Sections 3.1 to 3.4) in MATLAB with third-party libraries (MatConvNet [60] and Yael library [61]). Dense CNN features are extracted using the VGG-16 network [38]. Using conv4 and conv3 max-pooling layers, feature matches are computed by the coarse-to-fine matching followed by multiple Homography-RANSAC that finds at most five homographies supported by an inlier threshold of 10 pixels. The best N pairs based on multiple Homography-RANSAC of every image are imported to COLMAP [6] with the fixed intrinsic parameter option for scene with many repetitive structures. Otherwise, we use all the image pairs.
In our preliminary experiments, we tested other layers having the same spatial resolution, e.g., using conv4_3 and conv3_3 layers in the coarse-to-fine matching but we observed no improvement in 3D reconstruction. As a max-pooling layer has a half depth dimension in comparison with the other layers at the same spatial resolution, we chose the max-pooling layer as the dense features for efficiency.
In the following, we evaluate the reconstruction performance on Aachen Day-Night [37] and Strecha [39] dataset. We compare our SfM using dense CNN features with keypoint relocalization to the baseline COLMAP with DoG+RootSIFT features [6]. In addition, we also compare our SfM to SfM using dense CNN without keypoint relocalization [37]. All experiments are tested on a computer equipped with a 3.20-GHz Intel Core i7-6900K CPU with 16 threads and a 12-GB GeForce GTX 1080Ti.

4.1 Results on Aachen Day-Night dataset

The Aachen Day-Night dataset [37] is aimed for evaluating SfM and visual localization under large illumination changes such as day and night. It includes 98 subsets of images. Each subset consists of 20 day-time images and one night-time image, their reference camera poses, and 3D points 2.
For each subset, we run SfM and evaluate the estimated camera pose of the night image as follows. First, the reconstructed SfM model is registered to the reference camera poses by adopting a similarity transform obtained from the camera positions of day-time images. We then evaluate the estimated camera pose of the night image by measuring positional (L2 distance) and angular (\(\text {acos}(\frac {\text {trace}(\boldsymbol {R}_{ref}\boldsymbol {R}_{night}^{T})-1}{2})\)) error.
Table 1 shows the number of reconstructed cameras. The proposed SfM with keypoint relocalization (conv1_2) can reconstruct 96 night images that are twice as many as that of the baseline method using COLMAP with DoG+RootSIFT [6]. This result validates the benefit of densely detected features that can provide correspondences across large illumination changes as they have smaller loss in keypoint detection repeatability than a standard DoG. On the other hand, both methods with sparse and dense features work well for reconstructing day images. The difference between with and without keypoint localization can be seen more clearly in the next evaluation.
Table 1
Number of cameras reconstructed on the Aachen dataset
 
DoG+
DenseCNN
DenseCNN
 
RootSIFT [6]
w/o reloc
w/ reloc (Ours)
Night
48
95
96
Day
1910
1924
1944
The proposed method have the most number of reconstructed cameras for either day or night images
Figure 4 shows the percentages of night images reconstructed (y-axis) within certain positional and angular error threshold (x-axis). Similarly, Table 2 shows the reconstruction percentages of night images for varying distance error thresholds with a fixed angular error threshold at 10°. As can be seen from both evaluations, the proposed SfM using dense CNN features with keypoint relocalization outperforms the baseline DoG+RootSIFT [6] by a large margin. The improvement by the proposed keypoint relocalization is significant when the evaluation accounts for pose accuracy. Notice that the SfM using dense CNN without keypoint relocalization [37] performs worse than the baseline DoG+RootSIFT [6] at small thresholds, e.g., below 3.5 m position and 2° angular error. This indicates that the proposed keypoint relocalization gives features at more stable and accurate positions and provides better inlier matches for COLMAP reconstruction which results 3D reconstruction in higher quality.
Table 2
Evaluation of reconstructed camera poses (both position and orientation)
 
DoG+
DenseCNN
DenseCNN
 
RootSIFT [6]
w/o reloc
w/ reloc (Ours)
0.5m
15.31
5.10
18.37
1.0m
25.61
14.29
33.67
5.0m
36.73
45.92
69.39
10.0m
35.71
61.22
81.63
20.0m
39.80
69.39
82.65
The numbers show the percentage of the reconstructed night images within given positional error thresholds and an angular error fixed at 10°
The proposed method have the most number of reconstructed cameras for either day or night images
Figure 5 illustrates the qualitative comparison result between our method and the baseline DoG+RootSIFT [6].

4.2 Results on Strecha dataset

We additionally evaluate our SfM using dense CNN with the proposed keypoint relocalization on all six subsets of Strecha dataset [39] which is a standard benchmark dataset for SfM and MVS. Position and angular error between the reconstructed cameras and the ground truth poses are evaluated. In our SfM, we take only feature matches from the best N=5 image pairs for each image to suppress artifacts from confusing image pairs.
The mean average position and angular errors resulted by our SfM are 0.59 m and 2.27°. Although these errors are worse than those of the state-of-the-art COLMAP with DoG+RootSIFT [6] which are 0.17 m and 0.90°, the quantitative evaluation on the Strecha dataset demonstrated that our SfM does not overfit to specific challenging tasks but works reasonably well for standard (easy) situations.

5 Conclusion

We presented a new SfM using dense features extracted from CNN with the proposed keypoint relocalization to improve the accuracy of feature positions sampled on a regular grid. The advantage of our SfM has demonstrated on the Aachen Day-Night dataset that includes images with large illumination changes. The result on the Strecha dataset also showed that our SfM works for standard datasets and does not overfit to a particular task although it is less accurate than the state-of-the-art SfM with local features. We wish the proposed SfM becomes a milestone in the 3D reconstruction, in particularly challenging situations.

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number 17H00744, 15H05313, 16KK0002, and Indonesia Endowment Fund for Education.

Availability of data and materials

The code will be made publicly available on acceptance.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Footnotes
1
We use K=2 throughout the experiments.
 
2
Although the poses are carefully obtained with manual verification, the poses are called as “reference poses” but not ground truth.
 
Literature
4.
go back to reference Fuhrmann S, Langguth F, Goesele M (2014) MVE-A Multi-View Reconstruction Environment In: GCH, 11–18.. Eurographics Association, Aire-la-Ville. Fuhrmann S, Langguth F, Goesele M (2014) MVE-A Multi-View Reconstruction Environment In: GCH, 11–18.. Eurographics Association, Aire-la-Ville.
5.
go back to reference Sweeney C, Hollerer T, Turk M (2015) Theia: A fast and scalable structure-from-motion library In: Proc. ACMM, 693–696.. ACM, New York. Sweeney C, Hollerer T, Turk M (2015) Theia: A fast and scalable structure-from-motion library In: Proc. ACMM, 693–696.. ACM, New York.
6.
go back to reference Schonberger JL, Frahm JM (2016) Structure-from-motion revisited In: Proc. CVPR, 4104–4113.. IEEE. Schonberger JL, Frahm JM (2016) Structure-from-motion revisited In: Proc. CVPR, 4104–4113.. IEEE.
7.
go back to reference Schönberger JL, Zheng E, Frahm JM, Pollefeys M (2016) Pixelwise view selection for unstructured multi-view stereo In: Proc. ECCV, 501–518.. Springer, Cham. Schönberger JL, Zheng E, Frahm JM, Pollefeys M (2016) Pixelwise view selection for unstructured multi-view stereo In: Proc. ECCV, 501–518.. Springer, Cham.
8.
go back to reference Wilson K, Snavely N (2014) Robust global translations with 1dsfm In: Proc. ECCV, 61–75.. Springer, Cham. Wilson K, Snavely N (2014) Robust global translations with 1dsfm In: Proc. ECCV, 61–75.. Springer, Cham.
9.
go back to reference Moulon P, Monasse P, Perrot R, Marlet R (2016) OpenMVG: Open multiple view geometry In: International Workshop on Reproducible Research in Pattern Recognition, 60–74.. Springer, Cham. Moulon P, Monasse P, Perrot R, Marlet R (2016) OpenMVG: Open multiple view geometry In: International Workshop on Reproducible Research in Pattern Recognition, 60–74.. Springer, Cham.
10.
go back to reference Cui H, Gao X, Shen S, Hu Z (2017) Hsfm: Hybrid structure-from-motion In: Proc. CVPR, 2393–2402.. IEEE, Boston. Cui H, Gao X, Shen S, Hu Z (2017) Hsfm: Hybrid structure-from-motion In: Proc. CVPR, 2393–2402.. IEEE, Boston.
11.
go back to reference Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110.CrossRef Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110.CrossRef
12.
go back to reference Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point detectors. IJCV 60(1):63–86.CrossRef Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point detectors. IJCV 60(1):63–86.CrossRef
13.
go back to reference Kadir T, Zisserman A, Brady M (2004) An affine invariant salient region detector In: Proc. ECCV, 228–241.. Springer, Cham. Kadir T, Zisserman A, Brady M (2004) An affine invariant salient region detector In: Proc. ECCV, 228–241.. Springer, Cham.
14.
go back to reference Tuytelaars T, Van Gool L (2004) Matching widely separated views based on affine invariant regions. IJCV 59(1):61–85.CrossRef Tuytelaars T, Van Gool L (2004) Matching widely separated views based on affine invariant regions. IJCV 59(1):61–85.CrossRef
15.
go back to reference Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval In: Proc. CVPR, 2911–2918.. IEEE, Providence. Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval In: Proc. CVPR, 2911–2918.. IEEE, Providence.
16.
go back to reference Dong J, Soatto S (2015) Domain-size pooling in local descriptors: Dsp-sift In: Proc. CVPR, 5097–5106.. IEEE, Boston. Dong J, Soatto S (2015) Domain-size pooling in local descriptors: Dsp-sift In: Proc. CVPR, 5097–5106.. IEEE, Boston.
17.
go back to reference Yi KM, Trulls E, Lepetit V, Fua P (2016) Lift: Learned invariant feature transform In: Proc. ECCV, 467–483.. Springer, Cham. Yi KM, Trulls E, Lepetit V, Fua P (2016) Lift: Learned invariant feature transform In: Proc. ECCV, 467–483.. Springer, Cham.
18.
go back to reference Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections. IJCV 80(2):189–210.CrossRef Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections. IJCV 80(2):189–210.CrossRef
19.
go back to reference Agarwal S, Furukawa Y, Snavely N, Curless B, Seitz SM, Szeliski R (2010) Reconstructing rome. Computer 43(6):40–47.CrossRef Agarwal S, Furukawa Y, Snavely N, Curless B, Seitz SM, Szeliski R (2010) Reconstructing rome. Computer 43(6):40–47.CrossRef
20.
go back to reference Agarwal S, Furukawa Y, Snavely N, Simon I, Curless B, Seitz SM, Szeliski R (2011) Building rome in a day. Commun ACM 54(10):105–112.CrossRef Agarwal S, Furukawa Y, Snavely N, Simon I, Curless B, Seitz SM, Szeliski R (2011) Building rome in a day. Commun ACM 54(10):105–112.CrossRef
21.
22.
go back to reference Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors In: Proc. CVPR, vol 2.. IEEE, Washington. Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors In: Proc. CVPR, vol 2.. IEEE, Washington.
23.
go back to reference Abdel-Hakim AE, Farag AA (2006) Csift: A sift descriptor with color invariant characteristics In: Proc. CVPR, vol, 2, 1978–1983.. IEEE, New York. Abdel-Hakim AE, Farag AA (2006) Csift: A sift descriptor with color invariant characteristics In: Proc. CVPR, vol, 2, 1978–1983.. IEEE, New York.
24.
go back to reference Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features In: Proc. ECCV, 404–417.. Springer, Cham. Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features In: Proc. ECCV, 404–417.. Springer, Cham.
25.
go back to reference Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints In: Proc. ECCV, vol 1, 1–2.. Springer, Cham. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints In: Proc. ECCV, vol 1, 1–2.. Springer, Cham.
26.
go back to reference Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories In: Proc. CVPR, vol 2, 2169–2178.. IEEE, New York. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories In: Proc. CVPR, vol 2, 2169–2178.. IEEE, New York.
27.
go back to reference Chong W, Blei D, Li FF (2009) Simultaneous image classification and annotation In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference On, 1903–1910.. IEEE, Miami.CrossRef Chong W, Blei D, Li FF (2009) Simultaneous image classification and annotation In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference On, 1903–1910.. IEEE, Miami.CrossRef
28.
go back to reference Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2012) Gradient response maps for real-time detection of textureless objects. IEEE PAMI 34(5):876–888.CrossRef Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2012) Gradient response maps for real-time detection of textureless objects. IEEE PAMI 34(5):876–888.CrossRef
29.
go back to reference Torii A, Sivic J, Pajdla T, Okutomi M (2013) Visual place recognition with repetitive structures In: Proc. CVPR, 883–890.. IEEE, Portland. Torii A, Sivic J, Pajdla T, Okutomi M (2013) Visual place recognition with repetitive structures In: Proc. CVPR, 883–890.. IEEE, Portland.
30.
go back to reference Mishkin D, Matas J, Perdoch M (2015) Mods: Fast and robust method for two-view matching. CVIU 141:81–93. Mishkin D, Matas J, Perdoch M (2015) Mods: Fast and robust method for two-view matching. CVIU 141:81–93.
31.
go back to reference Taira H, Torii A, Okutomi M (2016) Robust feature matching by learning descriptor covariance with viewpoint synthesis In: Proc. ICPR, 1953–1958.. IEEE, Cancun. Taira H, Torii A, Okutomi M (2016) Robust feature matching by learning descriptor covariance with viewpoint synthesis In: Proc. ICPR, 1953–1958.. IEEE, Cancun.
32.
go back to reference Torii A, Arandjelović R, Sivic J, Okutomi M, Pajdla T (2015) 24/7 place recognition by view synthesis In: Proc. CVPR, 1808–1817.. IEEE, Boston. Torii A, Arandjelović R, Sivic J, Okutomi M, Pajdla T (2015) 24/7 place recognition by view synthesis In: Proc. CVPR, 1808–1817.. IEEE, Boston.
33.
go back to reference Radenovic F, Schonberger JL, Ji D, Frahm JM, Chum O, Matas J (2016) From dusk till dawn: Modeling in the dark In: Proc. CVPR, 5488–5496.. IEEE, Las Vegas. Radenovic F, Schonberger JL, Ji D, Frahm JM, Chum O, Matas J (2016) From dusk till dawn: Modeling in the dark In: Proc. CVPR, 5488–5496.. IEEE, Las Vegas.
34.
go back to reference Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns In: Proc. ICCV, 1–8.. IEEE, Rio de Janeiro. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns In: Proc. ICCV, 1–8.. IEEE, Rio de Janeiro.
35.
go back to reference Liu C, Yuen J, Torralba A (2016) Sift flow: Dense correspondence across scenes and its applications In: Dense Image Correspondences for Computer Vision, 15–49.. Springer, Cham.CrossRef Liu C, Yuen J, Torralba A (2016) Sift flow: Dense correspondence across scenes and its applications In: Dense Image Correspondences for Computer Vision, 15–49.. Springer, Cham.CrossRef
36.
go back to reference Zhao WL, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features In: Proc. BMVC.. BMVA, South Road. Zhao WL, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features In: Proc. BMVC.. BMVA, South Road.
37.
go back to reference Sattler T, Maddern W, Toft C, Torii A, Hammarstrand L, Stenborg E, Safari D, Sivic J, Pajdla T, Pollefeys M, Kahl F, Okutomi M (2017) Benchmarking 6dof outdoor visual localization in changing conditions. arXiv preprint arXiv:1707.09092. Sattler T, Maddern W, Toft C, Torii A, Hammarstrand L, Stenborg E, Safari D, Sivic J, Pajdla T, Pollefeys M, Kahl F, Okutomi M (2017) Benchmarking 6dof outdoor visual localization in changing conditions. arXiv preprint arXiv:1707.09092.
38.
go back to reference Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
39.
go back to reference Strecha C, Von Hansen W, Van Gool L, Fua P, Thoennessen U (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery In: Proc. CVPR, 1–8.. IEEE, Anchorage. Strecha C, Von Hansen W, Van Gool L, Fua P, Thoennessen U (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery In: Proc. CVPR, 1–8.. IEEE, Anchorage.
40.
go back to reference Wu C (2013) Towards linear-time incremental structure from motion In: Proc. 3DV, 127–134.. IEEE, Seattle. Wu C (2013) Towards linear-time incremental structure from motion In: Proc. 3DV, 127–134.. IEEE, Seattle.
41.
go back to reference Cui Z, Tan P (2015) Global structure-from-motion by similarity averaging In: Proc. ICCV, 864–872.. IEEE, Santiago. Cui Z, Tan P (2015) Global structure-from-motion by similarity averaging In: Proc. ICCV, 864–872.. IEEE, Santiago.
42.
go back to reference Magerand L, Del Bue A (2017) Practical projective structure from motion (p2sfm) In: Proc. CVPR, 39–47.. IEEE, Venice. Magerand L, Del Bue A (2017) Practical projective structure from motion (p2sfm) In: Proc. CVPR, 39–47.. IEEE, Venice.
43.
go back to reference Engel J, Schöps T, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam In: Proc. ECCV, 834–849.. Springer, Cham. Engel J, Schöps T, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam In: Proc. ECCV, 834–849.. Springer, Cham.
44.
go back to reference Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam: Dense tracking and mapping in real-time In: Proc. ICCV, 2320–2327.. IEEE, Barcelona. Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam: Dense tracking and mapping in real-time In: Proc. ICCV, 2320–2327.. IEEE, Barcelona.
45.
go back to reference Furukawa Y, Hernández C, et al (2015) Multi-view stereo: A tutorial. Found Trends®; Comput Graph Vis 9(1-2):1–148.CrossRef Furukawa Y, Hernández C, et al (2015) Multi-view stereo: A tutorial. Found Trends®; Comput Graph Vis 9(1-2):1–148.CrossRef
46.
go back to reference Tola E, Lepetit V, Fua P (2010) Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE PAMI 32(5):815–830.CrossRef Tola E, Lepetit V, Fua P (2010) Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE PAMI 32(5):815–830.CrossRef
47.
go back to reference Tuytelaars T (2010) Dense interest points In: Proc. CVPR, 2281–2288.. IEEE, San Francisco. Tuytelaars T (2010) Dense interest points In: Proc. CVPR, 2281–2288.. IEEE, San Francisco.
48.
go back to reference Fischer P, Dosovitskiy A, Brox T (2014) Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769. Fischer P, Dosovitskiy A, Brox T (2014) Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769.
49.
go back to reference Schonberger JL, Hardmeier H, Sattler T, Pollefeys M (2017) Comparative evaluation of hand-crafted and learned local features In: Proc. CVPR, 6959–6968.. IEEE, Honolulu. Schonberger JL, Hardmeier H, Sattler T, Pollefeys M (2017) Comparative evaluation of hand-crafted and learned local features In: Proc. CVPR, 6959–6968.. IEEE, Honolulu.
50.
go back to reference Simo-Serra E, Trulls E, Ferraz L, Kokkinos I, Fua P, Moreno-Noguer F (2015) Discriminative learning of deep convolutional feature point descriptors In: Proc. ICCV, 118–126.. IEEE, Santiago. Simo-Serra E, Trulls E, Ferraz L, Kokkinos I, Fua P, Moreno-Noguer F (2015) Discriminative learning of deep convolutional feature point descriptors In: Proc. ICCV, 118–126.. IEEE, Santiago.
51.
go back to reference Simonyan K, Vedaldi A, Zisserman A (2014) Learning local feature descriptors using convex optimisation. IEEE PAMI 36(8):1573–1585.CrossRef Simonyan K, Vedaldi A, Zisserman A (2014) Learning local feature descriptors using convex optimisation. IEEE PAMI 36(8):1573–1585.CrossRef
52.
go back to reference Bursuc A, Tolias G, Jégou H (2015) Kernel local descriptors with implicit rotation matching In: Proc. ACMM, 595–598.. ACM, New York. Bursuc A, Tolias G, Jégou H (2015) Kernel local descriptors with implicit rotation matching In: Proc. ACMM, 595–598.. ACM, New York.
53.
go back to reference Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition In: Proc. CVPR, 5297–5307.. IEEE, Las Vegas. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition In: Proc. CVPR, 5297–5307.. IEEE, Las Vegas.
54.
go back to reference Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples In: Proc. ECCV.. Springer, Cham. Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples In: Proc. ECCV.. Springer, Cham.
55.
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et al. (2015) Going deeper with convolutions In: Proc. CVPR, Boston. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et al. (2015) Going deeper with convolutions In: Proc. CVPR, Boston.
56.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proc. CVPR, 770–778.. IEEE, Las Vegas. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proc. CVPR, 770–778.. IEEE, Las Vegas.
57.
go back to reference Berkes P, Wiskott L (2006) On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Comput 18(8):1868–1895.MathSciNetCrossRefMATH Berkes P, Wiskott L (2006) On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Comput 18(8):1868–1895.MathSciNetCrossRefMATH
58.
go back to reference Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks In: European Conference on Computer Vision, 818–833.. Springer, Cham. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks In: European Conference on Computer Vision, 818–833.. Springer, Cham.
59.
go back to reference Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching In: Proc. CVPR.. IEEE, Minneapolis. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching In: Proc. CVPR.. IEEE, Minneapolis.
60.
go back to reference Vedaldi A, Lenc K (2015) Matconvnet – convolutional neural networks for matlab In: Proc. ACMM.. ACM, New York. Vedaldi A, Lenc K (2015) Matconvnet – convolutional neural networks for matlab In: Proc. ACMM.. ACM, New York.
Metadata
Title
Structure from motion using dense CNN features with keypoint relocalization
Authors
Aji Resindra Widya
Akihiko Torii
Masatoshi Okutomi
Publication date
01-12-2018
Publisher
Springer Berlin Heidelberg
DOI
https://doi.org/10.1186/s41074-018-0042-y

Other articles of this Issue 1/2018

IPSJ Transactions on Computer Vision and Applications 1/2018 Go to the issue

Premium Partner