Introduction
-
We propose a 3D UNet-shape network and firstly use the depthwise separable convolution for 3D cost volume regularization, which can effectively improve the model efficiency with performance maintained.
-
We propose a 3D-Attention module to enhance the ability in cost volume regularization to fully aggregate the valuable information of cost volume and alleviate the problem of feature mismatching.
-
We proposed an effective and efficient feature transfer module to upsample the LR depth map to obtain the HR depth map to achieve higher quality reconstruction.
-
With extensive experiments on two benchmarks, our method demonstrates comparable or even better reconstruction results than the state-of-the-art methods with much lower computation cost. For instance, compared to state-of-the-art methods MVSNet, our model reduces the memory by 49% while improving the accuracy by 20%.
Related work
Traditional MVS reconstruction
Deep learning-based MVS
Depthwise separable convolutions
Attention mechanism
Methodology
Pipeline description
3D depthwise separable convolution (3D-DSC)
Convolution | Computation |
---|---|
3D-CNN | \( M \times K^3 \times C \times {\hat{C}} \) |
Depthwise-Conv | \( M \times K^3 \times C \) |
Pointwise-Conv | \( M \times C \times {\hat{C}} \) |
3D-DSC | \( M \times K^3 \times C + M \times C \times {\hat{C}} \) |
3D-attention module (3DA)
Feature transfer module
Other modules
Informative feature extraction network
Input images size: \(3 \times H \times W\) | ||
---|---|---|
Name | Layer description | Output size |
Encoder | ||
\( \text {conv0}\) | \(3\times 3\) conv, stride 1 | \(8 \times H \times W\) |
\( \text {conv1}\) | \(3\times 3\) conv, stride 1 | \(8 \times H \times W\) |
\( \text {conv2}\) | \(5\times 5\) conv, stride 2 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv3}\) | \(3\times 3\) conv, stride 1 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv4}\) | \(3\times 3\) conv, stride 1 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv5}\) | \(5\times 5\) conv, stride 2 | \(32 \times \frac{1}{4}H \times \frac{1}{4}W\) |
\( \text {conv6}\) | \(3\times 3\) conv, stride 1 | \(32 \times \frac{1}{4}H \times \frac{1}{4}W\) |
\( \text {conv7}\) | \(3\times 3\) conv, stride 1 | \(32 \times \frac{1}{4}H \times \frac{1}{4}W\) |
Decoder | ||
\( \text {conv8}\) | \(3\times 3\) transposed conv, stride 2 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv9}\) | \(3\times 3\) conv, stride 1 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv10}\) | \(3\times 3\) conv, stride 1 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
sp | Add conv4 & conv10 features | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv11}\) | \(3\times 3\) transposed conv, stride 2 | \(8 \times H \times W\) |
\( \text {conv12}\) | \(3\times 3\) conv, stride 1 | \(8 \times H \times W\) |
\( \text {conv13}\) | \(3\times 3\) conv, stride 1 | \(8 \times H \times W\) |
sp | Add conv7 & conv13 features | \(8 \times H \times W\) |
Adjuster | ||
\( \text {conv14}\) | \(5\times 5\) conv, stride 2 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv15}\) | \(3\times 3\) conv, stride 1 | \(16 \times \frac{1}{2}H \times \frac{1}{2}W\) |
\( \text {conv16}\) | \(5\times 5\) conv, stride 2 | \(32 \times \frac{1}{4}H \times \frac{1}{4}W\) |
\( \text {conv17}\) | \(3\times 3\) conv, stride 1 (no BN &ReLU) | \(32 \times \frac{1}{4}H \times \frac{1}{4}W\) |
Cost volume construction
Depth map refinement
Training loss
Experiments
Experimental settings
Dataset
Methods | Acc. (mm) \(\downarrow \) | Comp. (mm) \(\downarrow \) | Overall (mm) \(\downarrow \) |
---|---|---|---|
Camp [55] | 0.835 | 0.554 | 0.695 |
Furu [8] | 0.613 | 0.941 | 0.777 |
Tola [25] | 0.342 | 1.19 | 0.766 |
Gipuma [23] | 0.283 | 0.873 | 0.578 |
MVSNet [10] | 0.396 | 0.527 | 0.462 |
R-MVSNet [15] | 0.383 | 0.452 | 0.417 |
P-MVSNet [11] | 0.406 | 0.434 | 0.420 |
MVSCRF [56] | 0.371 | 0.426 | 0.398 |
PointMVSNet [32] | 0.342 | 0.411 | 0.376 |
Fast-MVSNet [13] | 0.336 | 0.403 | 0.370 |
Cascade-MVSNet [12] | 0.325 | 0.385 | 0.355 |
CVP-MVSNet [49] | 0.296 | 0.406 | 0.351 |
PVA-MVSNet [53] | 0.372 | 0.350 | 0.361 |
Vis-Net [57] | 0.369 | 0.361 | 0.365 |
MVSNet++ [54] | 0.407 | 0.345 | 0.376 |
UCS-Net [14] | 0.338 | 0.349 | 0.344 |
\(D^{2}\)HC-RMVSNet [16] | 0.395 | 0.378 | 0.386 |
DeepFusion [31] | 0.357 | 0.502 | 0.429 |
AA-RMVSNet [58] | 0.376 | 0.339 | 0.357 |
PatchmatchNet [59] | 0.427 | 0.277 | 0.352 |
Our | 0.316 | 0.372 | 0.344 |
Implement details
Evaluation metrics
Evaluation on DTU dataset
Comparison of the models performance
Methods | H, W | Parameters | Memory (GB) | Time (s) |
---|---|---|---|---|
MVSNet [10] | 1152, 864 | 1084304 | 10.8 | 1.21 |
R-MVSNet [15] | 1600, 1184 | 799365 | 6.7 | 2.35 |
PointMVSNet [32] | 1600, 1152 | 698936 | 8.7 | 5.44 |
Fast-MVSNet [13] | 1280, 960 | 455472 | 5.3 | 0.6 |
Cascade-MVSNet [12] | 1152, 864 | 934304 | 5.3 | 0.49 |
CVP-MVSNet [49] | 1600, 1152 | 551585 | 8.7 | 1.72 |
PVA-MVSNet [53] | 1600, 1184 | 338129 | 17.3 | 0.95 |
UCS-Net [14] | 1152, 864 | 938496 | 5.4 | 0.76 |
\(D^{2}\)HC-RMVSNet [16] | 1600, 1200 | 338257 | 6.6 | 29.15 |
Our | 1600, 1152 | 253585 | 5.5 | 0.74 |
#Source | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
VRAM (MB) | 3810 | 4302 | 4966 | 5539 | 6289 | 7044 | 7732 | 8546 |
TIME (s) | 0.45 | 0.55 | 0.64 | 0.74 | 0.87 | 0.99 | 1.11 | 1.23 |
Comparison of the models efficiency
Method | Acc. (mm) \(\downarrow \) | Comp. (mm) \(\downarrow \) | Overall (mm) \(\downarrow \) | Parameters \(\downarrow \) | Memory (MB) \(\downarrow \) | Time (s) \(\downarrow \) |
---|---|---|---|---|---|---|
Baseline (+ 3D CNNs) | 0.391 | 0.482 | 0.437 | 169024 | 9034 | 0.523 |
Baseline + DSC | 0.398 | 0.470 | 0.434 | 892713 | 3408 | 0.274 |
Baseline + DSC + 3DA | 0.358 | 0.453 | 0.406 | 170785 | 3422 | 0.324 |
Baseline + DSC + IFEN | 0.376 | 0.467 | 0.422 | 208432 | 3488 | 0.338 |
Baseline + DSC + FTM | 0.364 | 0.441 | 0.403 | 214177 | 3572 | 0.335 |
Baseline + DSC + 3DA + IFEN + FTM | 0.316 | 0.372 | 0.344 | 253585 | 3766 | 0.354 |
Ablation experiments
Methods | Family \(\uparrow \) | Francis \(\uparrow \) | Horse \(\uparrow \) | Lighthouse \(\uparrow \) | M60 \(\uparrow \) | Panther \(\uparrow \) | Playground \(\uparrow \) | Train \(\uparrow \) | Intermediate mean \(\uparrow \) |
---|---|---|---|---|---|---|---|---|---|
Colmap [22] | 50.41 | 22.25 | 25.63 | 56.43 | 44.83 | 46.97 | 48.53 | 42.04 | 42.14 |
Pix4D [60] | 64.45 | 31.91 | 26.43 | 54.41 | 50.58 | 35.37 | 47.78 | 34.96 | 43.24 |
58.86 | 32.59 | 26.25 | 43.12 | 44.73 | 46.85 | 45.97 | 35.27 | 41.71 | |
DSC-MVSNet | 68.06 | 47.43 | 41.60 | 54.96 | 56.73 | 53.86 | 53.46 | 51.71 | 53.48 |
MVSNet [10] | 55.99 | 28.55 | 25.07 | 50.79 | 53.96 | 50.86 | 47.90 | 34.69 | 43.48 |
R-MVSNet [15] | 69.96 | 46.65 | 32.59 | 42.95 | 51.88 | 48.80 | 52.00 | 42.38 | 48.40 |
MVSCRF [56] | 59.83 | 30.60 | 29.93 | 51.15 | 50.61 | 51.45 | 52.60 | 39.68 | 45.73 |
PointMVSNet [32] | 61.79 | 41.15 | 34.20 | 50.79 | 51.97 | 50.85 | 52.38 | 43.06 | 48.27 |
CIDER [63] | 56.79 | 32.39 | 29.89 | 54.67 | 53.46 | 53.51 | 50.48 | 42.85 | 46.76 |
Fast-MVSNet [13] | 65.18 | 39.59 | 34.98 | 47.81 | 49.16 | 46.20 | 53.27 | 42.91 | 47.39 |
PatchmatchNet [59] | 66.99 | 52.64 | 43.24 | 54.87 | 52.87 | 49.54 | 54.21 | 50.81 | 53.15 |
DSC-MVSNet | 68.06 | 47.43 | 41.60 | 54.96 | 56.73 | 53.86 | 53.46 | 51.71 | 53.48 |