1 Introduction
2 Related Work
3 LucidTracker
3.1 Architecture
3.2 Training Modalities
4 Lucid Data Dreaming
5 Single Object Segmentation Results
5.1 Experimental Setup
Method | # Training images | Flow | Dataset, mIoU | ||
---|---|---|---|---|---|
\(\mathcal {F}\) | \(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\text{ v2 }}\) | ||
Box oracle (Khoreva et al. 2016) | 0 | ✗ | 45.1 | 55.3 | 56.1 |
Grabcut oracle (Khoreva et al. 2016) | 0 | ✗ | 67.3 | 67.6 | 74.2 |
Ignores 1st frame annotation | |||||
Saliency | 0 | ✗ | 32.7 | 40.7 | 22.2 |
NLC (Faktor and Irani 2014) | 0 | ✓ | 64.1 | – | |
TRS (Xiao and Lee 2016) | 0 | ✓ | – | – | 69.1 |
MP-Net (Tokmakov et al. 2016) | ~ 22.5k | ✓ | 69.7 | – | – |
Flow saliency | 0 | ✓ | 70.7 | 36.3 | 35.9 |
FusionSeg (Jain et al. 2017) | ~ 95k | ✓ | 71.5 | 67.9 | – |
LVO (Tokmakov et al. 2017) | ~ 35k | ✓ | 75.9 | – | 57.3 |
PDB (Song et al. 2018) | ~ 18k | ✗ | 77.2 | – | – |
Uses 1st frame annotation | |||||
Mask warping | 0 | ✓ | 32.1 | 43.2 | 42.0 |
FCP (Perazzi et al. 2015) | 0 | ✓ | 63.1 | – | – |
BVS (Maerki et al. 2016) | 0 | ✗ | 66.5 | 59.7 | 58.4 |
N15 (Nagaraja et al. 2015) | 0 | ✓ | – | – | 69.6 |
ObjFlow (Tsai et al. 2016) | 0 | ✓ | 71.1 | 70.1 | 67.5 |
STV (Wang and Shen 2017) | 0 | ✓ | 73.6 | – | – |
VPN (Jampani et al. 2016) | ~ 2.3k | ✗ | 75.0 | – | – |
OSVOS (Caelles et al. 2017) | ~ 2.3k | ✗ | 79.8 | 72.5 | 65.4 |
MaskTrack (Khoreva et al. 2016) | ~ 11k | ✓ | 80.3 | 72.6 | 70.3 |
PReMVOS (Luiten and Voigtlaender 2018) | ~ 145k | ✓ | 84.9 | – | – |
OnAVOS (Voigtlaender and Leibe 2017b) | ~ 120k | ✗ | 86.1 | – | – |
VideoGCRF (Chandra et al. 2018) | ~ 120k | ✗ | 86.5 | – | – |
\(\text {LucidTracker}\) | 24–126 | ✓ | 86.6 | 77.3 | 78.0 |
5.2 Key Results
Variant | ImgNet pre-train. | Per-dataset training | Per-video fine-tun. | Dataset, mIoU | ||
---|---|---|---|---|---|---|
\(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\text{ v2 }}\) | ||||
LucidTracker\(^{-}\) | ✓ | ✓ | ✓ | 83.7 | 76.2 | 76.8 |
(no ImgNet) | ✗ | ✓ | ✓ | 82.0 | 74.3 | 71.2 |
No per-video tuning | ✓ | ✓ | ✗ | 82.7 | 72.3 | 71.9 |
✗ | ✓ | ✗ | 78.4 | 69.7 | 68.2 | |
Only per-video tuning | ✓ | ✗ | ✓ | 79.4 | – | 70.4 |
✗ | ✗ | ✓ | 80.5 | – | 66.8 |
5.3 Ablation Studies
5.3.1 Effect of Training Modalities
Variant | \(\mathcal {I}\) | \(\mathcal {F}\) | warp. | Dataset, mIoU | ||
---|---|---|---|---|---|---|
w | \(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\text{ v2 }}\) | |||
\(\text {LucidTracker}\) | ✓ | ✓ | ✓ | 86.6 | 77.3 | 78.0 |
LucidTracker\(^{-}\) | ✓ | ✓ | ✓ | 83.7 | 76.2 | 76.8 |
No warping | ✓ | ✓ | ✗ | 82.0 | 74.6 | 70.5 |
No OF | ✓ | ✗ | ✗ | 78.0 | 74.7 | 61.8 |
OF only | ✗ | ✓ | ✓ | 74.5 | 43.1 | 55.8 |
Variant | Optical flow | Dataset, mIoU | ||
---|---|---|---|---|
\(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\mathrm{v2}}\) | ||
LucidTracker\(^{-}\) | FlowNet2.0 | 83.7 | 76.2 | 76.8 |
EpicFlow | 80.2 | 71.3 | 67.0 | |
No flow | 78.0 | 74.7 | 61.8 | |
No ImageNet pre-training | FlowNet2.0 | 82.0 | 74.3 | 71.2 |
EpicFlow | 80.0 | 72.3 | 68.8 | |
No flow | 76.7 | 71.4 | 63.0 |
Method | CRF parameters | Dataset, mIoU | ||
---|---|---|---|---|
\(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\mathrm{v2}}\) | ||
\(\text {LucidTracker}^{-}\) | – | 83.7 | 76.2 | 76.8 |
\(\text {LucidTracker}\) | Default | 84.2 | 75.5 | 72.2 |
\(\text {LucidTracker}\) | Tuned per-dataset | 84.8 | 76.2 | 77.6 |
5.3.2 Effect of Optical Flow
5.3.3 Effect of CRF Tuning
Training set | # Training | # Frames | mIoU |
---|---|---|---|
videos | per video | ||
Includes 1st frames from test set | 1 | 1 | 78.3 |
2 | 1 | 75.4 | |
15 | 1 | 68.7 | |
30 | 1 | 65.4 | |
30 | 2 | 74.3 | |
Excludes 1st frames from test set | 2 | 1 | 11.6 |
15 | 1 | 36.4 | |
30 | 1 | 41.7 | |
30 | 2 | 48.4 |
5.4 Additional Experiments
5.4.1 Generalization Across Videos
Training set | Dataset, mIoU | Mean | ||
---|---|---|---|---|
\(\text {DAVIS}_{\text {16}}\) | YoutbObjs | \(\text{ SegTrck }_{\text{ v2 }}\) | ||
\(\text {DAVIS}_{\text {16}}\) | 80.9 | 50.9 | 46.9 | 59.6 |
YoutbObjs | 67.0 | 71.5 | 52.0 | 63.5 |
\(\text{ SegTrack }_{\text{ v2 }}\) | 56.0 | 52.2 | 66.4 | 58.2 |
Best | 80.9 | 71.5 | 66.4 | 72.9 |
Second best | 67.0 | 52.2 | 52.0 | 57.1 |
All-in-one | 71.9 | 70.7 | 60.8 | 67.8 |
Architecture | ImgNet pre-train. | Per-dataset training | Per-video fine-tun. | \(\text {DAVIS}_{\text {16}}\) |
---|---|---|---|---|
mIoU | ||||
Two streams | ✓ | ✓ | ✗ | 80.9 |
One stream | ✓ | ✓ | ✗ | 80.3 |
5.4.2 Generalization Across Datasets
5.4.3 Experimenting with the Convnet Architecture
Method | # Training images | Flow \(\mathcal {F}\) | \(\text {DAVIS}_{\text {16}}\) | ||||||
---|---|---|---|---|---|---|---|---|---|
Region, \(J\) | Boundary, \(F\) | Temporal stability, \(T\) | |||||||
Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | Mean \(\downarrow \) | |||
Box oracle (Khoreva et al. 2016) | 0 | ✗ | 45.1 | 39.7 | \(-\) 0.7 | 21.4 | 6.7 | 1.8 | 1.0 |
Grabcut oracle (Khoreva et al. 2016) | 0 | ✗ | 67.3 | 76.9 | 1.5 | 65.8 | 77.2 | 2.9 | 34.0 |
Ignores 1st frame annotation | |||||||||
Saliency | 0 | ✗ | 32.7 | 22.6 | \(-\) 0.2 | 26.9 | 10.3 | 0.9 | 32.8 |
NLC (Faktor and Irani 2014) | 0 | ✓ | 64.1 | 73.1 | 8.6 | 59.3 | 65.8 | 8.6 | 35.8 |
MP-Net (Tokmakov et al. 2016) | ~ 22.5k | ✓ | 69.7 | 82.9 | 5.6 | 66.3 | 78.3 | 6.7 | 68.6 |
Flow saliency | 0 | ✓ | 70.7 | 83.2 | 6.7 | 69.7 | 82.9 | 7.9 | 48.2 |
FusionSeg (Jain et al. 2017) | ~ 95k | ✓ | 71.5 | – | – | – | – | – | – |
LVO (Tokmakov et al. 2017) | ~ 35k | ✓ | 75.9 | 89.1 | 0.0 | 72.1 | 83.4 | 1.3 | 26.5 |
PDB (Song et al. 2018) | ~ 18k | ✗ | 77.2 | 90.1 | 0.9 | 74.5 | 84.4 | \(-\) 0.2 | 29.1 |
Uses 1st frame annotation | |||||||||
Mask warping | 0 | ✓ | 32.1 | 25.5 | 31.7 | 36.3 | 23.0 | 32.8 | 8.4 |
FCP (Perazzi et al. 2015) | 0 | ✓ | 63.1 | 77.8 | 3.1 | 54.6 | 60.4 | 3.9 | 28.5 |
BVS (Maerki et al. 2016) | 0 | ✗ | 66.5 | 76.4 | 26.0 | 65.6 | 77.4 | 23.6 | 31.6 |
ObjFlow (Tsai et al. 2016) | 0 | ✓ | 71.1 | 80.0 | 22.7 | 67.9 | 78.0 | 24.0 | 22.1 |
STV (Wang and Shen 2017) | 0 | ✓ | 73.6 | – | – | 72.0 | – | – | – |
VPN (Jampani et al. 2016) | ~ 2.3k | ✗ | 75.0 | – | – | 72.4 | – | – | 29.5 |
OSVOS (Caelles et al. 2017) | ~ 2.3k | ✗ | 79.8 | 93.6 | 14.9 | 80.6 | 92.6 | 15.0 | 37.6 |
MaskTrack (Khoreva et al. 2016) | ~ 11k | ✓ | 80.3 | 93.5 | 8.9 | 75.8 | 88.2 | 9.5 | 18.3 |
PReMVOS (Luiten and Voigtlaender 2018) | ~ 145k | ✓ | 84.9 | 96.1 | 8.8 | 88.6 | 94.7 | 9.8 | 19.7 |
OnAVOS (Voigtlaender and Leibe 2017b) | ~ 120k | ✗ | 86.1 | 96.1 | 5.2 | 84.9 | 89.7 | 5.8 | 19.0 |
VideoGCRF (Chandra et al. 2018) | ~ 120k | ✗ | 86.5 | – | – | – | – | – | – |
\(\text {LucidTracker}\) | 50 | ✓ | 86.6 | 97.3 | 5.3 | 84.8 | 93.1 | 7.5 | 15.9 |
Attribute | Method | ||||
---|---|---|---|---|---|
BVS (Maerki et al. 2016) | ObjFlow (Tsai et al. 2016) | OSVOS (Caelles et al. 2017) | MaskTrack (Khoreva et al. 2016) | \(\text {LucidTracker}\) | |
Appearance change | 0.46 | 0.54 | 0.81 | 0.76 | 0.84 |
Background clutter | 0.63 | 0.68 | 0.83 | 0.79 | 0.86 |
Camera-shake | 0.62 | 0.72 | 0.78 | 0.78 | 0.88 |
Deformation | 0.7 | 0.77 | 0.79 | 0.78 | 0.87 |
Dynamic background | 0.6 | 0.67 | 0.74 | 0.76 | 0.82 |
Edge ambiguity | 0.58 | 0.65 | 0.77 | 0.74 | 0.82 |
Fast-motion | 0.53 | 0.55 | 0.76 | 0.75 | 0.85 |
Heterogeneous object | 0.63 | 0.66 | 0.75 | 0.79 | 0.85 |
Interacting objects | 0.63 | 0.68 | 0.75 | 0.77 | 0.85 |
Low resolution | 0.59 | 0.58 | 0.77 | 0.77 | 0.84 |
Motion blur | 0.58 | 0.6 | 0.74 | 0.74 | 0.83 |
Occlusion | 0.68 | 0.66 | 0.77 | 0.77 | 0.84 |
Out-of-view | 0.43 | 0.53 | 0.72 | 0.71 | 0.84 |
Scale variation | 0.49 | 0.56 | 0.74 | 0.73 | 0.81 |
Shape complexity | 0.67 | 0.69 | 0.71 | 0.75 | 0.82 |
5.5 Error Analysis
6 Multiple Object Segmentation Results
6.1 Experimental Setup
Method | \(\text {DAVIS}_{\text {17}}\), test-dev set | |||||||
---|---|---|---|---|---|---|---|---|
Rank | Global mean \(\uparrow \) | Region, \(J\) | Boundary, \(F\) | |||||
Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | |||
sidc | 10 | 45.8 | 43.9 | 51.5 | 34.3 | 47.8 | 53.6 | 36.9 |
YXLKJ | 9 | 49.6 | 46.1 | 49.1 | 22.7 | 53.0 | 56.5 | 22.3 |
haamooon (Shaban et al. 2017) | 8 | 51.3 | 48.8 | 56.9 | 12.2 | 53.8 | 61.3 | 11.8 |
Fromandtozh (Zhao 2017) | 7 | 55.2 | 52.4 | 58.4 | 18.1 | 57.9 | 66.1 | 20.0 |
ilanv (Sharir et al. 2017) | 6 | 55.8 | 51.9 | 55.7 | 17.6 | 59.8 | 65.8 | 18.9 |
voigtlaender (Voigtlaender and Leibe 2017a) | 5 | 56.5 | 53.4 | 57.8 | 19.9 | 59.6 | 65.4 | 19.0 |
lalalafine123 | 4 | 57.4 | 54.5 | 61.3 | 24.4 | 60.2 | 68.8 | 24.6 |
wangzhe | 3 | 57.7 | 55.6 | 63.2 | 31.7 | 59.8 | 66.7 | 37.1 |
lixx (Li et al. 2017) | 2 | 66.1 | 64.4 | 73.5 | 24.5 | 67.8 | 75.6 | 27.1 |
\(\text {LucidTracker}\) | 1 | 66.6 | 63.4 | 73.9 | 19.5 | 69.9 | 80.1 | 19.4 |
Method | \(\text {DAVIS}_{\text {17}}\), test-challenge set | |||||||
---|---|---|---|---|---|---|---|---|
Rank | Global mean \(\uparrow \) | Region,\(J\) | Boundary, \(F\) | |||||
Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | Mean \(\uparrow \) | Recall \(\uparrow \) | Decay \(\downarrow \) | |||
zwrq0 | 10 | 53.6 | 50.5 | 54.9 | 28.0 | 56.7 | 63.5 | 30.4 |
Fromandtozh (Zhao 2017) | 9 | 53.9 | 50.7 | 54.9 | 32.5 | 57.1 | 63.2 | 33.7 |
wasidennis | 8 | 54.8 | 51.6 | 56.3 | 26.8 | 57.9 | 64.8 | 28.8 |
YXLKJ | 7 | 55.8 | 53.8 | 60.1 | 37.7 | 57.8 | 62.1 | 42.9 |
cjc (Cheng et al. 2017) | 6 | 56.9 | 53.6 | 59.5 | 25.3 | 60.2 | 67.9 | 27.6 |
lalalafine123 | 6 | 56.9 | 54.8 | 60.7 | 34.4 | 59.1 | 66.7 | 36.1 |
voigtlaender (Voigtlaender and Leibe 2017a) | 5 | 57.7 | 54.8 | 60.8 | 31.0 | 60.5 | 67.2 | 34.7 |
haamooon (Shaban et al. 2017) | 4 | 61.5 | 59.8 | 71.0 | 21.9 | 63.2 | 74.6 | 23.7 |
vantam299 (Le et al. 2017) | 3 | 63.8 | 61.5 | 68.6 | 17.1 | 66.2 | 79.0 | 17.6 |
\(\text {LucidTracker}\) | 2 | 67.8 | 65.1 | 72.5 | 27.7 | 70.6 | 79.8 | 30.2 |
lixx (Li et al. 2017) | 1 | 69.9 | 67.9 | 74.6 | 22.5 | 71.9 | 79.1 | 24.1 |
6.2 Key Results
Variant | \(\mathcal {I}\) | \(\mathcal {F}\) | \(\mathcal {S}\) | Ensemble | CRF tuning | Temp. coherency | \(\text {DAVIS}_{\text {17}}\) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Test-dev | Test-challenge | |||||||||||
Global mean | mIoU | mF | Global mean | mIoU | mF | |||||||
\(\text {LucidTracker}\) (ensemble) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 66.6 | 63.4 | 69.9 | 67.8 | 65.1 | 70.6 |
✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 65.2 | 61.5 | 69.0 | 67.0 | 64.3 | 69.7 | |
✓ | ✓ | ✓ | ✓ | ✗ | ✗ | 64.7 | 60.5 | 68.9 | 66.5 | 63.2 | 69.8 | |
✓ | ✓ | ✗ | ✓ | ✓ | ✗ | 64.9 | 61.3 | 68.4 | – | – | – | |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | 64.2 | 60.1 | 68.3 | – | – | – | |
\(\text {LucidTracker}\) | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | 62.9 | 59.1 | 66.6 | – | – | – |
\(\mathcal {I}+\mathcal {F}+\mathcal {S}\) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | 62.0 | 57.7 | 62.2 | 64.0 | 60.7 | 67.3 |
\(\mathcal {I}+\mathcal {F}\) | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | 61.3 | 56.8 | 65.8 | – | – | – |
\(\mathcal {I}+\mathcal {S}\) | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | 61.1 | 56.9 | 65.3 | – | – | – |
\(\mathcal {I}\) | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | 59.8 | 63.1 | 63.9 | – | – | – |