1 Introduction
2 Deep learning
2.1 Basic deep learning methods
2.1.1 Convolutional neural networks (CNN)
2.1.2 Recurrent neural networks (RNN)
2.1.3 Restricted Boltzmann machine (RBM)
2.1.4 Auto-encoders (AE)
2.2 Deep learning for high dimensional data
2.2.1 Increase in physical dimensions
2.2.2 Increase in modalities
3 Traditional methods
3.1 Object surface features
3.1.1 Global features
3.1.2 Local features
Method | Year | Comments |
---|---|---|
1998 | Most sited surface descriptor | |
PFH [228] | 2008 | Captures multiple characteristics |
FPFH [227] | 2009 | Improved computational efficiency of PFH |
2.5D SIFT [166] | 2009 | SIFT for depth images |
HKS [268] | 2009 | Invariant to non-rigid transformations |
mesh-HOG [329] | 2009 | Extension of HOG [48] descriptor for triangular meshes |
3D-SURF [136] | 2010 | Extension of SURF [12] descriptor for triangular meshes |
SI-HKS [29] | 2010 | Scale invariant extension of HKS |
SHOT [281] | 2010 | Signatures of histograms, balance between descriptiveness and robustness |
CSHOT [282] | 2011 | Extension of SHOT descriptor to incorporate texture information |
WKS [7] | 2011 | Invariant to non-rigid transformations, scale invariant, outperforms HKS |
TriSI [89] | 2013 | Rotation, scale invariant and robust extension of SI descriptor |
RoPS [87] | 2013 | Unique and repeatable LRF, robust to noise and mesh resolution |
3DLBP [178] | 2015 | Generalization of LBP to three dimensions |
3DBRIEF [178] | 2015 | Generalization of BRIEF to three dimensions |
3DORB [178] | 2015 | Generalization of ORB to three dimensions |
LFSH [319] | 2016 | Combines depth map, point distribution and deviation angle between normals |
Toldi [320] | 2017 | Robust to noise, resolution, clutter and occlusion LRF. Multi-view depth map descriptor |
RSM [210] | 2018 | Uses multi-view silhouette instead of depth map. Outperforms RoPS |
BroPH [336] | 2018 | Binary descriptor, combines depth map and spatial distribution |
MVD [83] | 2019 | Extremely low dimensional. Performs similar to SoA descriptors in object recognition |
Method | Data type | Dimensionality | Comments |
---|---|---|---|
Scovanner et al. [239] | Video | 3 | First 3D SIFT |
Cheung and Hamarneh [39] | 3D MRI and 4D CT | n | Detector and nD |
Allaire et al. [5] | 3D CT, MRI, CBCT | 3 | Detector and account for tilt and 3D DoG |
Ni et al. [192] | 3D Ultrasound | 3 | Ultrasound specific noise filter and smoothing |
3.2 Volume features
3.3 Spatiotemporal features
3.3.1 STIP detectors
Method | comments | Year |
---|---|---|
Harris3D [144] | First STIP detector | 2003 |
Limit camera motion detections | 2004 | |
Cuboids [52] | More dense point detection | 2005 |
Bregonzio et al. [24] | Limit false detections and detect slow movements | 2009 |
Oikonomopoulos et al. [197] | Information based saliency | 2005 |
Wong et al. [311] | Use of local and global information | 2007 |
V-FAST [325] | Efficient computation | 2010 |
Chakraborty et al. [34] | Limit background detections | 2012 |
Li et al. [158] | Unified motion and appearance | 2018 |
3.3.2 STIP descriptors
3.3.3 3D space
3.3.4 Trajectories
Dataset | Data type | # Images | # Objects | # Object Cat. | 6DoF pose |
---|---|---|---|---|---|
PSB [247] | Polygonal surface geometry | – | 1814/6670 | 161/1271 | – |
ModelNet [314] | CAD | – | 151,128 | 660 | – |
ShapeNet [35] | CAD | – | 3M/220K | – /3135 | – |
shapeNetCore [35] | CAD | – | 51,300 | 55 | – |
shapeNetSem [35] | CAD | – | 12K | 270 | – |
YCB [31] | RGB-D | 600 | 75 | – | No |
Rutgers APC [216] | RGB-D | 10K | 24 | 24 | Yes |
SUD [41] | RGB-D | 23M | \(>\,10\)K | 44 | No |
4 Datasets and benchmarks
4.1 Object understanding
4.2 Scene understanding
Dataset (reference) | RGB-D video | Per-pixel annotation | traj. GT | RGB texture | # scenes | # layouts | # object classes | 3D Models avail. |
---|---|---|---|---|---|---|---|---|
B3DO [119] | No | Key frames | No | Real | 75 | – | \(>\,50\) | No |
NYUv2 [251] | Yes | Key frames | No | Real | 464 | 464 | 894 | No |
SUN 3D [316] | Yes | 3D point cloud \(+\) Video | No | Real | 254 | 415 | – | Yes |
SUN RGB-D [258] | No | Key frames | No | Real | – | – | \(\sim \,800\) | No |
sceneNN [113] | Yes | Video | Yes | Real | 100 | 100 | \(\ge \,63\) | Yes |
SceneNet [94] | No | Key frames | No | non-pr | 57 | 1000 | – | Yes |
SceneNet RGB-D [180] | Yes | Video | Yes | pr | 57 | 16,895 | 255 | Yes |
SUN-CG [259] | Yes | Video | Yes | non-pr | 45,622 | 45,622 | 84 | Yes |
ScanNet [47] | Yes | 3D \(+\) Video | ? | Real | 1513 | ? | \(\ge \,20\) | Yes |
4.3 Video understanding
Dataset | #Videos | #Clips | #Classes | Multi-label | Trimmed | Manually annotated |
---|---|---|---|---|---|---|
HMDB51 [141] | 3312 | 6766 | 51 | No | Yes | Yes |
UCF101 [261] | 2500 | 13,320 | 101 | No | Yes | Yes |
Sports 1M [129] | 1M | – | 487 | No | No | No |
ActivityNet [30] | 19,994 | 28,108 | 203 | No | Both | Yes |
FCVID [123] | 91,223 | 91,223 | 239 | No | No | Yes |
YFCC100M [280] | 0.8M | – | – | – | No | – |
YouTube-8M [1] | \(\sim \,8\)M | – | 4800 | Yes | No | No |
Kinetics [130] | 306,245 | 306,245 | 400 | No | Yes | Yes |
Okutama—action [11] | 43 | 43 | 12 | Yes | Yes | Yes |
Something–something [81] | 108,499 | 108,499 | 174 | No | Yes | Yes |
Moments in time [185] | 1M | 1M | 339 | No | Yes | Yes |
4.4 Other datasets
5 Research areas
5.1 Object classification and recognition
5.1.1 RGB-D object recognition
Method | Category | Instance |
---|---|---|
linSVM [142] | \(81.9 \pm 2.8\) | 73.9 |
kSVM [142] | \(83.8 \pm 3.5\) | 74.8 |
RF [142] | \(79.6 \pm 4\) | 73.1 |
KDE [20] | \(86.2 \pm 2.1\) | 84.5 |
HKDE [18] | \(84.1 \pm 2.2\) | 82.4 |
Upgraded HMP [21] | \(87.5 \pm 2.9\) | 92.8 |
CNN-RNN [257] | \(86.8 \pm 3.3\) | – |
Fus-CNN [57] | \(91.3 \pm 1.4\) | – |
5.1.2 3D object classification
Method | Type | MN10 | MN40 |
---|---|---|---|
shapeNet [314] | 3D | 83.54 | 77.32 |
MV-CNN [266] | 2D proj. | – | 90.1 |
VoxNet [179] | 3D | 92.0 | 83.0 |
DeepPano [245] | 2D proj. | 88.66 | 82.54 |
MVCNN-MultiRes [207] | 2D proj. | – | 91.4 |
MO-AniProbing [207] | 3D | – | 89.9 |
ORION [241] | 3D | 93.9 | 89.4 |
FusionNet [100] | Both | 93.11 | 90.8 |
VRN [26] | 3D | 93.61 | 91.33 |
VRN-ensemble [26] | 3D | 97.14 | 95.54 |
Wang et al. [295] | 2D proj. | – | 93.8 |
Method | Year | Shallow/deep | NYUv1 | NYUv2 | |||||
---|---|---|---|---|---|---|---|---|---|
4 Classes | 40 Classes | ||||||||
pixacc | pixacc | clacc | fwavacc | avacc | pixacc | clacc | |||
SIFT+MRF [250] | 2011 | Shallow | \(56.6 \pm 2.9\) | – | – | – | – | – | – |
Silberman et al. [251] | 2012 | Shallow | – | 58.6 | – | – | – | – | – |
KDES [215] | 2012 | Shallow | *\(76.1 \pm 0.9\) | – | – | – | – | – | – |
Gupta et al. [91] | 2013 | Shallow | – | – | – | 45.1 | 26.1 | 57.9 | *28.4 |
Hermans et al. [102] | 2014 | Shallow | 59.5 | 69.0 | – | – | – | – | – |
2014 | Shallow | – | *72.3 | *71.9 | – | – | – | – | |
Khan et al. [133] | 2014 | Shallow | – | 69.2 | 65.6 | – | – | – | – |
Gupta et al. [90] | 2015 | Shallow | – | – | – | 45.9 | 26.8 | 58.3 | – |
Deng et al. [51] | 2015 | Shallow | – | – | – | *48.5 | *31.5 | *63.8 | – |
Stückler et al. [265] | 2015 | Shallow | – | 70.9 | 67.0 | – | – | – | – |
Couprie et al. [46] | 2013 | Deep | – | 64.5 | 63.5 | – | – | – | – |
R-CNN [92] | 2014 | Deep | – | – | – | 47.0 | 28.6 | 60.3 | 35.1 |
FCN [167] | 2015 | Deep | – | – | – | 49.5 | 34.0 | 65.4 | 46.1 |
Eigen and Fergus [56] | 2015 | Deep | – | 83.2 | – | 51.4 | 34.1 | 65.6 | 45.1 |
Wang et al. [303] | 2016 | Deep | 78.8 | – | 74.7 | – | – | – | 47.3 |
RDF-152 [202] | 2017 | Deep | – | – | – | – | 50.1 | 76.0 | 62.8 |
3DGNN [208] | 2017 | Deep | – | – | – | – | 43.1 | – | 59.5 |
5.2 Semantic segmentation
5.3 Human action classification
5.3.1 Traditional methods
5.3.2 Deep learning
Method | Year | +IDT | RGB | Flow | UCF-101 | HMDB-51 |
---|---|---|---|---|---|---|
IDT [300] | 2013 | – | – | – | 86.4 | 61.7 |
Two-Stream [253] | 2014 | No | Yes | Yes | 88.0 | 59.4 |
Karpathy et al. [129], Sport 1M pre-train | 2014 | No | Yes | No | 65.2 | – |
TDD [304] | 2015 | No | Yes | Yes | 90.3 | 63.2 |
C3D ensemble [284], Sport 1M pre-train | 2015 | No | Yes | No | 85.2 | – |
Very deep two-stream [305] | 2015 | No | Yes | Yes | 91.4 | – |
Two-stream fusion [65] | 2016 | No | Yes | Yes | 92.5 | 65.4 |
LTC [289], Kinetics pre-train | 2017 | No | Yes | Yes | 91.7 | 64.8 |
Two-stream I3D [33], Kinetics pre-train | 2017 | No | Yes | Yes | 97.9 | 80.2 |
(2 \(+\) 1)D [285], Kinetics+Sports 1M pre-train | 2018 | No | Yes | Yes | 97.3 | 78.7 |
TDD \(+\) IDT [304] | 2015 | Yes | Yes | Yes | 91.5 | 65.9 |
C3D ensemble \(+\) IDT [284], Sport 1M pre-train | 2015 | Yes | Yes | No | 90.1 | – |
Dynamic Image Networks \(+\) IDT [16] | 2016 | Yes | Yes | No | 89.1 | 65.2 |
Two-stream fusion \(+\) IDT [65] | 2016 | Yes | Yes | Yes | 93.5 | 69.2 |
LTC \(+\) IDT [289], Kinetics pre-train | 2017 | Yes | Yes | Yes | 92.7 | 67.2 |
5.4 Other areas
6 Discussion
6.1 Major challenges
-
Deep learning in high dimensional data is very computationally and memory expensive, limiting the capabilities of the applied approaches.
-
Deep learning approaches lack invariance in many transformations, such as scale and rotation, which are usually tackled by very computationally expensive approaches.
-
There exist many competing strategies for handling high dimensional data, and it is still not clear which approaches are suited better for which type of data and more importantly why.