Skip to main content
Top

2018 | OriginalPaper | Chapter

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Authors : Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy

Published in: Computer Vision – ECCV 2018

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level “semantic” features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The original “Mini-Kinetics” dataset used in [6] contains videos that are no longer available. We created the new Mini-Kinetics-200 dataset in collaboration with the original authors.
 
2
To reduce the memory and time requirements, and to keep the training protocol identical to I3D (in terms of the number of clips we use for training in each batch, etc), we retain two max-pooling layers with temporal stride 2 between Inception modules. Hence, strictly speaking, I2D is not a pure 2D model. However, it is very similar to a single-frame 2D classification model.
 
3
There are 4 branches in an Inception block, but only two of them have 3x3 convolutions (the other two being pointwise 1x1 convolutions), as shown in Fig. 3. As such, when I3D inflates the convolutions to 3D, only some of the features contain temporal information. However, by using separable temporal convolution, we can add temporal information to all 4 branches. This improves the performance from \(78.4\%\) to \(78.9\%\) on Mini-Kinetics-200. In the following sections, whenever we refer to an S3D model, we mean S3D with such configuration.
 
4
The labels are as follows. 0: Dropping [something], 1: Moving [something] from right to left, 2: Moving [something] from left to right, 3: Picking [something], 4:Putting [something], 5: Poking [something], 6: Tearing [something], 7: Pouring [something], 8: Holding [something], 9: Showing [something].
 
Literature
1.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
2.
go back to reference Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015) Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
3.
go back to reference Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
4.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
5.
go back to reference Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
6.
go back to reference Kay, W., et al.: The kinetics human action video dataset. In: CVPR (2017) Kay, W., et al.: The kinetics human action video dataset. In: CVPR (2017)
7.
go back to reference Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: ICCV (2017) Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: ICCV (2017)
8.
go back to reference Caba Heilbron, F., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: A large-scale video benchmark for human activity understanding. In: CVPR (2015) Caba Heilbron, F., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: A large-scale video benchmark for human activity understanding. In: CVPR (2015)
9.
go back to reference Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016) Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)
10.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
11.
go back to reference Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018) Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)
12.
go back to reference Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017) Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017)
13.
14.
go back to reference Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017) Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
15.
go back to reference Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV (2017) Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV (2017)
16.
go back to reference Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV (2015) Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV (2015)
17.
go back to reference Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017) Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)
18.
go back to reference Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: ICML (2017) Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: ICML (2017)
19.
go back to reference Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.: Film: visual reasoning with a general conditioning layer. In: AAAI (2018) Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.: Film: visual reasoning with a general conditioning layer. In: AAAI (2018)
20.
go back to reference Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. (2018) Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. (2018)
22.
go back to reference Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018) Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
24.
go back to reference Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: INIPS (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: INIPS (2014)
25.
go back to reference Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR (2016) Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR (2016)
26.
go back to reference Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. IEEE PAMI (2017) Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. IEEE PAMI (2017)
27.
go back to reference Feichtenhofer, C., Pinz, A., Wildes, R.P.: Temporal residual networks for dynamic scene recognition. In: CVPR (2017) Feichtenhofer, C., Pinz, A., Wildes, R.P.: Temporal residual networks for dynamic scene recognition. In: CVPR (2017)
29.
go back to reference Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015) Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)
30.
go back to reference Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016) Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)
31.
go back to reference Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal multiplier networks for video action recognition. In: CVPR (2017) Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal multiplier networks for video action recognition. In: CVPR (2017)
32.
go back to reference Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016) Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)
33.
go back to reference Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: ICCV (2017) Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: ICCV (2017)
34.
go back to reference Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In CVPR (2015) Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In CVPR (2015)
35.
go back to reference Wang, X., Farhadi, A., Gupta, A.: Actions transformations. In: CVPR (2016) Wang, X., Farhadi, A., Gupta, A.: Actions transformations. In: CVPR (2016)
36.
go back to reference Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016) Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016)
37.
go back to reference Pickup, L.C., et al.: Seeing the arrow of time. In: CVPR (2014) Pickup, L.C., et al.: Seeing the arrow of time. In: CVPR (2014)
38.
go back to reference Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR (2008) Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR (2008)
40.
go back to reference Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l1 optical flow. Pattern Recognit. (2007) Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l1 optical flow. Pattern Recognit. (2007)
41.
go back to reference Bian, Y., et al.: Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification (2017). arXiv:1708.03805 Bian, Y., et al.: Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification (2017). arXiv:​1708.​03805
42.
go back to reference Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018) Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
43.
go back to reference Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: CVPR (2018) Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: CVPR (2018)
44.
go back to reference Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV (2011) Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV (2011)
45.
go back to reference Soomro, K., Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01 (2012) Soomro, K., Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01 (2012)
46.
47.
go back to reference Tran, D., Ray, J., Shou, Z., Chang, S., Paluri, M.: Convnet architecture search for spatiotemporal feature learning (2017). arXiv:1708.05038 Tran, D., Ray, J., Shou, Z., Chang, S., Paluri, M.: Convnet architecture search for spatiotemporal feature learning (2017). arXiv:​1708.​05038
49.
go back to reference Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015) Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015)
50.
go back to reference Gu, C., et al.: AVA: A video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018) Gu, C., et al.: AVA: A video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)
51.
go back to reference Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.: Towards understanding action recognition. In: ICCV (2013) Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.: Towards understanding action recognition. In: ICCV (2013)
52.
go back to reference Saha, S., Sing, G., Cuzzolin, F.: AMTnet: Action-micro-tube regression by end-to-end trainable deep architecture. In: ICCV (2017) Saha, S., Sing, G., Cuzzolin, F.: AMTnet: Action-micro-tube regression by end-to-end trainable deep architecture. In: ICCV (2017)
53.
go back to reference Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015) Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)
54.
go back to reference Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: CVPR (2017) Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: CVPR (2017)
55.
go back to reference Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: ICCV (2015) Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: ICCV (2015)
56.
go back to reference Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV (2017) Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV (2017)
Metadata
Title
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Authors
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Murphy
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-01267-0_19

Premium Partner