Skip to main content
Top

2016 | OriginalPaper | Chapter

Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Authors : Yu-Chuan Su, Kristen Grauman

Published in: Computer Vision – ECCV 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Current approaches for activity recognition often ignore constraints on computational resources: (1) they rely on extensive feature computation to obtain rich descriptors on all frames, and (2) they assume batch-mode access to the entire test video at once. We propose a new active approach to activity recognition that prioritizes “what to compute when” in order to make timely predictions. The main idea is to learn a policy that dynamically schedules the sequence of features to compute on selected frames of a given test video. In contrast to traditional static feature selection, our approach continually re-prioritizes computation based on the accumulated history of observations and accounts for the transience of those observations in ongoing video. We develop variants to handle both the batch and streaming settings. On two challenging datasets, our method provides significantly better accuracy than alternative techniques for a wide range of computational budgets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
For clarity of presentation, in the following we present our method assuming a trimmed input video; Sect. 3.2 explains adjustments for untrimmed inputs.
 
2
Note that \(a^{(k)}\) identifies an action selected at step k in the episode, whereas \(a_{m}\) is one of the M discrete action choices in \(\mathcal {A}\).
 
3
Some object detectors share features across object categories, e.g., R-CNN [53], in which case it may be practical to simplify the action to select only the video volume and apply all object classes. We use the DPM detector [54], which has the advantage of near real-time detection [42] using a single thread, whereas R-CNN relies heavily on parallel computation and hardware acceleration [55].
 
4
We retain the 75 objects among all 15,000 found most responsive for the activities, following [4]. Because the provided detections are frame-level, we split volumes only in the temporal dimension for \(l_m\) on UCF.
 
5
Object-Pref [4] is not applicable to the streaming case because it lacks a unique object response prior for the actions that is dependent on the buffer location.
 
6
ADL is less amenable to full-frame CNN descriptors, due to domain shift of egocentric video and the nature of the composite, object-driven activities.
 
Literature
1.
go back to reference Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43(3), April 2011 Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43(3), April 2011
2.
go back to reference Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013) Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)
3.
go back to reference Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012) Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
4.
go back to reference Jain, M., van Gemert, J.C., Snoek, C.G.M.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR (2015) Jain, M., van Gemert, J.C., Snoek, C.G.M.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR (2015)
5.
go back to reference Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
6.
go back to reference Xu, Z., Yang, Y., Hauptman, A.: A discriminative cnn video representation for event detection. In: CVPR (2015) Xu, Z., Yang, Y., Hauptman, A.: A discriminative cnn video representation for event detection. In: CVPR (2015)
7.
go back to reference Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN architectures for unconstrained video classification. In: BMVC (2015) Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN architectures for unconstrained video classification. In: BMVC (2015)
8.
go back to reference Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: ICCV (2009) Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: ICCV (2009)
9.
go back to reference Butko, N., Movellan, J.: Optimal scanning for faster object detection. In: CVPR (2009) Butko, N., Movellan, J.: Optimal scanning for faster object detection. In: CVPR (2009)
10.
go back to reference Vijayanarasimhan, S., Kapoor, A.: Visual recognition and detection under bounded computational resources. In: CVPR (2010) Vijayanarasimhan, S., Kapoor, A.: Visual recognition and detection under bounded computational resources. In: CVPR (2010)
11.
go back to reference Karayev, S., Baumgartner, T., Fritz, M., Darrell, T.: Timely object recognition. In: NIPS (2012) Karayev, S., Baumgartner, T., Fritz, M., Darrell, T.: Timely object recognition. In: NIPS (2012)
12.
go back to reference Dulac-Arnold, G., Denoyer, L., Thome, N., Cord, M.: Sequentially generated instance-dependent image representations for classification. In: ICLR (2014) Dulac-Arnold, G., Denoyer, L., Thome, N., Cord, M.: Sequentially generated instance-dependent image representations for classification. In: ICLR (2014)
13.
go back to reference Karayev, S., Fritz, M., Darrell, T.: Anytime recognition of objects and scenes. In: CVPR (2014) Karayev, S., Fritz, M., Darrell, T.: Anytime recognition of objects and scenes. In: CVPR (2014)
14.
go back to reference Gonzalez-Garcia, A., Vezhnevets, A., Ferrari, V.: An active search strategy for efficient object class detection. In: CVPR (2015) Gonzalez-Garcia, A., Vezhnevets, A., Ferrari, V.: An active search strategy for efficient object class detection. In: CVPR (2015)
15.
go back to reference Gao, T., Koller, D.: Active classification based on value of classifier. In: NIPS (2011) Gao, T., Koller, D.: Active classification based on value of classifier. In: NIPS (2011)
16.
go back to reference Yu, X., Fermuller, C., Teo, C.L., Yang, Y., Aloimonos, Y.: Active scene recognition with vision and language. In: CVPR (2011) Yu, X., Fermuller, C., Teo, C.L., Yang, Y., Aloimonos, Y.: Active scene recognition with vision and language. In: CVPR (2011)
17.
go back to reference Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016) Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)
18.
go back to reference Hoai, M., la Torre, F.D.: Max-margin early event detectors. In: CVPR (2012) Hoai, M., la Torre, F.D.: Max-margin early event detectors. In: CVPR (2012)
19.
go back to reference Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV (2011) Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV (2011)
20.
go back to reference Davis, J., Tyagi, A.: Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 455–472 (2006)CrossRef Davis, J., Tyagi, A.: Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 455–472 (2006)CrossRef
21.
go back to reference Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011) Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011)
22.
go back to reference Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_11 Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-33718-5_​11
23.
go back to reference Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015) Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)
24.
go back to reference Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition (2016) Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition (2016)
26.
go back to reference Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015) Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
27.
go back to reference Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005) Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005)
28.
go back to reference Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009) Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)
29.
30.
go back to reference Medioni, G., Nevatia, R., Cohen, I.: Event detection and analysis from video streams. Trans. Pattern Anal. Mach. Intell. 23(8), 873–889 (2001)CrossRef Medioni, G., Nevatia, R., Cohen, I.: Event detection and analysis from video streams. Trans. Pattern Anal. Mach. Intell. 23(8), 873–889 (2001)CrossRef
31.
go back to reference Yao, A., Gall, J., van Gool, L.: A hough transform-based voting framework for action recognition. In: CVPR (2010) Yao, A., Gall, J., van Gool, L.: A hough transform-based voting framework for action recognition. In: CVPR (2010)
32.
go back to reference Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In: International Workshop on Sign, Gesture, Activity (2010) Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In: International Workshop on Sign, Gesture, Activity (2010)
33.
go back to reference Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011) Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)
34.
go back to reference Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: CVPR (2015) Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: CVPR (2015)
35.
go back to reference Jain, M., van Gemert, J., Jegou, H., Bouthemy, P., Snoek, C.: Action localization with tubelets from motion. In: CVPR (2015) Jain, M., van Gemert, J., Jegou, H., Bouthemy, P., Snoek, C.: Action localization with tubelets from motion. In: CVPR (2015)
36.
go back to reference Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015) Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)
37.
go back to reference van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.M.: APT: action localization proposals from dense trajectories. In: BMVC (2015) van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.M.: APT: action localization proposals from dense trajectories. In: BMVC (2015)
38.
go back to reference Chen, C.Y., Grauman, K.: Efficient activity detection with max-subgraph search. In: CVPR (2012) Chen, C.Y., Grauman, K.: Efficient activity detection with max-subgraph search. In: CVPR (2012)
39.
go back to reference Yu, G., Yuan, J., Liu, Z.: Unsupervised random forest indexing for fast action search. In: CVPR (2011) Yu, G., Yuan, J., Liu, Z.: Unsupervised random forest indexing for fast action search. In: CVPR (2011)
40.
go back to reference Su, Y.C., Grauman, K.: Leaving some stones unturned: dynamic feature prioritization for activity detection in streaming video. Technical report, Comp. Sci. Dept., Univ. Texas, Austin, AI-15-05 (2015) Su, Y.C., Grauman, K.: Leaving some stones unturned: dynamic feature prioritization for activity detection in streaming video. Technical report, Comp. Sci. Dept., Univ. Texas, Austin, AI-15-05 (2015)
41.
go back to reference Sadeghi, M.A., Forsyth, D.: 30Hz object detection with DPM V5. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 65–79. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10590-1_5 Sadeghi, M.A., Forsyth, D.: 30Hz object detection with DPM V5. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 65–79. Springer, Heidelberg (2014). doi:10.​1007/​978-3-319-10590-1_​5
42.
go back to reference Yan, J., Lei, Z., Wen, L., Li, S.: The fastest deformable part model for object detection. In: CVPR (2014) Yan, J., Lei, Z., Wen, L., Li, S.: The fastest deformable part model for object detection. In: CVPR (2014)
43.
go back to reference Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001)
44.
go back to reference Pedersoli, M., Vedaldi, A., Gonzalez, J.: A coarse-to-fine approach for fast deformable object detection. In: CVPR (2011) Pedersoli, M., Vedaldi, A., Gonzalez, J.: A coarse-to-fine approach for fast deformable object detection. In: CVPR (2011)
45.
go back to reference Weiss, D.J., Taskar, B.: Learning adaptive value of information for structured prediction. In: NIPS (2013) Weiss, D.J., Taskar, B.: Learning adaptive value of information for structured prediction. In: NIPS (2013)
46.
go back to reference Wang, J., Bolukbasi, T., Trapeznikov, K., Saligrama, V.: Model selection by linear programming. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 647–662. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10605-2_42 Wang, J., Bolukbasi, T., Trapeznikov, K., Saligrama, V.: Model selection by linear programming. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 647–662. Springer, Heidelberg (2014). doi:10.​1007/​978-3-319-10605-2_​42
47.
go back to reference Karasev, V., Ravichandran, A., Soatto, S.: Active frame, location, and detector selection for automated and manual video annotation. In: CVPR (2014) Karasev, V., Ravichandran, A., Soatto, S.: Active frame, location, and detector selection for automated and manual video annotation. In: CVPR (2014)
48.
go back to reference Chen, D., Bilgic, M., Getoor, L., Jacobs, D.: Dynamic processing allocation in video. IEEE Trans. Pattern Anal. Mach. Intell. 33(11), 2174–2187 (2011)CrossRef Chen, D., Bilgic, M., Getoor, L., Jacobs, D.: Dynamic processing allocation in video. IEEE Trans. Pattern Anal. Mach. Intell. 33(11), 2174–2187 (2011)CrossRef
49.
go back to reference Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 187–200. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33765-9_14 Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 187–200. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-33765-9_​14
50.
go back to reference Amer, M., Todorovic, S., Fern, A., Zhu, S.C.: Monte Carlo tree search for scheduling activity recognition. In: ICCV (2013) Amer, M., Todorovic, S., Fern, A., Zhu, S.C.: Monte Carlo tree search for scheduling activity recognition. In: ICCV (2013)
51.
go back to reference Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson, Upper Saddle River (2010)MATH Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson, Upper Saddle River (2010)MATH
52.
go back to reference Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007) Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)
53.
go back to reference Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014) Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
54.
go back to reference Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)CrossRef Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)CrossRef
55.
go back to reference Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015) Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
56.
go back to reference Nvidia: GPU-based deep learning inference: a performance and power analysis. Whitepaper (2015) Nvidia: GPU-based deep learning inference: a performance and power analysis. Whitepaper (2015)
57.
go back to reference Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:​1212.​0402 (2012)
58.
go back to reference Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409.​1556 (2014)
60.
go back to reference Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate Frank-Wolfe optimization for structural SVMs. In: ICML (2013) Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate Frank-Wolfe optimization for structural SVMs. In: ICML (2013)
61.
go back to reference McCandless, T., Grauman, K.: Object-centric spatio-temporal pyramids for egocentric activity recognition. In: BMVC (2013) McCandless, T., Grauman, K.: Object-centric spatio-temporal pyramids for egocentric activity recognition. In: BMVC (2013)
Metadata
Title
Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video
Authors
Yu-Chuan Su
Kristen Grauman
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-46478-7_48

Premium Partner