Skip to main content

2016 | OriginalPaper | Buchkapitel

Detecting Engagement in Egocentric Video

verfasst von : Yu-Chuan Su, Kristen Grauman

Erschienen in: Computer Vision – ECCV 2016

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In a wearable camera video, we see what the camera wearer sees. While this makes it easy to know roughly https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-46454-1_28/419978_1_En_28_IEq1_HTML.gif , it does not immediately reveal https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-46454-1_28/419978_1_En_28_IEq2_HTML.gif . Specifically, at what moments did his focus linger, as he paused to gather more information about something he saw? Knowing this answer would benefit various applications in video summarization and augmented reality, yet prior work focuses solely on the “what” question (estimating saliency, gaze) without considering the “when” (engagement). We propose a learning-based approach that uses long-term egomotion cues to detect engagement, specifically in browsing scenarios where one frequently takes in new visual information (e.g., shopping, touring). We introduce a large, richly annotated dataset for ego-engagement that is the first of its kind. Our approach outperforms a wide array of existing methods. We show engagement can be detected well independent of both scene appearance and the camera wearer’s identity.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Throughout, we will use the term “recorder” to refer to the photographer or the first-person camera-wearer; we use the term “viewer” to refer to a third party who is observing the data captured by some other recorder.
 
2
For a portion of the video, we also ask the original recorders to label all frames for their own video; this requires substantial tedious effort, hence to get the full labeled set in a scalable manner we apply crowdsourcing.
 
3
We randomly generate interval predictions 10 times based on the prior of interval length and temporal distribution and report the average.
 
Literatur
1.
Zurück zum Zitat Rudoy, D., Goldman, D., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013) Rudoy, D., Goldman, D., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)
2.
Zurück zum Zitat Han, J., Sun, L., Hu, X., Han, J., Shao, L.: Spatial and temporal visual attention prediction in videos using eye movement data. Neurocomputing 145, 140–153 (2014)CrossRef Han, J., Sun, L., Hu, X., Han, J., Shao, L.: Spatial and temporal visual attention prediction in videos using eye movement data. Neurocomputing 145, 140–153 (2014)CrossRef
3.
Zurück zum Zitat Lee, W., Huang, T., Yeh, S., Chen, H.: Learning-based prediction of visual attention for video signals. IEEE TIP 20(11), 3028–3038 (2011)MathSciNet Lee, W., Huang, T., Yeh, S., Chen, H.: Learning-based prediction of visual attention for video signals. IEEE TIP 20(11), 3028–3038 (2011)MathSciNet
4.
Zurück zum Zitat Abdollahian, G., Taskiran, C., Pizlo, Z., Delp, E.: Camera motion-based analysis of user generated video. TMM 12(1), 28–41 (2010) Abdollahian, G., Taskiran, C., Pizlo, Z., Delp, E.: Camera motion-based analysis of user generated video. TMM 12(1), 28–41 (2010)
5.
Zurück zum Zitat Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. TPAMI 32(1), 171–177 (2010) Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. TPAMI 32(1), 171–177 (2010)
6.
Zurück zum Zitat Rahtu, E., Kannala, J., Salo, M., Heikkilä, J.: Segmenting salient objects from images and videos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 366–379. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_27 CrossRef Rahtu, E., Kannala, J., Salo, M., Heikkilä, J.: Segmenting salient objects from images and videos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 366–379. Springer, Heidelberg (2010). doi:10.​1007/​978-3-642-15555-0_​27 CrossRef
7.
Zurück zum Zitat Itti, L., Baldi, P.: Bayesian surprise attracts human attention. Vision Res. 49(10), 1295–1306 (2009)CrossRef Itti, L., Baldi, P.: Bayesian surprise attracts human attention. Vision Res. 49(10), 1295–1306 (2009)CrossRef
8.
Zurück zum Zitat Liu, H., Jiang, S., Huang, Q., Xu, C.: A generic virtual content insertion system based on visual attention analysis. In: ACM MM (2008) Liu, H., Jiang, S., Huang, Q., Xu, C.: A generic virtual content insertion system based on visual attention analysis. In: ACM MM (2008)
9.
Zurück zum Zitat Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013) Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013)
10.
Zurück zum Zitat Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Attention prediction in egocentric video using motion and visual saliency. In: Ho, Y.-S. (ed.) PSIVT 2011. LNCS, vol. 7087, pp. 277–288. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25367-6_25 CrossRef Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Attention prediction in egocentric video using motion and visual saliency. In: Ho, Y.-S. (ed.) PSIVT 2011. LNCS, vol. 7087, pp. 277–288. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-25367-6_​25 CrossRef
11.
Zurück zum Zitat Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Can saliency map models predict human egocentric visual attention? In: Koch, R., Huang, F. (eds.) ACCV 2010. LNCS, vol. 6468, pp. 420–429. Springer, Heidelberg (2011). doi:10.1007/978-3-642-22822-3_42 CrossRef Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Can saliency map models predict human egocentric visual attention? In: Koch, R., Huang, F. (eds.) ACCV 2010. LNCS, vol. 6468, pp. 420–429. Springer, Heidelberg (2011). doi:10.​1007/​978-3-642-22822-3_​42 CrossRef
12.
Zurück zum Zitat Kender, J., Yeo, B.L.: On the structure and analysis of home videos. In: ACCV (2000) Kender, J., Yeo, B.L.: On the structure and analysis of home videos. In: ACCV (2000)
13.
Zurück zum Zitat Li, K., Oh, S., Perera, A., Fu, Y.: A videography analysis framework for video retrieval and summarization. In: BMVC (2012) Li, K., Oh, S., Perera, A., Fu, Y.: A videography analysis framework for video retrieval and summarization. In: BMVC (2012)
14.
Zurück zum Zitat Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10584-0_33 Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi:10.​1007/​978-3-319-10584-0_​33
15.
Zurück zum Zitat Poleg, Y., Arora, C., Peleg, S.: Temporal segmentation of egocentric videos. In: CVPR (2014) Poleg, Y., Arora, C., Peleg, S.: Temporal segmentation of egocentric videos. In: CVPR (2014)
16.
Zurück zum Zitat Nguyen, T.V., Xu, M., Gao, G., Kankanhalli, M., Tian, Q., Yan, S.: Static saliency vs. dynamic saliency: a comparative study. In: ACM MM (2013) Nguyen, T.V., Xu, M., Gao, G., Kankanhalli, M., Tian, Q., Yan, S.: Static saliency vs. dynamic saliency: a comparative study. In: ACM MM (2013)
17.
Zurück zum Zitat Ejaz, N., Mehmood, I., Baik, S.: Efficient visual attention based framework for extracting key frames from videos. Image Commun. 28, 34–44 (2013) Ejaz, N., Mehmood, I., Baik, S.: Efficient visual attention based framework for extracting key frames from videos. Image Commun. 28, 34–44 (2013)
18.
Zurück zum Zitat Itti, L., Dhavale, N., Pighin, F.: Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Proceedings of the SPIE 48th Annual International Symposium on Optical Science and Technology, vol. 5200, pp. 64–78, August 2003 Itti, L., Dhavale, N., Pighin, F.: Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Proceedings of the SPIE 48th Annual International Symposium on Optical Science and Technology, vol. 5200, pp. 64–78, August 2003
19.
Zurück zum Zitat Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS (2007) Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS (2007)
20.
Zurück zum Zitat Seo, H., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vision 9(7), 1–27 (2009)CrossRef Seo, H., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vision 9(7), 1–27 (2009)CrossRef
21.
Zurück zum Zitat Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002) Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002)
22.
Zurück zum Zitat Kienzle, W., Schölkopf, B., Wichmann, F.A., Franz, M.O.: How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 405–414. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74936-3_41 CrossRef Kienzle, W., Schölkopf, B., Wichmann, F.A., Franz, M.O.: How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 405–414. Springer, Heidelberg (2007). doi:10.​1007/​978-3-540-74936-3_​41 CrossRef
23.
Zurück zum Zitat Dorr, M., Martinetz, T., Gegenfurtner, K.R., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. J. Vision 10(10), 1–17 (2010)CrossRef Dorr, M., Martinetz, T., Gegenfurtner, K.R., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. J. Vision 10(10), 1–17 (2010)CrossRef
24.
Zurück zum Zitat Pilu, M.: On the use of attention clues for an autonomous wearable camera. Technical report HPL-2002-195, HP Laboratories Bristol (2003) Pilu, M.: On the use of attention clues for an autonomous wearable camera. Technical report HPL-2002-195, HP Laboratories Bristol (2003)
25.
Zurück zum Zitat Rallapalli, S., Ganesan, A., Padmanabhan, V., Chintalapudi, K., Qiu, L.: Enabling physical analytics in retail stores using smart glasses. In: MobiCom (2014) Rallapalli, S., Ganesan, A., Padmanabhan, V., Chintalapudi, K., Qiu, L.: Enabling physical analytics in retail stores using smart glasses. In: MobiCom (2014)
26.
Zurück zum Zitat Nakamura, Y., Ohde, J., Ohta, Y.: Structuring personal activity records based on attention-analyzing videos from head mounted camera. In: ICPR (2000) Nakamura, Y., Ohde, J., Ohta, Y.: Structuring personal activity records based on attention-analyzing videos from head mounted camera. In: ICPR (2000)
27.
Zurück zum Zitat Cheatle, P.: Media content and type selection from always-on wearable video. In: ICPR (2004) Cheatle, P.: Media content and type selection from always-on wearable video. In: ICPR (2004)
28.
Zurück zum Zitat Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012) Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)
29.
Zurück zum Zitat Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: CVPR (2013) Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: CVPR (2013)
30.
Zurück zum Zitat Aghazadeh, O., Sullivan, J., Carlsson, S.: Novelty detection from an egocentric perspective. In: CVPR (2011) Aghazadeh, O., Sullivan, J., Carlsson, S.: Novelty detection from an egocentric perspective. In: CVPR (2011)
31.
Zurück zum Zitat Hoshen, Y., Ben-Artzi, G., Peleg, S.: Wisdom of the crowd in egocentric video curation. In: CVPR Workshop (2014) Hoshen, Y., Ben-Artzi, G., Peleg, S.: Wisdom of the crowd in egocentric video curation. In: CVPR Workshop (2014)
32.
Zurück zum Zitat Park, H.S., Jain, E., Sheikh, Y.: 3D gaze concurrences from head-mounted cameras. In: NIPS (2012) Park, H.S., Jain, E., Sheikh, Y.: 3D gaze concurrences from head-mounted cameras. In: NIPS (2012)
33.
Zurück zum Zitat Fathi, A., Hodgins, J., Rehg, J.: Social interactions: a first-person perspective. In: CVPR (2012) Fathi, A., Hodgins, J., Rehg, J.: Social interactions: a first-person perspective. In: CVPR (2012)
34.
Zurück zum Zitat Fathi, A., Farhadi, A., Rehg, J.: Understanding egocentric activities. In: ICCV (2011) Fathi, A., Farhadi, A., Rehg, J.: Understanding egocentric activities. In: ICCV (2011)
35.
Zurück zum Zitat Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012) Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
36.
Zurück zum Zitat Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-do, i-learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC 2014 (2014) Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-do, i-learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC 2014 (2014)
37.
Zurück zum Zitat Soran, B., Farhadi, A., Shapiro, L.: Action recognition in the presence of one egocentric and multiple static cameras. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 178–193. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16814-2_12 Soran, B., Farhadi, A., Shapiro, L.: Action recognition in the presence of one egocentric and multiple static cameras. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 178–193. Springer, Heidelberg (2015). doi:10.​1007/​978-3-319-16814-2_​12
38.
Zurück zum Zitat Kitani, K., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports video. In: CVPR (2011) Kitani, K., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports video. In: CVPR (2011)
39.
Zurück zum Zitat Spriggs, E., la Torre, F.D., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: CVPR Workshop on Egocentric Vision (2009) Spriggs, E., la Torre, F.D., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: CVPR Workshop on Egocentric Vision (2009)
40.
Zurück zum Zitat Li, Y., Ye, Z., Rehg, J.: Delving into egocentric actions. In: CVPR (2015) Li, Y., Ye, Z., Rehg, J.: Delving into egocentric actions. In: CVPR (2015)
41.
Zurück zum Zitat Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M.: Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)CrossRef Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M.: Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)CrossRef
42.
Zurück zum Zitat Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint (2014). arXiv:1408.5093 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint (2014). arXiv:​1408.​5093
43.
Zurück zum Zitat Liu, C.: Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. thesis, Massachusetts Institute of Technology, May 2009 Liu, C.: Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. thesis, Massachusetts Institute of Technology, May 2009
44.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)MathSciNetMATH
Metadaten
Titel
Detecting Engagement in Egocentric Video
verfasst von
Yu-Chuan Su
Kristen Grauman
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46454-1_28