Skip to main content

2017 | OriginalPaper | Buchkapitel

7. Two-Stream CNNs for Gesture-Based Verification and Identification: Learning User Style

verfasst von : Jonathan Wu, Jiawei Chen, Prakash Ishwar, Janusz Konrad

Erschienen in: Deep Learning for Biometrics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A gesture is a short body motion that contains both static (nonrenewable) anatomical information and dynamic (renewable) behavioral information. Unlike traditional biometrics such as face, fingerprint, and iris, which cannot be easily changed, gestures can be modified if compromised. We consider two types of gestures: full-body gestures, such as a wave of the arms, and hand gestures, such as a subtle curl of the fingers and palm, as captured by a depth sensor (Kinect v1 and v2 in our case). Most prior work in this area evaluates gestures in the context of a “password,” where each user has a single, chosen gesture motion. Contrary to this, we aim to learn a user’s gesture “style” from a set of training gestures. This allows for user convenience since an exact user motion is not required for user recognition. To achieve the goal of learning gesture style, we use two-stream convolutional neural networks, a deep learning framework that leverages both the spatial (depth) and temporal (optical flow) information of a video sequence. First, we evaluate the generalization performance during testing of our approach against gestures of users that have not been seen during training. Then, we study the importance of dynamics by suppressing the use of dynamic information in training and testing. Finally, we assess the capacity of the aforementioned techniques to learn representations of gestures that are invariant across users (gesture recognition) or to learn representations of users that are invariant across gestures (user style in verification and identification) by visualizing the two-dimensional t-Distributed Stochastic Neighbor Embedding (t-SNE) of neural network features. We find that our approach outperforms state-of-the-art methods in identification and verification on two biometrics-oriented gesture datasets for full-body and in-air hand gestures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Verification is also called authentication.
 
2
Of the 5 gesture classes in BodyLogin, 4 gesture classes are shared across users, and 1 is not, being user defined. This means that in leave-persons-out gesture recognition, the fifth gesture class will not have samples of its gesture type in training. As a result, the fifth gesture class is expected to act as a “reject”/“not gestures 1-4” category for gesture recognition.
 
3
Due to the general lack of per-user samples in MSRAction3D (as it is a gesture-centric dataset), we do not report results for verification, and leave-gesture-out experiments for identification.
 
Literatur
1.
Zurück zum Zitat M. Aumi, S. Kratz, Airauth: evaluating in-air hand gestures for authentication, in Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices & Services (ACM, 2014), pp. 309–318 M. Aumi, S. Kratz, Airauth: evaluating in-air hand gestures for authentication, in Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices & Services (ACM, 2014), pp. 309–318
5.
Zurück zum Zitat J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation feature for generic visual recognition, in Proceedings of The 31st International Conference on Machine Learning (2014), pp. 647–655 J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation feature for generic visual recognition, in Proceedings of The 31st International Conference on Machine Learning (2014), pp. 647–655
6.
Zurück zum Zitat C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1933–1941 C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1933–1941
7.
Zurück zum Zitat S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in CHI (ACM, 2012), pp. 1737–1746 S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in CHI (ACM, 2012), pp. 1737–1746
9.
Zurück zum Zitat M. Hussein, M. Torki, M.,Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations, in Proceedings of the 23rd International Joint Conference on Artificial Intelligence (AAAI Press, 2013), pp. 2466–2472 M. Hussein, M. Torki, M.,Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations, in Proceedings of the 23rd International Joint Conference on Artificial Intelligence (AAAI Press, 2013), pp. 2466–2472
10.
Zurück zum Zitat A. Jain, A. Ross, K. Nandakumar, Introduction to Biometrics (Springer, Berlin, 2011)CrossRef A. Jain, A. Ross, K. Nandakumar, Introduction to Biometrics (Springer, Berlin, 2011)CrossRef
11.
Zurück zum Zitat Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in Proceedings of the ACM International Conference on Multimedia (ACM, 2014), pp. 675–678 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in Proceedings of the ACM International Conference on Multimedia (ACM, 2014), pp. 675–678
12.
Zurück zum Zitat S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, H. Winnemoeller, Recognizing image style, in Proceedings of the British Machine Vision Conference (BMVA Press, 2014) S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, H. Winnemoeller, Recognizing image style, in Proceedings of the British Machine Vision Conference (BMVA Press, 2014)
14.
Zurück zum Zitat A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105 A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105
15.
Zurück zum Zitat A. Kurakin, Z. Zhang, Z. Liu, A real time system for dynamic hand gesture recognition with a depth sensor, in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European (IEEE, 2012), pp. 1975–1979 A. Kurakin, Z. Zhang, Z. Liu, A real time system for dynamic hand gesture recognition with a depth sensor, in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European (IEEE, 2012), pp. 1975–1979
16.
Zurück zum Zitat I. Kviatkovsky, I. Shimshoni, E. Rivlin, Person identification from action styles, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2015), pp. 84–92 I. Kviatkovsky, I. Shimshoni, E. Rivlin, Person identification from action styles, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2015), pp. 84–92
17.
Zurück zum Zitat K. Lai, J. Konrad, P. Ishwar, Towards gesture-based user authentication, in 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS, 2012), pp. 282–287. doi:10.1109/AVSS.2012.77 K. Lai, J. Konrad, P. Ishwar, Towards gesture-based user authentication, in 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS, 2012), pp. 282–287. doi:10.​1109/​AVSS.​2012.​77
18.
Zurück zum Zitat W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2010), pp. 9–14. doi:10.1109/CVPRW.2010.5543273 W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2010), pp. 9–14. doi:10.​1109/​CVPRW.​2010.​5543273
19.
Zurück zum Zitat T.Y. Lin, A. RoyChowdhury, S. Maji, Bilinear CNN models for fine-grained visual recognition, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1449–1457 T.Y. Lin, A. RoyChowdhury, S. Maji, Bilinear CNN models for fine-grained visual recognition, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1449–1457
20.
Zurück zum Zitat C. Liu, Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Citeseer (2009) C. Liu, Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Citeseer (2009)
21.
Zurück zum Zitat L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. Vieira, M. Campos, Real-time gesture recognition from depth data through key poses learning and decision forests, in 25th SIBGRAPI Conference on Graphics, Patterns and Images (IEEE, 2012), pp. 268–275 L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. Vieira, M. Campos, Real-time gesture recognition from depth data through key poses learning and decision forests, in 25th SIBGRAPI Conference on Graphics, Patterns and Images (IEEE, 2012), pp. 268–275
22.
Zurück zum Zitat M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber, Documentation mocap database hdm05. Technical Report CG-2007-2, Universität Bonn (2007) M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber, Documentation mocap database hdm05. Technical Report CG-2007-2, Universität Bonn (2007)
23.
Zurück zum Zitat F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Berkeley mhad: a comprehensive multimodal human action database. IEEE Workshop Appl. Comput. Vis. 0, 53–60 (2013). doi:10.1109/WACV.2013.6474999 F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Berkeley mhad: a comprehensive multimodal human action database. IEEE Workshop Appl. Comput. Vis. 0, 53–60 (2013). doi:10.​1109/​WACV.​2013.​6474999
24.
Zurück zum Zitat O. Oreifej, Z. Liu, Hon4D: histogram of oriented 4D normals for activity recognition from depth sequences, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 716–723 O. Oreifej, Z. Liu, Hon4D: histogram of oriented 4D normals for activity recognition from depth sequences, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 716–723
25.
Zurück zum Zitat E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep CNNs for action recognition, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2016), pp. 1–8 E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep CNNs for action recognition, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2016), pp. 1–8
26.
Zurück zum Zitat Z. Ren, J. Yuan, Z. Zhang, Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera, in Proceedings of the 19th ACM International Conference on Multimedia (ACM, 2011), pp. 1093–1096 Z. Ren, J. Yuan, Z. Zhang, Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera, in Proceedings of the 19th ACM International Conference on Multimedia (ACM, 2011), pp. 1093–1096
27.
Zurück zum Zitat O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef
28.
Zurück zum Zitat M.J. Scott, M. Niranjan, R.W. Prager, Realisable classifiers: improving operating performance on variable cost problems, in BMVC (Citeseer, 1998), pp. 1–10 M.J. Scott, M. Niranjan, R.W. Prager, Realisable classifiers: improving operating performance on variable cost problems, in BMVC (Citeseer, 1998), pp. 1–10
29.
Zurück zum Zitat L. Sigal, A.O. Balan, M.J. Black, Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision 87(1–2), 4–27 (2010). doi:10.1007/s11263-009-0273-6CrossRef L. Sigal, A.O. Balan, M.J. Black, Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision 87(1–2), 4–27 (2010). doi:10.​1007/​s11263-009-0273-6CrossRef
30.
Zurück zum Zitat K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems (2014), pp. 568–576 K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems (2014), pp. 568–576
31.
32.
Zurück zum Zitat J. Suarez, R. Murphy, Hand gesture recognition with depth images: a review, in RO-MAN, 2012 (IEEE, 2012), pp. 411–417 J. Suarez, R. Murphy, Hand gesture recognition with depth images: a review, in RO-MAN, 2012 (IEEE, 2012), pp. 411–417
33.
Zurück zum Zitat C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9
34.
Zurück zum Zitat L. Van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)MATH L. Van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)MATH
35.
Zurück zum Zitat J. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012), pp. 1290–1297. doi:10.1109/CVPR.2012.6247813 J. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012), pp. 1290–1297. doi:10.​1109/​CVPR.​2012.​6247813
36.
Zurück zum Zitat J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3D action recognition with random occupancy patterns, in Computer Vision–ECCV 2012 (Springer, 2012), pp. 872–885CrossRef J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3D action recognition with random occupancy patterns, in Computer Vision–ECCV 2012 (Springer, 2012), pp. 872–885CrossRef
37.
38.
Zurück zum Zitat J. Wu, J. Konrad, P. Ishwar, Dynamic time warping for gesture-based user identification and authentication with kinect, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP, 2013), pp. 2371–2375. doi:10.1109/ICASSP.2013.6638079 J. Wu, J. Konrad, P. Ishwar, Dynamic time warping for gesture-based user identification and authentication with kinect, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP, 2013), pp. 2371–2375. doi:10.​1109/​ICASSP.​2013.​6638079
39.
Zurück zum Zitat J. Wu, P. Ishwar, J. Konrad, Silhouettes versus skeletons in gesture-based authentication with kinect, in Proceedings of the IEEE Conference on Advanced Video and Signal-Based Surveillance (AVSS) (2014) J. Wu, P. Ishwar, J. Konrad, Silhouettes versus skeletons in gesture-based authentication with kinect, in Proceedings of the IEEE Conference on Advanced Video and Signal-Based Surveillance (AVSS) (2014)
40.
Zurück zum Zitat J. Wu, P. Ishwar, J. Konrad, The value of posture, build and dynamics in gesture-based user authentication, in 2014 IEEE International Joint Conference on Biometrics (IJCB) (IEEE, 2014), pp. 1–8 J. Wu, P. Ishwar, J. Konrad, The value of posture, build and dynamics in gesture-based user authentication, in 2014 IEEE International Joint Conference on Biometrics (IJCB) (IEEE, 2014), pp. 1–8
41.
Zurück zum Zitat J. Wu, J. Konrad, P. Ishwar, The value of multiple viewpoints in gesture-based user authentication, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2014), pp. 90–97 J. Wu, J. Konrad, P. Ishwar, The value of multiple viewpoints in gesture-based user authentication, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2014), pp. 90–97
42.
Zurück zum Zitat J. Wu, J. Christianson, J. Konrad, P. Ishwar, Leveraging shape and depth in user authentication from in-air hand gestures, in 2015 IEEE International Conference on Image Processing (ICIP) (IEEE, 2015), pp. 3195–3199 J. Wu, J. Christianson, J. Konrad, P. Ishwar, Leveraging shape and depth in user authentication from in-air hand gestures, in 2015 IEEE International Conference on Image Processing (ICIP) (IEEE, 2015), pp. 3195–3199
43.
Zurück zum Zitat L. Xia, C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3D joints, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2012), pp. 20–27 L. Xia, C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3D joints, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2012), pp. 20–27
Metadaten
Titel
Two-Stream CNNs for Gesture-Based Verification and Identification: Learning User Style
verfasst von
Jonathan Wu
Jiawei Chen
Prakash Ishwar
Janusz Konrad
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-61657-5_7