Skip to main content
Erschienen in: International Journal of Computer Vision 1/2015

01.03.2015

Continuous Action Recognition Based on Sequence Alignment

verfasst von: Kaustubh Kulkarni, Georgios Evangelidis, Jan Cech, Radu Horaud

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time warping framework and devise a novel visual alignment technique, namely dynamic frame warping (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two extensions which enable to perform recognition concomitant with segmentation, namely one-pass DFW and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action recognition has been overlooked. We test and illustrate the proposed techniques with a recently released dataset (RAVEL) and with two public-domain datasets widely used in action recognition (Hollywood-1 and Hollywood-2). We also compare the performances of the proposed isolated and continuous recognition algorithms with several recently published methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Alameda-Pineda, X., Sanchez-Riera, J., Wienke, J., Franc, V., Cech, J., Kulkarni, K., et al. (2013). RAVEL: An annotated corpus for training robots with audiovisual abilities. Journal on Multimodal User Interfaces, 7(1–2), 79–91.CrossRef Alameda-Pineda, X., Sanchez-Riera, J., Wienke, J., Franc, V., Cech, J., Kulkarni, K., et al. (2013). RAVEL: An annotated corpus for training robots with audiovisual abilities. Journal on Multimodal User Interfaces, 7(1–2), 79–91.CrossRef
Zurück zum Zitat Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.CrossRef Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.CrossRef
Zurück zum Zitat Blackburn, J., & Ribeiro, E. (2007). Human motion recognition using isomap and dynamic time warping. Human motion-understanding, modeling, capture and animation (pp. 285–298). Berlin: Springer.CrossRef Blackburn, J., & Ribeiro, E. (2007). Human motion recognition using isomap and dynamic time warping. Human motion-understanding, modeling, capture and animation (pp. 285–298). Berlin: Springer.CrossRef
Zurück zum Zitat Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. New York, NY: Cambridge University Press.CrossRefMATH Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. New York, NY: Cambridge University Press.CrossRefMATH
Zurück zum Zitat Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In N. Paragios (Ed.), Computer Vision-ECCV 2010 (pp. 721–734). Berlin: Springer.CrossRef Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In N. Paragios (Ed.), Computer Vision-ECCV 2010 (pp. 721–734). Berlin: Springer.CrossRef
Zurück zum Zitat Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision. Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision.
Zurück zum Zitat Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athitsos, V., & Escalante, H. J. (2013). Multi-modal gesture recognition challenge 2013: Dataset and results. In ChaLearn Multi-modal Gesture Recognition Grand Challenge and Workshop, 15th ACM International Conference on Multimodal Interaction. Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athitsos, V., & Escalante, H. J. (2013). Multi-modal gesture recognition challenge 2013: Dataset and results. In ChaLearn Multi-modal Gesture Recognition Grand Challenge and Workshop, 15th ACM International Conference on Multimodal Interaction.
Zurück zum Zitat Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.CrossRef Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.CrossRef
Zurück zum Zitat Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.CrossRef Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.CrossRef
Zurück zum Zitat Gill, P. R., Wang, A., & Molnar, A. (2011). The in-crowd algorithm for fast basis pursuit denoising. IEEE Transactions on Signal Processing, 59(10), 4595–4605.CrossRefMathSciNet Gill, P. R., Wang, A., & Molnar, A. (2011). The in-crowd algorithm for fast basis pursuit denoising. IEEE Transactions on Signal Processing, 59(10), 4595–4605.CrossRefMathSciNet
Zurück zum Zitat Gong, D., & Medioni, G. (2011) Dynamic manifold warping for view invariant action recognition. In IEEE International Conference on Computer Vision, (pp. 571–578). IEEE. Gong, D., & Medioni, G. (2011) Dynamic manifold warping for view invariant action recognition. In IEEE International Conference on Computer Vision, (pp. 571–578). IEEE.
Zurück zum Zitat Hienz, H., Bauer, B., & Kraiss, K. F. (1999). HMM-based continuous sign language recognition using stochastic grammars. In A. Braffort, R. Gherbi, S. Gibet, D. Teil, & J. Richardson (Eds.), Gesture-based communication in human-computer interaction (Vol. 1739, pp. 185–196)., Lecture Notes in Computer Science Berlin: Springer.CrossRef Hienz, H., Bauer, B., & Kraiss, K. F. (1999). HMM-based continuous sign language recognition using stochastic grammars. In A. Braffort, R. Gherbi, S. Gibet, D. Teil, & J. Richardson (Eds.), Gesture-based communication in human-computer interaction (Vol. 1739, pp. 185–196)., Lecture Notes in Computer Science Berlin: Springer.CrossRef
Zurück zum Zitat Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In 2011 IEEE Conference on Computer Vision and Pattern Recognition CVPR. (pp. 3265–3272). IEEE. Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In 2011 IEEE Conference on Computer Vision and Pattern Recognition CVPR. (pp. 3265–3272). IEEE.
Zurück zum Zitat Ikizler, N., & Duygulu, P. (2009). Histogram of oriented rectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 27(10), 1515–1526.CrossRef Ikizler, N., & Duygulu, P. (2009). Histogram of oriented rectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 27(10), 1515–1526.CrossRef
Zurück zum Zitat Jain, M., Jégou, H., & Bouthémy, P. (2013). Better exploiting motion for better action recognition. In Computer Vision and Pattern Recognition, (pp. 2555–2562). IEEE. Jain, M., Jégou, H., & Bouthémy, P. (2013). Better exploiting motion for better action recognition. In Computer Vision and Pattern Recognition, (pp. 2555–2562). IEEE.
Zurück zum Zitat Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision, (pp. 425–438). Berlin :Springer. Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision, (pp. 425–438). Berlin :Springer.
Zurück zum Zitat Kulkarni, K., Cherla, S., Kale, A., & Ramasubramanian, V. (2008). A framework for indexing human actions in video. In The 1st International Workshop on Machine Learning for Vision-based Motion Analysis-MLVMA’08. Kulkarni, K., Cherla, S., Kale, A., & Ramasubramanian, V. (2008). A framework for indexing human actions in video. In The 1st International Workshop on Machine Learning for Vision-based Motion Analysis-MLVMA’08.
Zurück zum Zitat Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, (pp. 1–8). IEEE. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, (pp. 1–8). IEEE.
Zurück zum Zitat Lee, C., & Rabiner, L. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1649–1658.CrossRef Lee, C., & Rabiner, L. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1649–1658.CrossRef
Zurück zum Zitat Liang, R., & Ouhyoung, M. (1998). A real-time continuous gesture recognition system for sign language. In Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, (pp. 558–567). IEEE. Liang, R., & Ouhyoung, M. (1998). A real-time continuous gesture recognition system for sign language. In Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, (pp. 558–567). IEEE.
Zurück zum Zitat Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost. In European Conference on Computer Vision, (pp. 359–372). Berlin: Springer. Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost. In European Conference on Computer Vision, (pp. 359–372). Berlin: Springer.
Zurück zum Zitat Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Computer Vision and Pattern Recognition, 2007. CVPR’07, (pp. 1–8). IEEE. Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Computer Vision and Pattern Recognition, 2007. CVPR’07, (pp. 1–8). IEEE.
Zurück zum Zitat Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefMATH Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefMATH
Zurück zum Zitat Marszalek, M., Laptev, I., & Schmid, C. (2009) Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2929–2936). IEEE. Marszalek, M., Laptev, I., & Schmid, C. (2009) Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2929–2936). IEEE.
Zurück zum Zitat Morency, L., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In Computer Vision and Pattern Recognition, (pp. 1–8). IEEE. Morency, L., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In Computer Vision and Pattern Recognition, (pp. 1–8). IEEE.
Zurück zum Zitat Mueller, M. (2007). Dynamic time warping. Information retrieval for music and motion (pp. 69–84). Berlin: Springer.CrossRef Mueller, M. (2007). Dynamic time warping. Information retrieval for music and motion (pp. 69–84). Berlin: Springer.CrossRef
Zurück zum Zitat Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2), 263–271.CrossRef Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2), 263–271.CrossRef
Zurück zum Zitat Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 64–83.CrossRef Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 64–83.CrossRef
Zurück zum Zitat Ning, H., Xu, W., Gong, Y., Huang, T. (2008). Latent pose estimator for continuous action recognition. In European Conference on Computer Vision, (pp. 419–433). Springer. Ning, H., Xu, W., Gong, Y., Huang, T. (2008). Latent pose estimator for continuous action recognition. In European Conference on Computer Vision, (pp. 419–433). Springer.
Zurück zum Zitat Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Salt Lake: Prentice hall. Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Salt Lake: Prentice hall.
Zurück zum Zitat Sakoe, H. (1979). Two-level DP-matching - a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustic, Speech, and Signal Processing, 27(6), 588–595.CrossRef Sakoe, H. (1979). Two-level DP-matching - a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustic, Speech, and Signal Processing, 27(6), 588–595.CrossRef
Zurück zum Zitat Sanchez-Riera, J., Cech, J., Horaud, R. P. (2012). Action recognition robust to background clutter by using stereo vision. In The Fourth International Workshop on Video Event Categorization, Tagging and Retrieval, LNCS: Springer. Sanchez-Riera, J., Cech, J., Horaud, R. P. (2012). Action recognition robust to background clutter by using stereo vision. In The Fourth International Workshop on Video Event Categorization, Tagging and Retrieval, LNCS: Springer.
Zurück zum Zitat Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using SMMs. IJCV, 93(1), 22–32.CrossRefMATH Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using SMMs. IJCV, 93(1), 22–32.CrossRefMATH
Zurück zum Zitat Sigal, L., Balan, A., & Black, M. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1), 4–27.CrossRef Sigal, L., Balan, A., & Black, M. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1), 4–27.CrossRef
Zurück zum Zitat Sivic, J., & Zisserman, A. (2009). Efficient visual search of videos cast as text retrieval. IEEE Transactions on PAMI, 31(4), 591–606.CrossRef Sivic, J., & Zisserman, A. (2009). Efficient visual search of videos cast as text retrieval. IEEE Transactions on PAMI, 31(4), 591–606.CrossRef
Zurück zum Zitat Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Conditional models for contextual human motion recognition. CVIU, 104(2–3), 210–220. Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Conditional models for contextual human motion recognition. CVIU, 104(2–3), 210–220.
Zurück zum Zitat Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global video descriptor. Machine vision and applications, 24(7), 1473–1485.CrossRef Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global video descriptor. Machine vision and applications, 24(7), 1473–1485.CrossRef
Zurück zum Zitat Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.CrossRef Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.CrossRef
Zurück zum Zitat Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.CrossRefMATHMathSciNet Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.CrossRefMATHMathSciNet
Zurück zum Zitat Ullah, M. M., Parizi, S. N,, Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British Machine Vision Conference. (Vol. 10, pp. 95–101). Ullah, M. M., Parizi, S. N,, Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British Machine Vision Conference. (Vol. 10, pp. 95–101).
Zurück zum Zitat Vail, D., Veloso, M., & Lafferty, J. (2007). Conditional random fields for activity recognition. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, (p. 235). ACM. Vail, D., Veloso, M., & Lafferty, J. (2007). Conditional random fields for activity recognition. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, (p. 235). ACM.
Zurück zum Zitat Vintsyuk, T. (1971). Element-wise recognition of continuous speech composed of words from a specified dictionary. Cybernetics and Systems Analysis, 7(2), 361–372. Vintsyuk, T. (1971). Element-wise recognition of continuous speech composed of words from a specified dictionary. Cybernetics and Systems Analysis, 7(2), 361–372.
Zurück zum Zitat Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Sixth International Conference on Computer Vision, (pp. 363–369). Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Sixth International Conference on Computer Vision, (pp. 363–369).
Zurück zum Zitat Vogler, C., & Metaxas, D. (2001). A framework for recognizing the simultaneous aspects of american sign language. Computer Vision and Image Understanding, 81(3), 358–384.CrossRefMATH Vogler, C., & Metaxas, D. (2001). A framework for recognizing the simultaneous aspects of american sign language. Computer Vision and Image Understanding, 81(3), 358–384.CrossRefMATH
Zurück zum Zitat Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In International Conference on Computer Vision, (pp. 3551–3558). IEEE. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In International Conference on Computer Vision, (pp. 3551–3558). IEEE.
Zurück zum Zitat Young, S., Russell, N. H., & Thornton, J. (1989). Token passing: a simple conceptual model for connected speech recognition systems. Technical Report 38, University of Cambridge, Department of Engineering. Young, S., Russell, N. H., & Thornton, J. (1989). Token passing: a simple conceptual model for connected speech recognition systems. Technical Report 38, University of Cambridge, Department of Engineering.
Zurück zum Zitat Young, S., Woodland, P., & Byrne, W. (1993). HTK: Hidden Markov model toolkit v1. 5. Technical Report, University of Cambridge, Department of Engineering. Young, S., Woodland, P., & Byrne, W. (1993). HTK: Hidden Markov model toolkit v1. 5. Technical Report, University of Cambridge, Department of Engineering.
Zurück zum Zitat Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., et al. (2009). The HTK book. Technical Report: University of Cambridge, Department of Engineering. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., et al. (2009). The HTK book. Technical Report: University of Cambridge, Department of Engineering.
Zurück zum Zitat Zhou, F., & la Torre, F. D. (2009). Canonical time warping for alignment of human behavior. In Advances in Neural Information Processing Systems, (pp. 2286–2294). Zhou, F., & la Torre, F. D. (2009). Canonical time warping for alignment of human behavior. In Advances in Neural Information Processing Systems, (pp. 2286–2294).
Metadaten
Titel
Continuous Action Recognition Based on Sequence Alignment
verfasst von
Kaustubh Kulkarni
Georgios Evangelidis
Jan Cech
Radu Horaud
Publikationsdatum
01.03.2015
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2015
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-014-0758-9

Weitere Artikel der Ausgabe 1/2015

International Journal of Computer Vision 1/2015 Zur Ausgabe