Skip to main content
Log in

Exploring probabilistic localized video representation for human action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent years, the bag-of-words (BoW) video representations have achieved promising results in human action recognition in videos. By vector quantizing local spatial temporal (ST) features, the BoW video representation brings in simplicity and efficiency, but limitations too. First, the discretization of feature space in BoW inevitably results in ambiguity and information loss in video representation. Second, there exists no universal codebook for BoW representation. The codebook needs to be re-built when video corpus is changed. To tackle these issues, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local ST features of a video into a distribution estimated by a generative probabilistic model. Furthermore, the probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable to most discriminative classifiers, such as the nearest neighbor schemes and the kernel based classifiers. Experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2009) Effective Codebooks for Human Action Categorization. Proceedings of International Conference on Computer Vision 506–513

  2. Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, New York

    Google Scholar 

  3. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  4. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  5. Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Technical Report, University of California, San Diego

  6. Do MN, Vetterli M (2002) Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans Image Process 11(2):146–158

    Article  MathSciNet  Google Scholar 

  7. Doll’ar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Proceeding of IEEE international workshop on Visual Surveillance Performance Evaluation and Tracking Surveillance 65–72

  8. Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. Proceedings of International Conference on Computer Vision 487–493

  9. Greenspan H, Goldberger J, Mayer A (2004) Probabilistic space-time video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 26(3):384–396

    Article  Google Scholar 

  10. Greenspan H, Goldberger J, Ridel L (2001) Continuous probabilistic framework for image matching. Comput Vis Image Underst 84(3):384–406

    Article  MATH  Google Scholar 

  11. Hershey JR, Olsen PA (2007) Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceeding of International Conference on Acoustics, Speech and Signal Processing 4:317–320

  12. Hofmann T (1999) Probabilistic latent semantic indexing. In Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–57

  13. Kendall D (1984) Shape manifolds, procrustean metrics and complex projective spaces. Bull Lond Math Soc 16:81–121

    Article  MathSciNet  MATH  Google Scholar 

  14. Kl¨aser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. Proceeding of British Machine Vision Conference 995–1004

  15. Kullback S (1968) Information theory and statistics. Dover, New York

    Google Scholar 

  16. Laptev I, Lindeberg T (2003) Space-time interest points. Proceedings of International Conference on Computer Vision 1:432–439

  17. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. Proceedings of International Conference on Computer Vision and Pattern Recognition 1–8

  18. Cao LL, Liu ZC, Huang TS (2010) Cross-dataset action detection. Proceeding of International Conference on Computer Vision and Pattern Recognition 1998–2005

    Article  Google Scholar 

  19. Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151

    Article  MATH  Google Scholar 

  20. Liu JG, Luo JB, Shah M (2009) Action recognition in unconstrained amateur videos. Proceeding of International Conference on Acoustics, Speech and Signal Processing 3549–3552

  21. Liu JG, Shah M (2008) Learning human actions via information maximization. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8

  22. Liu JG, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. Proceedings of International Conference on Computer Vision and Pattern Recognition 461–468

  23. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110

    Article  Google Scholar 

  24. Niebles JC, Wang HC, Li FF (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vision 79:299–318

    Article  Google Scholar 

  25. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  26. Rissanen J (1978) Modeling by shortest data description. Automatic 14:465–471

    Article  MATH  Google Scholar 

  27. Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8

  28. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. Proceeding of International Conference on Pattern Recognition 3:32–36

    Google Scholar 

  29. Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In ACM International Conference on Multimedia 357–360

  30. Song Y, Tang S, Zheng YT, Chua TS, Zhang YD, Lin SX (2010) A distribution based video representation for human action recognition. In Proceedings of IEEE International Conference on Multimedia & Expo

  31. Vasconcelos N, Ho P, Moreno P (2004) The Kullback-Leibler kernel as a framework for discriminant and localized representation for visual recognition. Proceedings of European Conference on Computer Vision 430–441

  32. Veeraraghavan A, Roy-Chowdhury AK, Chellappa R (2005) Matching shape sequences in video with applications in human movement analysis. IEEE Trans Pattern Anal Mach Intell 27(12):1896–1909

    Article  Google Scholar 

  33. Vergés-Llahí J, Sanfeliu (2005) A evaluation of distances between color image segmentations. Pattern Recognit Image Anal 263–270

  34. Wang H, Uiiah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding of British Machine Vision Conference 127–138

  35. Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. Proceedings of European Conference on Computer Vision 650–663

  36. Wong S-F, Cipolla R (2007) Extracting spatiotemporal interest points using global information. Proceedings of International Conference on Computer Vision 1–8

  37. Xiong ZY, Radhakrishnan R, Divakaran A, Huang TS (2004) Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures. Proceeding of International Conference on Multimedia and Expo 3:1947–1950

  38. Xu LM, Tang ZM (2007) Speaker identification using multi-step clustering algorithm with transformation based GMM. Autom Control Comput Sci 41(4):224–231

    Article  Google Scholar 

  39. Zhou X, Zhuang XD, Yan SC, Chang SF, Johnson MH, Huang TS (2008) SIFT-Bag kernel for video event analysis. In ACM International Conference on Multimedia 229–238

Download references

Acknowledgments

This work was supported by National Basic Research Program of China (973 Program, 2007CB311105); National Nature Science Foundation of China (60873165); Co-building Program of Beijing Municipal Education Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Y., Tang, S., Zheng, YT. et al. Exploring probabilistic localized video representation for human action recognition. Multimed Tools Appl 58, 663–685 (2012). https://doi.org/10.1007/s11042-011-0748-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-011-0748-7

Keywords

Navigation