Top

International Journal of Computer Vision

Published in:

01-12-2013

Image Classification with the Fisher Vector: Theory and Practice

Authors: Jorge Sánchez, Florent Perronnin, Thomas Mensink, Jakob Verbeek

Published in: International Journal of Computer Vision | Issue 3/2013

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.

previous article Coloring Action Recognition in Still Images

next article An Improved Hierarchical Dirichlet Process-Hidden Markov Model and Its Application to Trajectory Modeling and Retrieval

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

http://www.image-net.org

http://www.flickr.com/groups

Normalizing by any \(\ell _p\)-norm would cancel-out the effect of \(\omega \). Perronnin et al. (2010c) chose the \(\ell _2\)-norm because it is the natural norm associated with the dot-product. In Sect. 3.2 we experiment with different \(\ell _p\)-norms.

See Appendix A.2 in the extended version of Jaakkola and Haussler (1998) which is available at: http://people.csail.mit.edu/tommi/papers/gendisc.ps

Xiao et al. (2010) also report results with one training sample per class. However, a single sample does not provide any way to perform cross-validation which is the reason why we do not report results in this setting.

See http://people.csail.mit.edu/jxiao/SUN/

Actually, any continuous distribution can be approximated with arbitrary precision by a GMM with isotropic covariance matrices.

Note that since \(q\) draws values in a finite set, we could replace the \(\int \nolimits _q\) by \(\sum _q\) in the following equations but we will keep the integral notation for simplicity.

While it is standard practice to report per-class accuracy on this dataset (see Deng et al. 2010; Sánchez and Perronnin 2011), Krizhevsky et al. (2012); Le et al. (2012) report a per-image accuracy. This results in a more optimistic number since those classes which are over-represented in the test data also have more training samples and therefore have (on average) a higher accuracy than those classes which are under-represented. This was clarified through a personal correspondence with the first authors of Krizhevsky et al. (2012) and Le et al. (2012).

Available at: http://htk.eng.cam.ac.uk/.

Amari, S., & Nagaoka, H. (2000). Methods of information geometry, translations of mathematical monographs (Vol. 191). Oxford: Oxford University Press.

Berg, A., Deng, J., & Fei-Fei, L. (2010). ILSVRC 2010. Retrieved from http://www.image-net.org/challenges/LSVRC/2010/index.

Bergamo, A., & Torresani, L. (2012). Meta-class features for large-scale object categorization on a budget. In CVPR.

Bishop, C. (1995). Training with noise is equivalent to tikhonov regularization. In Neural computation (Vol 7).

Bo, L., & Sminchisescu, C. (2009). Efficient match kernels between sets of features for visual recognition. In NIPS.

Bo, L., Ren, X., & Fox, D. (2012). Multipath sparse coding using hierarchical matching pursuit. In NIPS workshop on deep learning.

Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In CVPR.

Bottou, L. (2011). Stochastic gradient descent. Retrieved from http://leon.bottou.org/projects/sgd.

Bottou, L., & Bousquet, O. (2007). The tradeoffs of large scale learning. In NIPS.

Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In CVPR.

Boureau, Y. L., LeRoux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In ICCV.

Burrascano, P. (1991). A norm selection criterion for the generalized delta rule. IEEE Transactions on Neural Networks, 2(1), 125–30.CrossRef

Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In BMVC.

Cinbis, G., Verbeek, J., & Schmid, C. (2012). Image categorization using Fisher kernels of non-iid image models. In CVPR.

Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. M. (2007). XRCEs participation to imageval. In ImageEval workshop at CVIR.

Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV SLCV workshop.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.

Deng, J., Berg, A., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.

Everingham, M., Gool, L.V., Williams, C., Winn, J. & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.

Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.

Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

Farquhar, J., Szedmak, S., Meng, H., & Shawe-Taylor, J. (2005). Improving “bag-of-keypoints” image categorisation. Technical report. Southampton: University of Southampton.

Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell _p\)-norm feature pooling for image classification. In CVPR.

Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.

Gray, R., & Neuhoff, D. (1998). Quantization. IEEE Transactions on Information Theory, 44(6), 2724–2742.MathSciNetCrossRef

Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.. California Institute of Technology. Retrieved from http://authors.library.caltech.edu/7694.

Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.

Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV.

Haussler, D. (1999). Convolution kernels on discrete structures. Technical report. Santa Cruz: UCSC.

Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In NIPS.

Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.

Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR.

Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In IEEE PAMI.

Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRef

Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In ICCV.

Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Image classification with deep convolutional neural networks. In NIPS.

Kulkarni, N., & Li, B. (2011). Discriminative affine sparse codes for image classification. In CVPR.

Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., et al. (2012). Building high-level features using large scale unsupervised learning. In ICML.

Lin, Y., Lv, F., Zhu, S., Yu, K., Yang, M., & Cour, T. (2011). Large-scale image classification: Fast feature extraction and svm training. In CVPR.

Liu, Y., & Perronnin, F. (2008). A similarity measure between unordered vector sets with application to image categorization. In CVPR.

Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRef

Lyu, S. (2005). Mercer kernels for object recognition with local features. In CVPR.

Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In ICCV.

Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In CVPR.

Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.

Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted vocabularies for generic visual categorization. In ECCV.

Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010a). Large-scale image retrieval with compressed Fisher vectors. In CVPR.

Perronnin, F., Sánchez, J., & Liu, Y. (2010b). Large-scale image categorization with explicit data embedding. In CVPR.

Perronnin, F., Sánchez, J., & Mensink, T. (2010c). Improving the Fisher kernel for large-scale image classification. In ECCV.

Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.

Sabin, M., & Gray, R. (1984). Product code vector quantizers for waveform and voice coding. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(3), 474–488.CrossRef

Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image classification. In CVPR.

Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids. Pattern Recognition Letters, 33(16), 2216–2223.CrossRef

Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimate sub-gradient solver for SVM. In ICML.

Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In ICCV.

Smith, N., & Gales, M. (2001). Speech recognition using SVMs. In NIPS.

Song, D., & Gupta, A. K. (1997). Lp-norm uniform distribution. Proceedings of American Mathematical Society, 125, 595–601.MathSciNetMATHCrossRef

Spruill, M. (2007). Asymptotic distribution of coordinates on high dimensional spheres. In Electronic communications in probability (Vol. 12).

Sreekanth, V., Vedaldi, A., Jawahar, C., & Zisserman, A. (2010). Generalized rbf feature maps for efficient detection. In BMVC.

Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: John Wiley.MATH

Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR.

Uijlings, J., Smeulders, A., & Scha, R. (2009). What is the spatial extent of an object? In CVPR.

van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition. IEEE PAMI, 32(9), 1582–1596.CrossRef

VanGemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. (2010). Visual word ambiguity. In IEEE TPAMI.

Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.

Vedaldi, A., & Zisserman, A. (2012). Sparse kernel approximations for efficient classification and detection. In CVPR.

Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In ICCV.

Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr groups using stochastic intersection kernel machines. In ICCV.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.

Winn, J., Criminisi, A., & Minka, T. (2005). Object categorization by learned visual dictionary. In ICCV.

Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.

Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., & Huang, T. (2008). Regression from patch-kernel. In CVPR.

Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group sensitive multiple kernel learning for object categorization. In ICCV.

Yang, J., Yu, K., Gong, Y., & Huang, T. (2009b). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, S., Valtchev V. & Woodland P. (2002). The HTK book (version 3.2.1). Cambridge: Cambridge University Engineering Department.

Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 123–138.CrossRef

Zhou, Z., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.

Title: Image Classification with the Fisher Vector: Theory and Practice
Authors: Jorge Sánchez
Florent Perronnin
Thomas Mensink
Jakob Verbeek
Publication date: 01-12-2013
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 3/2013
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-013-0636-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 3/2013

An Improved Hierarchical Dirichlet Process-Hidden Markov Model and Its Application to Trajectory Modeling and Retrieval

Camera Spectral Sensitivity and White Balance Estimation from Sky Images

Variational Recursive Joint Estimation of Dense Scene Structure and Camera Motion from Monocular High Speed Traffic Sequences

Coloring Action Recognition in Still Images

Premium Partner