Skip to main content
Erschienen in: Neural Processing Letters 3/2021

03.04.2021

A Comprehensive Study on VLAD

verfasst von: Xin Li, Lei Zhang, Zhiping Jian, Liyun Zuo

Erschienen in: Neural Processing Letters | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, the vector of locally aggregated descriptor (VLAD) has shown its great effectiveness in diverse computer vision tasks including image retrieval, Scene classification, and action recognition. Its great success stems from its powerful representation ability and computational efficiency. However, it remains unclear about its theoretical foundation and how it is connected to basic while important algorithms, e.g., the bag-of-words model and match kernels, and how its performance is affected by parameter configurations, e.g., normalization and pooling, which are also widely used in state-of-the-art algorithms based on local features. In this paper, with an aim to achieve the full capacity of VLAD, we conduct a comprehensive and in-depth study from both theoretical analysis and experimental practice perspectives. As a theoretical contribution, we provide a new formulation of VLAD via match kernels, which serves to connect VLAD with existing important encoding methods based on local features. As a contribution to the practical use of VLAD, we comprehensively investigate the roles and effects of the two widely-used operations in local feature encoding: normalization and pooling. To the best of our knowledge, our work provides the first comprehensive study on VLAD, which will not only enable a full understanding of it but also provide an important guidance for state-of-the-art algorithms based on local features. We have conducted extensive experiments on three benchmark datasets: Scene-15, Caltech 101 and PPMI for both image classification and action recognition.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Fekriershad S, Saberi M, Tajeripour F (2012) An innovative skin detection approach using color based image retrieval technique. Int J Multimed Appl 4(3):57–65 Fekriershad S, Saberi M, Tajeripour F (2012) An innovative skin detection approach using color based image retrieval technique. Int J Multimed Appl 4(3):57–65
2.
Zurück zum Zitat Yan S, Xu X, Xu D, Lin S, Li X (2015) Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Trans Cybern 45(3):381–390CrossRef Yan S, Xu X, Xu D, Lin S, Li X (2015) Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Trans Cybern 45(3):381–390CrossRef
3.
Zurück zum Zitat Yu J, Rui Y, Tang Y, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44(12):2431–2442CrossRef Yu J, Rui Y, Tang Y, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44(12):2431–2442CrossRef
4.
Zurück zum Zitat Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int Conf Comput Vision 60(2):91–110CrossRef Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int Conf Comput Vision 60(2):91–110CrossRef
5.
Zurück zum Zitat Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606CrossRef Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606CrossRef
6.
Zurück zum Zitat Tang J, Shao L, Li X, Lu K (2016) A local structural descriptor for image matching via normalized graph laplacian embedding. IEEE Trans Cybern 46(2):410–420CrossRef Tang J, Shao L, Li X, Lu K (2016) A local structural descriptor for image matching via normalized graph laplacian embedding. IEEE Trans Cybern 46(2):410–420CrossRef
7.
Zurück zum Zitat Boureau Y.-L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 2559–2566 Boureau Y.-L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 2559–2566
8.
Zurück zum Zitat Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3304–3311 Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3304–3311
9.
Zurück zum Zitat Jegou H, Perronnin F, Douze M, Sanchez J (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716CrossRef Jegou H, Perronnin F, Douze M, Sanchez J (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716CrossRef
10.
Zurück zum Zitat Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features, In: European Conference on Computer Vision, Springer, pp. 392–407 Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features, In: European Conference on Computer Vision, Springer, pp. 392–407
11.
Zurück zum Zitat Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3606–3613 Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3606–3613
12.
Zurück zum Zitat Kantorov V, Laptev I (2014) Efficient feature extraction, encoding and classification for action recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1–8 Kantorov V, Laptev I (2014) Efficient feature extraction, encoding and classification for action recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1–8
13.
Zurück zum Zitat Spyromitros-Xioufis E, Papadopoulos S, Kompatsiaris IY, Tsoumakas G, Vlahavas I (2014) A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Trans Multimed 16(6):1713–1728CrossRef Spyromitros-Xioufis E, Papadopoulos S, Kompatsiaris IY, Tsoumakas G, Vlahavas I (2014) A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Trans Multimed 16(6):1713–1728CrossRef
14.
Zurück zum Zitat Faraki M, Harandi M, Porikli F (2015) More about vlad: A leap from euclidean to riemannian manifolds, In: IEEE Conference on computer vision and pattern recognition, pp. 4951–4960 Faraki M, Harandi M, Porikli F (2015) More about vlad: A leap from euclidean to riemannian manifolds, In: IEEE Conference on computer vision and pattern recognition, pp. 4951–4960
15.
Zurück zum Zitat Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification, In: European Conference on computer vision, pp. 143-156 Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification, In: European Conference on computer vision, pp. 143-156
16.
Zurück zum Zitat Husain SS, Bober M (2017) Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Trans Pattern Anal Mach Intell 99:1783–1796CrossRef Husain SS, Bober M (2017) Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Trans Pattern Anal Mach Intell 99:1783–1796CrossRef
17.
Zurück zum Zitat Delhumeau J, Gosselin P.-H, Jégou H, Pérez P (2013) Revisiting the vlad image representation, In: ACM international conference on multimedia, ACM, pp. 653–656 Delhumeau J, Gosselin P.-H, Jégou H, Pérez P (2013) Revisiting the vlad image representation, In: ACM international conference on multimedia, ACM, pp. 653–656
18.
Zurück zum Zitat Arandjelovic R, Zisserman A (2013) All about vlad, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1578-1585 Arandjelovic R, Zisserman A (2013) All about vlad, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1578-1585
19.
Zurück zum Zitat Tolias G, Avrithis Y, Jégou H (2013) To aggregate or not to aggregate: Selective match kernels for image search, In: IEEE International Conference on computer vision, IEEE, pp. 1401–1408 Tolias G, Avrithis Y, Jégou H (2013) To aggregate or not to aggregate: Selective match kernels for image search, In: IEEE International Conference on computer vision, IEEE, pp. 1401–1408
20.
Zurück zum Zitat Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search, In: European Conference on computer vision, Springer, pp. 304–317 Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search, In: European Conference on computer vision, Springer, pp. 304–317
21.
Zurück zum Zitat Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336CrossRef Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336CrossRef
22.
Zurück zum Zitat Angelina Uy. Mikaela, Lee Gim Hee (2018) PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 4470-4479 Angelina Uy. Mikaela, Lee Gim Hee (2018) PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 4470-4479
23.
Zurück zum Zitat Qi C. R, Su H, Mo K, Guibas L. J (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 652-660 Qi C. R, Su H, Mo K, Guibas L. J (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 652-660
24.
Zurück zum Zitat Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 5297-5307 Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 5297-5307
25.
Zurück zum Zitat Haussler D (1999) Convolution kernels on discrete structures, Technical report 7. University of California at Santa Cruz, Department of Computer Science, pp 95–174 Haussler D (1999) Convolution kernels on discrete structures, Technical report 7. University of California at Santa Cruz, Department of Computer Science, pp 95–174
26.
Zurück zum Zitat Grauman K, Darrell T (2007) The pyramid match kernel: Efficient learning with sets of features. J Mach Learn Res 8:725–760MATH Grauman K, Darrell T (2007) The pyramid match kernel: Efficient learning with sets of features. J Mach Learn Res 8:725–760MATH
27.
Zurück zum Zitat Bo L, Sminchisescu C (2009) Efficient match kernel between sets of features for visual recognition, In: Advances in neural information processing systems, pp. 135–143 Bo L, Sminchisescu C (2009) Efficient match kernel between sets of features for visual recognition, In: Advances in neural information processing systems, pp. 135–143
28.
Zurück zum Zitat Murray N, Perronnin F (2014) Generalized max pooling, In: IEEE Conference on computer vision and pattern recognition, pp. 2473–2480 Murray N, Perronnin F (2014) Generalized max pooling, In: IEEE Conference on computer vision and pattern recognition, pp. 2473–2480
29.
Zurück zum Zitat Kondor R, Jebara T (2003) A kernel between sets of vectors, In: International conference on machine learning, pp. 361–368 Kondor R, Jebara T (2003) A kernel between sets of vectors, In: International conference on machine learning, pp. 361–368
30.
Zurück zum Zitat Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features, In: IEEE International Conference on computer vision, pp. 1458–1465 Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features, In: IEEE International Conference on computer vision, pp. 1458–1465
31.
Zurück zum Zitat Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, In: IEEE Conference on computer vision and pattern recognition, pp. 2169–2178 Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, In: IEEE Conference on computer vision and pattern recognition, pp. 2169–2178
32.
Zurück zum Zitat Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):1–48MATH Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):1–48MATH
33.
Zurück zum Zitat Boureau Y.-L, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition, In: International Conference on machine learning, pp. 111–118 Boureau Y.-L, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition, In: International Conference on machine learning, pp. 111–118
34.
Zurück zum Zitat Boureau Y, Roux N. L, Bach F, Ponce J, LeCun Y (2011) Ask the locals: multi-way local pooling for image recognition, In: International Conference on computer vision, IEEE, pp. 1–8 Boureau Y, Roux N. L, Bach F, Ponce J, LeCun Y (2011) Ask the locals: multi-way local pooling for image recognition, In: International Conference on computer vision, IEEE, pp. 1–8
35.
Zurück zum Zitat Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval, In: IEEE Conference on computer vision and pattern recognition, pp. 1–8 Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval, In: IEEE Conference on computer vision and pattern recognition, pp. 1–8
36.
Zurück zum Zitat Douze M, Jégou H, Schmid C, Pérez P (2010) Compact video description for copy detection with precise temporal alignment, In: European Conference on computer vision, Springer, pp. 522–535 Douze M, Jégou H, Schmid C, Pérez P (2010) Compact video description for copy detection with precise temporal alignment, In: European Conference on computer vision, Springer, pp. 522–535
37.
Zurück zum Zitat Zhang X, Li Z, Zhang L, Ma W.-Y, Shum H.-Y (2009) Efficient indexing for large scale visual search, In: IEEE 12th International conference on computer vision, pp. 1103–1110 Zhang X, Li Z, Zhang L, Ma W.-Y, Shum H.-Y (2009) Efficient indexing for large scale visual search, In: IEEE 12th International conference on computer vision, pp. 1103–1110
38.
Zurück zum Zitat Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput Vision Imag Underst 106(1):59–70CrossRef Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput Vision Imag Underst 106(1):59–70CrossRef
39.
Zurück zum Zitat Yao B, Jiang X, Khosla A, Lin A. L, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts, In: IEEE International Conference on computer vision (ICCV), pp. 1331–1338 Yao B, Jiang X, Khosla A, Lin A. L, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts, In: IEEE International Conference on computer vision (ICCV), pp. 1331–1338
40.
Zurück zum Zitat Fekriershad S, Tajeripour F (2017) Color texture classification based on proposed impulse-noise resistant color local binary patterns and significant points selection algorithm. Sens Rev 37(1):33–42CrossRef Fekriershad S, Tajeripour F (2017) Color texture classification based on proposed impulse-noise resistant color local binary patterns and significant points selection algorithm. Sens Rev 37(1):33–42CrossRef
41.
Zurück zum Zitat Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification, In: European conference on computer vision, Springer, pp. 490–503 Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification, In: European conference on computer vision, Springer, pp. 490–503
42.
Zurück zum Zitat Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27CrossRef Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27CrossRef
43.
Zurück zum Zitat Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification, In: IEEE Conference on computer vision and pattern recognition, pp. 3360–3367 Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification, In: IEEE Conference on computer vision and pattern recognition, pp. 3360–3367
44.
Zurück zum Zitat Zuo Z, Wang G (2014) Learning discriminative hierarchical features for object recognition. Signal Process Lett 21(9):1159–1163CrossRef Zuo Z, Wang G (2014) Learning discriminative hierarchical features for object recognition. Signal Process Lett 21(9):1159–1163CrossRef
45.
Zurück zum Zitat Zhu F, Jiang Z, Shao L (2014) Submodular object recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2457–2464 Zhu F, Jiang Z, Shao L (2014) Submodular object recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2457–2464
46.
Zurück zum Zitat Long X, Lu H, Peng Y et al (2016) Image classification based on improved VLAD. Multimed Tools Appl 75(10):5533–5555CrossRef Long X, Lu H, Peng Y et al (2016) Image classification based on improved VLAD. Multimed Tools Appl 75(10):5533–5555CrossRef
47.
Zurück zum Zitat Zhang L, Zhen X, Shao L (2014) Learning object-to-class kernels for scene classification. IEEE Trans Image Process 23(8):3241–3253MathSciNetCrossRef Zhang L, Zhen X, Shao L (2014) Learning object-to-class kernels for scene classification. IEEE Trans Image Process 23(8):3241–3253MathSciNetCrossRef
48.
Zurück zum Zitat Wang P, Wang J, Zeng G, Xu W, Zha H, Li S (2013) Supervised kernel descriptors for visual recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2858–2865 Wang P, Wang J, Zeng G, Xu W, Zha H, Li S (2013) Supervised kernel descriptors for visual recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2858–2865
49.
Zurück zum Zitat Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition, In: Advances in neural information processing systems, pp. 244–252 Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition, In: Advances in neural information processing systems, pp. 244–252
50.
Zurück zum Zitat Li Q, Peng Q, Yan C (2017) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 99:1–8 Li Q, Peng Q, Yan C (2017) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 99:1–8
Metadaten
Titel
A Comprehensive Study on VLAD
verfasst von
Xin Li
Lei Zhang
Zhiping Jian
Liyun Zuo
Publikationsdatum
03.04.2021
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 3/2021
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-021-10502-0

Weitere Artikel der Ausgabe 3/2021

Neural Processing Letters 3/2021 Zur Ausgabe

Neuer Inhalt