Elsevier

Pattern Recognition Letters

Volume 33, Issue 16, 1 December 2012, Pages 2216-2223
Pattern Recognition Letters

Modeling the spatial layout of images beyond spatial pyramids

https://doi.org/10.1016/j.patrec.2012.07.019Get rights and content

Abstract

Several state-of-the-art image representations consist in averaging local statistics computed from patch-level descriptors. It has been shown by Boureau et al. that such average statistics suffer from two sources of variance. The first one comes from the fact that a finite set of local statistics are averaged. The second one is due to the variation in the proportion of object-dependent information between different images of the same class. For the problem of object classification, these sources of variance affect negatively the accuracy since they increase the overlap between class-conditional probabilities.

Our goal is to include information about the spatial layout of images in image signatures based on average statistics. We show that the traditional approach to including the spatial layout – the spatial pyramid (SP) – increases the first source of variance while only weakly reducing the second one. We therefore propose two complementary approaches to account for the spatial layout which are compatible with our goal of variance reduction. The first one models the spatial layout in an image-independent manner (as is the case of the SP) while the second one adapts to the image content. A significant benefit of these approaches with respect to the SP is that they do not incur an increase of the image signature dimensionality. We show on PASCAL VOC 2007, 2008 and 2009 the benefits of our approach.

Highlights

► State-of-the-art image representations consist in averaging local statistics. ► Our goal is to include information about spatial layout in image signatures. ► We propose and evaluate two complementary approaches. ► The first is image-independent and the second adapts to the image content. ► These methods present a significant benefit over spatial pyramid-based methods.

Introduction

One of the most successful approaches to describe the content of images is the bag-of-features (BOF). It consists in computing and aggregating statistics derived from local patch descriptors such as the SIFT (Lowe, 2004). The most popular variant of the BOF framework is certainly the bag-of-visual-words (BOV) which characterizes an image as a histogram of quantized local descriptors (Csurka et al., 2004, Sivic and Zisserman, 2003). In a nutshell, a codebook of prototypical descriptors is learned with k-means and each local descriptor is assigned to its closest centroid. These counts are then averaged over the image.

The BOV has been extended in several ways. For instance, the hard quantization can be replaced by a soft quantization to model the assignment uncertainty (Farquhar et al., 2005, van Gemert et al., 2008) or by other coding strategies such as sparse coding (Yang et al., 2009, Yang et al., 2010, Boureau et al., 2010). Also the average pooling can be replaced by a max pooling (Yang et al., 2009, Yang et al., 2010, Boureau et al., 2010, Boureau et al., 2010). Another extension is to include higher order statistics. Indeed, while the BOV is only concerned with the number of descriptors assigned to each codeword, the Fisher vector (FV) (Perronnin and Dance, 2007, Perronnin et al., 2010) as well as the related vector of locally aggregated descriptors (VLAD) (Jégou et al., 2010) and super-vector coding (SVC) (Zhou et al., 2010) also model the distribution of descriptors assigned to each codeword.

Obviously discarding all information about the location of patches incurs a loss of information. The dominant approach to include spatial information in the BOF framework is the spatial pyramid (SP). Inspired by the pyramid match kernel of Grauman and Darrell (2005), Lazebnik et al. proposed to partition an image into a set of regions in a coarse-to-fine manner Lazebnik et al. (2006). Each region is described independently and the region-level histograms are then concatenated into an image-level histogram. The SP enables to account for the fact that different regions can contain different visual information.

Several extensions of the SP have been proposed. Marszalek et al. suggested a different partitioning strategy (Marszalek et al., 2007). Their system combined the full image with a 1 × 3 (top, middle and bottom) and a 2 × 2 (four quadrants) partitioning. Viitaniemi and Laaksonen proposed to assign patches to multiple regions in a soft manner (Viitaniemi and Laaksonen, 2009). The SP has also been extended beyond the BOV, for instance to the FV (Perronnin et al., 2010) or the SVC (Zhou et al., 2010). We note that all previous methods rely on a pre-defined partitioning of the image which is independent of its content. Uijlings et al. proposed a bi-partite image-dependent partitioning in terms of object/non-object (Uijlings et al., 2009). Two BOV histograms are computed per image: an object BOV and a context BOV. While the authors report a very significant increase of the classification accuracy on PASCAL VOC 2007, their method relies on the knowledge of the object bounding boxes which is unrealistic for most scenarios of practical value. We outline that the simple SP of Lazebnik et al. is still by far the most prevalent approach to account for spatial information in BOF-based methods. Recently, Krapac et al. (2011) proposed to include a location prior per visual word and to derive a Fisher kernel from this model. They report similar results as with SP but using a more compact representation. In Koniusz et al. (2011), the authors propose to include spatial and angular information directly at descriptor level. They used soft-BOV and sparse coding-based signatures, reporting promising results compared to SP.4

Our goal is to propose alternatives to the SP for object classification. We focus on the FV which is simple to implement, computationally efficient and which was shown to yield excellent results in a recent evaluation (Chatfield et al., 2011). However, our work could be extended to other BOF-based techniques in a straightforward manner.

We build on the insights of Boureau et al., 2010, Boureau et al., 2010. If we have a two-class classification problem, linear classification requires the distributions of FVs for these two classes to be well-separated. However, there are two sources of variance which make the distributions of FVs overlap. The first one is due to the fact that the FV is computed from a finite set of descriptors. The second one comes from the fact that the proportion of object-dependent information may vary between two images of the same class. Reducing these sources of variance would increase the linear separability and therefore the classification accuracy. In this paper, we propose two different and complementary ways to include the spatial information into the image signature which target these two sources of variance.

The remainder of the article is organized as follows. In the next section, we briefly review the FV coding method. In Section 3 we consider the variance due to the finite sampling of descriptors. We extend the analysis of Boureau et al., 2010, Boureau et al., 2010 to the case of correlated samples. We show that, because the SP reduces the size of the region over which statistics are averaged, it impacts negatively the variance of the distribution of FVs. We therefore propose a novel approach to include the spatial information by augmenting the descriptors with their location. In Section 4 we analyze the second source of variance specifically in the case of the FV. We show that we could partially compensate for this source of variance if we had access to the object bounding boxes. However, as opposed to Uijlings et al. (2009) we propose a practical solution to this problem based on the objectness measure of Alexe et al. (2010). In Section 5, we provide experimental results on PASCAL VOC 2007, 2008 and 2009 showing the validity and the complementarity of the two proposed techniques. A major benefit is that, as opposed to the SP, they do not increase the feature dimensionality thus making the classifier learning more efficient.

Section snippets

The Fisher vector

We only provide a brief introduction to the FV coding method. More details can be found in Perronnin and Dance, 2007, Perronnin et al., 2010. Let X={xt,t=1T} be the set of T local descriptors extracted from an image. Let uλ:RDR+ be a probability density function with parameters λ which models the generation process of the local descriptors for any image. The Fisher vector GλX is defined as:GλX=LλGλX.Lλ is the Cholesky decomposition of the inverse of the Fisher information matrix Fλ of uλ, i.e.

Average pooling and feature augmentation

The FV, as given by Eq. (1), (2), can be viewed as an average of patch-level statistics. Indeed, we can rewrite:GλX=1Tt=1Tztwith:ztg(xt)Lλλloguλ(xt).If we assume the samples xt to have been generated by a class-conditional distribution pc (where the variable c indexes the class) and to be iid, then (6) can be seen as the sample estimate of a class-conditional expectation:limTGλX=Expc[g(x)].***As noted in Boureau et al. (2010), there is an intrinsic variance in this estimation process

Within-class variance and objectness

In the previous section, we showed that the FV can be understood as the sample estimate of a class-conditional expectation and that there is an intrinsic variance in this estimation process which is caused by sampling from a finite pool of descriptors. We now show that there is a second source of variance which is inherent to the model and we propose another approach to take into account the spatial layout to remediate this issue.

Experiments

We first present the experimental setup. We then provide more details about the computation of the average correlation in Section 3. We finally report our results.

Conclusions

We addressed the problem of representing the spatial layout of images with two different and complementary approaches. Both originated from a theoretical well founded analysis. We showed on three of the challenging PASCAL VOC benchmarks the benefits of our approach: a higher accuracy without increasing the image signature dimensionality. Although our focus was on FVs, the generality of the approach makes it applicable to other BOF-based representations.

References (32)

  • T.E. deCampos et al.

    Images as sets of locally weighted features

    Comput. Vision Image Understand.

    (2012)
  • Alexe, B., Deselaers, T., Ferrari, V., 2010. What is an object?, In: Proc. Conf. on Computer Vision and Pattern...
  • Bottou, L., Stochastic Gradient Descent Package, <http://leon.bottou.org/projects/sgd> accessed in August...
  • Boureau, Y.-L., Bach, F., LeCun, Y., Ponce, J., 2010. Learning mid-level features for recognition, In: Proc. Conf. on...
  • Boureau, Y.-L., Ponce, J. LeCun, Y., 2010. A theoretical analysis of feature pooling in visual recognition, In: Proc....
  • Chatfield, K. Lempitsky, V., Vedaldi, A., Zisserman, A., 2011. The devil is in the details: an evaluation of recent...
  • Cheng, M.M. Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M., 2011. Global contrast based salient region detection, In:...
  • Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C., 2004. Visual categorization with bags of keypoints, In: ECCV...
  • Everingham, M., Van Gool, L., Williams, C.K.I. Winn, J., Zisserman, A., 2007. The PASCAL Visual Object Classes...
  • Everingham, M., Van Gool, L., Williams, C.K.I. Winn, J., Zisserman, A., 2008. The PASCAL Visual Object Classes...
  • Everingham, M., Van Gool, Williams, C.K.I., Winn, J., Zisserman, A., 2009. The PASCAL Visual Object Classes Challenge...
  • Farquhar, J., Szedmak, S., Meng, H., Shawe–Taylor, J, 2005. Improving “bag-of-keypoints” image categorisation, Tech....
  • Gong, Y., Huang, T., Lv, F., Wang, J., Wu, C., Xu, W., Yang, J., Yu, K., Zhang, T., Zhou, X., 2009. Image...
  • Grauman, K., Darrell, T., 2005. The pyramid match kernel: Discriminative classification with sets of image features,...
  • Jégou, H., Douze, M., Schmid, C., Pérez, P. 2010. Aggregating local descriptors into a compact image representation,...
  • Koniusz, P., Mikolajczyk, K., 2011. Spatial coordinate coding to reduce histogram representations, dominant angle and...
  • Cited by (90)

    • Document image classification: Progress over two decades

      2021, Neurocomputing
      Citation Excerpt :

      It totally disregards the spatial locations of the visual words, which results in limited descriptive ability. In order to cope with this issue, the spatial extensions of the BOVW model have been proposed in several works [77–80]. The most straightforward approach is to associate the coordinates of the keypoints with their local descriptors as in [77].

    • On Fisher vector encoding of binary features for video face recognition

      2018, Journal of Visual Communication and Image Representation
    • Chebyshev Pooling: An Alternative Layer for the Pooling of CNNs-Based Classifier

      2021, 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology, CCET 2021
    View all citing articles on Scopus
    1

    Most of this work was done while J. Sánchez was at CIII, Universidad Tecnológica Nacional, Factultad Regional Córdoba, X5000HUA Córdoba, Argentina. He was partially supported by a grant from CONICET, Argentina. Tel.: +54 351 4334051 int. 309.

    2

    Tel.: +33 476 61 50 17; fax: +33 476 61 50 99.

    3

    T. deCampos received support from the British EPSRC through grant EP/F069421/1, ACASVA.

    View full text