Modeling the spatial layout of images beyond spatial pyramids

doi:10.1016/j.patrec.2012.07.019

Pattern Recognition Letters

Volume 33, Issue 16, 1 December 2012, Pages 2216-2223

https://doi.org/10.1016/j.patrec.2012.07.019 Get rights and content

Abstract

Several state-of-the-art image representations consist in averaging local statistics computed from patch-level descriptors. It has been shown by Boureau et al. that such average statistics suffer from two sources of variance. The first one comes from the fact that a finite set of local statistics are averaged. The second one is due to the variation in the proportion of object-dependent information between different images of the same class. For the problem of object classification, these sources of variance affect negatively the accuracy since they increase the overlap between class-conditional probabilities.

Our goal is to include information about the spatial layout of images in image signatures based on average statistics. We show that the traditional approach to including the spatial layout – the spatial pyramid (SP) – increases the first source of variance while only weakly reducing the second one. We therefore propose two complementary approaches to account for the spatial layout which are compatible with our goal of variance reduction. The first one models the spatial layout in an image-independent manner (as is the case of the SP) while the second one adapts to the image content. A significant benefit of these approaches with respect to the SP is that they do not incur an increase of the image signature dimensionality. We show on PASCAL VOC 2007, 2008 and 2009 the benefits of our approach.

Highlights

► State-of-the-art image representations consist in averaging local statistics. ► Our goal is to include information about spatial layout in image signatures. ► We propose and evaluate two complementary approaches. ► The first is image-independent and the second adapts to the image content. ► These methods present a significant benefit over spatial pyramid-based methods.

Introduction

One of the most successful approaches to describe the content of images is the bag-of-features (BOF). It consists in computing and aggregating statistics derived from local patch descriptors such as the SIFT (Lowe, 2004). The most popular variant of the BOF framework is certainly the bag-of-visual-words (BOV) which characterizes an image as a histogram of quantized local descriptors (Csurka et al., 2004, Sivic and Zisserman, 2003). In a nutshell, a codebook of prototypical descriptors is learned with k-means and each local descriptor is assigned to its closest centroid. These counts are then averaged over the image.

The BOV has been extended in several ways. For instance, the hard quantization can be replaced by a soft quantization to model the assignment uncertainty (Farquhar et al., 2005, van Gemert et al., 2008) or by other coding strategies such as sparse coding (Yang et al., 2009, Yang et al., 2010, Boureau et al., 2010). Also the average pooling can be replaced by a max pooling (Yang et al., 2009, Yang et al., 2010, Boureau et al., 2010, Boureau et al., 2010). Another extension is to include higher order statistics. Indeed, while the BOV is only concerned with the number of descriptors assigned to each codeword, the Fisher vector (FV) (Perronnin and Dance, 2007, Perronnin et al., 2010) as well as the related vector of locally aggregated descriptors (VLAD) (Jégou et al., 2010) and super-vector coding (SVC) (Zhou et al., 2010) also model the distribution of descriptors assigned to each codeword.

Obviously discarding all information about the location of patches incurs a loss of information. The dominant approach to include spatial information in the BOF framework is the spatial pyramid (SP). Inspired by the pyramid match kernel of Grauman and Darrell (2005), Lazebnik et al. proposed to partition an image into a set of regions in a coarse-to-fine manner Lazebnik et al. (2006). Each region is described independently and the region-level histograms are then concatenated into an image-level histogram. The SP enables to account for the fact that different regions can contain different visual information.

Several extensions of the SP have been proposed. Marszalek et al. suggested a different partitioning strategy (Marszalek et al., 2007). Their system combined the full image with a 1 × 3 (top, middle and bottom) and a 2 × 2 (four quadrants) partitioning. Viitaniemi and Laaksonen proposed to assign patches to multiple regions in a soft manner (Viitaniemi and Laaksonen, 2009). The SP has also been extended beyond the BOV, for instance to the FV (Perronnin et al., 2010) or the SVC (Zhou et al., 2010). We note that all previous methods rely on a pre-defined partitioning of the image which is independent of its content. Uijlings et al. proposed a bi-partite image-dependent partitioning in terms of object/non-object (Uijlings et al., 2009). Two BOV histograms are computed per image: an object BOV and a context BOV. While the authors report a very significant increase of the classification accuracy on PASCAL VOC 2007, their method relies on the knowledge of the object bounding boxes which is unrealistic for most scenarios of practical value. We outline that the simple SP of Lazebnik et al. is still by far the most prevalent approach to account for spatial information in BOF-based methods. Recently, Krapac et al. (2011) proposed to include a location prior per visual word and to derive a Fisher kernel from this model. They report similar results as with SP but using a more compact representation. In Koniusz et al. (2011), the authors propose to include spatial and angular information directly at descriptor level. They used soft-BOV and sparse coding-based signatures, reporting promising results compared to SP.⁴

Our goal is to propose alternatives to the SP for object classification. We focus on the FV which is simple to implement, computationally efficient and which was shown to yield excellent results in a recent evaluation (Chatfield et al., 2011). However, our work could be extended to other BOF-based techniques in a straightforward manner.

We build on the insights of Boureau et al., 2010, Boureau et al., 2010. If we have a two-class classification problem, linear classification requires the distributions of FVs for these two classes to be well-separated. However, there are two sources of variance which make the distributions of FVs overlap. The first one is due to the fact that the FV is computed from a finite set of descriptors. The second one comes from the fact that the proportion of object-dependent information may vary between two images of the same class. Reducing these sources of variance would increase the linear separability and therefore the classification accuracy. In this paper, we propose two different and complementary ways to include the spatial information into the image signature which target these two sources of variance.

The remainder of the article is organized as follows. In the next section, we briefly review the FV coding method. In Section 3 we consider the variance due to the finite sampling of descriptors. We extend the analysis of Boureau et al., 2010, Boureau et al., 2010 to the case of correlated samples. We show that, because the SP reduces the size of the region over which statistics are averaged, it impacts negatively the variance of the distribution of FVs. We therefore propose a novel approach to include the spatial information by augmenting the descriptors with their location. In Section 4 we analyze the second source of variance specifically in the case of the FV. We show that we could partially compensate for this source of variance if we had access to the object bounding boxes. However, as opposed to Uijlings et al. (2009) we propose a practical solution to this problem based on the objectness measure of Alexe et al. (2010). In Section 5, we provide experimental results on PASCAL VOC 2007, 2008 and 2009 showing the validity and the complementarity of the two proposed techniques. A major benefit is that, as opposed to the SP, they do not increase the feature dimensionality thus making the classifier learning more efficient.

Section snippets

The Fisher vector

We only provide a brief introduction to the FV coding method. More details can be found in Perronnin and Dance, 2007, Perronnin et al., 2010. Let $X = {x_{t}, t = 1 \dots T}$ be the set of T local descriptors extracted from an image. Let $u_{λ} : R^{D} \to R_{+}$ be a probability density function with parameters $λ$ which models the generation process of the local descriptors for any image. The Fisher vector $G_{λ}^{X}$ is defined as: $G_{λ}^{X} = L_{λ} G_{λ}^{X} .$ $L_{λ}$ is the Cholesky decomposition of the inverse of the Fisher information matrix $F_{λ}$ of $u_{λ}$ , i.e.

Average pooling and feature augmentation

The FV, as given by Eq. (1), (2), can be viewed as an average of patch-level statistics. Indeed, we can rewrite: $G_{λ}^{X} = \frac{1}{T} \sum_{t = 1}^{T} z_{t}$ with: $z_{t} \equiv g (x_{t}) \equiv L_{λ} \nabla_{λ} \log u_{λ} (x_{t}) .$ If we assume the samples $x_{t}$ to have been generated by a class-conditional distribution $p_{c}$ (where the variable c indexes the class) and to be iid, then (6) can be seen as the sample estimate of a class-conditional expectation: $\lim_{T \to \infty} G_{λ}^{X} = E_{x \sim p_{c}} [g (x)] .$ ***As noted in Boureau et al. (2010), there is an intrinsic variance in this estimation process

Within-class variance and objectness

In the previous section, we showed that the FV can be understood as the sample estimate of a class-conditional expectation and that there is an intrinsic variance in this estimation process which is caused by sampling from a finite pool of descriptors. We now show that there is a second source of variance which is inherent to the model and we propose another approach to take into account the spatial layout to remediate this issue.

Experiments

We first present the experimental setup. We then provide more details about the computation of the average correlation in Section 3. We finally report our results.

Conclusions

We addressed the problem of representing the spatial layout of images with two different and complementary approaches. Both originated from a theoretical well founded analysis. We showed on three of the challenging PASCAL VOC benchmarks the benefits of our approach: a higher accuracy without increasing the image signature dimensionality. Although our focus was on FVs, the generality of the approach makes it applicable to other BOF-based representations.

References (32)

T.E. deCampos et al.
Images as sets of locally weighted features
Comput. Vision Image Understand.
(2012)
Alexe, B., Deselaers, T., Ferrari, V., 2010. What is an object?, In: Proc. Conf. on Computer Vision and Pattern...
Bottou, L., Stochastic Gradient Descent Package, <http://leon.bottou.org/projects/sgd> accessed in August...
Boureau, Y.-L., Bach, F., LeCun, Y., Ponce, J., 2010. Learning mid-level features for recognition, In: Proc. Conf. on...
Boureau, Y.-L., Ponce, J. LeCun, Y., 2010. A theoretical analysis of feature pooling in visual recognition, In: Proc....
Chatfield, K. Lempitsky, V., Vedaldi, A., Zisserman, A., 2011. The devil is in the details: an evaluation of recent...
Cheng, M.M. Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M., 2011. Global contrast based salient region detection, In:...
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C., 2004. Visual categorization with bags of keypoints, In: ECCV...
Everingham, M., Van Gool, L., Williams, C.K.I. Winn, J., Zisserman, A., 2007. The PASCAL Visual Object Classes...
Everingham, M., Van Gool, L., Williams, C.K.I. Winn, J., Zisserman, A., 2008. The PASCAL Visual Object Classes...

Everingham, M., Van Gool, Williams, C.K.I., Winn, J., Zisserman, A., 2009. The PASCAL Visual Object Classes Challenge...

Farquhar, J., Szedmak, S., Meng, H., Shawe–Taylor, J, 2005. Improving “bag-of-keypoints” image categorisation, Tech....

Gong, Y., Huang, T., Lv, F., Wang, J., Wu, C., Xu, W., Yang, J., Yu, K., Zhang, T., Zhou, X., 2009. Image...

Grauman, K., Darrell, T., 2005. The pyramid match kernel: Discriminative classification with sets of image features,...

Jégou, H., Douze, M., Schmid, C., Pérez, P. 2010. Aggregating local descriptors into a compact image representation,...

Koniusz, P., Mikolajczyk, K., 2011. Spatial coordinate coding to reduce histogram representations, dominant angle and...

Cited by (90)

Document image classification: Progress over two decades
2021, Neurocomputing
Citation Excerpt :
It totally disregards the spatial locations of the visual words, which results in limited descriptive ability. In order to cope with this issue, the spatial extensions of the BOVW model have been proposed in several works [77–80]. The most straightforward approach is to associate the coordinates of the keypoints with their local descriptors as in [77].
Document image classification plays a vital role in the document image processing system. Thus it is of great importance to have a clear understanding of the state-of-the-art of the document image classification field, especially in this deep learning era, which will facilitate the development of effective document image processing systems. In this paper, we provide a comprehensive survey of the progress that has been made in the field of document image classification over the past two decades. We categorize the document images into non-mobile images and mobile images according to the way they are acquired. The existing document image classification methods for these two types of images are reviewed, which are classified as textual-based methods, structural-based methods, visual-based methods and hybrid methods. We further compare the performance of different classification methods on several public benchmark datasets. Finally, we highlight some open issues and recommend promising directions for future research.
Combination of spatially enhanced bag-of-visual-words model and genuine difference subspace for fake coin detection
2020, Expert Systems with Applications
Fake coins are harmful for society, the detection of which is of paramount importance. Due to the large quantities of fake coins in the real world, it is impossible to examine them manually. To address this issue, we present an intelligent system to automatically detect fake coins based on their images. The intelligent system consists of two components: coin image representation and classifier learning. To represent the coin image, a new spatially enhanced bag-of-visual-words model, called SEBOVW model, is proposed. Afterwards, we improve the representation by building a genuine difference subspace. The coin is finally represented based on its projection onto this subspace. In order to discriminate between genuine and fake coins, we train a classifier using the subspace representations. A thorough evaluation of the proposed intelligent system has been conducted on four coin datasets, consisting of thousands of coins of different denominations and from two countries. Promising experimental results in excess of $98 %$ accuracy demonstrate its effectiveness and validity.
On Fisher vector encoding of binary features for video face recognition
2018, Journal of Visual Communication and Image Representation
Several approaches have been proposed for face recognition in videos. Fisher vector (FV) encoding of local Scale-Invariant Feature Transforms (SIFT) is among the best performing ones. Aiming at speed up the computation time of this approach, a method based on FV encoding of binary features was recently introduced. By using Binary Robust Independent Elementary Features (BRIEF), this method gained in efficiency but lost in accuracy. FV representation of binary features demands appropriated mathematical tools, which are not as easy available as for continuous features. This paper introduces a new way for obtaining FV encoding of binary features that is still efficient and also accurate. We show that BRIEF combined with FV are discriminative enough, and provide as good performance as the one obtained by using SIFT features for video face recognition. Besides, we discuss several insights and promising lines of future work in regard to FV encoding of binary features.
Fisher vector for scene character recognition: A comprehensive evaluation
2017, Pattern Recognition
Fisher vector (FV), which could be seen as a bag of visual words (BOW) that encodes not only word counts but also higher-order statistics, works well with linear classifiers and has shown promising performance for image categorization. For character recognition, although standard BOW has been applied, the results are still not satisfactory. In this paper, we apply Fisher vector derived from Gaussian Mixture Models (GMM) based visual vocabularies on character recognition and integrate spatial information as well. We give a comprehensive evaluation of Fisher vector with linear classifier on a series of challenging English and digits character recognition datasets, including both the handwritten and scene character recognition ones. Moreover, we also collect two Chinese scene character recognition datasets to evaluate the suitability of Fisher vector to represent Chinese characters. Through extensive experiments we make three contributions: (1) we demonstrate that FV with linear classifier could outperform most of the state-of-the-art methods for character recognition, even the CNN based ones and the superiority is more obvious when training samples are insufficient to train the networks; (2) we show that additional spatial information is very useful for character representation, especially for Chinese ones, which have more complex structures; and (3) the results also imply the potential of FV to represent new unseen categories, which is quite inspiring since it is quite difficult to collect enough training samples for large-category Chinese scene characters.
Towards fine-grained maize tassel flowering status recognition: Dataset, theory and practice
2017, Applied Soft Computing
Maize is one of the three main cereal crops of the world. Accurately knowing its tassel flowering status can help to analyze the growth status and adjust the farming operation accordingly. At the current stage, acquiring the tassel flowering status mainly depends on human observation. Actually, it is costly and subjective, especially for the large-scale quantitative analysis under the in-field environment. To alleviate this, we propose an automatic maize tassel flowering status (i.e., non-flowering, partially-flowering and fully-flowering) recognition method via the computer vision technology in this paper. In particular, this task is formulated as a fine-grained image categorization problem. More specifically, scale-invariant feature transform (SIFT) is first extracted as the low-level visual descriptor to characterize the maize flower. Fisher vector (FV) is then applied to execute feature encoding on SIFT to generate more discriminative flowering status representation. To further leverage the performance, a novel metric leaning method termed large-margin dimensionality reduction (LMDR) is proposed. To verify the effectiveness of the proposed method, a flowering status dataset that consists of 3000 images is built. The experimental results demonstrate that our approach goes beyond the state-of-the-art by large margins (at least 8.3%). The dataset and source code are made available online.
Chebyshev Pooling: An Alternative Layer for the Pooling of CNNs-Based Classifier
2021, 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology, CCET 2021

View all citing articles on Scopus

¹: Most of this work was done while J. Sánchez was at CIII, Universidad Tecnológica Nacional, Factultad Regional Córdoba, X5000HUA Córdoba, Argentina. He was partially supported by a grant from CONICET, Argentina. Tel.: +54 351 4334051 int. 309.

²: Tel.: +33 476 61 50 17; fax: +33 476 61 50 99.

³: T. deCampos received support from the British EPSRC through grant EP/F069421/1, ACASVA.

View full text