Similarity caching in large-scale image retrieval

https://doi.org/10.1016/j.ipm.2010.12.006Get rights and content

Abstract

Feature-rich data, such as audio-video recordings, digital images, and results of scientific experiments, nowadays constitute the largest fraction of the massive data sets produced daily in the e-society. Content-based similarity search systems working on such data collections are rapidly growing in importance. Unfortunately, similarity search is in general very expensive and hardly scalable.

In this paper we study the case of content-based image retrieval (CBIR) systems, and focus on the problem of increasing the throughput of a large-scale CBIR system that indexes a very large collection of digital images. By analyzing the query log of a real CBIR system available on the Web, we characterize the behavior of users who experience a novel search paradigm, where content-based similarity queries and text-based ones can easily be interleaved. We show that locality and self-similarity is present even in the stream of queries submitted to such a CBIR system. According to these results, we propose an effective way to exploit this locality, by means of a similarity caching system, which stores the results of recently/frequently submitted queries and associated results. Unlike traditional caching, the proposed cache can manage not only exact hits, but also approximate ones that are solved by similarity with respect to the result sets of past queries present in the cache. We evaluate extensively the proposed solution by using the real query stream recorded in the log and a collection of 100 millions of digital photographs. The high hit ratios and small average approximation error figures obtained demonstrate the effectiveness of the approach.

Research highlights

► We propose similarity-based caching techniques to increase the throughput of a large-scale CBIR. ► We exploit knowledge from a query log of a real large-scale CBIR system. ► Our experiments are conducted on a very large collection containing 100 million images.

Introduction

Content-based image retrieval (CBIR) (Datta, Joshi, Li, & Wang, 2007) systems support a search paradigms that is becoming day by day more important. Given an image, used as a query by example, users can ask the CBIR system for identifying a given number of images that are the most similar to the query by comparing their visual contents. More specifically, this content-based similarity search is realized in terms of k nearest neighbors (kNN) queries that exploit an appropriate distance measure, defined over the vectors of visual features extracted from each image.

Indeed, such search paradigm fits not only digital images, but also other data types, such as audio-videos, bio-medical data, financial data, and many other scientific data. They constitute the largest fraction of data produced daily in the e-society. All these dominant data sets can be represented in feature spaces that are inherently multi-dimensional. Providing an effective and efficient way to search such multi-dimensional data sets is very challenging. This is not only due to the curse of dimensionality of the visual feature spaces used to represent images, but also to the colossal size of multimedia data over which such content-based search paradigm could be applied. Just to mention some numbers about the hugeness of such data sets, we can refer to a report by an EMC-sponsored research team of International Data Corporation (IDC). This report states that about 281 exabytes (281 billion gigabytes) of digital data were available in the world in 2007, and that by 2011 the aggregate amount of digital data will reach about 1.8 zettabytes (1800 exabytes) (John et al., 2008). Already in 2008, more than 80% of the size of digital contents was related to images: pictures, surveillance videos, TV streams, and so forth. Moreover, if we consider the Web, we know that about 15 billions of images and 220 millions of new photos per week were uploaded to Facebook in May 2009, and about 4 billions of images and 100 millions of new photos per month were uploaded to Flickr in October 2009.

In this paper we focus on the general issue of making CBIR systems scalable, and tackle this goal by proposing a novel similarity cache that intercepts the kNN content-based similarity queries directed to a CBIR system. We show that our cache is able to satisfy a large number of these requests, with exact or approximate results. Since a hit on our cache requires a time much shorter than a kNN query on the whole index, the overall query throughput of the CBIR system is remarkably improved.

Our similarity cache differs from a traditional query result cache (e.g., the one exploited by Web search engines (Fagni, Perego, Silvestri, & Orlando, 2006)), and is based on the mathematical foundations of metric spaces (Chávez, Navarro, Baeza-Yates, & Marroquín, 2001). More specifically, the measurable similarity between the query and the cached objects is exploited for increasing the cache hit rate by allowing approximate cache hits. Besides the traditional exact answers, which occur when the same image query has been submitted in the past and its result set has not yet evicted from the cache, our cache can also return approximate answers. To this end, we exploit the cached visual features of the images belonging to the result sets of previous queries, and use them to identify the associated images that are closest to the query. Obviously, the effectiveness of our similarity cache approach cannot be evaluated in terms of the usual hit ratio only. So in our tests we also considered the quality of the approximate results returned by the cache to show the effectiveness of our approach.

It is worth remarking that, given a query q, when the result set returned by the cache is the same as the one retrieved in the past by the back-end for the same query q, we consider it as an exact hit. Obviously, the CBIR back-end can in turn exploit any query answering approach and return either exact or approximate kNN results sets. In the last case also a cache exact hit actually returns an approximate kNN result. Our similarity cache is thus a general approach to improve CBIR throughput since it can improve the response time of any exact or approximate CBIR back-end.

In Falchi, Lucchese, Orlando, Perego, and Rabitti (2009) we investigated an important characteristic of our similarity cache for CBIR systems: its robustness with respect to the near-duplicate images. A popular image has in fact thousands of slightly different versions in the Web. In case of near-duplicate queries, that are variants of previously cached queries, our similarity cache can return high-quality answers, namely approximate hits, without querying the CBIR back-end.

The experimental setting exploited for testing and evaluating our similarity cache is unique in the panorama of multimedia search. We refer to a real CBIR system, which indexes five MPEG-7 visual descriptors (Scalable Color, Color Structure, Color Layout, Edge Histogram, Homogeneous Texture) of a collection of 100 millions of Flickr photos. The set of visual descriptors associated with this image collection is publicly available for research purposes (CoPhIR: Content-based Photo Image Retrieval Test-Collection (Bolettieri et al., 2009)1). The CBIR is MUFIN Image Search,2 where query images can be either internal or external images – i.e. images belonging to the indexed CoPhIR collection or uploaded by the user.

We analyzed a log of the queries submitted to MUFIN CBIR system between March 20th and October 28th, 2009, with the aim of devising common patterns of user behaviors that can be exploited to better understand user desiderata, improve the design of a large-scale CBIR system and its caching system, and tune its performance. To the best of our knowledge, no similar analysis was conducted before on such kind of query logs. Even if this query log is not huge if compared to the size of commonly studied query logs of Web Search engines, it can be considered representative of Web user behavior. The log in fact records 65,063 queries issued during 5004 sessions by 3390 distinct users. We experimentally show that the distribution of topic popularity existing for text queries (Fagni et al., 2006, Silvestri, 2010) is confirmed to some extent also for content-based image queries. In particular a high level of locality and of self-similarity can be found in the results of the queries submitted by the same and different users during their search sessions, so that the distribution of the popularity of such results is highly skewed. Moreover, we mined from the log of the MUFIN system providing the novel content-based similarity search paradigm, the most common behaviors of users, and measured their preferences for using similarity and/or text queries.

Moreover, we exploited the large experimental setting illustrated above, the MUFFIN query log and the CoPhIR collection, in order to evaluate carefully our similarity caching strategy. To this end we collected, for each similarity query in the log, the Top-200 exact closest results present in the CoPhIR collection, and used them as a sort of ground truth to estimate the quality of the approximate results returned by our cache, and thus to evaluate the actual benefits of using the cache in terms of improved query throughput and approximation error. It is worth mentioning that the preliminary conclusions reported in Falchi et al., 2008, Falchi et al., 2009, where we exploited a synthetic query log and a smaller image database, were not only confirmed but made remarkably stronger: our caching strategy behaves much better with a larger database and a real-life query log.

The rest of the paper is organized as follows. Section 2 reviews related work, while Section 3 discusses the results of the analysis conducted on the query log recording the content-based similarity queries issued by Web users to the MUFIN Image Search system. In Section 4 the issues related to caching the results of similarity search queries are outlined, and the framework for evaluating the efficiency and effectiveness is presented. Section 5 describes experimental settings, and reports on the results of the experiments conducted. Finally, Section 6 draws some conclusions.

Section snippets

Related work

There are three main research topics that are related to our work. The first is concerned with WSE query log analysis and the methods for caching query results. Another is concerned with approximate search in metric spaces and the metrics for measuring effectiveness of such methods. Finally, independent of our work, another notion of similarity cache has been recently introduced. The problem as been designed and studied in a completely different context, in particular as a method to make faster

The CBIR system query log

Although there exist several CBIR prototypes,3 no log of content-based similarity queries is publicly available. Recently, both Google and Microsoft Bing introduced some limited possibilities for content-based search within their popular image search services. However, no study on the use of such services is available yet.

We had access to the query log recorded by the MUFIN image search system (Novak

Caching content-based query results

Let C be our similarity cache hosted on the front-end of a large-scale CBIR system, which stores recently or/and frequently submitted queries and associated results (see Fig. 2). In order to compute the similarity between the cache content and the query, our method needs to keep in cache not only the identifiers of the results returned for a cached query, but also the features (e.g., the MPEG-7 visual descriptors) associated with them. As a consequence, look-up, update, eviction, and insertion

Results assessment

To the best of our knowledge, this is the first rigorous evaluation in the area of multimedia content-based search involving a real-life query log, generated by the search activity of a large number of users, over a large collection of objects. As discussed in Section 3, our evaluation is based on a query log of the MUFIN Image Search system, which answers to content-based similarity queries over the CoPhIR collection.

The distance between two images is computed as a weighted sum of the

Conclusions

Starting from the analysis of a query log recording the behavior of Web users experimenting a uncommon paradigm of image search based on image content similarity, we have discussed and evaluated a model for caching query results based on the concept of similarity. Our goal was to show that similarity caching is a viable and effective way to improve efficiency of a large-scale CBIR system processing a realistic workload.

We studied a large log recording the queries issued to the MUFIN Image

Acknowledgments

We would like to thank Pavel Zezula and his group at the Masaryk University (Brno, Czech Republic) for making available to us the query log of the MUFIN Image Search system. This work would have not been possible without their contribution. We acknowledge the partial support of S-CUBE (EU-FP7-215483), ASSETS (CIP-ICT-PSP-250527), and VISITO Tuscany (POR-FESR-63748) projects.

References (40)

  • M. Batko et al.

    Scalability comparison of peer-to-peer similarity search structures

    Future Generation Computer System

    (2008)
  • B. Bustos et al.

    Pivot selection techniques for proximity searching in metric spaces

    Pattern Recognition Letters

    (2003)
  • E.P. Markatos

    On caching search engine query results

    Computer Communications

    (2001)
  • G. Amato et al.

    Approximate similarity search in metric spaces using inverted files

  • S. Arya et al.

    An optimal algorithm for approximate nearest neighbor searching fixed dimensions

    Journal of ACM

    (1998)
  • R.A. Baeza-Yates et al.

    Design trade-offs for search engine caching

    TWEB

    (2008)
  • Batko, M., Falchi, F., Lucchese, C., Novak, D., Perego, R., Rabitti, F., Sedmidubsky, J., Zezula, P. (2009). Building a...
  • C. Böhm et al.

    Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

    ACM Computing Surveys

    (2001)
  • P. Bolettieri et al.

    Cophir: A test collection for content-based image retrieval

    CoRR, abs/0905.4627

    (2009)
  • T. Bozkaya et al.

    Indexing large metric spaces for similarity search queries

    ACM Transactions on Computer Systems

    (1999)
  • E. Chávez et al.

    Searching in metric spaces

    ACM Computing Surveys (CSUR)

    (2001)
  • E. Chavez Gonzalez et al.

    Effective proximity retrieval by ordering permutations

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2008)
  • Chierichetti, F., Kumar, R., Vassilvitskii, S. (2009). Similarity caching. In PODS (pp....
  • Ciaccia, P., Patella, M., Zezula, P. (1997). M-Tree: An efficient access method for similarity search in metric spaces....
  • R. Datta et al.

    Image retrieval: Ideas, influences, and trends of the new age

    ACM Computing Surveys

    (2007)
  • Esuli, A. (2009). Pp-index: Using permutation prefixes for efficient and scalable approximate similarity search. In...
  • T. Fagni et al.

    Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data

    ACM Transactions on Computer Systems

    (2006)
  • Falchi, F., Lucchese, C., Orlando, S., Perego, R., Rabitti, F. (2008). A metric cache for similarity search. In...
  • Falchi, F., Lucchese, C., Orlando, S., Perego, R., Rabitti, F. (2009). Caching content-based queries for robust and...
  • Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A. (2001). Approximate nearest neighbor searching in...
  • Cited by (22)

    • Modeling user preferences in content-based image retrieval: A novel attempt to bridge the semantic gap

      2015, Neurocomputing
      Citation Excerpt :

      In our experience, the most important point is the database size and, close to this, the evaluation cost. Many of the published experiments work with small databases (around 1000 images), or medium-size ones (up to 100,000 images) with the highly relevant exception of [9] which evaluates its algorithm in a 100-million image database using a similarity caching system. The main differences between the previously cited works and the current work in each of the aforementioned issues are as follows:

    • Ascent Similarity Caching With Approximate Indexes

      2023, IEEE/ACM Transactions on Networking
    • GRADES: Gradient Descent for Similarity Caching

      2023, IEEE/ACM Transactions on Networking
    View all citing articles on Scopus
    View full text