Scalability comparison of Peer-to-Peer similarity search structures

doi:10.1016/j.future.2007.07.012

Future Generation Computer Systems

Volume 24, Issue 8, October 2008, Pages 834-848

https://doi.org/10.1016/j.future.2007.07.012 Get rights and content

Abstract

Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and grow linearly on single server structures, which prevents them from efficient application on large data volumes. In this paper, we shortly describe four recent scalable distributed techniques for similarity search and study their performance in executing queries on three different datasets. Though all the methods employ parallelism to speed up query execution, different advantages for different objectives have been identified by experiments. The reported results would be helpful for choosing the best implementations for specific applications. They can also be used for designing new and better indexing structures in the future.

Introduction

Efficient lookup for specific keywords in a dictionary containing hundreds of thousands of words or locating specific records in millions of bank accounts are quite easy tasks for present-day computers. Since records in such domains can be sorted, and every record either fully satisfies the search condition or it does not at all, hashing or tree-like structures can be applied as indexes. High scalability of such technologies is guaranteed by logarithmically bounded search time with respect to the size of the file.

However, to find images of sport cars, time series with similar development, or groups of customers with common buying patterns in respective data collections, the traditional technologies simply fail. Here, the required comparison is gradual, rather than binary, because, once a reference (query) pattern is given, each instance in a search file and the pattern are in certain relation measured by a user-defined dissimilarity function. Such a searching is more and more important for a variety of current complex digital data collections and is generally designated as the similarity search.

Although many similarity search approaches have been proposed, the most generic one considers the mathematical metric space as a suitable abstraction of similarity [1]. The simple but powerful concept of the metric space consists of a domain of objects and a distance function that measures proximity of pairs of objects. It can be applied not only to multi-dimensional vector spaces, but also to different forms of string objects, as well as to sets or groups of various nature, etc. Although many index structures have been proposed, the similarity search is inherently expensive. Besides, the linear increase of the search time prevents them from application to huge files that have become common and the predictions expect a continuous rapid growth.

Very recently, we have proposed four scalable distributed similarity search structures for metric data. The first two structures adopt the basic ball and generalized hyperplane partitioning principles [2] and they are called the ${VPT}^{*}$ and the ${GHT}^{*}$ , respectively [3]. The other two apply transformation strategies — the metric similarity search problem is transformed into a series of range queries executed on existing distributed keyword structures, namely the CAN [4] and the Chord [5]. By analogy, we call them the MCAN [6] and the M-Chord [7]. Each of the structures is able to execute similarity queries for any metric dataset, and they all exploit parallelism for query execution. However, due to the completely different underlying principles, an important question arises: What are the advantages and disadvantages of the individual approaches in terms of search costs and scalability for different real-life search problems?

In this paper, we report on implementations of the ${VPT}^{*}$ , ${GHT}^{*}$ , MCAN and M-Chord systems over the same infrastructure of peer computers. We have conducted numerous experiments on three different datasets and present the most important findings. We focus on scalability with respect to the size of the query, the size of the dataset, and the number of queries executed simultaneously. The results reported in this paper have been presented on the INFOSCALE ’06 conference [8].

The rest of the paper is organized as follows. The necessary background and related work are reported in Section 2. A brief specification of our four indexing techniques is available in Section 3, while the assumptions and results of our experiments can be found in Section 4. The paper concludes in Section 5.

Section snippets

Background and related work

In this section, we provide a theoretical background for the similarity search. Then we mention the distributed paradigm adopted by all presented solutions. We also give a brief related work survey in the end of this section.

Distributed metric approaches

This section contains short descriptions of four different distributed structures for indexing and similarity search in the metric data. The first two, ${GHT}^{*}$ and ${VPT}^{*}$ , are native metric index structures whereas the other two, MCAN and M-Chord, transform the metric search issue into a different problem and take advantage of some existing solutions. Each description consists of a main idea of the particular approach, a basic architecture of the system, and a schema of algorithms for the $Range$

Evaluation of performance scalability

In this section, we provide comparison of the presented approaches by confronting results of extensive experiments. For each data structure, the tests have been conducted on the same datasets and in the same test environment. Moreover, all the structures have been implemented over the very same infrastructure sharing a lower-level code. Due to these facts, we consider results of the experiments fairly comparable.

When designing the experiments, we have focused mainly on various aspects of the

Conclusions

In this paper, we have studied performance of four different distributed index structures for metric spaces, namely the ${GHT}^{*}$ , the ${VPT}^{*}$ , the MCAN and the M-Chord. We have focused on their scalability of executing similarity queries from three different points of view: (1) changing query radii, (2) growing volume of searched data, and (3) accumulating number of concurrent queries. We have conducted a vast bulk of experiments and reported the most interesting findings in special sections of this

Acknowledgements

This research has been partially supported by the national research project No. 1ET100300419 and partially by the Czech Science Foundation grant No. 102/05/H050.

References (33)

J.K. Uhlmann
Satisfying general proximity/similarity queries with metric trees
Information Processing Letters
(1991)
B. Bustos et al.
Pivot selection techniques for proximity searching in metric spaces
Pattern Recognition Letters
(2003)
S.B. Needleman et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology
(1970)
C. Chen et al.
An adaptive grid implementation of DNA sequence alignment
Future Generation Computer Systems
(2005)
P. Zezula et al.
M. Batko et al.
Similarity grid for searching in metric spaces
S. Ratnasamy et al.
A scalable content addressable network
I. Stoica et al.
Chord: A scalable Peer-to-Peer lookup service for internet applications
F. Falchi, C. Gennaro, P. Zezula, A content-addressable network for similarity search in metric spaces, in: Proceedings...
D. Novak et al.
M-Chord: A scalable distributed similarity search structure

M. Batko et al.

On scalability of the similarity search in the world of peers

W. Litwin et al.

LH*–A scalable, distributed data structure

ACM Transactions on Database Systems

(1996)

G.R. Hjaltason et al.

Index-driven similarity search in metric spaces

ACM Transactions on Database Systems

(2003)

E. Chávez et al.

Searching in metric spaces

ACM Computing Surveys

(2001)

V. Dohnal et al.

D-index: Distance searching index for metric data sets

Multimedia Tools and Applications

(2003)

K. Aberer

P-Grid: A self-organizing access structure for P2P information systems

Lecture Notes in Computer Science

(2001)

Cited by (24)

Modelling efficient novelty-based search result diversification in metric spaces
2013, Journal of Discrete Algorithms
Citation Excerpt :
Metric spaces have been widely studied to perform sequentially searches for objects which are similar to a given query object (e.g. [39,42,43]). Additionally, some works have used metric spaces in other settings like distributed environments [11,31,48] and peer-to-peer (P2P) systems [7,15,37]. Additional work on permutation-based algorithms was described by Skala [54], who evaluated the number of possible permutations that can occur in a given space.
Novelty-based diversification provides a way to tackle ambiguous queries by re-ranking a set of retrieved documents. Current approaches are typically greedy, requiring $O (n^{2})$ document–document comparisons in order to diversify a ranking of n documents. In this article, we introduce a new approach for novelty-based search result diversification to reduce the overhead incurred by document–document comparisons. To this end, we model novelty promotion as a similarity search in a metric space, exploiting the properties of this space to efficiently identify novel documents. We investigate three different approaches: pivoting-based, clustering-based, and permutation-based. In the first two, a novel document is one that lies outside the range of a pivot or outside a cluster. In the latter, a novel document is one that has a different signature (i.e., the documentʼs relative distance to a distinguished set of fixed objects called permutants) compared to previously selected documents. Thorough experiments using two TREC test collections for diversity evaluation, as well as a large sample of the query stream of a commercial search engine show that our approaches perform at least as effectively as well-known novelty-based diversification approaches in the literature, while dramatically improving their efficiency.
Similarity caching in large-scale image retrieval
2012, Information Processing and Management
Citation Excerpt :
Due to its generally expensive cost. Recent literature has focus on distributed data structures (Batko, Novak, Falchi, & Zezula, 2008; Batko et al., 2009) and approximate search (Amato & Savino, 2008; Esuli, 2009; Gennaro, Amato, Bolettieri, & Savino, 2010). Several approximate techniques have been studied to improve efficiency at the price of obtaining approximate result sets.
Feature-rich data, such as audio-video recordings, digital images, and results of scientific experiments, nowadays constitute the largest fraction of the massive data sets produced daily in the e-society. Content-based similarity search systems working on such data collections are rapidly growing in importance. Unfortunately, similarity search is in general very expensive and hardly scalable.
In this paper we study the case of content-based image retrieval (CBIR) systems, and focus on the problem of increasing the throughput of a large-scale CBIR system that indexes a very large collection of digital images. By analyzing the query log of a real CBIR system available on the Web, we characterize the behavior of users who experience a novel search paradigm, where content-based similarity queries and text-based ones can easily be interleaved. We show that locality and self-similarity is present even in the stream of queries submitted to such a CBIR system. According to these results, we propose an effective way to exploit this locality, by means of a similarity caching system, which stores the results of recently/frequently submitted queries and associated results. Unlike traditional caching, the proposed cache can manage not only exact hits, but also approximate ones that are solved by similarity with respect to the result sets of past queries present in the cache. We evaluate extensively the proposed solution by using the real query stream recorded in the log and a collection of 100 millions of digital photographs. The high hit ratios and small average approximation error figures obtained demonstrate the effectiveness of the approach.
Large-scale similarity data management with distributed Metric Index
2012, Information Processing and Management
Citation Excerpt :
Several distributed data structures for metric-based searching were proposed: GHT∗ (Batko, Gennaro, & Zezula, 2005), MCAN (Falchi, Gennaro, & Zezula, 2007), or M-Chord (Novak & Zezula, 2006). A detailed comparison of these techniques (Batko, Novak, Falchi, & Zezula, 2006, 2008) indicates that the M-Chord slightly outperforms the others under most conditions. There is also a distributed approximate algorithm for the M-Chord (Novak, Batko, & Zezula, 2008).
Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.
Special Section: Scalable information systems
2009, Future Generation Computer Systems
Distance browsing in distributed multimedia databases
2009, Future Generation Computer Systems
Citation Excerpt :
All the presented performance characteristics of query processing have been taken as an average over 100 queries with randomly chosen query objects. To study scalability with respect to the number of objects, we limited the number of objects each node can maintain (the same has been done in [2,8,9,3,12,4]). When a node exceeds its space limit it splits by sending a subset of its objects to a free node that takes the responsibility for a part of the original region.
The state of the art of searching for non-text data (e.g., images) is to use extracted metadata annotations or text, which might be available as a related information. However, supporting real content-based audiovisual search, based on similarity search on features, is significantly more expensive than searching for text. Moreover, such search exhibits linear scalability with respect to the dataset size, so parallel query execution is needed.
In this paper, we present a Distributed Incremental Nearest Neighbor algorithm (DINN) for finding closest objects in an incremental fashion over data distributed among computer nodes, each able to perform its local Incremental Nearest Neighbor (local-INN) algorithm. We prove that our algorithm is optimum with respect to both the number of involved nodes and the number of local-INN invocations. An implementation of our DINN algorithm, on a real P2P system called MCAN, was used for conducting an extensive experimental evaluation on a real-life dataset.
The proposed algorithm is being used in two running projects: SAPIR and NeP4B.
An efficient peer-to-peer indexing tree structure for multidimensional data
2009, Future Generation Computer Systems
Citation Excerpt :
The first one is the dimension-reducing based method. [23,13,4] using Space Filling Curve to transform the multidimensional data into one-dimensional data, [24,25] make use of pivot points to do data transformation and [14] compares four kinds of designs which make use of dimension reducing techniques or pivot points. After that, a classic index will be used to index the data.
As one of the most important technologies for implementing large-scale distributed systems, peer-to-peer (P2P) computing has attracted much attention in both research and industrial communities, for its advantages such as high availability, high performance, and high flexibility to the dynamics of networks. However, multidimensional data indexing remains as a big challenge to P2P computing, because of the inefficiency in search and network maintenance caused by the complicated existing index structures, which greatly limits the scalability of applications and dimensionality of the data to be indexed.
We propose SDI (Swift tree structure for multidimensional Data Indexing), a swift index scheme with a simple tree structure for multidimensional data indexing in large-scale distributed systems. While keeping the query efficiency in $O (log N)$ in terms of routing hops, SDI has extremely low maintenance costs which is proved through theoretical analysis. Furthermore, SDI overcomes the root-bottleneck problem existing in most other tree-based distributed indexing systems. Extensive empirical study verifies the superiority of SDI in both query and maintenance performance.

View all citing articles on Scopus

Michal Batko is a research and development worker at the Faculty of Informatics, Masaryk University, Brno, Czech Republic where he obtained a Ph.D. Degree in computer science. His research activities are concentrated on the efficient searching in large distributed environments with emphasis on the scalability problem. He is focused especially on the problems of distributed similarity search using the metric space approach. As a software developer, he coordinates the development of an extensive similarity searching framework used for creating prototypes of several indexing techniques. Recently, he participated in the EU project called Search on Audio-visual content using Peer-to-Peer Information Retrieval.

David Novak graduated from the Faculty of Informatics, Masaryk University, Brno, Czech Republic with both Bachelor’s and Master’s Degree in computer science. At the moment, he is a doctoral student at the same faculty and he is working on his thesis on Similarity Search in the Distributed Environment, which is at the center of his professional interests. He participates in two EU projects — DELOS Network of Excellence on Digital Libraries and, recently, project SAPIR, which focuses on searching in multimedia data in the Peer-to-Peer environment.

Fabrizio Falchi is a joint Ph.D. student in Information Engineering at University of Pisa (Italy), and in Informatics at the Faculty of Informatics of Masaryk University of Brno (Czech Republic). In 2003 he received a Research Fellowship from the Networked Multimedia Information System Laboratory of the Information Science and Technologies Institute of the Italian CNR in Pisa, where, from 2006, he is a collaborator. His research interests include similarity search, distributed indexes, multimedia content management systems, content-based image retrieval, Peer-to-Peer systems.

Pavel Zezula is a professor of computer science at the Faculty of Informatics, Masaryk University, Brno, Czech Republic. His professional interests concentrate on storage structures and algorithms for content-based retrieval in non-traditional digital data types and formats, such as the similarity search and the exploration of XML structured data collections. He has a long history of co-operation with the CNR in Pisa, Italy, and has participated in numerous EU projects — the most recent project, SAPIR, explores the search on audio-visual content using Peer-to-Peer information retrieval. He has been a Program Committee Member of several prestigious international conferences (EDBT, VLDB, ACM SIGMOD, and CIKM). His research results appear in major journals such as ACM TOIS, ACM TODS and VLDB Journal, and in the proceedings of leading conferences such as ACM PODS, VLDB, and ACM SIGIR.

View full text

Scalability comparison of Peer-to-Peer similarity search structures

Abstract

Introduction

Section snippets

Background and related work

Distributed metric approaches

Evaluation of performance scalability

Conclusions

Acknowledgements

Information Processing Letters

Pattern Recognition Letters

Journal of Molecular Biology

Future Generation Computer Systems

Similarity grid for searching in metric spaces

A scalable content addressable network

Chord: A scalable Peer-to-Peer lookup service for internet applications

M-Chord: A scalable distributed similarity search structure

On scalability of the similarity search in the world of peers

LH*–A scalable, distributed data structure

ACM Transactions on Database Systems

Index-driven similarity search in metric spaces

ACM Transactions on Database Systems

Searching in metric spaces

ACM Computing Surveys

D-index: Distance searching index for metric data sets

Multimedia Tools and Applications

P-Grid: A self-organizing access structure for P2P information systems

Lecture Notes in Computer Science