Scalability comparison of Peer-to-Peer similarity search structures

https://doi.org/10.1016/j.future.2007.07.012Get rights and content

Abstract

Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and grow linearly on single server structures, which prevents them from efficient application on large data volumes. In this paper, we shortly describe four recent scalable distributed techniques for similarity search and study their performance in executing queries on three different datasets. Though all the methods employ parallelism to speed up query execution, different advantages for different objectives have been identified by experiments. The reported results would be helpful for choosing the best implementations for specific applications. They can also be used for designing new and better indexing structures in the future.

Introduction

Efficient lookup for specific keywords in a dictionary containing hundreds of thousands of words or locating specific records in millions of bank accounts are quite easy tasks for present-day computers. Since records in such domains can be sorted, and every record either fully satisfies the search condition or it does not at all, hashing or tree-like structures can be applied as indexes. High scalability of such technologies is guaranteed by logarithmically bounded search time with respect to the size of the file.

However, to find images of sport cars, time series with similar development, or groups of customers with common buying patterns in respective data collections, the traditional technologies simply fail. Here, the required comparison is gradual, rather than binary, because, once a reference (query) pattern is given, each instance in a search file and the pattern are in certain relation measured by a user-defined dissimilarity function. Such a searching is more and more important for a variety of current complex digital data collections and is generally designated as the similarity search.

Although many similarity search approaches have been proposed, the most generic one considers the mathematical metric space as a suitable abstraction of similarity [1]. The simple but powerful concept of the metric space consists of a domain of objects and a distance function that measures proximity of pairs of objects. It can be applied not only to multi-dimensional vector spaces, but also to different forms of string objects, as well as to sets or groups of various nature, etc. Although many index structures have been proposed, the similarity search is inherently expensive. Besides, the linear increase of the search time prevents them from application to huge files that have become common and the predictions expect a continuous rapid growth.

Very recently, we have proposed four scalable distributed similarity search structures for metric data. The first two structures adopt the basic ball and generalized hyperplane partitioning principles [2] and they are called the VPT and the GHT, respectively [3]. The other two apply transformation strategies — the metric similarity search problem is transformed into a series of range queries executed on existing distributed keyword structures, namely the CAN [4] and the Chord [5]. By analogy, we call them the MCAN [6] and the M-Chord [7]. Each of the structures is able to execute similarity queries for any metric dataset, and they all exploit parallelism for query execution. However, due to the completely different underlying principles, an important question arises: What are the advantages and disadvantages of the individual approaches in terms of search costs and scalability for different real-life search problems?

In this paper, we report on implementations of the VPT, GHT, MCAN and M-Chord systems over the same infrastructure of peer computers. We have conducted numerous experiments on three different datasets and present the most important findings. We focus on scalability with respect to the size of the query, the size of the dataset, and the number of queries executed simultaneously. The results reported in this paper have been presented on the INFOSCALE ’06 conference [8].

The rest of the paper is organized as follows. The necessary background and related work are reported in Section 2. A brief specification of our four indexing techniques is available in Section 3, while the assumptions and results of our experiments can be found in Section 4. The paper concludes in Section 5.

Section snippets

Background and related work

In this section, we provide a theoretical background for the similarity search. Then we mention the distributed paradigm adopted by all presented solutions. We also give a brief related work survey in the end of this section.

Distributed metric approaches

This section contains short descriptions of four different distributed structures for indexing and similarity search in the metric data. The first two, GHT and VPT, are native metric index structures whereas the other two, MCAN and M-Chord, transform the metric search issue into a different problem and take advantage of some existing solutions. Each description consists of a main idea of the particular approach, a basic architecture of the system, and a schema of algorithms for the Range

Evaluation of performance scalability

In this section, we provide comparison of the presented approaches by confronting results of extensive experiments. For each data structure, the tests have been conducted on the same datasets and in the same test environment. Moreover, all the structures have been implemented over the very same infrastructure sharing a lower-level code. Due to these facts, we consider results of the experiments fairly comparable.

When designing the experiments, we have focused mainly on various aspects of the

Conclusions

In this paper, we have studied performance of four different distributed index structures for metric spaces, namely the GHT, the VPT, the MCAN and the M-Chord. We have focused on their scalability of executing similarity queries from three different points of view: (1) changing query radii, (2) growing volume of searched data, and (3) accumulating number of concurrent queries. We have conducted a vast bulk of experiments and reported the most interesting findings in special sections of this

Acknowledgements

This research has been partially supported by the national research project No. 1ET100300419 and partially by the Czech Science Foundation grant No. 102/05/H050.

Michal Batko is a research and development worker at the Faculty of Informatics, Masaryk University, Brno, Czech Republic where he obtained a Ph.D. Degree in computer science. His research activities are concentrated on the efficient searching in large distributed environments with emphasis on the scalability problem. He is focused especially on the problems of distributed similarity search using the metric space approach. As a software developer, he coordinates the development of an extensive

References (33)

  • M. Batko et al.

    On scalability of the similarity search in the world of peers

  • W. Litwin et al.

    LH*–A scalable, distributed data structure

    ACM Transactions on Database Systems

    (1996)
  • G.R. Hjaltason et al.

    Index-driven similarity search in metric spaces

    ACM Transactions on Database Systems

    (2003)
  • E. Chávez et al.

    Searching in metric spaces

    ACM Computing Surveys

    (2001)
  • V. Dohnal et al.

    D-index: Distance searching index for metric data sets

    Multimedia Tools and Applications

    (2003)
  • K. Aberer

    P-Grid: A self-organizing access structure for P2P information systems

    Lecture Notes in Computer Science

    (2001)
  • Cited by (24)

    • Modelling efficient novelty-based search result diversification in metric spaces

      2013, Journal of Discrete Algorithms
      Citation Excerpt :

      Metric spaces have been widely studied to perform sequentially searches for objects which are similar to a given query object (e.g. [39,42,43]). Additionally, some works have used metric spaces in other settings like distributed environments [11,31,48] and peer-to-peer (P2P) systems [7,15,37]. Additional work on permutation-based algorithms was described by Skala [54], who evaluated the number of possible permutations that can occur in a given space.

    • Similarity caching in large-scale image retrieval

      2012, Information Processing and Management
      Citation Excerpt :

      Due to its generally expensive cost. Recent literature has focus on distributed data structures (Batko, Novak, Falchi, & Zezula, 2008; Batko et al., 2009) and approximate search (Amato & Savino, 2008; Esuli, 2009; Gennaro, Amato, Bolettieri, & Savino, 2010). Several approximate techniques have been studied to improve efficiency at the price of obtaining approximate result sets.

    • Large-scale similarity data management with distributed Metric Index

      2012, Information Processing and Management
      Citation Excerpt :

      Several distributed data structures for metric-based searching were proposed: GHT∗ (Batko, Gennaro, & Zezula, 2005), MCAN (Falchi, Gennaro, & Zezula, 2007), or M-Chord (Novak & Zezula, 2006). A detailed comparison of these techniques (Batko, Novak, Falchi, & Zezula, 2006, 2008) indicates that the M-Chord slightly outperforms the others under most conditions. There is also a distributed approximate algorithm for the M-Chord (Novak, Batko, & Zezula, 2008).

    • Special Section: Scalable information systems

      2009, Future Generation Computer Systems
    • Distance browsing in distributed multimedia databases

      2009, Future Generation Computer Systems
      Citation Excerpt :

      All the presented performance characteristics of query processing have been taken as an average over 100 queries with randomly chosen query objects. To study scalability with respect to the number of objects, we limited the number of objects each node can maintain (the same has been done in [2,8,9,3,12,4]). When a node exceeds its space limit it splits by sending a subset of its objects to a free node that takes the responsibility for a part of the original region.

    • An efficient peer-to-peer indexing tree structure for multidimensional data

      2009, Future Generation Computer Systems
      Citation Excerpt :

      The first one is the dimension-reducing based method. [23,13,4] using Space Filling Curve to transform the multidimensional data into one-dimensional data, [24,25] make use of pivot points to do data transformation and [14] compares four kinds of designs which make use of dimension reducing techniques or pivot points. After that, a classic index will be used to index the data.

    View all citing articles on Scopus

    Michal Batko is a research and development worker at the Faculty of Informatics, Masaryk University, Brno, Czech Republic where he obtained a Ph.D. Degree in computer science. His research activities are concentrated on the efficient searching in large distributed environments with emphasis on the scalability problem. He is focused especially on the problems of distributed similarity search using the metric space approach. As a software developer, he coordinates the development of an extensive similarity searching framework used for creating prototypes of several indexing techniques. Recently, he participated in the EU project called Search on Audio-visual content using Peer-to-Peer Information Retrieval.

    David Novak graduated from the Faculty of Informatics, Masaryk University, Brno, Czech Republic with both Bachelor’s and Master’s Degree in computer science. At the moment, he is a doctoral student at the same faculty and he is working on his thesis on Similarity Search in the Distributed Environment, which is at the center of his professional interests. He participates in two EU projects — DELOS Network of Excellence on Digital Libraries and, recently, project SAPIR, which focuses on searching in multimedia data in the Peer-to-Peer environment.

    Fabrizio Falchi is a joint Ph.D. student in Information Engineering at University of Pisa (Italy), and in Informatics at the Faculty of Informatics of Masaryk University of Brno (Czech Republic). In 2003 he received a Research Fellowship from the Networked Multimedia Information System Laboratory of the Information Science and Technologies Institute of the Italian CNR in Pisa, where, from 2006, he is a collaborator. His research interests include similarity search, distributed indexes, multimedia content management systems, content-based image retrieval, Peer-to-Peer systems.

    Pavel Zezula is a professor of computer science at the Faculty of Informatics, Masaryk University, Brno, Czech Republic. His professional interests concentrate on storage structures and algorithms for content-based retrieval in non-traditional digital data types and formats, such as the similarity search and the exploration of XML structured data collections. He has a long history of co-operation with the CNR in Pisa, Italy, and has participated in numerous EU projects — the most recent project, SAPIR, explores the search on audio-visual content using Peer-to-Peer information retrieval. He has been a Program Committee Member of several prestigious international conferences (EDBT, VLDB, ACM SIGMOD, and CIKM). His research results appear in major journals such as ACM TOIS, ACM TODS and VLDB Journal, and in the proceedings of leading conferences such as ACM PODS, VLDB, and ACM SIGIR.

    View full text