skip to main content
10.1145/1031171.1031236acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

SWAM: a family of access methods for similarity-search in peer-to-peer data networks

Published:13 November 2004Publication History

ABSTRACT

Peer-to-peer Data Networks (PDNs) are large-scale, self-organizing, distributed query processing systems. Familiar examples of PDN are peer-to-peer file-sharing networks, which support exact-match search queries to locate user-requested files. In this paper, we formalize the more general problem of <i>similarity-search</i> in PDNs, and propose a <i>family</i> of distributed access methods, termed <i>Small-World Access Methods (SWAM)</i>, for efficient execution of various similarity-search queries, namely exact-match, range, and k-nearest-neighbor queries. Unlike its predecessors, i.e., LH* and DHTs, SWAM does not control the assignment of data objects to PDN nodes; each node autonomously stores its own data. Besides, SWAM supports all similarity-search queries on multiple attributes. SWAM guarantees that the query object will be found (if it exists in the network) in average time logarithmically proportional to the network size. Moreover, once the query object is found, all the similar objects would be in its proximate network neighborhood and hence enabling efficient range and k-nearest-neighbor queries.

As a specific instance of SWAM, we propose <i>SWAM-V</i>, a Voronoi-based SWAM that indexes PDNs with multi-attribute data objects. For a PDN with <i>N</i> nodes SWAM-V has query time, communication cost, and computation cost of <i>O</i>(log <i>N</i>) for exact-match queries, and <i>O</i>(log <i>N</i> + <b>s</b><i>N</i>) and <i>O</i>(log <i>N</i> + <b>k</b>) for range queries (with selectivity <b>s</b>) and <b>k</b>NN queries, respectively. Our experiments show that SWAM-V consistently outperforms a similarity-search enabled version of CAN in query time and communication cost by a factor of 2 to 3.

References

  1. K. Aberer, P. Cudré-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, and R. Schmidt. P-grid: A self-organizing structured p2p system. SIGMOD Record, 32(2), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Banaei-Kashani and C. Shahabi. Efficient flooding in power-law networks. In Proceedings of Twenty-Second ACM Symposium on Principles of Distributed Computing (PODC'03), July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Banaei-Kashani and C. Shahabi. Searchable querical data networks. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing in conjunction with VLDB'03, September 2003.Google ScholarGoogle Scholar
  4. M. Bawa, G. Manku, and P. Raghavan. SETS: Search enhanced by topic-segmentation. In Proceedings of the 26th Annual International Conference on Research and Development in Informaion Retrieval (SIGIR'03), August 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bollobás. Random Graphs. Academic Press, New York, 1985.Google ScholarGoogle Scholar
  6. S. Brin. Near neighbor search in large metric spaces. In Proceedings of the 21th International Conferenceon Very Large Data Bases (VLDB'95), September 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Chavez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of ACM International Conference on Management of Data (SIGMOD'01), November 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Freenet. Freenet project, 2004. http://freenet.sourceforge.net/.Google ScholarGoogle Scholar
  10. V. Gaede and O. Günther. Multidimensional access methods. ACM Computing Surveys, 30(2), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Gupta, D. Agrawal, and A. El Abbadi. Approximate range selection queries in peer-to-peer systems. In Proceedings of the First Biennial Conference on Innovative Data Systems Research, January 2003.Google ScholarGoogle Scholar
  12. R. Huebsch, N. Lanham, B. Loo, J. Hellerstein, S. Shenker, and I. Stoica. Querying the inernet withPIER. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB'03), September 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. KaZaA. Sharman networks, 2004. http://www.kazaa.com/.Google ScholarGoogle Scholar
  14. J. Kleinberg. The small-world phenomenon: an algorithmic perspective. In Proceedings of the 32nd ACM Symposium on Theory of Computing, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Litwin, M. Neimat, and D. Schneider. LH*:A scalable, distributed data structure. ACM Transactions on Database Systems, 21(4), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Navarro. Searching in metric spaces by spatial approximation. The Very Large Databases Journal (VLDBJ), 11(1), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Okabe, B. Boots, K. Sugihara, and S. Chiu. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley, 2nd edition, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proceedings of ACM SIGCOMM '01, August 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Seidel. Exact upper bounds for the number of faces in d-dimensional Voronoi diagrams, DIMACS Series, volume~4. American Mathematical Society, 1991.Google ScholarGoogle Scholar
  20. SETI@home, 2004. http://setiathome.ssl.berkeley.edu//.Google ScholarGoogle Scholar
  21. I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of ACM SIGCOMM '01, August 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Watts and S. Strogatz. Collective dynamics of small world networks. Nature, (393):440--442, 1998.Google ScholarGoogle Scholar
  23. B. Yang and H. Garcia-Molina. Designing a super-peer network. In Proceedings of the 19th International Conference on Data Engineering (ICDE'03), March 2003.Google ScholarGoogle Scholar

Index Terms

  1. SWAM: a family of access methods for similarity-search in peer-to-peer data networks

                Recommendations

                Reviews

                Maytham Hassan Safar

                Banaei-Kashani and Shahabi describe querical data networks (QDNs), which are large-scale, self-organizing, distributed query processing systems. As an example, they present a peer-to-peer network. The data is distributed among the nodes, and each node knows its data, plus the data of some of the nodes that it is connected to. To keep the network scalable, there are only a few connections made from a node to other nodes. The authors provide a method to perform typical similarity queries on peer-to-peer networks. In addition, they provide a mechanism to support different query types, such as range, and k-nn queries. The authors' solution relies on already existing data partitioning techniques in the value space, where each data item is considered to exist only once. The space of all the data items in the network is decomposed once hierarchically, and once using Voronoi diagram partitioning techniques. Then, a grid of connections is created that consists of some random connections between nodes, and connections with other nodes that are considered as neighbors of a node by the decomposition method. The two components are overlaid, and the resulting grid forms a connection network of the "virtual" nodes of data items. Finally, a reverse mapping, from each data item in this virtual space to the actual storage nodes, is performed, and the resulting network is presented. Queries are shown to be exactly answerable using the resulting network. The authors fail to describe the applications that would use their technique with examples of real queries. Instead, only abstract configurations are mentioned in their experimental results. The curse of dimensionality issue, which implies that, in high-dimensional spaces, there is no useful locality to exploit, is not discussed. This locality is crucial to the proposed method. The authors should more carefully investigate what happens to the communication cost as the dimensionality of their space increases. The authors assume exactly one item per node, and provided an argument showing how the general case can be transformed to this restricted case. By doing this transformation, and making the data size proportional to the number of nodes in the system, one loses track of data dependence. The authors imply that the complexity of their method is linear in the data set size, regardless of the number of dimensions of the data, which is a strange-sounding result. Overall, this is a good work, except for the experimental setup. Online Computing Reviews Service

                Access critical reviews of Computing literature here

                Become a reviewer for Computing Reviews.

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
                  November 2004
                  678 pages
                  ISBN:1581138741
                  DOI:10.1145/1031171

                  Copyright © 2004 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 13 November 2004

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate1,861of8,427submissions,22%

                  Upcoming Conference

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader