skip to main content
research-article

A term-based inverted index partitioning model for efficient distributed query processing

Published:30 September 2013Publication History
Skip Abstract Section

Abstract

In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.

References

  1. Alpert, C. J. and Kahng, A. B. 1995. Recent directions in netlist partitioning: a survey. Integration VLSI J. 19, 1--2, 1--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aykanat, C., Cambazoglu, B. B., and Uçar, B. 2008. Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices. J. Parallel Distrib. Comput. 68, 5, 609--625. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Badue, C., Ribeiro-Neto, B., Baeza-Yates, R., and Ziviani, N. 2001. Distributed query processing using partitioned inverted files. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval. 10--20.Google ScholarGoogle Scholar
  4. Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Ziviani, N. 2007. Analyzing imbalance among homogeneous index servers in a web search system. Inf. Process. Manage. 43, 3, 592--608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Barroso, L. A., Dean, J., and Hölzle, U. 2003. Web search for a planet: the Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Berge, C. 1985. Graphs and Hypergraphs. Elsevier Science Ltd. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. dissertation. Department of Computer Engineering, Bilkent University.Google ScholarGoogle Scholar
  8. Cambazoglu, B. B. and Aykanat, C. 2006. A term-based inverted index organization for communication-efficient parallel query processing. In Proceedings of the IFIP International Conference on Network and Parallel Computing. 104--109.Google ScholarGoogle Scholar
  9. Cambazoglu, B. B. and Baeza-Yates, R. 2011. Scalability challenges in web search engines. In Advanced Topics in Information Retrieval, Information Retrieval Series, vol. 33, Springer, 27--50.Google ScholarGoogle Scholar
  10. Cambazoglu, B. B., Catal, A., and Aykanat, C. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of the 21st International Conference on Computer and Information Sciences. 717--725. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cambazoglu, B. B., Junqueira, F. P., Plachouras, V., Banachowski, S., Cui, B., Lim, S., and Bridge, B. 2010. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web. 181--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Catalyurek, U. and Aykanat, C. 1999. Hypergraph-partitioning-based decomposition for parallel sparsematrix vector multiplication. IEEE Trans. Parallel Distrib. Systems 10, 7, 673--693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gan, Q. and Suel, T. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeong, B.-S. and Omiecinski, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distrib. Systems 6, 2, 142--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jonassen, S. and Bratsberg, S. E. 2009. Impact of the query model and system settings on performance of distributed inverted indexes. In Proceedings of the Norsk Informatikkonferance. 143--154.Google ScholarGoogle Scholar
  16. Jonassen, S. and Bratsberg, S. E. 2010. A combined semi-pipelined query processing architecture for distributed full-text retrieval. In Proceedings of the 11th International Conference on Web Information Systems Engineering. 587--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jonassen, S. and Bratsberg, S. E. 2012a. Improving the performance of pipelined query processing with skipping. In Proceedings of the 13th International Conference on Web Information Systems Engineering. 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jonassen, S. and Bratsberg, S. E. 2012b. Intra-query concurrent pipelined processing for distributed full-text retrieval. In Proceedings of the 34th European Conference on Advances in Information Retrieval. 413--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1, 359--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kucukyilmaz, T., Turk, A., and Aykanat, C. 2012. A parallel framework for in-memory construction of term-partitioned inverted indexes. Comput. J. 55, 11, 1317--1330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., and Morris, R. 2003. On the feasibility of peer-to-peer web indexing and search. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems. 207--215.Google ScholarGoogle Scholar
  22. Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. 2007. Mining query logs to optimize index partitioning in parallel web search engines. In Proceedings of the 2nd International Conference on Scalable Information Systems. 43:1--43:9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ma, Y.-C., Chen, T.-F., and Chung, C.-P. 2002. Posting file partitioning and parallel information retrieval. J. Syst. Soft. 63, 2, 113--127.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ma, Y.-C., Chung, C.-P., and Chen, T.-F. 2011. Load and storage balanced posting file partitioning for parallel information retrieval. J. Syst. Soft. 84, 5, 864--884. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. MacFarlane, A., McCann, J. A., and Robertson, S. E. 2000. Parallel search using partitioned inverted files. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval. 209--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Moffat, A., Webber, W., and Zobel, J. 2006. Load balancing for term-distributed parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 348--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Moffat, A., Webber, W., Zobel, J., and Baeza-Yates, R. 2007. A pipelined architecture for distributed text query evaluation. Inf. Retriev. 10, 3, 205--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ribeiro-Neto, B. A. and Barbosa, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM Conference on Digital Libraries. 182--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ribeiro-Neto, B. A., Kitajima, J. P., Navarro, G., Sant'Ana, C. R. G., and Ziviani, N. 1998. Parallel generation of inverted files for distributed text collections. In Proceedings of the 18th International Conference of the Chilean Society of Computer Science. 149--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Selvitopi, R. O., Turk, A., and Aykanat, C. 2012. Replicated partitioning for undirected hypergraphs. J. Parallel and Distrib. Comput. 72, 4, 547--563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In Proceedings of the International Workshop on the Web and Databases.Google ScholarGoogle Scholar
  32. Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems. 8--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Webber, W. 2007. Design and evaluation of a pipelined distributed information retrieval architecture. Master's thesis. University of Melbourne.Google ScholarGoogle Scholar
  34. Yan, H., Ding, S., and Suel, T. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web. 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhang, J. and Suel, T. 2005. Efficient query evaluation on large textual collections in a peer-to-peer environment. In Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 225--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhang, J. and Suel, T. 2007. Optimized inverted list assignment in distributed search engine architectures. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. 1--10.Google ScholarGoogle Scholar
  37. Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, Article 6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A term-based inverted index partitioning model for efficient distributed query processing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 7, Issue 3
      September 2013
      149 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2516633
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 September 2013
      • Accepted: 1 March 2013
      • Revised: 1 August 2012
      • Received: 1 January 2012
      Published in tweb Volume 7, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader