research-article

A term-based inverted index partitioning model for efficient distributed query processing

Authors:
B. Barla Cambazoglu

Yahoo Labs

Yahoo Labs
View Profile

,
Enver Kayaaslan

Yahoo Labs

Yahoo Labs
View Profile

,
Simon Jonassen

Yahoo Labs

Yahoo Labs
View Profile

,
Cevdet Aykanat

Bilkent University

Bilkent University
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 7 Issue 3Article No.: 15pp 1–23https://doi.org/10.1145/2516633.2516637

Published:30 September 2013Publication History

ACM Transactions on the Web

Abstract

In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.

References

Alpert, C. J. and Kahng, A. B. 1995. Recent directions in netlist partitioning: a survey. Integration VLSI J. 19, 1--2, 1--81. Google ScholarDigital Library
Aykanat, C., Cambazoglu, B. B., and Uçar, B. 2008. Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices. J. Parallel Distrib. Comput. 68, 5, 609--625. Google ScholarDigital Library
Badue, C., Ribeiro-Neto, B., Baeza-Yates, R., and Ziviani, N. 2001. Distributed query processing using partitioned inverted files. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval. 10--20.Google Scholar
Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Ziviani, N. 2007. Analyzing imbalance among homogeneous index servers in a web search system. Inf. Process. Manage. 43, 3, 592--608. Google ScholarDigital Library
Barroso, L. A., Dean, J., and Hölzle, U. 2003. Web search for a planet: the Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
Berge, C. 1985. Graphs and Hypergraphs. Elsevier Science Ltd. Google ScholarDigital Library
Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. dissertation. Department of Computer Engineering, Bilkent University.Google Scholar
Cambazoglu, B. B. and Aykanat, C. 2006. A term-based inverted index organization for communication-efficient parallel query processing. In Proceedings of the IFIP International Conference on Network and Parallel Computing. 104--109.Google Scholar
Cambazoglu, B. B. and Baeza-Yates, R. 2011. Scalability challenges in web search engines. In Advanced Topics in Information Retrieval, Information Retrieval Series, vol. 33, Springer, 27--50.Google Scholar
Cambazoglu, B. B., Catal, A., and Aykanat, C. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of the 21st International Conference on Computer and Information Sciences. 717--725. Google ScholarDigital Library
Cambazoglu, B. B., Junqueira, F. P., Plachouras, V., Banachowski, S., Cui, B., Lim, S., and Bridge, B. 2010. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web. 181--190. Google ScholarDigital Library
Catalyurek, U. and Aykanat, C. 1999. Hypergraph-partitioning-based decomposition for parallel sparsematrix vector multiplication. IEEE Trans. Parallel Distrib. Systems 10, 7, 673--693. Google ScholarDigital Library
Gan, Q. and Suel, T. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarDigital Library
Jeong, B.-S. and Omiecinski, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distrib. Systems 6, 2, 142--153. Google ScholarDigital Library
Jonassen, S. and Bratsberg, S. E. 2009. Impact of the query model and system settings on performance of distributed inverted indexes. In Proceedings of the Norsk Informatikkonferance. 143--154.Google Scholar
Jonassen, S. and Bratsberg, S. E. 2010. A combined semi-pipelined query processing architecture for distributed full-text retrieval. In Proceedings of the 11th International Conference on Web Information Systems Engineering. 587--601. Google ScholarDigital Library
Jonassen, S. and Bratsberg, S. E. 2012a. Improving the performance of pipelined query processing with skipping. In Proceedings of the 13th International Conference on Web Information Systems Engineering. 1--15. Google ScholarDigital Library
Jonassen, S. and Bratsberg, S. E. 2012b. Intra-query concurrent pipelined processing for distributed full-text retrieval. In Proceedings of the 34th European Conference on Advances in Information Retrieval. 413--425. Google ScholarDigital Library
Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1, 359--392. Google ScholarDigital Library
Kucukyilmaz, T., Turk, A., and Aykanat, C. 2012. A parallel framework for in-memory construction of term-partitioned inverted indexes. Comput. J. 55, 11, 1317--1330. Google ScholarDigital Library
Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., and Morris, R. 2003. On the feasibility of peer-to-peer web indexing and search. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems. 207--215.Google Scholar
Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. 2007. Mining query logs to optimize index partitioning in parallel web search engines. In Proceedings of the 2nd International Conference on Scalable Information Systems. 43:1--43:9. Google ScholarDigital Library
Ma, Y.-C., Chen, T.-F., and Chung, C.-P. 2002. Posting file partitioning and parallel information retrieval. J. Syst. Soft. 63, 2, 113--127.Google ScholarCross Ref
Ma, Y.-C., Chung, C.-P., and Chen, T.-F. 2011. Load and storage balanced posting file partitioning for parallel information retrieval. J. Syst. Soft. 84, 5, 864--884. Google ScholarDigital Library
MacFarlane, A., McCann, J. A., and Robertson, S. E. 2000. Parallel search using partitioned inverted files. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval. 209--220. Google ScholarDigital Library
Moffat, A., Webber, W., and Zobel, J. 2006. Load balancing for term-distributed parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 348--355. Google ScholarDigital Library
Moffat, A., Webber, W., Zobel, J., and Baeza-Yates, R. 2007. A pipelined architecture for distributed text query evaluation. Inf. Retriev. 10, 3, 205--231. Google ScholarDigital Library
Ribeiro-Neto, B. A. and Barbosa, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM Conference on Digital Libraries. 182--190. Google ScholarDigital Library
Ribeiro-Neto, B. A., Kitajima, J. P., Navarro, G., Sant'Ana, C. R. G., and Ziviani, N. 1998. Parallel generation of inverted files for distributed text collections. In Proceedings of the 18th International Conference of the Chilean Society of Computer Science. 149--157. Google ScholarDigital Library
Selvitopi, R. O., Turk, A., and Aykanat, C. 2012. Replicated partitioning for undirected hypergraphs. J. Parallel and Distrib. Comput. 72, 4, 547--563. Google ScholarDigital Library
Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In Proceedings of the International Workshop on the Web and Databases.Google Scholar
Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems. 8--17. Google ScholarDigital Library
Webber, W. 2007. Design and evaluation of a pipelined distributed information retrieval architecture. Master's thesis. University of Melbourne.Google Scholar
Yan, H., Ding, S., and Suel, T. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web. 401--410. Google ScholarDigital Library
Zhang, J. and Suel, T. 2005. Efficient query evaluation on large textual collections in a peer-to-peer environment. In Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 225--233. Google ScholarDigital Library
Zhang, J. and Suel, T. 2007. Optimized inverted list assignment in distributed search engine architectures. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. 1--10.Google Scholar
Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, Article 6. Google ScholarDigital Library

Index Terms

A term-based inverted index partitioning model for efficient distributed query processing
1. Information systems
  1. Information retrieval

Recommendations

The Effect of Index Partitioning Schemes on the Performance of Distributed Query Processing

An indexing scheme called partitioned global indexes (PGI) for a locally distributed database system is presented. The scheme builds a global index for the entire relation and partitions the index across the sites. A strategy for processing such an ...
Read More
Scalable nearest neighbor query processing based on Inverted Grid Index

With the increasing availability of Location-Based Services (LBS) and mobile internet, the amount of spatial data is growing larger. It poses new requirements and challenges for distributed index and query processing on large scale spatial data. A ...
Read More
Inverted Grid-Based kNN Query Processing with MapReduce
CHINAGRID '12: Proceedings of the 2012 Seventh ChinaGrid Annual Conference

With the increasing availability of LBS (Location Based Services) and mobile internet, the amount of spatial data is growing larger and larger. It poses new requirements and challenges towards cloud environments, such as how to accomplish efficient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 7, Issue 3
September 2013
149 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2516633
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 September 2013
- Accepted: 1 March 2013
- Revised: 1 August 2012
- Received: 1 January 2012
Published in tweb Volume 7, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web search engine
distributed query processing
hypergraph partitioning
term-based index partitioning
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 444
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A term-based inverted index partitioning model for efficient distributed query processing

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

The Effect of Index Partitioning Schemes on the Performance of Distributed Query Processing

Scalable nearest neighbor query processing based on Inverted Grid Index

Inverted Grid-Based kNN Query Processing with MapReduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A term-based inverted index partitioning model for efficient distributed query processing

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

The Effect of Index Partitioning Schemes on the Performance of Distributed Query Processing

Scalable nearest neighbor query processing based on Inverted Grid Index

Inverted Grid-Based kNN Query Processing with MapReduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media