Abstract
In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.
- Alpert, C. J. and Kahng, A. B. 1995. Recent directions in netlist partitioning: a survey. Integration VLSI J. 19, 1--2, 1--81. Google ScholarDigital Library
- Aykanat, C., Cambazoglu, B. B., and Uçar, B. 2008. Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices. J. Parallel Distrib. Comput. 68, 5, 609--625. Google ScholarDigital Library
- Badue, C., Ribeiro-Neto, B., Baeza-Yates, R., and Ziviani, N. 2001. Distributed query processing using partitioned inverted files. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval. 10--20.Google Scholar
- Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Ziviani, N. 2007. Analyzing imbalance among homogeneous index servers in a web search system. Inf. Process. Manage. 43, 3, 592--608. Google ScholarDigital Library
- Barroso, L. A., Dean, J., and Hölzle, U. 2003. Web search for a planet: the Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
- Berge, C. 1985. Graphs and Hypergraphs. Elsevier Science Ltd. Google ScholarDigital Library
- Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. dissertation. Department of Computer Engineering, Bilkent University.Google Scholar
- Cambazoglu, B. B. and Aykanat, C. 2006. A term-based inverted index organization for communication-efficient parallel query processing. In Proceedings of the IFIP International Conference on Network and Parallel Computing. 104--109.Google Scholar
- Cambazoglu, B. B. and Baeza-Yates, R. 2011. Scalability challenges in web search engines. In Advanced Topics in Information Retrieval, Information Retrieval Series, vol. 33, Springer, 27--50.Google Scholar
- Cambazoglu, B. B., Catal, A., and Aykanat, C. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of the 21st International Conference on Computer and Information Sciences. 717--725. Google ScholarDigital Library
- Cambazoglu, B. B., Junqueira, F. P., Plachouras, V., Banachowski, S., Cui, B., Lim, S., and Bridge, B. 2010. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web. 181--190. Google ScholarDigital Library
- Catalyurek, U. and Aykanat, C. 1999. Hypergraph-partitioning-based decomposition for parallel sparsematrix vector multiplication. IEEE Trans. Parallel Distrib. Systems 10, 7, 673--693. Google ScholarDigital Library
- Gan, Q. and Suel, T. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarDigital Library
- Jeong, B.-S. and Omiecinski, E. 1995. Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distrib. Systems 6, 2, 142--153. Google ScholarDigital Library
- Jonassen, S. and Bratsberg, S. E. 2009. Impact of the query model and system settings on performance of distributed inverted indexes. In Proceedings of the Norsk Informatikkonferance. 143--154.Google Scholar
- Jonassen, S. and Bratsberg, S. E. 2010. A combined semi-pipelined query processing architecture for distributed full-text retrieval. In Proceedings of the 11th International Conference on Web Information Systems Engineering. 587--601. Google ScholarDigital Library
- Jonassen, S. and Bratsberg, S. E. 2012a. Improving the performance of pipelined query processing with skipping. In Proceedings of the 13th International Conference on Web Information Systems Engineering. 1--15. Google ScholarDigital Library
- Jonassen, S. and Bratsberg, S. E. 2012b. Intra-query concurrent pipelined processing for distributed full-text retrieval. In Proceedings of the 34th European Conference on Advances in Information Retrieval. 413--425. Google ScholarDigital Library
- Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1, 359--392. Google ScholarDigital Library
- Kucukyilmaz, T., Turk, A., and Aykanat, C. 2012. A parallel framework for in-memory construction of term-partitioned inverted indexes. Comput. J. 55, 11, 1317--1330. Google ScholarDigital Library
- Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., and Morris, R. 2003. On the feasibility of peer-to-peer web indexing and search. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems. 207--215.Google Scholar
- Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. 2007. Mining query logs to optimize index partitioning in parallel web search engines. In Proceedings of the 2nd International Conference on Scalable Information Systems. 43:1--43:9. Google ScholarDigital Library
- Ma, Y.-C., Chen, T.-F., and Chung, C.-P. 2002. Posting file partitioning and parallel information retrieval. J. Syst. Soft. 63, 2, 113--127.Google ScholarCross Ref
- Ma, Y.-C., Chung, C.-P., and Chen, T.-F. 2011. Load and storage balanced posting file partitioning for parallel information retrieval. J. Syst. Soft. 84, 5, 864--884. Google ScholarDigital Library
- MacFarlane, A., McCann, J. A., and Robertson, S. E. 2000. Parallel search using partitioned inverted files. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval. 209--220. Google ScholarDigital Library
- Moffat, A., Webber, W., and Zobel, J. 2006. Load balancing for term-distributed parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 348--355. Google ScholarDigital Library
- Moffat, A., Webber, W., Zobel, J., and Baeza-Yates, R. 2007. A pipelined architecture for distributed text query evaluation. Inf. Retriev. 10, 3, 205--231. Google ScholarDigital Library
- Ribeiro-Neto, B. A. and Barbosa, R. A. 1998. Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM Conference on Digital Libraries. 182--190. Google ScholarDigital Library
- Ribeiro-Neto, B. A., Kitajima, J. P., Navarro, G., Sant'Ana, C. R. G., and Ziviani, N. 1998. Parallel generation of inverted files for distributed text collections. In Proceedings of the 18th International Conference of the Chilean Society of Computer Science. 149--157. Google ScholarDigital Library
- Selvitopi, R. O., Turk, A., and Aykanat, C. 2012. Replicated partitioning for undirected hypergraphs. J. Parallel and Distrib. Comput. 72, 4, 547--563. Google ScholarDigital Library
- Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In Proceedings of the International Workshop on the Web and Databases.Google Scholar
- Tomasic, A. and Garcia-Molina, H. 1993. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems. 8--17. Google ScholarDigital Library
- Webber, W. 2007. Design and evaluation of a pipelined distributed information retrieval architecture. Master's thesis. University of Melbourne.Google Scholar
- Yan, H., Ding, S., and Suel, T. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web. 401--410. Google ScholarDigital Library
- Zhang, J. and Suel, T. 2005. Efficient query evaluation on large textual collections in a peer-to-peer environment. In Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 225--233. Google ScholarDigital Library
- Zhang, J. and Suel, T. 2007. Optimized inverted list assignment in distributed search engine architectures. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. 1--10.Google Scholar
- Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, Article 6. Google ScholarDigital Library
Index Terms
- A term-based inverted index partitioning model for efficient distributed query processing
Recommendations
The Effect of Index Partitioning Schemes on the Performance of Distributed Query Processing
An indexing scheme called partitioned global indexes (PGI) for a locally distributed database system is presented. The scheme builds a global index for the entire relation and partitions the index across the sites. A strategy for processing such an ...
Scalable nearest neighbor query processing based on Inverted Grid Index
With the increasing availability of Location-Based Services (LBS) and mobile internet, the amount of spatial data is growing larger. It poses new requirements and challenges for distributed index and query processing on large scale spatial data. A ...
Inverted Grid-Based kNN Query Processing with MapReduce
CHINAGRID '12: Proceedings of the 2012 Seventh ChinaGrid Annual ConferenceWith the increasing availability of LBS (Location Based Services) and mobile internet, the amount of spatial data is growing larger and larger. It poses new requirements and challenges towards cloud environments, such as how to accomplish efficient ...
Comments