Abstract
A lot of research has studied how to optimize inverted index structures in search engines through suitable reassignment of document identifiers. This approach was originally proposed to allow for better compression of the index, but subsequent work showed that it can also result in significant speed-ups for conjunctive queries and even certain types of disjunctive top-k algorithms. However, we do not have a good understanding of why this happens, and how we could directly optimize an index for query processing speed. As a result, existing techniques attempt to optimize for size, and treat speed increases as a welcome side-effect.
In this paper, we take an initial but important step towards understanding and modeling speed increases due to document reordering. We define the problem of minimizing the cost of queries given an inverted index and a query distribution, relate it to work on adaptive set intersection, and show that it is fundamentally different from that of minimizing compressed index size. We then propose a heuristic algorithm for finding a document reordering that minimizes query processing costs under suitable cost models. Our experiments show significant increases in the speed of intersections over state-of-the-art reordering techniques.
- R. R. Amossen and R. Pagh. A new data layout for set intersection on gpus. In 2011 IEEE International Parallel Distributed Processing Symposium, pages 698--708, 2011. Google ScholarDigital Library
- N. Ao, F. Zhang, D. Wu, D. Stones, G. Wang, X. Liu, J. Liu, and S. Lin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. PVLDB, 4(8):470--481, 2011. Google ScholarDigital Library
- R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Combinatorial Pattern Matching, pages 400--408, 2004.Google ScholarCross Ref
- R. A. Baeza-Yates. Experimental analysis of a fast intersection algorithm for sorted sequences. In String Processing and Information Retrieval, pages 13--24, 2005. Google ScholarDigital Library
- J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Trans. Algorithms, 4(1):4:1--4:18, Mar. 2008. Google ScholarDigital Library
- J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Conference on Experimental Algorithms, pages 146--157, 2006. Google ScholarDigital Library
- J. Barbay, A. López-Ortiz, and T. Lu. An experimental investigation of set intersection algorithms for text searching. J. Exp. Algorithmics, 14:7:3.7--7:3.24, June 2010. Google ScholarDigital Library
- J. L. Bentley and A. C. Yao. An almost optimal algorithm for unbounded searching. Information Processing Letters, 5:82--87, 1976.Google ScholarCross Ref
- P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In International Symposium on Algorithms and Computation, pages 739--750, 2007. Google ScholarDigital Library
- R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proceedings of the 29th European Conference on IR Research, pages 375--387, 2005. Google ScholarDigital Library
- D. Blandford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, page 342, 2002. Google ScholarDigital Library
- G. E. Blelloch and M. Reid-Miller. Fast set operations using treaps. In ACM Symposium on Parallel Algorithms and Architectures, pages 16--26, 1998. Google ScholarDigital Library
- A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 426--434, 2003. Google ScholarDigital Library
- M. R. Brown and R. E. Tarjan. A fast merging algorithm. J. ACM, 26:211--226, 1979. Google ScholarDigital Library
- M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering, pages 1360--1369, 2012. Google ScholarDigital Library
- K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In 2011 IEEE 27th International Conference on Data Engineering, pages 709--720, 2011. Google ScholarDigital Library
- J. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, 2010. Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743--752, 2000. Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. Munro. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation, pages 91--104, 2001. Google ScholarDigital Library
- L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1535--1544, 2016. Google ScholarDigital Library
- C. Dimopoulos, S. Nepomnyachiy, and T. Suel. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 113--122, 2013. Google ScholarDigital Library
- B. Ding and A. C. Konig. Fast set intersection in memory. PVLDB, 4(4):255--266, 2011. Google ScholarDigital Library
- S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web, pages 311--320, 2010. Google ScholarDigital Library
- S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 993--1002, 2011. Google ScholarDigital Library
- B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. Bitfunnel: Revisiting signatures for search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605--614, 2017. Google ScholarDigital Library
- F. K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Inf., 1(2):145--158, June 1971. Google ScholarDigital Library
- F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly-ordered sets. SIAM J. Comput, 1(1):31--39, June 1972.Google ScholarDigital Library
- A. Kane and F. Tompa. Skewed partial bitvectors for list intersection. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 263--272, 2014. Google ScholarDigital Library
- A. Kane and F. W. Tompa. Distribution by document size. In Workshop on Large-scale and Distributed Systems for Information Retrieval, 2014. Google ScholarDigital Library
- S. Khot and O. Regev. Vertex cover might be hard to approximate to within 2-ϵ J. Comput. Syst. Sci., 74(3):335--349, May 2008. Google ScholarDigital Library
- D. Lemire, L. Boytsov, and N. Kurz. Simd compression and the intersection of sorted integers. Softw. Pract. Exper., 46(6):723--749, June 2016. Google ScholarDigital Library
- A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster blockmax wand with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 625--634, 2017. Google ScholarDigital Library
- A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, Jul 2000. Google ScholarDigital Library
- G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 273--282, 2014. Google ScholarDigital Library
- V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document reordering is good, especially for e-commerce. In Proceedings of the SIGIR 2017 Workshop On eCommerce, 2017.Google Scholar
- C. Rosset, D. Jose, G. Ghosh, B. Mitra, and S. Tiwary. Optimizing query evaluations using reinforcement learning for web search. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1193--1196, 2018. Google ScholarDigital Library
- P. Sanders and F. Transier. Intersection in integer inverted indices. In Proceedings of the Meeting on Algorithm Engineering and Experiments, pages 71--83, 2007. Google ScholarDigital Library
- B. Schlegel, T. Dresden, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd instructions. In In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage at VLDB, 2016.Google Scholar
- W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Process. Management, 39(1):117--131, Jan. 2003. Google ScholarDigital Library
- F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research, pages 101--112, 2007. Google ScholarDigital Library
- F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 305--312, 2004. Google ScholarDigital Library
- F. Silvestri and R. Venturini. Vsencoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pages 1219--1228, 2010. Google ScholarDigital Library
- S. Tatikonda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On efficient posting list intersection with multicore processors. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 738--739, 2009. Google ScholarDigital Library
- D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(1):838--849, 2009. Google ScholarDigital Library
- H. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Process. Management, 31(6):831--850, Nov. 1995. Google ScholarDigital Library
- L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 105--114, 2011. Google ScholarDigital Library
- D. Wu, F. Zhang, N. Ao, F. Wang, X. Liu, and G. Wang. A batched gpu algorithm for set intersection. In 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks, pages 752--756, 2009. Google ScholarDigital Library
- H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web, pages 401--410, 2009. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), July 2006. Google ScholarDigital Library
Recommendations
Faster Index Reordering with Bipartite Graph Partitioning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalWe revisit the Bipartite Graph Partitioning approach to document reordering (Dhulipala et al., KDD 2016), and consider a range of algorithmic and heuristic refinements that lead to faster computation of index-minimizing document orderings. Our final ...
Faster top-k document retrieval using block-max indexes
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalLarge search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by ...
Index Compression through Document Reordering
DCC '02: Proceedings of the Data Compression ConferenceAn important concern in the design of search engines is the construction of an inverted index. An inverted index, also called a concordance, contains a list of documents (or posting list) for every possible search term. These posting lists are usually ...
Comments