skip to main content
research-article

Document reordering for faster intersection

Published:01 January 2019Publication History
Skip Abstract Section

Abstract

A lot of research has studied how to optimize inverted index structures in search engines through suitable reassignment of document identifiers. This approach was originally proposed to allow for better compression of the index, but subsequent work showed that it can also result in significant speed-ups for conjunctive queries and even certain types of disjunctive top-k algorithms. However, we do not have a good understanding of why this happens, and how we could directly optimize an index for query processing speed. As a result, existing techniques attempt to optimize for size, and treat speed increases as a welcome side-effect.

In this paper, we take an initial but important step towards understanding and modeling speed increases due to document reordering. We define the problem of minimizing the cost of queries given an inverted index and a query distribution, relate it to work on adaptive set intersection, and show that it is fundamentally different from that of minimizing compressed index size. We then propose a heuristic algorithm for finding a document reordering that minimizes query processing costs under suitable cost models. Our experiments show significant increases in the speed of intersections over state-of-the-art reordering techniques.

References

  1. R. R. Amossen and R. Pagh. A new data layout for set intersection on gpus. In 2011 IEEE International Parallel Distributed Processing Symposium, pages 698--708, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Ao, F. Zhang, D. Wu, D. Stones, G. Wang, X. Liu, J. Liu, and S. Lin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. PVLDB, 4(8):470--481, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Combinatorial Pattern Matching, pages 400--408, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. A. Baeza-Yates. Experimental analysis of a fast intersection algorithm for sorted sequences. In String Processing and Information Retrieval, pages 13--24, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Trans. Algorithms, 4(1):4:1--4:18, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Conference on Experimental Algorithms, pages 146--157, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Barbay, A. López-Ortiz, and T. Lu. An experimental investigation of set intersection algorithms for text searching. J. Exp. Algorithmics, 14:7:3.7--7:3.24, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. L. Bentley and A. C. Yao. An almost optimal algorithm for unbounded searching. Information Processing Letters, 5:82--87, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In International Symposium on Algorithms and Computation, pages 739--750, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proceedings of the 29th European Conference on IR Research, pages 375--387, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Blandford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, page 342, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. E. Blelloch and M. Reid-Miller. Fast set operations using treaps. In ACM Symposium on Parallel Algorithms and Architectures, pages 16--26, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 426--434, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. R. Brown and R. E. Tarjan. A fast merging algorithm. J. ACM, 26:211--226, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering, pages 1360--1369, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In 2011 IEEE 27th International Conference on Data Engineering, pages 709--720, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. D. Demaine, A. López-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743--752, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. D. Demaine, A. López-Ortiz, and J. Munro. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation, pages 91--104, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1535--1544, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Dimopoulos, S. Nepomnyachiy, and T. Suel. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 113--122, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Ding and A. C. Konig. Fast set intersection in memory. PVLDB, 4(4):255--266, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web, pages 311--320, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 993--1002, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. Bitfunnel: Revisiting signatures for search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605--614, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Inf., 1(2):145--158, June 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly-ordered sets. SIAM J. Comput, 1(1):31--39, June 1972.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Kane and F. Tompa. Skewed partial bitvectors for list intersection. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 263--272, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Kane and F. W. Tompa. Distribution by document size. In Workshop on Large-scale and Distributed Systems for Information Retrieval, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Khot and O. Regev. Vertex cover might be hard to approximate to within 2-ϵ J. Comput. Syst. Sci., 74(3):335--349, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Lemire, L. Boytsov, and N. Kurz. Simd compression and the intersection of sorted integers. Softw. Pract. Exper., 46(6):723--749, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster blockmax wand with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 625--634, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, Jul 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 273--282, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document reordering is good, especially for e-commerce. In Proceedings of the SIGIR 2017 Workshop On eCommerce, 2017.Google ScholarGoogle Scholar
  36. C. Rosset, D. Jose, G. Ghosh, B. Mitra, and S. Tiwary. Optimizing query evaluations using reinforcement learning for web search. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1193--1196, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Sanders and F. Transier. Intersection in integer inverted indices. In Proceedings of the Meeting on Algorithm Engineering and Experiments, pages 71--83, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Schlegel, T. Dresden, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd instructions. In In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage at VLDB, 2016.Google ScholarGoogle Scholar
  39. W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Process. Management, 39(1):117--131, Jan. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research, pages 101--112, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 305--312, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. F. Silvestri and R. Venturini. Vsencoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pages 1219--1228, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Tatikonda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On efficient posting list intersection with multicore processors. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 738--739, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(1):838--849, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. H. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Process. Management, 31(6):831--850, Nov. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 105--114, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. Wu, F. Zhang, N. Ao, F. Wang, X. Liu, and G. Wang. A batched gpu algorithm for set intersection. In 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks, pages 752--756, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web, pages 401--410, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 5
    January 2019
    163 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 January 2019
    Published in pvldb Volume 12, Issue 5

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader