nach oben

Discover Computing

Erschienen in:

25.01.2017 | Information Retrieval Efficiency

The role of index compression in score-at-a-time query evaluation

verfasst von: Jimmy Lin, Andrew Trotman

Erschienen in: Discover Computing | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.

Vorheriger Artikel Performance improvements for search systems using an integrated cache of lists + intersections

Nächster Artikel Efficient distributed selective search

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

As a detail, note that in this treatment query weights are often ignored, which is justified for a few reasons: typical search queries do not have repeated terms (usually the source of query weights); in fact, although BM25 does include weighting for query frequency, in most implementation that weight is ignored.

The artifacts in the graph come from boilerplate terms that occur in subsets of the collection.

https://github.com/lintool/JASS.

Note that means can still be significantly different even though confidence intervals overlap (Wolfe and Hanley 2002), as is the case of uncompressed vs. QMX-D4 for ClueWeb12-B13.

Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRef

Anh, V. N., & Moffat, A. (2005a). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151–166.CrossRef

Anh, V.N., & Moffat, A. (2005b). Simplified similarity scoring using term ranks. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 226–233). Salvador.

Anh, V. N., & Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 372–379). Seattle.

Anh, V. N., & Moffat, A. (2010). Inverted index compression using word-aligned binary codes. Software: Practice and Experience, 40(2), 131–147.

Anh, V. N., de Kretser, O., & Moffat, A. (2001). Vector-space ranking with effective early termination. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 35–42). New Orleans.

Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 997–1000). Dublin.

Brewer, E. A. (2005). Combining systems and databases: A search engine retrospective. Readings in database systems. Cambridge: MIT Press.

Büttcher, S., & Clarke, C.L.A. (2007). Index compression is good, especially for random access. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 761–770). Lisbon.

Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the third ACM international conference on web search and data mining (WSDM 2010) (pp. 411–420). New York.

Crane, M., Trotman, A., & O’Keefe, R. (2013). Maintaining discriminatory power in quantized indexes. In Proceedings of 22nd international conference on information and knowledge management (CIKM 2013) (pp. 1221–1224). San Francisco.

Dean, J. (2009). Challenges in building large-scale information retrieval systems. In Keynote presentation at the second ACM international conference on web search and data mining (WSDM 2009). Barcelona.

Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.CrossRef

Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proceedings of the 34rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 993–1002). Beijing.

Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., & Zien, J. (2011). Evaluation strategies for top-k queries over memory-resident inverted indexes. In Proceedings of the 37th international conference on very large data bases (VLDB 2011) (pp. 1213–1224). Seattle.

Jacob, B. (2009). The memory system: you can’t avoid it, you can’t ignore it, you can’t fake it. San Rafael: Morgan & Claypool Publishers.

Jia, X.F., Trotman, A., & O’Keefe, R. (2010). Efficient accumulator initialisation. In Proceedings of the 15th Australasian document computing symposium (ADCS 2010).

Lemire, D., & Boytsov, L. (2015). Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1), 1–29.

Lemire, D., Boytsov, L., & Kurz, N. (2014). SIMD compression and the intersection of sorted integers. arXiv:1401.6399v9.

Lin, J., & Trotman, A. (2015). Anytime ranking for impact-ordered indexes. In Proceedings of the ACM international conference on the theory of information retrieval (ICTIR 2015) (pp. 301–304). Northampton.

Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., et al. (2016). Toward reproducible baselines: The open-source IR reproducibility challenge. In Proceedings of the 38th European conference on information retrieval (ECIR 2016) (pp. 408–420). Padua.

Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349–379.CrossRef

Patterson, D. A. (2004). Latency lags bandwidth. Communications of the ACM, 47(10), 71–75.CrossRef

Persin, M., Zobel, J., & Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10), 749–764.CrossRef

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 275–281). Melbourne.

Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the fourth text retrieval conference (TREC-4) (pp. 73–96). Gaithersburg.

Rossi, C., de Moura, E. S., Carvalho, A. L., & da Silva, A. S. (2013). Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 183–192). Dublin.

Salton, G. (1971). The SMART retrieval system-experiments in automatic document processing. Englewood Cliffs: Prentice-Hall.

Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 222–229). Tampere.

Silvestri, F., & Venturini, R. (2010). VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of 19th international conference on information and knowledge management (CIKM 2010) (pp. 1219–1228). Toronto.

Smucker, M.D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 623–632). Lisbon.

Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., & Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proceedings of 20th international conference on information and knowledge management (CIKM 2011) (pp. 317–326). Glasgow.

Strohman, T., & Croft, W. B. (2007). Efficient document retrieval in main memory. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007) (pp. 175–182). Amsterdam.

Strohman, T., Turtle, H., & Croft, W.B. (2005). Optimization strategies for complex queries. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 219–225). Salvador.

Thompson, M., & Barker, M. (2010). LMAX: How to do 100k TPS at less than 1ms latency. In Presentation at QCon San Francisco 2010

Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the sixth ACM international conference on web search and data mining (WSDM 2013) (pp. 63–72). Rome.

Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5–19.CrossRef

Trotman, A. (2014). Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian document computing symposium (ADCS ’14) (pp. 50–57). Melbourne.

Trotman, A., Jia, X.F., & Crane, M. (2012). Towards an efficient and effective search engine. In Proceedings of the SIGIR 2012 workshop on open source information retrieval. Portland.

Turtle, H., & Flood, J. (1995). Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 831–850.CrossRef

Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 105–114). Beijing.

Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3), 193–201.CrossRef

Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images. San Francisco: Morgan Kaufmann Publishing.MATH

Wolfe, R., & Hanley, J. (2002). If we’re so different, why do we keep overlapping? CMAJ, 166(1), 65–66.

Wu, H., & Fang, H. (2014). Document prioritization for scalable query processing. In Proceedings of 23rd international conference on information and knowledge management (CIKM 2014) (pp. 1609–1618). Shanghai.

Yan, H., Ding, S., & Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international eorld wide web conference (WWW 2009) (pp. 401–410). Madrid.

Zhang, J., Long, X., & Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international world wide web conference (WWW 2008) (pp. 387–396). Beijing.

Zukowski, M., Héman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the IEEE 22nd international conference on data engineering (ICDE 2006). Atlanta.

Titel: The role of index compression in score-at-a-time query evaluation
verfasst von: Jimmy Lin
Andrew Trotman
Publikationsdatum: 25.01.2017
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2017
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-016-9291-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2017

Waves: a fast multi-tier top-k query processing algorithm

Efficiency in information retrieval: introduction to special issue

Performance improvements for search systems using an integrated cache of lists + intersections

Efficient distributed selective search

Document retrieval on repetitive string collections

Premium Partner