Skip to main content
Erschienen in: Discover Computing 3/2017

25.01.2017 | Information Retrieval Efficiency

The role of index compression in score-at-a-time query evaluation

verfasst von: Jimmy Lin, Andrew Trotman

Erschienen in: Discover Computing | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
As a detail, note that in this treatment query weights are often ignored, which is justified for a few reasons: typical search queries do not have repeated terms (usually the source of query weights); in fact, although BM25 does include weighting for query frequency, in most implementation that weight is ignored.
 
2
The artifacts in the graph come from boilerplate terms that occur in subsets of the collection.
 
4
Note that means can still be significantly different even though confidence intervals overlap (Wolfe and Hanley 2002), as is the case of uncompressed vs. QMX-D4 for ClueWeb12-B13.
 
Literatur
Zurück zum Zitat Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRef Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRef
Zurück zum Zitat Anh, V. N., & Moffat, A. (2005a). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151–166.CrossRef Anh, V. N., & Moffat, A. (2005a). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151–166.CrossRef
Zurück zum Zitat Anh, V.N., & Moffat, A. (2005b). Simplified similarity scoring using term ranks. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 226–233). Salvador. Anh, V.N., & Moffat, A. (2005b). Simplified similarity scoring using term ranks. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 226–233). Salvador.
Zurück zum Zitat Anh, V. N., & Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 372–379). Seattle. Anh, V. N., & Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 372–379). Seattle.
Zurück zum Zitat Anh, V. N., & Moffat, A. (2010). Inverted index compression using word-aligned binary codes. Software: Practice and Experience, 40(2), 131–147. Anh, V. N., & Moffat, A. (2010). Inverted index compression using word-aligned binary codes. Software: Practice and Experience, 40(2), 131–147.
Zurück zum Zitat Anh, V. N., de Kretser, O., & Moffat, A. (2001). Vector-space ranking with effective early termination. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 35–42). New Orleans. Anh, V. N., de Kretser, O., & Moffat, A. (2001). Vector-space ranking with effective early termination. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 35–42). New Orleans.
Zurück zum Zitat Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 997–1000). Dublin. Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 997–1000). Dublin.
Zurück zum Zitat Brewer, E. A. (2005). Combining systems and databases: A search engine retrospective. Readings in database systems. Cambridge: MIT Press. Brewer, E. A. (2005). Combining systems and databases: A search engine retrospective. Readings in database systems. Cambridge: MIT Press.
Zurück zum Zitat Büttcher, S., & Clarke, C.L.A. (2007). Index compression is good, especially for random access. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 761–770). Lisbon. Büttcher, S., & Clarke, C.L.A. (2007). Index compression is good, especially for random access. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 761–770). Lisbon.
Zurück zum Zitat Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the third ACM international conference on web search and data mining (WSDM 2010) (pp. 411–420). New York. Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the third ACM international conference on web search and data mining (WSDM 2010) (pp. 411–420). New York.
Zurück zum Zitat Crane, M., Trotman, A., & O’Keefe, R. (2013). Maintaining discriminatory power in quantized indexes. In Proceedings of 22nd international conference on information and knowledge management (CIKM 2013) (pp. 1221–1224). San Francisco. Crane, M., Trotman, A., & O’Keefe, R. (2013). Maintaining discriminatory power in quantized indexes. In Proceedings of 22nd international conference on information and knowledge management (CIKM 2013) (pp. 1221–1224). San Francisco.
Zurück zum Zitat Dean, J. (2009). Challenges in building large-scale information retrieval systems. In Keynote presentation at the second ACM international conference on web search and data mining (WSDM 2009). Barcelona. Dean, J. (2009). Challenges in building large-scale information retrieval systems. In Keynote presentation at the second ACM international conference on web search and data mining (WSDM 2009). Barcelona.
Zurück zum Zitat Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.CrossRef Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.CrossRef
Zurück zum Zitat Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proceedings of the 34rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 993–1002). Beijing. Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proceedings of the 34rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 993–1002). Beijing.
Zurück zum Zitat Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., & Zien, J. (2011). Evaluation strategies for top-k queries over memory-resident inverted indexes. In Proceedings of the 37th international conference on very large data bases (VLDB 2011) (pp. 1213–1224). Seattle. Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., & Zien, J. (2011). Evaluation strategies for top-k queries over memory-resident inverted indexes. In Proceedings of the 37th international conference on very large data bases (VLDB 2011) (pp. 1213–1224). Seattle.
Zurück zum Zitat Jacob, B. (2009). The memory system: you can’t avoid it, you can’t ignore it, you can’t fake it. San Rafael: Morgan & Claypool Publishers. Jacob, B. (2009). The memory system: you can’t avoid it, you can’t ignore it, you can’t fake it. San Rafael: Morgan & Claypool Publishers.
Zurück zum Zitat Jia, X.F., Trotman, A., & O’Keefe, R. (2010). Efficient accumulator initialisation. In Proceedings of the 15th Australasian document computing symposium (ADCS 2010). Jia, X.F., Trotman, A., & O’Keefe, R. (2010). Efficient accumulator initialisation. In Proceedings of the 15th Australasian document computing symposium (ADCS 2010).
Zurück zum Zitat Lemire, D., & Boytsov, L. (2015). Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1), 1–29. Lemire, D., & Boytsov, L. (2015). Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1), 1–29.
Zurück zum Zitat Lin, J., & Trotman, A. (2015). Anytime ranking for impact-ordered indexes. In Proceedings of the ACM international conference on the theory of information retrieval (ICTIR 2015) (pp. 301–304). Northampton. Lin, J., & Trotman, A. (2015). Anytime ranking for impact-ordered indexes. In Proceedings of the ACM international conference on the theory of information retrieval (ICTIR 2015) (pp. 301–304). Northampton.
Zurück zum Zitat Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., et al. (2016). Toward reproducible baselines: The open-source IR reproducibility challenge. In Proceedings of the 38th European conference on information retrieval (ECIR 2016) (pp. 408–420). Padua. Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., et al. (2016). Toward reproducible baselines: The open-source IR reproducibility challenge. In Proceedings of the 38th European conference on information retrieval (ECIR 2016) (pp. 408–420). Padua.
Zurück zum Zitat Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349–379.CrossRef Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349–379.CrossRef
Zurück zum Zitat Patterson, D. A. (2004). Latency lags bandwidth. Communications of the ACM, 47(10), 71–75.CrossRef Patterson, D. A. (2004). Latency lags bandwidth. Communications of the ACM, 47(10), 71–75.CrossRef
Zurück zum Zitat Persin, M., Zobel, J., & Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10), 749–764.CrossRef Persin, M., Zobel, J., & Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10), 749–764.CrossRef
Zurück zum Zitat Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 275–281). Melbourne. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 275–281). Melbourne.
Zurück zum Zitat Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the fourth text retrieval conference (TREC-4) (pp. 73–96). Gaithersburg. Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the fourth text retrieval conference (TREC-4) (pp. 73–96). Gaithersburg.
Zurück zum Zitat Rossi, C., de Moura, E. S., Carvalho, A. L., & da Silva, A. S. (2013). Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 183–192). Dublin. Rossi, C., de Moura, E. S., Carvalho, A. L., & da Silva, A. S. (2013). Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 183–192). Dublin.
Zurück zum Zitat Salton, G. (1971). The SMART retrieval system-experiments in automatic document processing. Englewood Cliffs: Prentice-Hall. Salton, G. (1971). The SMART retrieval system-experiments in automatic document processing. Englewood Cliffs: Prentice-Hall.
Zurück zum Zitat Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 222–229). Tampere. Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 222–229). Tampere.
Zurück zum Zitat Silvestri, F., & Venturini, R. (2010). VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of 19th international conference on information and knowledge management (CIKM 2010) (pp. 1219–1228). Toronto. Silvestri, F., & Venturini, R. (2010). VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of 19th international conference on information and knowledge management (CIKM 2010) (pp. 1219–1228). Toronto.
Zurück zum Zitat Smucker, M.D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 623–632). Lisbon. Smucker, M.D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 623–632). Lisbon.
Zurück zum Zitat Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., & Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proceedings of 20th international conference on information and knowledge management (CIKM 2011) (pp. 317–326). Glasgow. Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., & Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proceedings of 20th international conference on information and knowledge management (CIKM 2011) (pp. 317–326). Glasgow.
Zurück zum Zitat Strohman, T., & Croft, W. B. (2007). Efficient document retrieval in main memory. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007) (pp. 175–182). Amsterdam. Strohman, T., & Croft, W. B. (2007). Efficient document retrieval in main memory. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007) (pp. 175–182). Amsterdam.
Zurück zum Zitat Strohman, T., Turtle, H., & Croft, W.B. (2005). Optimization strategies for complex queries. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 219–225). Salvador. Strohman, T., Turtle, H., & Croft, W.B. (2005). Optimization strategies for complex queries. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 219–225). Salvador.
Zurück zum Zitat Thompson, M., & Barker, M. (2010). LMAX: How to do 100k TPS at less than 1ms latency. In Presentation at QCon San Francisco 2010 Thompson, M., & Barker, M. (2010). LMAX: How to do 100k TPS at less than 1ms latency. In Presentation at QCon San Francisco 2010
Zurück zum Zitat Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the sixth ACM international conference on web search and data mining (WSDM 2013) (pp. 63–72). Rome. Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the sixth ACM international conference on web search and data mining (WSDM 2013) (pp. 63–72). Rome.
Zurück zum Zitat Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5–19.CrossRef Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5–19.CrossRef
Zurück zum Zitat Trotman, A. (2014). Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian document computing symposium (ADCS ’14) (pp. 50–57). Melbourne. Trotman, A. (2014). Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian document computing symposium (ADCS ’14) (pp. 50–57). Melbourne.
Zurück zum Zitat Trotman, A., Jia, X.F., & Crane, M. (2012). Towards an efficient and effective search engine. In Proceedings of the SIGIR 2012 workshop on open source information retrieval. Portland. Trotman, A., Jia, X.F., & Crane, M. (2012). Towards an efficient and effective search engine. In Proceedings of the SIGIR 2012 workshop on open source information retrieval. Portland.
Zurück zum Zitat Turtle, H., & Flood, J. (1995). Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 831–850.CrossRef Turtle, H., & Flood, J. (1995). Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 831–850.CrossRef
Zurück zum Zitat Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 105–114). Beijing. Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 105–114). Beijing.
Zurück zum Zitat Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3), 193–201.CrossRef Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3), 193–201.CrossRef
Zurück zum Zitat Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images. San Francisco: Morgan Kaufmann Publishing.MATH Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images. San Francisco: Morgan Kaufmann Publishing.MATH
Zurück zum Zitat Wolfe, R., & Hanley, J. (2002). If we’re so different, why do we keep overlapping? CMAJ, 166(1), 65–66. Wolfe, R., & Hanley, J. (2002). If we’re so different, why do we keep overlapping? CMAJ, 166(1), 65–66.
Zurück zum Zitat Wu, H., & Fang, H. (2014). Document prioritization for scalable query processing. In Proceedings of 23rd international conference on information and knowledge management (CIKM 2014) (pp. 1609–1618). Shanghai. Wu, H., & Fang, H. (2014). Document prioritization for scalable query processing. In Proceedings of 23rd international conference on information and knowledge management (CIKM 2014) (pp. 1609–1618). Shanghai.
Zurück zum Zitat Yan, H., Ding, S., & Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international eorld wide web conference (WWW 2009) (pp. 401–410). Madrid. Yan, H., Ding, S., & Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international eorld wide web conference (WWW 2009) (pp. 401–410). Madrid.
Zurück zum Zitat Zhang, J., Long, X., & Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international world wide web conference (WWW 2008) (pp. 387–396). Beijing. Zhang, J., Long, X., & Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international world wide web conference (WWW 2008) (pp. 387–396). Beijing.
Zurück zum Zitat Zukowski, M., Héman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the IEEE 22nd international conference on data engineering (ICDE 2006). Atlanta. Zukowski, M., Héman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the IEEE 22nd international conference on data engineering (ICDE 2006). Atlanta.
Metadaten
Titel
The role of index compression in score-at-a-time query evaluation
verfasst von
Jimmy Lin
Andrew Trotman
Publikationsdatum
25.01.2017
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2017
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-016-9291-5

Weitere Artikel der Ausgabe 3/2017

Discover Computing 3/2017 Zur Ausgabe

Information Retrieval Efficiency

Efficient distributed selective search

Premium Partner