Skip to main content
Top

2015 | OriginalPaper | Chapter

Compact Indexes for Flexible Top-\(k\) Retrieval

Authors : Simon Gog, Matthias Petri

Published in: Combinatorial Pattern Matching

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We design and engineer a self-index based retrieval system capable of rank-safe evaluation of top-\(k\) queries. The framework generalizes the GREEDY approach of Culpepper et al. (ESA 2010) to handle multi-term queries, including over phrases. We propose two techniques which significantly reduce the ranking time for a wide range of popular Information Retrieval (IR) relevance measures, such as \({\textsc {TF}\times \textsc {IDF}} \) and \({\textsc {BM25}} \). First, we reorder elements in the document array according to document weight. Second, we introduce the repetition array, which generalizes Sadakane’s (JDA 2007) document frequency structure to document subsets. Combining document and repetition array, we achieve attractive functionality-space trade-offs. We provide an extensive evaluation of our system on terabyte-sized IR collections.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Boldi, P., Vigna, S.: MG4J at TREC 2005. In: Proceedings of the TREC (2005) Boldi, P., Vigna, S.: MG4J at TREC 2005. In: Proceedings of the TREC (2005)
2.
go back to reference Broder, A.Z., Carmel, D., Herscovici, H., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the CIKM, pp. 426–434 (2003) Broder, A.Z., Carmel, D., Herscovici, H., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the CIKM, pp. 426–434 (2003)
3.
go back to reference Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990) Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
4.
go back to reference Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k ranked document search in general text databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part II. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010) CrossRef Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k ranked document search in general text databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part II. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010) CrossRef
5.
go back to reference Culpepper, J.S., Petri, M., Scholer, F.: Efficient in-memory top-k document retrieval. In: Proceedings of the SIGIR, pp. 225–234 (2012) Culpepper, J.S., Petri, M., Scholer, F.: Efficient in-memory top-k document retrieval. In: Proceedings of the SIGIR, pp. 225–234 (2012)
6.
go back to reference Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theoret. Comput. Sci. 426, 25–41 (2012)MathSciNetCrossRef Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theoret. Comput. Sci. 426, 25–41 (2012)MathSciNetCrossRef
7.
go back to reference Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014) Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
8.
go back to reference Gog, S., Navarro, G.: Improved single-term top-k document retrieval. In: Proceedings of the ALENEX, pp. 24–32 (2015) Gog, S., Navarro, G.: Improved single-term top-k document retrieval. In: Proceedings of the ALENEX, pp. 24–32 (2015)
9.
go back to reference Gog, S., Moffat, A., Petri, M.: On identifying phrases using collection statistics. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 278–283. Springer, Heidelberg (2015) CrossRef Gog, S., Moffat, A., Petri, M.: On identifying phrases using collection statistics. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 278–283. Springer, Heidelberg (2015) CrossRef
10.
go back to reference Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the SODA, pp. 841–850 (2003) Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the SODA, pp. 841–850 (2003)
12.
go back to reference Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 1–36 (2014)MathSciNetCrossRef Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 1–36 (2014)MathSciNetCrossRef
13.
go back to reference Konow, R., Navarro, G.: Faster compact top-k document retrieval. In: Proceedings of the DCC, pp. 351–360 (2013) Konow, R., Navarro, G.: Faster compact top-k document retrieval. In: Proceedings of the DCC, pp. 351–360 (2013)
14.
go back to reference Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 242–251. Springer, Heidelberg (2014) Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 242–251. Springer, Heidelberg (2014)
15.
go back to reference Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. Exp. 45, 1–29 (2013)CrossRef Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. Exp. 45, 1–29 (2013)CrossRef
16.
go back to reference Navarro, G.: Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comp. Surv. 46(4), 1–47 (2014)CrossRef Navarro, G.: Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comp. Surv. 46(4), 1–47 (2014)CrossRef
17.
go back to reference Navarro, G., Nekrich, Y.: Top- k document retrieval in optimal time and linear space. In: Proceedings of the SODA, pp. 1066–1077 (2012) Navarro, G., Nekrich, Y.: Top- k document retrieval in optimal time and linear space. In: Proceedings of the SODA, pp. 1066–1077 (2012)
18.
go back to reference Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 295–306. Springer, Heidelberg (2012) CrossRef Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 295–306. Springer, Heidelberg (2012) CrossRef
19.
go back to reference Navarro, G., Puglisi, S.J., Valenzuela, D.: General document retrieval in compact space. J. Experimental Alg. 19(2), 1–46 (2014)MathSciNet Navarro, G., Puglisi, S.J., Valenzuela, D.: General document retrieval in compact space. J. Experimental Alg. 19(2), 1–46 (2014)MathSciNet
20.
go back to reference Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)MATHMathSciNetCrossRef Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)MATHMathSciNetCrossRef
21.
go back to reference Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of the SIGIR, pp. 273–282 (2014) Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of the SIGIR, pp. 273–282 (2014)
22.
go back to reference Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Proceedings of the SODA, pp. 233–242 (2002) Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Proceedings of the SODA, pp. 233–242 (2002)
26.
go back to reference Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis (2005) Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis (2005)
27.
go back to reference Vigna, S.: Quasi-succinct indices. In: Proceedings of the WSDM, pp. 83–92 (2013) Vigna, S.: Quasi-succinct indices. In: Proceedings of the WSDM, pp. 83–92 (2013)
28.
go back to reference Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proceedings of the WWW, pp. 401–410 (2009) Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proceedings of the WWW, pp. 401–410 (2009)
29.
go back to reference Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comp. Surv. 38(2), 1–56 (2006)CrossRef Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comp. Surv. 38(2), 1–56 (2006)CrossRef
Metadata
Title
Compact Indexes for Flexible Top- Retrieval
Authors
Simon Gog
Matthias Petri
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-19929-0_18

Premium Partner