Skip to main content
Top

2016 | OriginalPaper | Chapter

Compressing and Decoding Term Statistics Time Series

Authors : Jinfeng Rao, Xing Niu, Jimmy Lin

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

There is growing recognition that temporality plays an important role in information retrieval, particularly for timestamped document collections such as tweets. This paper examines the problem of compressing and decoding term statistics time series, or counts of terms within a particular time window across a large document collection. Such data are large—essentially the cross product of the vocabulary and the number of time intervals—but are also sparse, which makes them amenable to compression. We explore various integer compression techniques, starting with a number of coding schemes that are well-known in the information retrieval literature, and build toward a novel compression approach based on Huffman codes over blocks of term counts. We show that our Huffman-based methods are able to substantially reduce storage requirements compared to state-of-the-art compression techniques while still maintaining good decoding performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
We set aside compression speed since we are working with retrospective collections.
 
Literature
1.
go back to reference Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retrieval 8(1), 151–166 (2005)CrossRef Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retrieval 8(1), 151–166 (2005)CrossRef
2.
go back to reference Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.: Earlybird: real-time search at Twitter. In: ICDE (2012) Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.: Earlybird: real-time search at Twitter. In: ICDE (2012)
3.
go back to reference Elsas, J.L., Dumais, S.T.: Leveraging temporal dynamics of document content in relevance ranking. In: WSDM (2010) Elsas, J.L., Dumais, S.T.: Leveraging temporal dynamics of document content in relevance ranking. In: WSDM (2010)
4.
go back to reference Huffman, D.A., et al.: A method for the construction of minimum redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)CrossRefMATH Huffman, D.A., et al.: A method for the construction of minimum redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)CrossRefMATH
5.
go back to reference Jones, R., Diaz, F.: Temporal profiles of queries. ACM TOIS 25, Article no. 14 (2007) Jones, R., Diaz, F.: Temporal profiles of queries. ACM TOIS 25, Article no. 14 (2007)
6.
go back to reference Lin, J., Mishne, G.: A study of “churn” in tweets and real-time search queries. In: ICWSM (2012) Lin, J., Mishne, G.: A study of “churn” in tweets and real-time search queries. In: ICWSM (2012)
7.
go back to reference Mishne, G., Dalton, J., Li, Z., Sharma, A., Lin, J.: Fast data in the era of big data: Twitter’s real-time related query suggestion architecture. In: SIGMOD (2013) Mishne, G., Dalton, J., Li, Z., Sharma, A., Lin, J.: Fast data in the era of big data: Twitter’s real-time related query suggestion architecture. In: SIGMOD (2013)
8.
go back to reference Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)CrossRef Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)CrossRef
9.
go back to reference Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW (2008) Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW (2008)
Metadata
Title
Compressing and Decoding Term Statistics Time Series
Authors
Jinfeng Rao
Xing Niu
Jimmy Lin
Copyright Year
2016
Publisher
Springer International Publishing
DOI
https://doi.org/10.1007/978-3-319-30671-1_52