Article

A time machine for text search

Authors:
Klaus Berberich

Max-Planck Institute for Informatics, Saarbruecken, Germany

Max-Planck Institute for Informatics, Saarbruecken, Germany
View Profile

,
Srikanta Bedathur

Max-Planck Institute for Informatics, Saarbruecken, Germany

Max-Planck Institute for Informatics, Saarbruecken, Germany
View Profile

,
Thomas Neumann

Max-Planck Institute for Informatics, Saarbruecken, Germany

Max-Planck Institute for Informatics, Saarbruecken, Germany
View Profile

,
Gerhard Weikum

Max-Planck Institute for Informatics, Saarbruecken, Germany

Max-Planck Institute for Informatics, Saarbruecken, Germany
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 519–526https://doi.org/10.1145/1277741.1277831

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 519–526

ABSTRACT

Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.

References

V. N. Anh and A. Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. In SIGIR, 2006. Google ScholarDigital Library
V. N. Anh and A. Moffat. Pruning Strategies for Mixed-Mode Querying. In CIKM, 2006. Google ScholarDigital Library
P. G. Anick and R. A. Flynn. Versioning a Full-Text Information Retrieval System. In SIGIR, 1992. Google ScholarDigital Library
R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. Google ScholarDigital Library
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text search. Technical Report MPI-I-2007-5-002, Max-Planck Institute for Informatics, 2007.Google ScholarDigital Library
M. H. Böhlen, R. T. Snodgrass, and M. D. Soo. Coalescing in Temporal Databases. In VLDB, 1996. Google ScholarDigital Library
P. Boldi, M. Santini, and S. Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In WAW, 2004.Google ScholarCross Ref
A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing Shared Content in Information Retrieval Systems. In EDBT, 2006. Google ScholarDigital Library
C. Buckley and A. F. Lewit. Optimization of Inverted Vector Searches. In SIGIR, 1985. Google ScholarDigital Library
M. Burrows and A. L. Hisgen. Method and Apparatus for Generating and Searching Range-Based Index of Word Locations. U.S. Patent 5,915,251, 1999.Google Scholar
S. Büttcher and C. L. A. Clarke. A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems. In CIKM, 2006.Google ScholarDigital Library
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static Index Pruning for Information Retrieval Systems. In SIGIR, 2001. Google ScholarDigital Library
http://www.europarchive.org.Google Scholar
R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top k Lists. SIAM J. Discrete Math., 17(1):134--160, 2003. Google ScholarDigital Library
R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003. Google ScholarDigital Library
S. Guha, K. Shim, and J. Woo. REHIST: Relative Error Histogram Construction Algorithms. In VLDB, 2004. Google ScholarDigital Library
M. Hersovici, R. Lempel, and S. Yogev. Efficient Indexing of Versioned Document Sequences. In ECIR, 2007. Google ScholarDigital Library
http://www.archive.org.Google Scholar
Y. E. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD, 1995. Google ScholarDigital Library
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. In VLDB, 1998. Google ScholarDigital Library
E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An Online Algorithm for Segmenting Time Series. In ICDM, 2001. Google ScholarDigital Library
S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671--680, 1983.Google ScholarCross Ref
J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley, 2005. Google ScholarDigital Library
U. Manber. Introduction to Algorithms: A Creative Approach. Addison-Wesley, 1989. Google ScholarDigital Library
K. Nørvåg and A. O. N. Nybø. DyST: Dynamic and Scalable Temporal Text Indexing. In TIME, 2006.Google Scholar
J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR, 1998. Google ScholarDigital Library
S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In TREC, 1999.Google Scholar
B. Salzberg and V. J. Tsotras. Comparison of Access Methods for Time-Evolving Data. ACM Comput. Surv., 31(2):158--221, 1999. Google ScholarDigital Library
M. Stack. Full Text Search of Web Archive Collections. In IWAW, 2006.Google Scholar
E. Terzi and P. Tsaparas. Efficient Algorithms for Sequence Segmentation. In SIAM-DM, 2006.Google ScholarCross Ref
M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In VLDB, 2004. Google ScholarDigital Library
http://www.wikipedia.org.Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann publishers Inc., 1999. Google ScholarDigital Library
J. Zhang and T. Suel. Efficient Search in Large Textual Collections with Redundancy. In WWW, 2007. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Comput. Surv., 38(2):6, 2006. 526. Google ScholarDigital Library

Index Terms

A time machine for text search
1. Information systems
  1. Information retrieval

Recommendations

Index maintenance for time-travel text search
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different ...
Read More
Lifespan-based Partitioning of Index Structures for Time-travel Text Search
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Time-travel text search over a temporally evolving document collection is useful in various applications. Supporting a wide range of query classes demanded by these applications require different index layouts optimized for their respective query access ...
Read More
Temporal index sharding for space-time efficiency in archive search
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
temporal text indexing
time-travel text search
web archives
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,230
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A time machine for text search

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Index maintenance for time-travel text search

Lifespan-based Partitioning of Index Structures for Time-travel Text Search

Temporal index sharding for space-time efficiency in archive search