ABSTRACT
Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.
- V. N. Anh and A. Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. In SIGIR, 2006. Google ScholarDigital Library
- V. N. Anh and A. Moffat. Pruning Strategies for Mixed-Mode Querying. In CIKM, 2006. Google ScholarDigital Library
- P. G. Anick and R. A. Flynn. Versioning a Full-Text Information Retrieval System. In SIGIR, 1992. Google ScholarDigital Library
- R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. Google ScholarDigital Library
- K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text search. Technical Report MPI-I-2007-5-002, Max-Planck Institute for Informatics, 2007.Google ScholarDigital Library
- M. H. Böhlen, R. T. Snodgrass, and M. D. Soo. Coalescing in Temporal Databases. In VLDB, 1996. Google ScholarDigital Library
- P. Boldi, M. Santini, and S. Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In WAW, 2004.Google ScholarCross Ref
- A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing Shared Content in Information Retrieval Systems. In EDBT, 2006. Google ScholarDigital Library
- C. Buckley and A. F. Lewit. Optimization of Inverted Vector Searches. In SIGIR, 1985. Google ScholarDigital Library
- M. Burrows and A. L. Hisgen. Method and Apparatus for Generating and Searching Range-Based Index of Word Locations. U.S. Patent 5,915,251, 1999.Google Scholar
- S. Büttcher and C. L. A. Clarke. A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems. In CIKM, 2006.Google ScholarDigital Library
- D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static Index Pruning for Information Retrieval Systems. In SIGIR, 2001. Google ScholarDigital Library
- http://www.europarchive.org.Google Scholar
- R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top k Lists. SIAM J. Discrete Math., 17(1):134--160, 2003. Google ScholarDigital Library
- R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003. Google ScholarDigital Library
- S. Guha, K. Shim, and J. Woo. REHIST: Relative Error Histogram Construction Algorithms. In VLDB, 2004. Google ScholarDigital Library
- M. Hersovici, R. Lempel, and S. Yogev. Efficient Indexing of Versioned Document Sequences. In ECIR, 2007. Google ScholarDigital Library
- http://www.archive.org.Google Scholar
- Y. E. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD, 1995. Google ScholarDigital Library
- H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. In VLDB, 1998. Google ScholarDigital Library
- E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An Online Algorithm for Segmenting Time Series. In ICDM, 2001. Google ScholarDigital Library
- S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671--680, 1983.Google ScholarCross Ref
- J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley, 2005. Google ScholarDigital Library
- U. Manber. Introduction to Algorithms: A Creative Approach. Addison-Wesley, 1989. Google ScholarDigital Library
- K. Nørvåg and A. O. N. Nybø. DyST: Dynamic and Scalable Temporal Text Indexing. In TIME, 2006.Google Scholar
- J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR, 1998. Google ScholarDigital Library
- S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In TREC, 1999.Google Scholar
- B. Salzberg and V. J. Tsotras. Comparison of Access Methods for Time-Evolving Data. ACM Comput. Surv., 31(2):158--221, 1999. Google ScholarDigital Library
- M. Stack. Full Text Search of Web Archive Collections. In IWAW, 2006.Google Scholar
- E. Terzi and P. Tsaparas. Efficient Algorithms for Sequence Segmentation. In SIAM-DM, 2006.Google ScholarCross Ref
- M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In VLDB, 2004. Google ScholarDigital Library
- http://www.wikipedia.org.Google Scholar
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann publishers Inc., 1999. Google ScholarDigital Library
- J. Zhang and T. Suel. Efficient Search in Large Textual Collections with Redundancy. In WWW, 2007. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Comput. Surv., 38(2):6, 2006. 526. Google ScholarDigital Library
Index Terms
- A time machine for text search
Recommendations
Index maintenance for time-travel text search
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalTime-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different ...
Lifespan-based Partitioning of Index Structures for Time-travel Text Search
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementTime-travel text search over a temporally evolving document collection is useful in various applications. Supporting a wide range of query classes demanded by these applications require different index layouts optimized for their respective query access ...
Temporal index sharding for space-time efficiency in archive search
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalTime-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve ...
Comments