skip to main content
10.1145/1277741.1277831acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A time machine for text search

Authors Info & Claims
Published:23 July 2007Publication History

ABSTRACT

Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.

References

  1. V. N. Anh and A. Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. In SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. N. Anh and A. Moffat. Pruning Strategies for Mixed-Mode Querying. In CIKM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. G. Anick and R. A. Flynn. Versioning a Full-Text Information Retrieval System. In SIGIR, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text search. Technical Report MPI-I-2007-5-002, Max-Planck Institute for Informatics, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. H. Böhlen, R. T. Snodgrass, and M. D. Soo. Coalescing in Temporal Databases. In VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Boldi, M. Santini, and S. Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In WAW, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing Shared Content in Information Retrieval Systems. In EDBT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Buckley and A. F. Lewit. Optimization of Inverted Vector Searches. In SIGIR, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Burrows and A. L. Hisgen. Method and Apparatus for Generating and Searching Range-Based Index of Word Locations. U.S. Patent 5,915,251, 1999.Google ScholarGoogle Scholar
  11. S. Büttcher and C. L. A. Clarke. A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems. In CIKM, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static Index Pruning for Information Retrieval Systems. In SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. http://www.europarchive.org.Google ScholarGoogle Scholar
  14. R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top k Lists. SIAM J. Discrete Math., 17(1):134--160, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Guha, K. Shim, and J. Woo. REHIST: Relative Error Histogram Construction Algorithms. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Hersovici, R. Lempel, and S. Yogev. Efficient Indexing of Versioned Document Sequences. In ECIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. http://www.archive.org.Google ScholarGoogle Scholar
  19. Y. E. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. In VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An Online Algorithm for Segmenting Time Series. In ICDM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671--680, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. U. Manber. Introduction to Algorithms: A Creative Approach. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Nørvåg and A. O. N. Nybø. DyST: Dynamic and Scalable Temporal Text Indexing. In TIME, 2006.Google ScholarGoogle Scholar
  26. J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In TREC, 1999.Google ScholarGoogle Scholar
  28. B. Salzberg and V. J. Tsotras. Comparison of Access Methods for Time-Evolving Data. ACM Comput. Surv., 31(2):158--221, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Stack. Full Text Search of Web Archive Collections. In IWAW, 2006.Google ScholarGoogle Scholar
  30. E. Terzi and P. Tsaparas. Efficient Algorithms for Sequence Segmentation. In SIAM-DM, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. http://www.wikipedia.org.Google ScholarGoogle Scholar
  33. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann publishers Inc., 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Zhang and T. Suel. Efficient Search in Large Textual Collections with Redundancy. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Comput. Surv., 38(2):6, 2006. 526. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A time machine for text search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
            July 2007
            946 pages
            ISBN:9781595935977
            DOI:10.1145/1277741

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 July 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader