ABSTRACT
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -- that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.
- V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8 (1): 151--166, January 2005. Source code available from www.cs.mu.oz.au/~alistair/carry/. Google ScholarDigital Library
- A. Biliris. An efficient database storage structure for large dynamic objects. In F. Golshani, editor, Proc. IEEE Int. Conf. on Data Engineering, pages 301--308, Tempe, Arizona, February 1992. IEEE Computer Society. ISBN 0-8186-2545-7. Google ScholarDigital Library
- A. Biliris. The performance of three database storage structures for managing large objects. In M. Stonebraker, editor, Proc. ACM-SIGMOD Int. Conf. on the Management of Data, pages 276--285, San Diego, California, June 1992. Google ScholarDigital Library
- E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. Int. Conf. on Very Large Databases, pages 192--202, Santiago, Chile, September 1994. Google ScholarDigital Library
- S. Büttcher and C. L. A. Clarke. Indexing time vs. query time trade-offs in dynamic information retrieval systems. In N. Fuhr, H.-J. Schek, and A. Chowdhury, editors, Proc. ACM CIKM Int. Conf. on Information and Knowledge Management, Bremen, Germany, 2005. Google ScholarDigital Library
- M. J. Carey, D. J. DeWitt, J. E. Richardson, and E. J. Shekita. Object and file management in the EXODUS extensible database system. In W. W. Chu, G. Gardarin, S. Ohsuga, and Y. Kambayashi, editors, Proc. Int. Conf. on Very Large Databases, pages 91--100, Kyoto, Japan, August 1986. Morgan Kaufmann. ISBN 0-934613-18-4. Google ScholarDigital Library
- M. J. Carey, D. J. DeWitt, J. E. Richardson, and E. J. Shekita. Storage management for objects in EXODUS. In W. Kim and F. H. Lochovsky, editors, Object-Oriented Concepts, Databases, and Applications, pages 341--369. Addison-Wesley Longman, New York, 1989. Google ScholarDigital Library
- C. L. A. Clarke and G. V. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Technical Report MT-95-01, Department of Computer Science, University of Waterloo, Waterloo, Canada, 1995. MultiText Project Technical Report.Google Scholar
- D. R. Cutting and J. O. Pedersen. Optimizations for dynamic inverted index maintenance. In J.-Luc Vidick, editor, Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 405--411, Brussels, Belgium, September 1990. ACM. ISBN 0-89791-408-2. Google ScholarDigital Library
- D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In E. M. Voorhees and D. K. Harman, editors, The Eighth Tert REtrieval Conference (TREC-8), pages 91--104, Gaithersburg, MD, 1999. National Institute of Standards and Technology Special Publication 500-246.Google Scholar
- S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Jour. of the American Society for Information Science and Technology, 54 (8): 713--729, 2003. Google ScholarDigital Library
- T. J. Lehman and B. G. Lindsay. The Starburst long field manager. In P. M. G. Apers and G. Wiederhold, editors, Proc. Int. Conf. on Very Large Databases, pages 375--383, Amsterdam, The Netherlands, August 1989. ISBN 1-55860-101-5. Google ScholarDigital Library
- N. Lester, J. Zobel, and H.E. Williams. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In V. Estivill-Castro, editor, Proc. Australasian Computer Science Conf., pages 15--22, January 2004. Google ScholarDigital Library
- A. Moffat and T. A. H. Bell. In situ generation of compressed inverted files. Journal of the American Society of Information Science, 46 (7): 537--550, 1995. Google ScholarDigital Library
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. H. Myaeng, editors, Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, August 2002. Google ScholarDigital Library
- W.-Y. Shieh and C.-P. Chung. A statistics-based approach to incrementally update inverted files. In H. R. Arabnia, editor, Proc. Int. Conf. on Information and Knowledge Engineering, pages 38--43, Las Vegas, Nevada, June 2003. CSREA Press.Google Scholar
- K. Shoens, A. Tomasic, and H. García-Molina. Synthetic workload performance analysis of incremental updates. In W. B. Croft and C. J. van Rijsbergen, editors, Proc. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 329--338, Dublin, Ireland, July 1994. Google ScholarDigital Library
- A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic. Searching the web: The public and their queries. Jour. of the American Society for Information Science, 52 (3): 226--234, 2001. Google ScholarDigital Library
- A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. ACM-SIGMOD Int. Conf. on the Management of Data, pages 289--300, Minneapolis, Minnesota, May 1994. ACM. Google ScholarDigital Library
- H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries, 5 (2): 99--105, 2005.Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, California, second edition, 1999. Google ScholarDigital Library
Index Terms
- Fast on-line index construction by geometric partitioning
Recommendations
On-line index maintenance using horizontal partitioning
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementIn this paper, we propose a new merge-based index maintenance strategy for Information Retrieval systems. The new model is based on partitioning of the inverted index across the terms in it. We exploit the query log to partition the on-disk inverted ...
Fast construction of the HYB index
As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, ...
Index tuning for query-log based on-line index maintenance
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementThe existing query-log based on-line index maintenance approaches rely on frequency distribution of terms in the static query-log. Though these approaches are proved to be efficient, but in real world, the frequency distribution of the terms changes ...
Comments