ABSTRACT
Query languages that take advantage of the XML document structure already exist. However, the systems that have been developed to query XML data explore the XML sources from a database perspective. This paper examines an XML collection from the viewpoint of Information Retrieval (IR). As such, we view the XML documents as a collection of text documents with additional tags and we attempt to adapt existing IR techniques to achieve more sophisticated search on XML documents. We employ a class of queries that support path expressions and suggest an efficient index, which extends the inverted file structure to search XML documents. This is accomplished by integrating the XML structure in the inverted file by combining the inverted file with a path index. The proposed structure is a lexicographical index, which may be used for the evaluation of queries that involve path expressions. Moreover, this paper discusses a ranking scheme based on both the term distribution and document structure. Some performance remarks are also presented.
- Angela Bonifati, Stefano Ceri. Comparative Analysis of Five XML Query Languages, ACM SIGMOD Record, 29(1): 68-79 (2000).]] Google ScholarDigital Library
- Tim Bray, Jean Paoli and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0, W3C Recommendation, available at http://www.w3.org/TR/1998/REC-xml-19980210.]]Google Scholar
- C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, Montreal, Canada, pp. 97-110, (June 1985).]] Google ScholarDigital Library
- Stefano Ceri, Piero Fraternali, and Stefano Paraboschi. XML: Current Developments and Future Challenges for the Database Community. In Proc. of the 7th International Conference on Extending Database Technology (EDBT 2000), pp 3-17, Konstanz, Germany (2000).]] Google ScholarDigital Library
- W. B. Croft and D. J. Harper. Using Probabilistic Models of Document Retrieval without Relevance Information. Journal of Documentation,35(4):285-295, (1979).]]Google ScholarCross Ref
- Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, David Maier, Dan Suciu. Querying XML Data. IEEE Data Engineering Bulletin,22(3):10-18 (1999).]]Google Scholar
- W. Frakes and R. Baeza-Yates (eds). Information Retrieval: Algorithms and Data Structures. Prentice-Hall (1992).]] Google ScholarDigital Library
- Roy Goldman and Jennifer Widom. DataGuides: Enabling Query Formulation and Optimazation in Semistructured Databases, In Proceedings of the 23rd VLDB Conference, pp. 436-445, Athens, Greece, August 25-29, (1997).]] Google ScholarDigital Library
- D. K. Harman. Ranking Algorithms. In Information Retrieval: Data Structures and Algorithms, W. B Frakes and R. Baeza-Yates (Eds) Prentice-Hall, Englewood Cliffs, N. J. pp. 363-392 (1992)]] Google ScholarDigital Library
- D. K. Harman, E. A. Fox, R. Baeza-Yates and W. C. Lee. Inverted files. In Information Retrieval: Data Structures and Algorithms, W. B Frakes and R. Baeza-Yates (Eds). Prentice-Hall, Englewood Cliffs, N. J. pp. 28-43 (1992)]] Google ScholarDigital Library
- Hyunchi Jang, Youngil Kim and Dongwook Shin. An effective mechanism for index update in structured documents. In Proceedings of the eighth international conference on Information knowledge management (CIKM'99), pp. 383 - 390 (1999).]] Google ScholarDigital Library
- Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon and P. Bruce Berra. Index structures for structured documents. In Proceedings of the 1st ACM international conference on Digital libraries (DL'96), pp. 91 - 99 (1996)]] Google ScholarDigital Library
- Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom. Lore: A Database Management System for Semistructured Data. ACM SIGMOD Record26(3): 54-66 (1997).]] Google ScholarDigital Library
- J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman. Indexing Semistructured Data. Technical Report, Computer Science Dept., Stanford University (1998)]]Google Scholar
- Andrei Mikheev. Document Centered Approach to Text Normalization. In Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval (SIGIR'00), pp.136-143 Athens, Greece (2000).]] Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst.14( 4):349 - 379 (Oct. 1996).]] Google ScholarDigital Library
- Gonzalo Navarro and Ricardo Baeza-Yates. Proximal nodes: a model to query document databases by content and structure. ACM Transactions on Information Systems,15(4): 400 - 435 (Oct. 1997)]] Google ScholarDigital Library
- W. M. Shaw, J. B. Wood, R. E. Wood and H. R. Tibbo. The Cystic Fibrosis Database: Content and Research Opportunities. Library and Information Science Research (LISR),13: 347-366 (1991).]]Google Scholar
- Dongwook Shin, Hyuncheol Jang, Honglan J. BUS: An Effective Indexing and Retrieval Scheme in Structured Documents . In proceedings of the third ACM Conference on Digital libraries (DL'98) pp. 235-243 (1998).]] Google ScholarDigital Library
- Jones K. Sparck. A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation28:11-21, (1972).]]Google ScholarCross Ref
Index Terms
- Structured information retrieval in XML documents
Recommendations
Configurable indexing and ranking for XML information retrieval
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalIndexing and ranking are two key factors for efficient and effective XML information retrieval. Inappropriate indexing may result in false negatives and false positives, and improper ranking may lead to low precisions. In this paper, we propose a ...
Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database
AbstractThis paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses ...
Information Retrieval System for XML Documents
DEXA '02: Proceedings of the 13th International Conference on Database and Expert Systems ApplicationsIn the research field of document information retrieval, the unit of retrieval results returned by IR systems is a whole document or a document fragment, like a paragraph in passage retrieval. IR systems based on the vector space model compute feature ...
Comments