skip to main content
10.1145/508791.508919acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Structured information retrieval in XML documents

Published:11 March 2002Publication History

ABSTRACT

Query languages that take advantage of the XML document structure already exist. However, the systems that have been developed to query XML data explore the XML sources from a database perspective. This paper examines an XML collection from the viewpoint of Information Retrieval (IR). As such, we view the XML documents as a collection of text documents with additional tags and we attempt to adapt existing IR techniques to achieve more sophisticated search on XML documents. We employ a class of queries that support path expressions and suggest an efficient index, which extends the inverted file structure to search XML documents. This is accomplished by integrating the XML structure in the inverted file by combining the inverted file with a path index. The proposed structure is a lexicographical index, which may be used for the evaluation of queries that involve path expressions. Moreover, this paper discusses a ranking scheme based on both the term distribution and document structure. Some performance remarks are also presented.

References

  1. Angela Bonifati, Stefano Ceri. Comparative Analysis of Five XML Query Languages, ACM SIGMOD Record, 29(1): 68-79 (2000).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tim Bray, Jean Paoli and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0, W3C Recommendation, available at http://www.w3.org/TR/1998/REC-xml-19980210.]]Google ScholarGoogle Scholar
  3. C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, Montreal, Canada, pp. 97-110, (June 1985).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stefano Ceri, Piero Fraternali, and Stefano Paraboschi. XML: Current Developments and Future Challenges for the Database Community. In Proc. of the 7th International Conference on Extending Database Technology (EDBT 2000), pp 3-17, Konstanz, Germany (2000).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. B. Croft and D. J. Harper. Using Probabilistic Models of Document Retrieval without Relevance Information. Journal of Documentation,35(4):285-295, (1979).]]Google ScholarGoogle ScholarCross RefCross Ref
  6. Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, David Maier, Dan Suciu. Querying XML Data. IEEE Data Engineering Bulletin,22(3):10-18 (1999).]]Google ScholarGoogle Scholar
  7. W. Frakes and R. Baeza-Yates (eds). Information Retrieval: Algorithms and Data Structures. Prentice-Hall (1992).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Roy Goldman and Jennifer Widom. DataGuides: Enabling Query Formulation and Optimazation in Semistructured Databases, In Proceedings of the 23rd VLDB Conference, pp. 436-445, Athens, Greece, August 25-29, (1997).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. K. Harman. Ranking Algorithms. In Information Retrieval: Data Structures and Algorithms, W. B Frakes and R. Baeza-Yates (Eds) Prentice-Hall, Englewood Cliffs, N. J. pp. 363-392 (1992)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. K. Harman, E. A. Fox, R. Baeza-Yates and W. C. Lee. Inverted files. In Information Retrieval: Data Structures and Algorithms, W. B Frakes and R. Baeza-Yates (Eds). Prentice-Hall, Englewood Cliffs, N. J. pp. 28-43 (1992)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hyunchi Jang, Youngil Kim and Dongwook Shin. An effective mechanism for index update in structured documents. In Proceedings of the eighth international conference on Information knowledge management (CIKM'99), pp. 383 - 390 (1999).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon and P. Bruce Berra. Index structures for structured documents. In Proceedings of the 1st ACM international conference on Digital libraries (DL'96), pp. 91 - 99 (1996)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom. Lore: A Database Management System for Semistructured Data. ACM SIGMOD Record26(3): 54-66 (1997).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman. Indexing Semistructured Data. Technical Report, Computer Science Dept., Stanford University (1998)]]Google ScholarGoogle Scholar
  15. Andrei Mikheev. Document Centered Approach to Text Normalization. In Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval (SIGIR'00), pp.136-143 Athens, Greece (2000).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst.14( 4):349 - 379 (Oct. 1996).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gonzalo Navarro and Ricardo Baeza-Yates. Proximal nodes: a model to query document databases by content and structure. ACM Transactions on Information Systems,15(4): 400 - 435 (Oct. 1997)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. M. Shaw, J. B. Wood, R. E. Wood and H. R. Tibbo. The Cystic Fibrosis Database: Content and Research Opportunities. Library and Information Science Research (LISR),13: 347-366 (1991).]]Google ScholarGoogle Scholar
  19. Dongwook Shin, Hyuncheol Jang, Honglan J. BUS: An Effective Indexing and Retrieval Scheme in Structured Documents . In proceedings of the third ACM Conference on Digital libraries (DL'98) pp. 235-243 (1998).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jones K. Sparck. A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation28:11-21, (1972).]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Structured information retrieval in XML documents

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '02: Proceedings of the 2002 ACM symposium on Applied computing
          March 2002
          1200 pages
          ISBN:1581134452
          DOI:10.1145/508791

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 March 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader