skip to main content
10.1145/502585.502613acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Structural inference for semistructured data

Published:05 October 2001Publication History

ABSTRACT

Semistructured data presents many challenges, mainly due to its lack of a strict schema. These challenges are further magnified when large amounts of data are gathered from heterogeneous sources. We address this by investigation and development of methods to automatically infer structural information from example data. Using XML as a reference format, we approach the schema generation problem by application of inductive inference theory. In doing so, we review and extend results relating to the search spaces of grammatical inferences. We then adapt a method for evaluating the result of an inference process from computational linguistics. Further, we combine several inference algorithms, including both new techniques introduced by us and those from previous work. Comprehensive experimentation reveals our new hybrid method, based upon recently developed optimisation techniques, to be the most effective.

References

  1. 1.H Ahonen. Generating Grammars for Structured Documents Using Grammatical Inference Methods. Report A-1996-4, Department of Computer Science, University of Finland, 1996.Google ScholarGoogle Scholar
  2. 2.A W Biermann and J A Feldman. On the synthesis of finite-state machines fmm samples of their behaviour. IEEE 'Transactions on Computers, 21:591-597, 1972.Google ScholarGoogle Scholar
  3. 3.R C Carrasco and J Oncina (editors). Grammatical Inference and Applications. Proceedings of the Second International Colloquium on Grammatical Inference (ICGI-94), Lecture Notes in Artificial Intelligence 862, Springer-Verlag 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.R C Carrasco and J Oncina. Learning Stochastic Regular Grammars by Means of a State Merging Method. In R C Carrasco and J Oncina (editors) {3}. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.J Chen. Grammar Generation and Query Processing for Text Databases. Research proposal, University of Waterloo, January 1991.Google ScholarGoogle Scholar
  6. 6.M Dorigo, V Maniezzo and A Coloni. Positive Feedback as a Search Strategy. Technical Report 91-016, Dipartmento di Elettronica, Politecnico di M&no, Italy, 1991.Google ScholarGoogle Scholar
  7. 7.P Fankhauser and Y Xu. Markitup! An Incremental Approach to Document Structure Recognition. Electronic Publishing - Origination, Dissemination and Design, 6(4):447-456, 1994.Google ScholarGoogle Scholar
  8. 8.M Garofalakis, A Gionis, R Rastogi, S Seshadri and K Shim. XTRACT: A System for Extmcting Document Type Descriptors from XML Documents. In Proceedings of SIGMOD 2000, pages 165-176, Dallas, TX, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.M P Georgeff and C S Wallace. A General Selection Criterion for Inductive Inference. In T O'Shea (editor), ECAI-84: Advances in Artificial Intelligence, pages 473-481. Dordretch: Elsevier, 1984.Google ScholarGoogle Scholar
  10. 10.R Goldman and J Widom. Dataguides: Enabling Query Formulation and Optimization in Semistructured Databaes. In Proceedings of the Twenty-Third International Conference on Very Large Databases, pages 436-445, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.J McHugh, S Abiteboul, R Goldman, D Quass and J Widom. Lore: A database management, system for semistructured data. SIGMOD Record, 26(3):54-66, September 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.S Nestorov, S Abiteboul and R Motwani. Inferring Structure in Semistructured Data. In Proceedings of the Workshop on Management of Semistructured Data (in Conjunction with PODS/SIGMOD), May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.L Pitt. Inductive Inference, DFA's and Computational Complexity. In J Siekmann (editor), Proceedings of the International Workshop AI1 '89, Lecture Notes in Artificial Intelligence 397, pages 1844, Springer-Verlag 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.A V Raman and J D Patrick. The Sk-strings method for inferring PFSA. In Proceedings of the 14th International Conference on Machine Learning, ICML'97, 1997.Google ScholarGoogle Scholar
  15. 15.A V Raman. An Information Theoretic Approach to Language Relatedness. PhD Thesis, Massey University, 1997.Google ScholarGoogle Scholar
  16. 16.Y Sakakibara. Recent Advances in Grammatical Inference. Theoretical Computer Science Volume 185, Number 1, October 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.K Shafer. Creating DTDs via the GB-engine and l+ed. Available at http://www.oclc.org/fred/, 1995.Google ScholarGoogle Scholar
  18. 18.C S Wallace and D M Boulton. An Information Measure for Classification. Computer Journal 11:185-194, 1968.Google ScholarGoogle ScholarCross RefCross Ref
  19. 19.M D Young-Lai. Application of a Stochastic Grammatical Inference Method to Text Structure. Master's thesis, Computer Science Department, University of Waterloo, 1996.Google ScholarGoogle Scholar

Index Terms

  1. Structural inference for semistructured data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
          October 2001
          616 pages
          ISBN:1581134363
          DOI:10.1145/502585

          Copyright © 2001 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 October 2001

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader