ABSTRACT
Semistructured data presents many challenges, mainly due to its lack of a strict schema. These challenges are further magnified when large amounts of data are gathered from heterogeneous sources. We address this by investigation and development of methods to automatically infer structural information from example data. Using XML as a reference format, we approach the schema generation problem by application of inductive inference theory. In doing so, we review and extend results relating to the search spaces of grammatical inferences. We then adapt a method for evaluating the result of an inference process from computational linguistics. Further, we combine several inference algorithms, including both new techniques introduced by us and those from previous work. Comprehensive experimentation reveals our new hybrid method, based upon recently developed optimisation techniques, to be the most effective.
- 1.H Ahonen. Generating Grammars for Structured Documents Using Grammatical Inference Methods. Report A-1996-4, Department of Computer Science, University of Finland, 1996.Google Scholar
- 2.A W Biermann and J A Feldman. On the synthesis of finite-state machines fmm samples of their behaviour. IEEE 'Transactions on Computers, 21:591-597, 1972.Google Scholar
- 3.R C Carrasco and J Oncina (editors). Grammatical Inference and Applications. Proceedings of the Second International Colloquium on Grammatical Inference (ICGI-94), Lecture Notes in Artificial Intelligence 862, Springer-Verlag 1994. Google ScholarDigital Library
- 4.R C Carrasco and J Oncina. Learning Stochastic Regular Grammars by Means of a State Merging Method. In R C Carrasco and J Oncina (editors) {3}. Google ScholarDigital Library
- 5.J Chen. Grammar Generation and Query Processing for Text Databases. Research proposal, University of Waterloo, January 1991.Google Scholar
- 6.M Dorigo, V Maniezzo and A Coloni. Positive Feedback as a Search Strategy. Technical Report 91-016, Dipartmento di Elettronica, Politecnico di M&no, Italy, 1991.Google Scholar
- 7.P Fankhauser and Y Xu. Markitup! An Incremental Approach to Document Structure Recognition. Electronic Publishing - Origination, Dissemination and Design, 6(4):447-456, 1994.Google Scholar
- 8.M Garofalakis, A Gionis, R Rastogi, S Seshadri and K Shim. XTRACT: A System for Extmcting Document Type Descriptors from XML Documents. In Proceedings of SIGMOD 2000, pages 165-176, Dallas, TX, 2000. Google ScholarDigital Library
- 9.M P Georgeff and C S Wallace. A General Selection Criterion for Inductive Inference. In T O'Shea (editor), ECAI-84: Advances in Artificial Intelligence, pages 473-481. Dordretch: Elsevier, 1984.Google Scholar
- 10.R Goldman and J Widom. Dataguides: Enabling Query Formulation and Optimization in Semistructured Databaes. In Proceedings of the Twenty-Third International Conference on Very Large Databases, pages 436-445, August 1997. Google ScholarDigital Library
- 11.J McHugh, S Abiteboul, R Goldman, D Quass and J Widom. Lore: A database management, system for semistructured data. SIGMOD Record, 26(3):54-66, September 1997. Google ScholarDigital Library
- 12.S Nestorov, S Abiteboul and R Motwani. Inferring Structure in Semistructured Data. In Proceedings of the Workshop on Management of Semistructured Data (in Conjunction with PODS/SIGMOD), May 1997. Google ScholarDigital Library
- 13.L Pitt. Inductive Inference, DFA's and Computational Complexity. In J Siekmann (editor), Proceedings of the International Workshop AI1 '89, Lecture Notes in Artificial Intelligence 397, pages 1844, Springer-Verlag 1989. Google ScholarDigital Library
- 14.A V Raman and J D Patrick. The Sk-strings method for inferring PFSA. In Proceedings of the 14th International Conference on Machine Learning, ICML'97, 1997.Google Scholar
- 15.A V Raman. An Information Theoretic Approach to Language Relatedness. PhD Thesis, Massey University, 1997.Google Scholar
- 16.Y Sakakibara. Recent Advances in Grammatical Inference. Theoretical Computer Science Volume 185, Number 1, October 1997. Google ScholarDigital Library
- 17.K Shafer. Creating DTDs via the GB-engine and l+ed. Available at http://www.oclc.org/fred/, 1995.Google Scholar
- 18.C S Wallace and D M Boulton. An Information Measure for Classification. Computer Journal 11:185-194, 1968.Google ScholarCross Ref
- 19.M D Young-Lai. Application of a Stochastic Grammatical Inference Method to Text Structure. Master's thesis, Computer Science Department, University of Waterloo, 1996.Google Scholar
Index Terms
- Structural inference for semistructured data
Recommendations
Semistructured data and XML
Information organization and databasesXML poses a new set of challenges for semistructured data research. The Extensible Markup Language, XML, is a new recommendation from World Wide Web Consortium that will become a universal data exchange format for the Web. XML shares many common ...
Validating semistructured data using OWL
WAIM '06: Proceedings of the 7th international conference on Advances in Web-Age Information ManagementSemistructured data has become prevalent in both web applications and database systems. This rapid growth in use makes the design of good semistructured data essential. Formal semantics and automated reasoning tools enable us to reveal the ...
Semistructured data: the TSIMMIS experience
ADBIS'97: Proceedings of the First East-European conference on Advances in Databases and Information systemsIn this paper we discuss themanagement of semi-structured data, i.e., data that has irregular or dynamically changing structure. We describe components of the Stanford TSIMMIS Project that help extract semi-structured data from Web pages, that allow the ...
Comments