Article

Structural inference for semistructured data

Authors:
Jason Sankey

University of Sydney, Sydney, Australia

University of Sydney, Sydney, Australia
View Profile

,
Raymond K. Wong

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

CIKM '01: Proceedings of the tenth international conference on Information and knowledge managementOctober 2001Pages 159–166https://doi.org/10.1145/502585.502613

Published:05 October 2001Publication History

CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

Pages 159–166

ABSTRACT

Semistructured data presents many challenges, mainly due to its lack of a strict schema. These challenges are further magnified when large amounts of data are gathered from heterogeneous sources. We address this by investigation and development of methods to automatically infer structural information from example data. Using XML as a reference format, we approach the schema generation problem by application of inductive inference theory. In doing so, we review and extend results relating to the search spaces of grammatical inferences. We then adapt a method for evaluating the result of an inference process from computational linguistics. Further, we combine several inference algorithms, including both new techniques introduced by us and those from previous work. Comprehensive experimentation reveals our new hybrid method, based upon recently developed optimisation techniques, to be the most effective.

References

1.H Ahonen. Generating Grammars for Structured Documents Using Grammatical Inference Methods. Report A-1996-4, Department of Computer Science, University of Finland, 1996.Google Scholar
2.A W Biermann and J A Feldman. On the synthesis of finite-state machines fmm samples of their behaviour. IEEE 'Transactions on Computers, 21:591-597, 1972.Google Scholar
3.R C Carrasco and J Oncina (editors). Grammatical Inference and Applications. Proceedings of the Second International Colloquium on Grammatical Inference (ICGI-94), Lecture Notes in Artificial Intelligence 862, Springer-Verlag 1994. Google ScholarDigital Library
4.R C Carrasco and J Oncina. Learning Stochastic Regular Grammars by Means of a State Merging Method. In R C Carrasco and J Oncina (editors) {3}. Google ScholarDigital Library
5.J Chen. Grammar Generation and Query Processing for Text Databases. Research proposal, University of Waterloo, January 1991.Google Scholar
6.M Dorigo, V Maniezzo and A Coloni. Positive Feedback as a Search Strategy. Technical Report 91-016, Dipartmento di Elettronica, Politecnico di M&no, Italy, 1991.Google Scholar
7.P Fankhauser and Y Xu. Markitup! An Incremental Approach to Document Structure Recognition. Electronic Publishing - Origination, Dissemination and Design, 6(4):447-456, 1994.Google Scholar
8.M Garofalakis, A Gionis, R Rastogi, S Seshadri and K Shim. XTRACT: A System for Extmcting Document Type Descriptors from XML Documents. In Proceedings of SIGMOD 2000, pages 165-176, Dallas, TX, 2000. Google ScholarDigital Library
9.M P Georgeff and C S Wallace. A General Selection Criterion for Inductive Inference. In T O'Shea (editor), ECAI-84: Advances in Artificial Intelligence, pages 473-481. Dordretch: Elsevier, 1984.Google Scholar
10.R Goldman and J Widom. Dataguides: Enabling Query Formulation and Optimization in Semistructured Databaes. In Proceedings of the Twenty-Third International Conference on Very Large Databases, pages 436-445, August 1997. Google ScholarDigital Library
11.J McHugh, S Abiteboul, R Goldman, D Quass and J Widom. Lore: A database management, system for semistructured data. SIGMOD Record, 26(3):54-66, September 1997. Google ScholarDigital Library
12.S Nestorov, S Abiteboul and R Motwani. Inferring Structure in Semistructured Data. In Proceedings of the Workshop on Management of Semistructured Data (in Conjunction with PODS/SIGMOD), May 1997. Google ScholarDigital Library
13.L Pitt. Inductive Inference, DFA's and Computational Complexity. In J Siekmann (editor), Proceedings of the International Workshop AI1 '89, Lecture Notes in Artificial Intelligence 397, pages 1844, Springer-Verlag 1989. Google ScholarDigital Library
14.A V Raman and J D Patrick. The Sk-strings method for inferring PFSA. In Proceedings of the 14th International Conference on Machine Learning, ICML'97, 1997.Google Scholar
15.A V Raman. An Information Theoretic Approach to Language Relatedness. PhD Thesis, Massey University, 1997.Google Scholar
16.Y Sakakibara. Recent Advances in Grammatical Inference. Theoretical Computer Science Volume 185, Number 1, October 1997. Google ScholarDigital Library
17.K Shafer. Creating DTDs via the GB-engine and l+ed. Available at http://www.oclc.org/fred/, 1995.Google Scholar
18.C S Wallace and D M Boulton. An Information Measure for Classification. Computer Journal 11:185-194, 1968.Google ScholarCross Ref
19.M D Young-Lai. Application of a Stochastic Grammatical Inference Method to Text Structure. Master's thesis, Computer Science Department, University of Waterloo, 1996.Google Scholar

Index Terms

Structural inference for semistructured data

Recommendations

Semistructured data and XML
Information organization and databases

XML poses a new set of challenges for semistructured data research. The Extensible Markup Language, XML, is a new recommendation from World Wide Web Consortium that will become a universal data exchange format for the Web. XML shares many common ...
Read More
Validating semistructured data using OWL
WAIM '06: Proceedings of the 7th international conference on Advances in Web-Age Information Management

Semistructured data has become prevalent in both web applications and database systems. This rapid growth in use makes the design of good semistructured data essential. Formal semantics and automated reasoning tools enable us to reveal the ...
Read More
Semistructured data: the TSIMMIS experience
ADBIS'97: Proceedings of the First East-European conference on Advances in Databases and Information systems

In this paper we discuss themanagement of semi-structured data, i.e., data that has irregular or dynamically changing structure. We describe components of the Stanford TSIMMIS Project that help extract semi-structured data from Web pages, that allow the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
October 2001
616 pages
ISBN:1581134363
DOI:10.1145/502585
Editors:
Henrique Paques
Georgia Institute of Technology
,
Ling Liu
Georgia Institute of Technology
,
David Grossman
Illinois Institute of Technology
,
General Chair:
Calton Pu
Georgia Institute of Technology
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 515
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Structural inference for semistructured data

CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semistructured data and XML

Validating semistructured data using OWL

Semistructured data: the TSIMMIS experience