NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

Author:
Brad Adelberg

Northwestern University, Computer Science Department

Northwestern University, Computer Science Department
View Profile

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataJune 1998Pages 283–294https://doi.org/10.1145/276304.276330

Published:01 June 1998Publication History

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Pages 283–294

ABSTRACT

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

References

Abi97.S. Abiteboul. Querying semi-structured data. In Proceedings of ICDT (invited talk), 1997. Google ScholarDigital Library
Ade98.B. Adelberg. NoDoSE - a tool for semiautomatic data extraction from text files. Technical report, Computer Science Department, Northwestern University, 1998.Google Scholar
AK97a.N. Ashish and C.A. Knoblock. Semi-automatic wrapper generation for internet information sources. In Proceedings of cooperative information systems, 1997. Google ScholarDigital Library
AK97b.N. Ashish and C.A. Knoblock. Wrapper generation for semi-structured internet sources. In Workshop on management of semistructured data, 1997.Google ScholarDigital Library
CGMH+97.S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the processing society of japan, 1997.Google Scholar
Gol90.A. Goldberg. Information models, views, and controllers. Dr. Dobb's Journal, July 1990. Google ScholarDigital Library
HGMC+97.J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.Google Scholar
KGP88.Krasner, Glenn, and S. Pope. A cookbook for using the model-view-controller user interface paradigm in smalltalk-80. Journal of Object-oriented programming, August/September 1988. Google ScholarDigital Library
KWD97.N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.Google Scholar
Liv90.M. Livny. DeNet user's guide. Technical report, University of Wisconsin-Madison, 1990.Google Scholar

Index Terms

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

Recommendations

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from ...
Read More
Data extraction and integration of semistructured documents
Read More
Structured, unstructured, and semistructured search in semistructured databases
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 283
  Total Citations
  View Citations
- 1,850
  Total Downloads
- Downloads (Last 12 months)112
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

Data extraction and integration of semistructured documents

Structured, unstructured, and semistructured search in semistructured databases