ABSTRACT
Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.
- Abi97.S. Abiteboul. Querying semi-structured data. In Proceedings of ICDT (invited talk), 1997. Google ScholarDigital Library
- Ade98.B. Adelberg. NoDoSE - a tool for semiautomatic data extraction from text files. Technical report, Computer Science Department, Northwestern University, 1998.Google Scholar
- AK97a.N. Ashish and C.A. Knoblock. Semi-automatic wrapper generation for internet information sources. In Proceedings of cooperative information systems, 1997. Google ScholarDigital Library
- AK97b.N. Ashish and C.A. Knoblock. Wrapper generation for semi-structured internet sources. In Workshop on management of semistructured data, 1997.Google ScholarDigital Library
- CGMH+97.S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the processing society of japan, 1997.Google Scholar
- Gol90.A. Goldberg. Information models, views, and controllers. Dr. Dobb's Journal, July 1990. Google ScholarDigital Library
- HGMC+97.J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.Google Scholar
- KGP88.Krasner, Glenn, and S. Pope. A cookbook for using the model-view-controller user interface paradigm in smalltalk-80. Journal of Object-oriented programming, August/September 1988. Google ScholarDigital Library
- KWD97.N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.Google Scholar
- Liv90.M. Livny. DeNet user's guide. Technical report, University of Wisconsin-Madison, 1990.Google Scholar
Index Terms
- NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents
Recommendations
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents
Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from ...
Comments