2002 | OriginalPaper | Buchkapitel
Machine Learning of Generalized Document Templates for Data Extraction
verfasst von : Janusz Wnek
Erschienen in: Document Analysis Systems V
Verlag: Springer Berlin Heidelberg
Enthalten in: Professional Book Archive
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
The purpose of this research is to reverse engineer the process of encoding data in structured documents and subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes beyond form processing. In fact, the documents may have flexible layouts and consist of multiple and varying numbers of pages. The data extraction method (DataX) employs general templates generated by the Inductive Template Generator (InTeGen). The InTeGen method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user’s input.