2005 | OriginalPaper | Chapter
Data Cleansing and Preparation for Moving Toward Electronic Library Repository
Author : Asanee Kawtrakul
Published in: Digital Libraries: Implementing Strategies and Sharing Experiences
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.