ABSTRACT
Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource which is the result of a large scale effort undertaken in the EU funded project IMPACT. A number of challenges faced during the creation phase but also the significant benefits and potential of this collection of printed historical documents are described. The dataset contains over 600,000 document images that originate from major European libraries and are representative of both their respective holdings and digitisation plans for the near to medium term. It is truly unique with regard to the very substantial amount of high-quality ground truth which is available for approximately 45,000 pages, capturing detailed layout, reading order and text content. The dataset is publicly available through the IMPACT Centre of Competence (www.digitisation.eu).
- IMPACT project: http://www.impact-project.euGoogle Scholar
- A. Antonacopoulos, D. Bridson, C. Papadopoulos, S. Pletschacher, "A Realistic Dataset for Performance Evaluation of Document Layout Analysis", Proc. ICDAR2009, Barcelona, Spain, pp. 296--300 Google ScholarDigital Library
- Internet Archive -- Text Archive: http://archive.org/details/textsGoogle Scholar
- Project Gutenberg: http://www.gutenberg.org/Google Scholar
- I. Z. Yalniz, R. Manmatha, "A Fast Alignment Scheme for Automatic OCR Evaluation of Books", Proc ICDAR2011, Beijing, China, pp. 754--758 Google ScholarDigital Library
- Leon Todoran, Marcel Worring, Arnold Smeulders, "The UvA color document dataset", International Journal of Document Analysis and Recognition (IJDAR), 2005, Vol.7(4), pp.228--240 Google ScholarDigital Library
- University of Washington, University Libraries -- Datasets: http://www.lib.washington.edu/types/datasets/Google Scholar
- C. Clausner, S. Pletschacher and A. Antonacopoulos, "Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments", Proc. ICDAR2011, Beijing, China, September 2011, pp. 48--52 Google ScholarDigital Library
- Medieval Unicode Font Initiative: http://www.mufi.info/Google Scholar
- S. Pletschacher and A. Antonacopoulos, "The PAGE (Page Analysis and Ground-Truth Elements) Format Framework", Proc. ICPR2008, Istanbul, Turkey, August 23-26, 2010, IEEE-CS Press, pp. 257--260 Google ScholarDigital Library
- Jesse de Does, Katrien Depuydt, " Lexicon-supported OCR of eighteenth century Dutch books: a case study", Proc. SPIE 8658, Document Recognition and Retrieval XX, 86580L (February 4, 2013); doi:10.1117/12.2008423Google Scholar
- A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Document Layout Analysis Competition", Proc. ICDAR2011, Beijing, China, Septempber 2011, pp. 1516--1520 Google ScholarDigital Library
- A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Book Recognition Competition -- HBR2013", Proc. ICDAR2013, Washington DC, USA, August 2013Google Scholar
- A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Newspaper Layout Analysis Competition -- HNLA2013", Proc. ICDAR2013, Washington DC, USA, August 2013 Google ScholarDigital Library
- Impact Centre of Competence: http://www.digitisation.eu/Google Scholar
- SUCCEED project: http://succeed-project.eu/Google Scholar
Index Terms
- The IMPACT dataset of historical document images
Recommendations
The lifecycle of a digital historical document: structure and content
DocEng '04: Proceedings of the 2004 ACM symposium on Document engineeringThis paper describes the lifecycle of a digital historical document, from template-based structure definition through to content extraction from the scanned pages and its final reconstitution as an electronic document (combining content and semantic ...
The IUPR dataset of camera-captured document images
CBDAR'11: Proceedings of the 4th international conference on Camera-Based Document Analysis and RecognitionMajor challenges in camera-base document analysis are dealing with uneven shadows, high degree of curl and perspective distortions. In CBDAR 2007, we introduced the first dataset (DFKI-I) of camera-captured document images in conjunction with a page ...
Benchmarking NAS for Article Separation in Historical Newspapers
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine CollaborationAbstractThe digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers ...
Comments