skip to main content
10.1145/2501115.2501130acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

The IMPACT dataset of historical document images

Published:24 August 2013Publication History

ABSTRACT

Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource which is the result of a large scale effort undertaken in the EU funded project IMPACT. A number of challenges faced during the creation phase but also the significant benefits and potential of this collection of printed historical documents are described. The dataset contains over 600,000 document images that originate from major European libraries and are representative of both their respective holdings and digitisation plans for the near to medium term. It is truly unique with regard to the very substantial amount of high-quality ground truth which is available for approximately 45,000 pages, capturing detailed layout, reading order and text content. The dataset is publicly available through the IMPACT Centre of Competence (www.digitisation.eu).

References

  1. IMPACT project: http://www.impact-project.euGoogle ScholarGoogle Scholar
  2. A. Antonacopoulos, D. Bridson, C. Papadopoulos, S. Pletschacher, "A Realistic Dataset for Performance Evaluation of Document Layout Analysis", Proc. ICDAR2009, Barcelona, Spain, pp. 296--300 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Internet Archive -- Text Archive: http://archive.org/details/textsGoogle ScholarGoogle Scholar
  4. Project Gutenberg: http://www.gutenberg.org/Google ScholarGoogle Scholar
  5. I. Z. Yalniz, R. Manmatha, "A Fast Alignment Scheme for Automatic OCR Evaluation of Books", Proc ICDAR2011, Beijing, China, pp. 754--758 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Leon Todoran, Marcel Worring, Arnold Smeulders, "The UvA color document dataset", International Journal of Document Analysis and Recognition (IJDAR), 2005, Vol.7(4), pp.228--240 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. University of Washington, University Libraries -- Datasets: http://www.lib.washington.edu/types/datasets/Google ScholarGoogle Scholar
  8. C. Clausner, S. Pletschacher and A. Antonacopoulos, "Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments", Proc. ICDAR2011, Beijing, China, September 2011, pp. 48--52 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Medieval Unicode Font Initiative: http://www.mufi.info/Google ScholarGoogle Scholar
  10. S. Pletschacher and A. Antonacopoulos, "The PAGE (Page Analysis and Ground-Truth Elements) Format Framework", Proc. ICPR2008, Istanbul, Turkey, August 23-26, 2010, IEEE-CS Press, pp. 257--260 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jesse de Does, Katrien Depuydt, " Lexicon-supported OCR of eighteenth century Dutch books: a case study", Proc. SPIE 8658, Document Recognition and Retrieval XX, 86580L (February 4, 2013); doi:10.1117/12.2008423Google ScholarGoogle Scholar
  12. A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Document Layout Analysis Competition", Proc. ICDAR2011, Beijing, China, Septempber 2011, pp. 1516--1520 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Book Recognition Competition -- HBR2013", Proc. ICDAR2013, Washington DC, USA, August 2013Google ScholarGoogle Scholar
  14. A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Newspaper Layout Analysis Competition -- HNLA2013", Proc. ICDAR2013, Washington DC, USA, August 2013 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Impact Centre of Competence: http://www.digitisation.eu/Google ScholarGoogle Scholar
  16. SUCCEED project: http://succeed-project.eu/Google ScholarGoogle Scholar

Index Terms

  1. The IMPACT dataset of historical document images

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
        August 2013
        141 pages
        ISBN:9781450321150
        DOI:10.1145/2501115

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        HIP '13 Paper Acceptance Rate18of31submissions,58%Overall Acceptance Rate52of90submissions,58%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader