research-article

The IMPACT dataset of historical document images

Authors:
Christos Papadopoulos

University of Salford, Greater Manchester, United Kingdom

University of Salford, Greater Manchester, United Kingdom
View Profile

,
Stefan Pletschacher

University of Salford, Greater Manchester, United Kingdom

University of Salford, Greater Manchester, United Kingdom
View Profile

,
Christian Clausner

University of Salford, Greater Manchester, United Kingdom

University of Salford, Greater Manchester, United Kingdom
View Profile

,
Apostolos Antonacopoulos

University of Salford, Greater Manchester, United Kingdom

University of Salford, Greater Manchester, United Kingdom
View Profile

HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and ProcessingAugust 2013Pages 123–130https://doi.org/10.1145/2501115.2501130

Published:24 August 2013Publication History

HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Pages 123–130

ABSTRACT

Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource which is the result of a large scale effort undertaken in the EU funded project IMPACT. A number of challenges faced during the creation phase but also the significant benefits and potential of this collection of printed historical documents are described. The dataset contains over 600,000 document images that originate from major European libraries and are representative of both their respective holdings and digitisation plans for the near to medium term. It is truly unique with regard to the very substantial amount of high-quality ground truth which is available for approximately 45,000 pages, capturing detailed layout, reading order and text content. The dataset is publicly available through the IMPACT Centre of Competence (www.digitisation.eu).

References

IMPACT project: http://www.impact-project.euGoogle Scholar
A. Antonacopoulos, D. Bridson, C. Papadopoulos, S. Pletschacher, "A Realistic Dataset for Performance Evaluation of Document Layout Analysis", Proc. ICDAR2009, Barcelona, Spain, pp. 296--300 Google ScholarDigital Library
Internet Archive -- Text Archive: http://archive.org/details/textsGoogle Scholar
Project Gutenberg: http://www.gutenberg.org/Google Scholar
I. Z. Yalniz, R. Manmatha, "A Fast Alignment Scheme for Automatic OCR Evaluation of Books", Proc ICDAR2011, Beijing, China, pp. 754--758 Google ScholarDigital Library
Leon Todoran, Marcel Worring, Arnold Smeulders, "The UvA color document dataset", International Journal of Document Analysis and Recognition (IJDAR), 2005, Vol.7(4), pp.228--240 Google ScholarDigital Library
University of Washington, University Libraries -- Datasets: http://www.lib.washington.edu/types/datasets/Google Scholar
C. Clausner, S. Pletschacher and A. Antonacopoulos, "Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments", Proc. ICDAR2011, Beijing, China, September 2011, pp. 48--52 Google ScholarDigital Library
Medieval Unicode Font Initiative: http://www.mufi.info/Google Scholar
S. Pletschacher and A. Antonacopoulos, "The PAGE (Page Analysis and Ground-Truth Elements) Format Framework", Proc. ICPR2008, Istanbul, Turkey, August 23-26, 2010, IEEE-CS Press, pp. 257--260 Google ScholarDigital Library
Jesse de Does, Katrien Depuydt, " Lexicon-supported OCR of eighteenth century Dutch books: a case study", Proc. SPIE 8658, Document Recognition and Retrieval XX, 86580L (February 4, 2013); doi:10.1117/12.2008423Google Scholar
A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Document Layout Analysis Competition", Proc. ICDAR2011, Beijing, China, Septempber 2011, pp. 1516--1520 Google ScholarDigital Library
A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Book Recognition Competition -- HBR2013", Proc. ICDAR2013, Washington DC, USA, August 2013Google Scholar
A. Antonacopoulos, C. Clausner, C. Papadopoulos and S. Pletschacher, "Historical Newspaper Layout Analysis Competition -- HNLA2013", Proc. ICDAR2013, Washington DC, USA, August 2013 Google ScholarDigital Library
Impact Centre of Competence: http://www.digitisation.eu/Google Scholar
SUCCEED project: http://succeed-project.eu/Google Scholar

Index Terms

The IMPACT dataset of historical document images
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Graphics recognition and interpretation
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia databases

Recommendations

The lifecycle of a digital historical document: structure and content
DocEng '04: Proceedings of the 2004 ACM symposium on Document engineering

This paper describes the lifecycle of a digital historical document, from template-based structure definition through to content extraction from the scanned pages and its final reconstitution as an electronic document (combining content and semantic ...
Read More
The IUPR dataset of camera-captured document images
CBDAR'11: Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition

Major challenges in camera-base document analysis are dealing with uneven shadows, high degree of curl and perspective distortions. In CBDAR 2007, we introduced the first dataset (DFKI-I) of camera-captured document images in conjunction with a page ...
Read More
Benchmarking NAS for Article Separation in Historical Newspapers
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration
Abstract
The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers ... $^{}$ $^{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
August 2013
141 pages
ISBN:9781450321150
DOI:10.1145/2501115
Editors:
Volker Märgner
Technische Universität Braunschweig, Germany
,
Volkmar Frinken
Computer Vision Center, Barcelona, Spain
,
Bill Barrett
Brigham Young University
,
R. Manmatha
UMass Amherst
,
Program Chairs:
Volkmar Frinken
Computer Vision Center, Barcelona, Spain
,
Bill Barrett
Brigham Young University
,
R. Manmatha
UMass Amherst
,
Volker Märgner
Technische Universität Braunschweig, Germany
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dataset production
ground truth production
historical documents
Qualifiers
- research-article
Conference

Acceptance Rates
HIP '13 Paper Acceptance Rate18of31submissions,58%Overall Acceptance Rate52of90submissions,58%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 267
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The IMPACT dataset of historical document images

HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

The lifecycle of a digital historical document: structure and content

The IUPR dataset of camera-captured document images

Benchmarking NAS for Article Separation in Historical Newspapers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The IMPACT dataset of historical document images

HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

The lifecycle of a digital historical document: structure and content

The IUPR dataset of camera-captured document images

Benchmarking NAS for Article Separation in Historical Newspapers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media