Skip to main content
Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) 3/2023

30.01.2023 | Special Issue Paper

Large-scale genealogical information extraction from handwritten Quebec parish records

verfasst von: Solène Tarride, Martin Maarand, Mélodie Boillet, James McGrath, Eugénie Capel, Hélène Vézina, Christopher Kermorvant

Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19–20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19th Century French Directories. In: Document Analysis Systems. pp. 445–460 (2022) Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19th Century French Directories. In: Document Analysis Systems. pp. 445–460 (2022)
2.
Zurück zum Zitat Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence Labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1638–1649 (Aug 2018) Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence Labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1638–1649 (Aug 2018)
3.
Zurück zum Zitat Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: A Generic Deep-learning Approach for Document Segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 7–12 (Aug 2018) Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: A Generic Deep-learning Approach for Document Segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 7–12 (Aug 2018)
4.
Zurück zum Zitat Arora, A., Chang, C.C., Rekabdar, B., BabaAli, B., Povey, D., Etter, D., Raj, D., Hadian, H., Trmal, J., Garcia, P., et al.: Using ASR Methods for OCR. In: 15th International Conference on Document Analysis and Recognition. pp. 663–668 (Sep 2019) Arora, A., Chang, C.C., Rekabdar, B., BabaAli, B., Povey, D., Etter, D., Raj, D., Hadian, H., Trmal, J., Garcia, P., et al.: Using ASR Methods for OCR. In: 15th International Conference on Document Analysis and Recognition. pp. 663–668 (Sep 2019)
5.
6.
Zurück zum Zitat Boillet, M., Maarand, M., Paquet, T., Kermorvant, C.: Including Keyword Position in Image-Based Models for Act Segmentation of Historical Registers. In: 6th International Workshop on Historical Document Imaging and Processing. p. 31-36 (Sep 2021). https://doi.org/10.1145/3476887.3476905 Boillet, M., Maarand, M., Paquet, T., Kermorvant, C.: Including Keyword Position in Image-Based Models for Act Segmentation of Historical Registers. In: 6th International Workshop on Historical Document Imaging and Processing. p. 31-36 (Sep 2021). https://​doi.​org/​10.​1145/​3476887.​3476905
7.
Zurück zum Zitat Boillet, M., Kermorvant, C., Paquet, T.: Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks. In: 25th International Conference on Pattern Recognition. pp. 2134–2141 (Jan 2020) Boillet, M., Kermorvant, C., Paquet, T.: Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks. In: 25th International Conference on Pattern Recognition. pp. 2134–2141 (Jan 2020)
8.
Zurück zum Zitat Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying Density-based Local Outliers. In: 2000 ACM SIGMOD International Conference on Management of Data. pp. 93–104 (2000) Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying Density-based Local Outliers. In: 2000 ACM SIGMOD International Conference on Management of Data. pp. 93–104 (2000)
11.
Zurück zum Zitat Carbonell, M., Villegas, M., Fornés, A., Lladós, J.: Joint recognition of handwritten text and named entities with a neural end-to-end model. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). pp. 399–404. IEEE Computer Society, Los Alamitos, CA, USA (apr 2018). 10.1109/DAS.2018.52, https://doi.ieeecomputersociety.org/10.1109/DAS.2018.52 Carbonell, M., Villegas, M., Fornés, A., Lladós, J.: Joint recognition of handwritten text and named entities with a neural end-to-end model. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). pp. 399–404. IEEE Computer Society, Los Alamitos, CA, USA (apr 2018). 10.1109/DAS.2018.52, https://​doi.​ieeecomputersoci​ety.​org/​10.​1109/​DAS.​2018.​52
12.
Zurück zum Zitat Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., Merveille, F.: Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early 20th Century Paris Census. In: Document Analysis Systems. pp. 143–157 (2022) Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., Merveille, F.: Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early 20th Century Paris Census. In: Document Analysis Systems. pp. 143–157 (2022)
13.
Zurück zum Zitat Coquenet, D., Chatelain, C., Paquet, T.: DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition (2022). 10.48550/ARXIV.2203.12273 Coquenet, D., Chatelain, C., Paquet, T.: DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition (2022). 10.48550/ARXIV.2203.12273
15.
Zurück zum Zitat Douzon, T., Duffner, S., Garcia, C., Espinas, J.: Improving Information Extraction on Business Documents with Specific Pre-training Tasks. In: Document Analysis Systems. pp. 111–125 (2022) Douzon, T., Duffner, S., Garcia, C., Espinas, J.: Improving Information Extraction on Business Documents with Specific Pre-training Tasks. In: Document Analysis Systems. pp. 111–125 (2022)
17.
Zurück zum Zitat Fornés, A., Romero, V., Baró, A., Toledo, J.I., Sánchez, J.A., Vidal, E., Lladós, J.: ICDAR2017 Competition on Information Extraction in Historical Handwritten Records. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 01, pp. 1389–1394 (2017). https://doi.org/10.1109/ICDAR.2017.227 Fornés, A., Romero, V., Baró, A., Toledo, J.I., Sánchez, J.A., Vidal, E., Lladós, J.: ICDAR2017 Competition on Information Extraction in Historical Handwritten Records. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 01, pp. 1389–1394 (2017). https://​doi.​org/​10.​1109/​ICDAR.​2017.​227
18.
Zurück zum Zitat Grüning, T., Labahn, R., Diem, M., Kleber, F., Fiel, S.: READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents. In: 13th International Workshop on Document Analysis Systems. pp. 351–356 (May 2017) Grüning, T., Labahn, R., Diem, M., Kleber, F., Fiel, S.: READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents. In: 13th International Workshop on Document Analysis Systems. pp. 351–356 (May 2017)
20.
Zurück zum Zitat Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On Calibration of Modern Neural Networks. In: International Conference on Machine Learning (2017) Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On Calibration of Modern Neural Networks. In: International Conference on Machine Learning (2017)
23.
Zurück zum Zitat Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 04, pp. 19–24 (Nov 2017). https://doi.org/10.1109/ICDAR.2017.307 Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 04, pp. 19–24 (Nov 2017). https://​doi.​org/​10.​1109/​ICDAR.​2017.​307
24.
Zurück zum Zitat Kiss, M., Kohút, J., Benes, K., Hradis, M.: Importance of Textlines in Historical Document Classification. In: Document Analysis Systems. pp. 158–170 (2022) Kiss, M., Kohút, J., Benes, K., Hradis, M.: Importance of Textlines in Historical Document Classification. In: Document Analysis Systems. pp. 158–170 (2022)
25.
26.
Zurück zum Zitat Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008) Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008)
27.
Zurück zum Zitat Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). pp. 32–39 (Jun 2019). https://doi.org/10.18653/v1/N19-2005 Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). pp. 32–39 (Jun 2019). https://​doi.​org/​10.​18653/​v1/​N19-2005
28.
Zurück zum Zitat Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in norwegian. In: Document Analysis Systems. pp. 399–413 (2022) Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in norwegian. In: Document Analysis Systems. pp. 399–413 (2022)
29.
Zurück zum Zitat Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: 58th Annual Meeting of the Association for Computational Linguistics. pp. 7203–7219 (2020) Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: 58th Annual Meeting of the Association for Computational Linguistics. pp. 7203–7219 (2020)
30.
Zurück zum Zitat Monnier, T., Aubry, M.: docExtractor: An off-the-shelf historical document element extraction. In: International Conference on Frontiers in Handwriting Recognition (2020) Monnier, T., Aubry, M.: docExtractor: An off-the-shelf historical document element extraction. In: International Conference on Frontiers in Handwriting Recognition (2020)
31.
Zurück zum Zitat Bizon Monroc, C., Miret, B., Bonhomme, M.L., Kermorvant, C.: A Comprehensive Study of Open-Source Libraries for Named Entity Recognition on Handwritten Historical Documents. In: Document Analysis Systems. pp. 429–444 (2022) Bizon Monroc, C., Miret, B., Bonhomme, M.L., Kermorvant, C.: A Comprehensive Study of Open-Source Libraries for Named Entity Recognition on Handwritten Historical Documents. In: Document Analysis Systems. pp. 429–444 (2022)
32.
Zurück zum Zitat Nion, T., Menasri, F., Louradour, J., Sibade, C., Retornaz, T., Métaireau, P.Y., Kermorvant, C.: Handwritten Information Extraction from Historical Census Documents. In: 2013 12th International Conference on Document Analysis and Recognition. pp. 822–826 (2013). https://doi.org/10.1109/ICDAR.2013.168 Nion, T., Menasri, F., Louradour, J., Sibade, C., Retornaz, T., Métaireau, P.Y., Kermorvant, C.: Handwritten Information Extraction from Historical Census Documents. In: 2013 12th International Conference on Document Analysis and Recognition. pp. 822–826 (2013). https://​doi.​org/​10.​1109/​ICDAR.​2013.​168
33.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
35.
36.
38.
Zurück zum Zitat Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Annual Meeting of the Association for Computational Linguistics (2016) Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Annual Meeting of the Association for Computational Linguistics (2016)
39.
Zurück zum Zitat Seuret, M., Nicolaou, A., Rodríguez-Salas, D., Weichselbaumer, N., Stutzmann, D., Mayr, M., Maier, A., Christlein, V.: ICDAR 2021 Competition on Historical Document Classification. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) International Conference on Document Analysis and Recognition. pp. 618–634 (2021) Seuret, M., Nicolaou, A., Rodríguez-Salas, D., Weichselbaumer, N., Stutzmann, D., Mayr, M., Maier, A., Christlein, V.: ICDAR 2021 Competition on Historical Document Classification. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) International Conference on Document Analysis and Recognition. pp. 618–634 (2021)
40.
Zurück zum Zitat Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., Ingold, R.: DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts. In: 15th International Conference on Frontiers in Handwriting Recognition. pp. 471–476 (Oct 2016). https://doi.org/10.1109/ICFHR.2016.0093 Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., Ingold, R.: DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts. In: 15th International Conference on Frontiers in Handwriting Recognition. pp. 471–476 (Oct 2016). https://​doi.​org/​10.​1109/​ICFHR.​2016.​0093
41.
Zurück zum Zitat Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: Combination of deep neural networks and logical rules for record segmentation in historical handwritten registers using few examples. Int. J. Doc. Anal. Recogn. 24, 77–96 (2021). https://doi.org/10.1007/s10032-021-00362-8 Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: Combination of deep neural networks and logical rules for record segmentation in historical handwritten registers using few examples. Int. J. Doc. Anal. Recogn. 24, 77–96 (2021). https://​doi.​org/​10.​1007/​s10032-021-00362-8
42.
Zurück zum Zitat Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: A Comparative Study of Information Extraction Strategies Using an Attention-Based Neural Network. In: Document Analysis Systems. pp. 644–658 (2022) Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: A Comparative Study of Information Extraction Strategies Using an Attention-Based Neural Network. In: Document Analysis Systems. pp. 644–658 (2022)
43.
Zurück zum Zitat Walton, S., Livermore, L., Bánki, O., N. Cubey, R.W., Drinkwater, R., Englund, M., Goble, C., Groom, Q., Kermorvant, C., Rey, I., M Santos, C., Scott, B., R. Williams, A., Wu, Z.: Landscape analysis for the specimen data refinery. Research Ideas and Outcomes 6, e57602 (2020). https://doi.org/10.3897/rio.6.e57602 Walton, S., Livermore, L., Bánki, O., N. Cubey, R.W., Drinkwater, R., Englund, M., Goble, C., Groom, Q., Kermorvant, C., Rey, I., M Santos, C., Scott, B., R. Williams, A., Wu, Z.: Landscape analysis for the specimen data refinery. Research Ideas and Outcomes 6, e57602 (2020). https://​doi.​org/​10.​3897/​rio.​6.​e57602
44.
Zurück zum Zitat Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., Cai, M.: Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021) Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., Cai, M.: Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
46.
Zurück zum Zitat Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 1192-1200 (Aug 2020) Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 1192-1200 (Aug 2020)
47.
Zurück zum Zitat Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. 2020 25th International Conference on Pattern Recognition pp. 4363–4370 (2021) Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. 2020 25th International Conference on Pattern Recognition pp. 4363–4370 (2021)
Metadaten
Titel
Large-scale genealogical information extraction from handwritten Quebec parish records
verfasst von
Solène Tarride
Martin Maarand
Mélodie Boillet
James McGrath
Eugénie Capel
Hélène Vézina
Christopher Kermorvant
Publikationsdatum
30.01.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Document Analysis and Recognition (IJDAR) / Ausgabe 3/2023
Print ISSN: 1433-2833
Elektronische ISSN: 1433-2825
DOI
https://doi.org/10.1007/s10032-023-00427-w

Weitere Artikel der Ausgabe 3/2023

International Journal on Document Analysis and Recognition (IJDAR) 3/2023 Zur Ausgabe

Premium Partner