Skip to main content
main-content
Top

Hint

Swipe to navigate through the chapters of this book

2021 | OriginalPaper | Chapter

RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi

Authors : Chahan Vidal-Gorène, Noëmie Lucas, Clément Salah, Aliénor Decours-Perez, Boris Dupin

Published in: Document Analysis and Recognition – ICDAR 2021 Workshops

Publisher: Springer International Publishing

share
SHARE

Abstract

The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general.
Footnotes
1
The history and the origins of these scripts have been an important scientific open debate [3, 4]. The most recent works, in particular those of U. Bongianino, have foregrounded the different itineraries (from books to qurans, from al-Andalus to the Maghreb) followed by these writings between the 10th and the 13th century [4].
 
2
Characteristics are taken from U. Bongianino [4]; theoretical realizations are taken from the article of N. Van de Boogert upon which U. Bongianino draws [13].
 
4
The BULAC holds the second biggest fund of Arabic manuscripts in France (2.458 identified documentary units). BULAC collections contains a substantial proportion of the manuscripts copied in Maghrebi script. 150 Arabic manuscripts are available online on the website of the BINA Digital library.
 
5
  https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-86198-8_19/MediaObjects/509411_1_En_19_Figb_HTML.gifMS.​ARA.​1977, Collections patrimoniales numérisées de la BULAC.
 
6
Muḥammad b. Mubārak al-Barāšī is also the copyist of the second text. There is no mention for the third text: the paleographical characteristics of the pages lead us to assume that it is the work of another hand.
 
7
  https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-86198-8_19/MediaObjects/509411_1_En_19_Figd_HTML.gifMS.​ARA.​609, Collections patrimoniales numérisées de la BULAC.
 
8
See bibliographic record on CALAMES.
 
10
Numbering class is not kept in the v1.1 of the dataset, for which we notice a 9% gain in average for identification of catchword and table classes.
 
11
With better polygons (dataset v1.1), the CER decreases more quickly (16.6 for batch 1, then 15.87, 13.67, 11.52, and finally 6.67 for the last batch).
 
Literature
1.
go back to reference Abdelhaleem, A., Droby, A., Asi, A., Kassis, M., Asam, R.A., El-sanaa, J.: WAHD: a database for writer identification of Arabic historical documents. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 64–68 (2017) Abdelhaleem, A., Droby, A., Asi, A., Kassis, M., Asam, R.A., El-sanaa, J.: WAHD: a database for writer identification of Arabic historical documents. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 64–68 (2017)
3.
go back to reference Ben Azzouza, N.: Les corans de l’occident musulman médiéval : état des recherches et nouvelles perspectives. Perspectives 2, 104–130 (2017) Ben Azzouza, N.: Les corans de l’occident musulman médiéval : état des recherches et nouvelles perspectives. Perspectives 2, 104–130 (2017)
4.
go back to reference Bongianino, U.: The origins and developments of Maghribī rounds scripts, Arabic Paleography in the Islamic West (4th/10th-6th/12th centuries). Ph.D. thesis, University of Oxford (2017) Bongianino, U.: The origins and developments of Maghribī rounds scripts, Arabic Paleography in the Islamic West (4th/10th-6th/12th centuries). Ph.D. thesis, University of Oxford (2017)
5.
go back to reference Camps, J.B., Vidal-Gorène, C., Vernet, M.: Handling heavily abbreviated manuscripts: HTR engines vs text normalisation approaches (2021). Accepted for IWCP workshop of ICDAR 2021 Camps, J.B., Vidal-Gorène, C., Vernet, M.: Handling heavily abbreviated manuscripts: HTR engines vs text normalisation approaches (2021). Accepted for IWCP workshop of ICDAR 2021
6.
go back to reference Clausner, C., Antonacopoulos, A., Mcgregor, N., Wilson-Nunn, D.: ICFHR 2018 competition on recognition of historical Arabic scientific manuscripts - RASM2018. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 471–476 (2018) Clausner, C., Antonacopoulos, A., Mcgregor, N., Wilson-Nunn, D.: ICFHR 2018 competition on recognition of historical Arabic scientific manuscripts - RASM2018. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 471–476 (2018)
8.
go back to reference Diem, M., Kleber, F., Sablatnig, R., Gatos, B.: cBAD: ICDAR2019 competition on baseline detection. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1494–1498 (2019) Diem, M., Kleber, F., Sablatnig, R., Gatos, B.: cBAD: ICDAR2019 competition on baseline detection. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1494–1498 (2019)
9.
go back to reference Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: VML-HD: the historical Arabic documents dataset for recognition systems. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 11–14 (2017) Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: VML-HD: the historical Arabic documents dataset for recognition systems. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 11–14 (2017)
10.
go back to reference Kiessling, B., Ezra, D.S.B., Miller, M.T.: BADAM: a public dataset for baseline detection in Arabic-script manuscripts. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing. HIP 2019, pp. 13–18. Association for Computing Machinery (2019) Kiessling, B., Ezra, D.S.B., Miller, M.T.: BADAM: a public dataset for baseline detection in Arabic-script manuscripts. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing. HIP 2019, pp. 13–18. Association for Computing Machinery (2019)
11.
go back to reference Milo, T., Martínez, A.G.: A new strategy for Arabic OCR: archigraphemes, letter blocks, script grammar, and shape synthesis. In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. DATeCH2019, pp. 93–96. Association for Computing Machinery, New York (2019) Milo, T., Martínez, A.G.: A new strategy for Arabic OCR: archigraphemes, letter blocks, script grammar, and shape synthesis. In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. DATeCH2019, pp. 93–96. Association for Computing Machinery, New York (2019)
12.
go back to reference Pantke, W., Dennhardt, M., Fecker, D., Märgner, V., Fingscheidt, T.: An historical handwritten Arabic dataset for segmentation-free word spotting - HADARA80P. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 15–20 (2014) Pantke, W., Dennhardt, M., Fecker, D., Märgner, V., Fingscheidt, T.: An historical handwritten Arabic dataset for segmentation-free word spotting - HADARA80P. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 15–20 (2014)
13.
go back to reference Van Den Boogert, N.: Some notes on Maghribi script. Manuscripts Middle East 4, 30–43 (1989) Van Den Boogert, N.: Some notes on Maghribi script. Manuscripts Middle East 4, 30–43 (1989)
14.
go back to reference Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages (2021). Accepted for ICDAR 2021 Main Conference Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages (2021). Accepted for ICDAR 2021 Main Conference
Metadata
Title
RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi
Authors
Chahan Vidal-Gorène
Noëmie Lucas
Clément Salah
Aliénor Decours-Perez
Boris Dupin
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-86198-8_19

Premium Partner