Out of vocabulary word detection and recovery in Arabic handwritten text recognition
Introduction
Handwritten Text Recognition is an active research area in the field of pattern recognition, which aims at converting a text from an image format to an electronic one. Therefore, the text recognition engine remains the main component of a document processing system. In fact, the success of any document processing system involves a highly precise text recognition system. Several systems are commonly trained and used for handwritten and printed text recognition tasks. Various approaches have been involved to deal with handwritten documents for a large vocabulary recognition task. Specifically, the most widely used methods are based either on the hidden Markov's models (HMMs) [82] or on the recurrent neural networks (RNN).
These systems rely on the internal representations that are produced using the sliding window approach, in which features are extracted from the line image vertical frames, whose output is fed to a trainable classifier. This method transforms the problem to a sequence to sequence transduction one, while eventually encoding the two-dimensional image nature using convolutional neural networks [74] or defining the relevant features [34], [81].
While the issue of learning features has been a topic of interest for decades, substantial progress has been achieved with the development of deep learning methods during the last few years. Especially, deep learning methods allowed building systems that can handle both the 2D aspect of the input image and the prediction sequential aspect. In particular, multidimensional long short-term memory recurrent neural networks (MDLSTM-RNNs) associated with the Connectionist Temporal Classification (CTC) [28] yield low error rates and have become the state of the art model for handwriting recognition [75], [77], [78], [79]. More recently, attention-based models have been applied to recognize handwritten text averting the paragraph to lines segmentation problems [76].
Traditional handwriting recognition research rely on linguistic resources including static word lexicons [17], referred to as In Vocabulary words (IV words) in this research study [15], [25], [26], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [60], [61]. Over the last three decades, several research works have taken into account the presence of words that do not belong to the used word lexicon [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [62]. In our work, we referred to these words as Out-Of-Vocabulary words (OOV words). OOV words represent an important source of error in word spotting [14], speech and handwritten text recognition systems and thus several research works have been proposed to address this issue.
In the field of Automatic Speech Recognition (ASR), there has been significant work in OOV word detection and recovery. The methods addressing this problem can be grouped into two categories: OOV detection based approaches and lexicon selection based ones. The OOV detection based approaches [64], [65], [66] proceed by detecting the OOV words and/or locating the OOV regions in the ASR hypothesis, followed by a search process to match the phoneme sequence that constitutes the OOV word. Generally, these methods mainly involve hybrid language models thanks to their ability to model both in-vocabulary word and sub-word units. However, the OOV detection methods rely heavily on features taken from the speech recognition hypothesis such as posterior scores. Such features are not that reliable as they may reflect the correspondence between the word hypothesis and the signal input and not the presence of the OOV word. Besides, a hybrid LM may require careful selection of sub-word units that can sometimes lead to increased error rates [67]. Vocabulary selection based approaches propose a relevant vocabulary for speech recognition based on additional text data. The second category based approaches has been proposed to minimize the OOV rate for a domain specific corpus [68], [69]. Moreover, they are more dynamic [70] as they can suggest context specific vocabulary.
For handwriting recognition, the methods addressing the OOV problem employ the same techniques used in ASR systems. These methods can be subdivided also into two categories. One-step approaches [1], [2], [3], [4], [9], [10], [11], [12] try to recover OOV words during the recognition process by increasing the vocabulary size, which generally increases the computational complexity and the confusability in the data. An alternative approach is to use sub-word units, either to estimate a full sub-word LM or to generate a hybrid LM that incorporates both words and sub-words. The performance of sub-word modeling approaches will however depend on the language model design and most importantly on the properties of the training corpus compared to those of the test corpus. In addition, they can produce some words that do not belong to the language and consequently their recognition performance drastically drops. The second category is based on two processing steps: OOV detection and OOV recovery [5], [6], [7], [8], [13].
The detailed study of the existing handwriting recognition research works shows that most of the existing word and text recognition systems integrate only the OOV words recovery without any preliminary detection step. To our knowledge, only one existing system handles the OOV words detection in handwritten Latin script [5]. Such detection is essentially based on the comparison of the confidence scores of the recognized words with a heuristic threshold whose value is determined through several experiments. In this framework, the used hypothesis is that the OOV words will, in most cases, have lower confidence scores than those of the IV words.
Considering the OOV words recovery, most of the proposed handwritten text recognition systems rely on the so-called sub-lexical units. These systems enrich the word lexicon by decomposing the words into different sub-word lexical units. These units can be letters or syllables for several scripts. They can be, Part of Arabic Words (PAWs) or morphemes, particularly, for the Arabic script. The PAWs result from the natural segmentation of words because of the presence of letters that do not connect to their successors in the words. A morpheme can be a prefix, which is added at the beginning of the word, or a suffix, which is added at the end of the word, or a stem. The morphemes result from the morphological structure of the Arabic vocabulary [16], [18], [59]. Other systems rely on a hybrid lexicon combining the different sub-word lexical units and words [1], [2], [3].
To increase the OOV words recovery rates, several systems have used, in addition to the statistical language models, a text corpus, freely available through the web [19], [20], [21], [22], [23], [24]. Such text corpus are used to feed the used initial word lexicons with new words and to build new statistical language models whose states, transitions and transition state probabilities are determined on the basis of the ground truth texts of the training parts of the used image databases and the freely available text corpus.
Since this paper proposed, a new Arabic Handwritten Text Recognition system, called AHTR system for the detection and recovery of OOV words in Arabic text images, it was rather limited to a detailed critical analysis of the research works proposed in [1], [2], [3] that deals only with the recovery of these words.
In [1], hybrid LMs consisting of words and PAWs were used to recover OOV words. The authors have decomposed the less frequent words in the training corpus into PAWs in order to provide an opportunity for newer words to appear. The used recognition engine relies on the hybridization of HMMs and Multi-Dimensional Long Short Term Memory Networks (MDLSTMs), which directly exploit the pixel values of text line images in four different scan directions. A CTC is used during the training step. To generate the letter sequence hypotheses, the Viterbi algorithm combined to the Weighted Finite State Transducers (WFSTs) [29], is applied.
In [2], the Arabic word morphological decomposition is adapted in the handwriting recognition system. Unlike the PAW decomposition, the morphology based one uses the internal structure of the Arabic word (i.e. prefix, stem and suffix). This technique decreases the out of vocabulary words by including new words generated from the morphological decomposition process. This process allowed a 1% improvement in the system performance. In addition to the use of this model, the authors exploit a text corpus collected from freely available newspapers and forums to direct the recognition stage. This study, therefore confirms that such text corpus exploitation has brought about a significant increase for the OOV words recovery rates and therefore decreased the word error rate significantly. The optical model is constructed using the Hidden Markov Models (HMMs) [27]. Each text line image is represented by a feature vector sequence extracted from a sliding window of size 9 × 30 with one pixel overlap.
More recently, the vocabulary augmentation was performed by decomposing the lexicon into morphemes and PAWs using a hybrid morphological decomposition [3]. Although theoretically interesting, this method results in a nominal improvement when compared to the PAWs or morphemes modeling.
References [1], [2], [3] analyze and compare different aspects of sub-word (PAWs and/or morphemes and words) LMs to handle the OOV issue. However, there are still some persisting relevant problems to be solved. These approaches are able to recognize OOV words by the concatenation of sub-word hypotheses. Indeed, the concatenated sub-words can lead to an incorrect word that does not belong to the Arabic language, since no word lexicon is provided either during recognition step or in the post-processing stage. Moreover, the statistical language models estimation carried out on a decomposed text corpus can produce an increased statistical bias that may affect the vocabulary single items. Unlike the presented methods for OOV recovery in Arabic scripts, the study proposed in [5] precedes the OOV words recovery by a preliminary detection step. This detection relies on the confidence score feature. However, this proposal suffers the shortcoming that such a measure is useful to determine whether a word hypothesis is correct or not, but not whether it is an OOV or not. A more rigorous investigation on the existing methods is given in Table 1.
Different from the AHTR systems proposed in [1], [2], [3] which focus only on the OOV words recovery, the AHTR system proposed in this paper starts by the detection of the OOV words then proceeds with their recovery. In addition, while the AHTR systems, proposed in [1], [2], [3], rely directly either on handcrafted features, ours relies on learned features deduced automatically through a deep multi-dimensional network architecture. This architecture consists of MDLSTMs and Convolutional Neural Network (CNN) layers, along with max-pooling and arranged alternately.
Contrary to [5] that proposes an OOV detection method based on confidence score, we suggest different OOV words detection methods and demonstrate that sub-word lexical units (PAWs and morphemes) modeling could give cues for improving detection.
In addition, and contrary to the AHTR system proposed in [2] which increases the used word lexicon by the most frequent words from the text corpus, freely available through the web, a dynamic lexicon is built in this paper by selecting words from the text corpus based on their string similarity to the detected OOV words.
The first contribution of this paper concerns the OOV detection module where three different methods are proposed. The first method is based on the word confidence scores of the Word Lexicon Driven recognition method (WLD). The second method relies on the difference between the word hypotheses from Word Lexicon Driven (WLD), PAW Lexicon Driven (PLD) and Morpheme Lexicon Driven (MLD) recognition methods. The third method uses the word confidence scores of the three sub-word modeling approaches. We demonstrate that sub-word modeling could give cues for improving the detection and that the best detection method is the second, which is not based on the confidence score. The second contribution is the use of a dynamic lexicon that extend the initial lexicon in order to cope with the detected OOVs. It includes a selection step in which the best word candidates from the external resource are kept. Finally, the proposed OOV detection and recovery methods are generic and independent of the recognition engine. The obtained results reveal that the proposed method achieves state of the art results on KHATT dataset and significantly improves the recognition performance over the use of reduced and large static dictionaries and the combination of different sub-word modeling approaches.
In the remaining of this paper, we first described the system proposed for AHTR and especially the methods suggested for the detection and recovery of the OOV words. In a second step, we detailed, the used text line images database and word lexicon. Thirdly, the obtained experimental results were revealed and discussed. Finally, the main conclusions were drawn and some future works were suggested.
Section snippets
Proposed Arabic handwritten text recognition system
The fundamental objective of this paper was to propose an original method for OOV words detection and recovery in handwriting recognition. To achieve this objective, three different lexicon driven recognition methods were used: the first method is a Word Lexicon Driven (WLD), a second method is a PAW Lexicon Driven (PLD), and a third method is a Morpheme Lexicon Driven (MLD). For the first recognition method, the text line hypotheses construction is carried out relying on a Word Statistical
Arabic databases description
Performances of the proposed OOV detection and recovery methods are evaluated on two benchmarking Arabic databases, namely KHATT and AHTID/MW databases.
Experimental results
The experimental results of the proposed AHTR system are presented in the terms of word error rates (WER). We used the Levenshtein edit distance between the recognized text and the reference one. The editing distance is calculated by computing the number of editing operations (insertions, substitutions and deletions), required to transform a source character string into a target character string.
For the evaluation of the MDLSTM network, we used the RETURNN framework [42]. All the experiments
Conclusion
In this paper, we proposed a novel OOV word detection and recovery method, which exploits the modeling of different lexical entities such as words, PAWs and morphemes. The proposed OOV detection and recovery method is generic and independent of the letter recognition engine. It was validated using two different recognition architectures based on learned features and handcrafted ones. The obtained experimental results reveal that the proposed method significantly improves the recognition
Sana Khamekhem Jemni is graduated in Computer Science engineering from the National Engineering School of Sfax (ENIS). Her current research interest concerns handwritten Document Image Analysis and Recognition and Arabic OCR.
References (82)
- et al.
Phrase-based correction model for improving handwriting recognition accuracies
Pattern Recognit.
(2009) - et al.
Weighted finite-state transducers in speech recognition
Comput. Speech Language
(2002) - et al.
Minimum bayes risk decoding and system combination based on a recursion for edit distance
Comput. Speech Language
(2011) - et al.
Open-vocabulary recognition of machine-printed Arabic text using hidden Markov models
Pattern Recognit.
(2016) - et al.
Off-line handwritten word recognition using multi-stream hidden Markov models
Pattern Recognition Letters,
(2010) - et al.
Combination of context-dependent bidirectional long short-term memory classifiers for robust offline handwriting recognition
Pattern Recognit. Lett.
(2017) - et al.
Hybrid word/part-of-Arabic-word language models for Arabic text document recognition
- et al.
Open vocabulary Arabic handwriting recognition using morphological decomposition
Arabic word decomposition techniques for offline Arabic text transcription
- et al.
A syllable based model for handwriting recognition
Handwritten word recognition using web resources and recurrent neural networks
Int. J. Document Anal. Recognit.
BLSTM-based handwritten text recognition using web resources
Using the web to create dynamic dictionaries in handwritten out-of-vocabulary word recognition
Using the world wide web for the recognition of out of vocabulary handwritten words
Open-lexicon language modeling combining word and character levels
An open vocabulary OCR system with hybrid word-subword language models
Over-generative finite state transducer n-gram for out-of-vocabulary word recognition
Open vocabulary handwriting recognition using combined word-level and character-level language models
Handling out-of-vocabulary words and recognition errors based on word linguistic context for handwritten sentence recognition
Querying out-of-vocabulary words in lexicon-based keyword spotting
Neural Comput. Appl.
Tahlil Sarfi Lil Arabia
A compression technique for Arabic dictionaries: the affix analysis
The DIINAR.1-« معالي» Arabic lexical resource, an outline of contents and methodology
Arabic morphological analysis techniques: a comprehensive survey
J.Am. Soc. Inf. Sci. Techno.
An omnifont open-vocabulary OCR system for English and Arabic
IEEE Trans. Pattern Anal. Mach. Intell.
Unlimited vocabulary script recognition using character N-Grams
The A2iA Arabic handwritten text recognition system at the OpenHaRT2013 evaluation
Handwriting recognition with multigrams
On the influence of vocabulary size and language models in unconstrained handwritten text recognition
Large vocabulary off-line handwriting recognition: a survey
Pattern Anal. Appl.
Creating word-level language models for large-vocabulary handwriting recognition
Int. J. Document Anal. Recognit.
Markov models for offline handwriting recognition: a survey
Int. J. Document Anal. Recognit.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
Handwriting recognition with large multidimensional long short-term memory recurrent neural networks
Dropout improves recurrent neural networks for handwriting recognition
Histograms of oriented gradients for human detection
Local gradient histogram features for word spotting in unconstrained handwritten documents
Offline Arabic handwriting recognition using BLSTMs combination
A novel connectionist system for unconstrained handwriting recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Offline handwriting recognition with multidimensional recurrent neural networks
A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)
Cited by (18)
Cross lingual handwritten character recognition using long short term memory network with aid of elephant herding optimization algorithm
2022, Pattern Recognition LettersCitation Excerpt :Few research works are only reported for Indian languages like Punjabi, Tamil, Kannada, Oriya, Telugu, Marathi and Malayalam [8,22,26]. Additionally, very few research works are reported on Kannada language in that most of the works are concentrated on printed Kannada character recognition and only limited research works are reported on handwritten Kannada characters [9,15,24,27]. Therefore, it is clear that the Kannada handwritten recognition is still an open research area [25].
Enhance to read better: A Multi-Task Adversarial Network for Handwritten Document Image Enhancement
2022, Pattern RecognitionCitation Excerpt :This work aims to recover images that contain hard degradation by removing the background noise, while keeping its readability by HTR systems as accurate as possible. This application called document enhancement, is generally a preprocessing stage that produces an enhanced version of the document, in order to improve the recognition results of HTR engines [9]. Early methods known by global binarization, aimed to find a single threshold value for the whole document.
Handwriting Arabic Words Recognition in KHATT Dataset Based on Faster R-CNN
2023, 6th Iraqi International Conference on Engineering Technology and its Applications, IICETA 2023
Sana Khamekhem Jemni is graduated in Computer Science engineering from the National Engineering School of Sfax (ENIS). Her current research interest concerns handwritten Document Image Analysis and Recognition and Arabic OCR.
Yousri Kessentinni is graduated in Computer Science engineering from the National Engineering School of Sfax (ENIS) in 2003 and received his Ph.D. degree in the field of pattern recognition from the University of Rouen, France in 2009. He was postdoctoral researcher at ITESOFT company and LITIS laboratory from 2011 to 2013. Currently he is Assistant Professor at the digital research center of Sfax and the head of the DeepVision research team. His main research areas concern deep learning, document processing, data fusion, and computer vision. He is certified as an official NVIDIA Deep Learning Institute instructor and ambassador in 2018. He has coordinate several research projects in partnership with industry. He is the author and co-author of several papers and has been a reviewer for international conferences and journals. He is also member of several scientific associations including GRCE and IAPR.
Slim Kanoun received the M.S. degree in computer sciences from the Paul Sabatier University of Toulouse, France, in 1998, the Ph.D. degree also in computer sciences from the University of Rouen, Saint-Etienne du Rouvray, in 2002. He is now associate professor in computer sciences at the University of Sfax, Tunisia. His research area is now in pattern recognition especially for Document Image Analysis, Recognition, and Arabic OCR.