Handwriting word recognition using windowed Bernoulli HMMs

doi:10.1016/j.patrec.2012.09.002

Pattern Recognition Letters

Volume 35, 1 January 2014, Pages 149-156

https://doi.org/10.1016/j.patrec.2012.09.002 Get rights and content

Abstract

Hidden Markov Models (HMMs) are now widely used for off-line handwriting recognition in many languages. As in speech recognition, they are usually built from shared, embedded HMMs at symbol level, where state-conditional probability density functions in each HMM are modeled with Gaussian mixtures. In contrast to speech recognition, however, it is unclear which kind of features should be used and, indeed, very different features sets are in use today. Among them, we have recently proposed to directly use columns of raw, binary image pixels, which are directly fed into embedded Bernoulli (mixture) HMMs, that is, embedded HMMs in which the emission probabilities are modeled with Bernoulli mixtures. The idea is to by-pass feature extraction and to ensure that no discriminative information is filtered out during feature extraction, which in some sense is integrated into the recognition model. In this work, column bit vectors are extended by means of a sliding window of adequate width to better capture image context at each horizontal position of the word image. Using these windowed Bernoulli mixture HMMs, good results are reported on the well-known IAM and RIMES databases of Latin script, and in particular, state-of-the-art results are provided on the IfN/ENIT database of Arabic handwritten words.

Highlights

► Binarized handwritten text images are directly fed into Bernoulli HMMs. ► We extend conventional BHMMs by means of a sliding window with repositioning. ► Windowed BHMMs are tested on the IAM-words, RIMES and IfN/ENIT databases. ► Windowed BHMMs clearly outperform conventional BHMMs. ► We obtain state of the art results on IfN/ENIT and good results on IAM and RIMES.

Introduction

Hidden Markov Models (HMMs) are now widely used for off-line handwriting recognition in many languages and, in particular, in languages with Latin and Arabic scripts (Dehghan et al., 2001, Günter and Bunke, 2004, Märgner and El Abed, 2007, Märgner and El Abed, 2009, Grosicki and El Abed, 2009). Following the conventional approach in speech recognition (Rabiner and Juang, 1993), HMMs at global (line or word) level are built from shared, embedded HMMs at character (subword) level, which are usually simple in terms of number of states and topology. In the common case of real-valued feature vectors, state-conditional probability (density) functions are modeled as Gaussian mixtures since, as with finite mixture models in general, their complexity can be easily adjusted to the available training data by simply varying the number of components.

After decades of research in speech recognition, the use of certain real-valued speech features and embedded Gaussian (mixture) HMMs is a de-facto standard (Rabiner and Juang, 1993). However, in the case of handwriting recognition, there is no such standard and, indeed, very different sets of features are in use today. In Giménez and Juan, 2009 we proposed to by-pass feature extraction and to directly feed columns of raw, binary pixels into embedded Bernoulli (mixture) HMMs (BHMMs), that is, embedded HMMs in which the emission probabilities are modeled with Bernoulli mixtures. The basic idea is to ensure that no discriminative information is filtered out during feature extraction, which in some sense is integrated into the recognition model. In (Giménez et al., 2010), we improved our basic approach by using a sliding window of adequate width to better capture image context at each horizontal position of the text image. This improvement, to which we refer as windowed BHMMs, achieved very competitive results on the well-known IfN/ENIT database of Arabic town names (Pechwitz et al., 2002).

Although windowed BHMMs achieved good results on IfN/ENIT, it was clear to us that text distortions are more difficult to model with wide windows than with narrow (e.g. one-column) windows. In order to circumvent this difficulty, we have considered new, adaptative window sampling techniques, as opposed to the conventional, direct strategy by which the sampling window center is applied at a constant height of the text image and moved horizontally one pixel at a time. More precisely, these adaptative techniques can be seen as an application of the direct strategy followed by a repositioning step by which the sampling window is repositioned to align its center to the center of gravity of the sampled image. This repositioning step can be done horizontally, vertically or in both directions. Although vertical repositioning was expected to have more influence on recognition results than horizontal repositioning, we decided to study both separately, and also in conjunction, so as to confirm this expectation.

In this paper, the repositioning techniques described above are introduced and extensively tested on different, well-known databases for off-line handwriting recognition. In particular, we provide new, state-of-the-art results on the IfN/ENIT database, which clearly outperform our previous results without repositioning (Giménez et al., 2010). Indeed, the first tests on IfN/ENIT of our windowed BHMM system with vertical repositioning were made at the ICFHR 2010 Arabic handwriting recognition competition, where our system ranked first (Märgner and El Abed, 2010). Moreover, the test sets used in this competition were also used in a new competition at the ICDAR 2011 and none of the participants improved the results achieved by our system at the ICFHR 2010 conference (Märgner and El Abed, 2011). Apart from state-of-the-art results on IfN/ENIT, we also provide new empirical results on the IAM database of English words (Marti and Bunke, 2002) and the RIMES database of French words (Grosicki et al., 2009). Our windowed BHMM system with vertical repositioning achieves good results on both databases.

In what follows, we briefly review Bernoulli mixtures (Section 2), BHMMs (Section 3), maximum likelihood parameter estimation (Section 4) and windowed BHMMs repositioning techniques (Section 5). Empirical results are then reported in Section 6 and concluding remarks are given in Section 7.

Section snippets

Bernoulli mixture

Let $o$ be a D-dimensional feature vector. A finite mixture is a probability (density) function of the form: $P (o | Θ) = \sum_{k = 1}^{K} π_{k} P (o | k, Θ_{k}),$ where K is the number of mixture components, $π_{k}$ is the kth component coefficient, and $P (o | k, Θ_{k})$ is the kth component-conditional probability (density) function. The mixture is controlled by a parameter vector $Θ$ comprising the mixture coefficients and a parameter vector for the components, $Θ_{k}$ . It can be seen as a generative model that first selects the kth component

Bernoulli HMM

Let $O = (o_{1}, \dots, o_{T})$ be a sequence of feature vectors. An HMM is a probability (density) function of the form: $P (O | Θ) = \sum_{q_{1}, \dots, q_{T}} \prod_{t = 0}^{T} a_{q_{t} q_{t + 1}} \prod_{t = 1}^{T} b_{q_{t}} (o_{t}),$ where the sum is over all possible paths (state sequences) $q_{0}, \dots, q_{T + 1}$ , such that $q_{0} = I$ (special initial or start state), $q_{T + 1} = F$ (special final or stop state), and $q_{1}, \dots c, q_{T} \in {1, \dots c, M}$ , being M the number of regular (non-special) states of the HMM. On the other hand, for any regular states i and $j, a_{ij}$ denotes the transition probability from i to j, while $b_{j}$

Maximum likelihood parameter estimation

Maximum likelihood estimation (MLE) of the parameters governing an embedded BHMM does not differ significantly from the conventional Gaussian case, and it is also efficiently performed using the well-known EM (Baum–Welch) re-estimation formulae (Rabiner and Juang, 1993, Young et al., 1995). Let $(O_{1}, S_{1}), \dots, (O_{N}, S_{N})$ , be a collection of N training samples in which the nth observation has length $T_{n}$ , $O_{n} = (o_{n 1}, \dots, o_{{nT}_{n}})$ , which corresponds to a sequence of $L_{n}$ symbols ( $L_{n} ⩽ T_{n}$ ), $S_{n} = (s_{n 1}, \dots, s_{{nL}_{n}})$ . At iteration r

Windowed BHMMs

Given a binary image normalized in height to H pixels, we may think of a feature vector $o_{t}$ as its column at position t or, more generally, as a concatenation of columns in a window of W columns in width, centered at position t. This generalization has no effect neither on the definition of BHMM nor on its MLE, although it would be very helpful to better capture the image context at each horizontal position of the image. As an example, Fig. 2 shows a binary image of 4 columns and 5 rows, which

Experiments

Our windowed BHMMs and the repositioning techniques described above were tested on three well-known databases of handwritten words: the IfN/ENIT database (Pechwitz et al., 2002), IAM words (Marti and Bunke, 2002) and RIMES (Grosicki et al., 2009). In what follows, we describe experiments and results in each database separately.

Concluding remarks

Windowed Bernoulli mixture HMMs (BHMMs) for handwriting word recognition have been described and improved by the introduction of window repositioning techniques. In particular, we have considered three techniques of window repositioning after window extraction: vertical, horizontal, and both. They only differ in the way in which extracted windows are shifted to align mass and window centers (only in the vertical direction, horizontally or in both directions). In this work, these repositioning

Acknowledgments

The work is supported by the EC (FEDER/FSE), the Spanish MICINN (MIPRCV “Consolider Ingenio 2010”, iTrans2 TIN2009-14511, MITTRAL TIN2009-14633-C03-01, erudito.com TSI-020110-2009-439, and a AECID 2010/11 grant).

References (19)

M. Dehghan et al.
Handwritten Farsi (Arabic) word recognition: a holistic approach using discrete HMM
Pattern Recognition
(2001)
S. Günter et al.
HMM-based handwritten word recognition: on the optimization of the number of states, training iterations and Gaussian components
Pattern Recognition
(2004)
A.L. Bianne-Bernard et al.
Dynamic and contextual information in HMM modeling for handwritten word recognition
IEEE Trans. Pattern Anal. Machine Intell.
(2011)
Dreuw, P., Heigold, G., Ney, H., 2009. Confidence-based discriminative training for model adaptation in offline Arabic...
Giménez, A., Juan, A., 2009. Embedded Bernoulli mixture HMMs for handwritten word recognition. In: ICDAR’09, Barcelona,...
Giménez, A., Khoury, I., Juan, A., 2010. Windowed Bernoulli mixture HMMs for Arabic handwritten word recognition. In:...
Grosicki, E., El Abed, H., 2009. ICDAR 2009 handwriting recognition competition. In: ICDAR’09, Barcelona, Spain, pp....
Grosicki, E., El Abed, H., 2011. ICDAR 2011 – French handwriting recognition competition. In: ICDAR’11, Beijing, China,...
Grosicki, E., Carré, M., Brodin, J.M., Geoffrois, E., 2009. Results of the RIMES evaluation campaign for handwritten...

There are more references available in the full text version of this article.

Cited by (32)

Pay attention to what you read: Non-recurrent handwritten text-Line recognition
2022, Pattern Recognition
The advent of recurrent neural networks for handwriting recognition marked an important milestone reaching impressive recognition accuracies despite the great variability that we observe across different writing styles. Sequential architectures are a perfect fit to model text lines, not only because of the inherent temporal aspect of text, but also to learn probability distributions over sequences of characters and words. However, using such recurrent paradigms comes at a cost at training stage, since their sequential pipelines prevent parallelization. In this work, we introduce a novel method that bypasses any recurrence during the training process with the use of transformer models. By using multi-head self-attention layers both at the visual and textual stages, we are able to tackle character recognition as well as to learn language-related dependencies of the character sequences to be decoded. Our model is unconstrained to any predefined vocabulary, being able to recognize out-of-vocabulary words, i.e. words that do not appear in the training vocabulary. We significantly advance over prior art and demonstrate that satisfactory recognition accuracies are yielded even in few-shot learning scenarios.
Multichannel dynamic modeling of non-Gaussian mixtures
2019, Pattern Recognition
Citation Excerpt :
Some examples where non-linearity in the probability has been considered are: action recognition via sparse Gaussian processes [2]; modeling growth dynamics using unscented Kalman filters [3]; an extended Kalman filter augmented with local searches [4]; and modeling the data using a two-step method with fuzzy clustering and Gaussian mixture models (GMM) [5]. Some particular non-Gaussian conditional probabilities have been proposed in HMMs in applications such as handwritten word recognition [6] and biological sequence analysis [7]. A general extension of GMM to non-Gaussian mixtures is based on the concept of independent component analyzers (ICA).
This paper presents a novel method that combines coupled hidden Markov models (HMM) and non-Gaussian mixture models based on independent component analyzer mixture models (ICAMM). The proposed method models the joint behavior of a number of synchronized sequential independent component analyzer mixture models (SICAMM), thus we have named it generalized SICAMM (G-SICAMM). The generalization allows for flexible estimation of complex data densities, subspace classification, blind source separation, and accurate modeling of both local and global dynamic interactions. In this work, the structured result obtained by G-SICAMM was used in two ways: classification and interpretation. Classification performance was tested on an extensive number of simulations and a set of real electroencephalograms (EEG) from epileptic patients performing neuropsychological tests. G-SICAMM outperformed the following competitive methods: Gaussian mixture models, HMM, Coupled HMM, ICAMM, SICAMM, and a long short-term memory (LSTM) recurrent neural network. As for interpretation, the structured result returned by G-SICAMM on EEGs was mapped back onto the scalp, providing a set of brain activations. These activations were consistent with the physiological areas activated during the tests, thus proving the ability of the method to deal with different kind of data densities and changing non-stationary and non-linear brain dynamics.
Synchronous Multi-Stream Hidden Markov Model for offline Arabic handwriting recognition without explicit segmentation
2016, Neurocomputing
Citation Excerpt :
In fact, Laurrence et al. [4] proved that the SVM classifier are less robust to degradation than the HMM classifier in case of highly broken characters. Thus, to model and absorb the distortion and the high variability of handwriting, a lot of works have been developed, based especially on HMMs that have become an effective paradigm for modelling stochastic processes and pattern sequences [7–16]. The standard HMM is a statistical model in which the modelled system is supposed to be a Markov process with unknown parameters, and the problem is to determine the hidden parameters from the observable ones [9].
Arabic handwriting recognition is still a challenging task due especially to the unlimited variation in human handwriting, the large variety of Arabic character shapes, the presence of ligature between characters and overlapping of the components. In this paper, we propose an offline Arabic-handwritten recognition system for Tunisian city names. A review of the literature shows that the Hidden Markov Model (HMM) adopting the sliding window approach are the mainly used models, which gives good results when a relevant feature-extraction process is performed. However, these models are utilized especially to model one dimensional signal. Consequently, to model bi-dimensional signals or multiple features, a solution based on combining multi-classifiers and then a post-treatment selecting the best hypothesis is applied. The problem considered in this case consists in searching the best way to combine the contribution of these classifiers. In this study, we put forward an extension of the HMM, which can surmount this problem. Our proposed system is based on a synchronous multi-stream HMM which has the advantage of efficiently modelling the interaction between multiple features. These features are composed of a combination of statistical and structural ones, which are extracted over the columns and rows using a sliding window approach. In fact, two word models are implemented based on the holistic and analytical approaches without any explicit segmentation. In the first approach, all the classes share the same architecture nevertheless, the parameters are different. In the second approach, each class has its own model by concatenating their components models. The results carried out on the IFN/ENIT database show that the analytical approach performs better than the holistic one and that the data fusion model is more efficient than the state fusion model.
Convolutional Arabic Handwriting Recognition System Based BLSTM-CTC Using WBS Decoder
2024, International Journal of Intelligent Systems and Applications in Engineering
Worddeepnet: handwritten gurumukhi word recognition using convolutional neural network
2023, Multimedia Tools and Applications
Lexicon and Attention based Handwritten Text Recognition System
2022, arXiv

View all citing articles on Scopus

View full text