Batch-adaptive rejection threshold estimation with application to OCR post-processing
Introduction
In many OCR and text recognition systems, a post-process addressed to the verification or correction of the errors yielded by the classifier is performed, due to its beneficial impact on the system accuracy. In many cases, a language-model-based correcting technique can be applied to find the best transformation of an OCR hypothesis into a string compatible with a given set of linguistic constraints.
Very different techniques have been employed to implement the post-processing of the OCR hypotheses (usually, strings of characters). Some examples can be found in Hull and Srihari, 1982, Tong and Evans, 1996, Perez-Cortes et al., 2000, Kolak and Resnik, 2005, Llobet et al., 2010. Most of them provide or can be easily modified to provide an estimation of the effort needed to make the output from the OCR classifier comply with the language constraints. This estimation is often called transformation cost. Sometimes this cost is substituted by an inversely related measure called correction confidence or reliability index, reflecting the likelihood that the OCR hypothesis agrees with the model.
Using a threshold on the transformation cost to reject the less reliable hypotheses, a variable level of (expected) accuracy can be imposed on the output of the fully automatic recognition process. The feature of allowing the user to specify an acceptable level of the expected error instead of having to deal with a threshold in an unfamiliar task-dependent scale is a clear advantage for the operator. This capability has important implications in many practical cases: the complete process is often requested to meet a maximum acceptable amount of erroneous transcriptions within the set of non-rejected strings in a batch of documents (known as false acceptance rate or simply error rate) to ensure a quality of service; and the rejected strings are usually sent to a high cost manual data-entry process that should be kept as limited as possible All these considerations suggest that a trade-off, where the threshold selection plays an important role, exists and has a significant impact on the practical and economic performance of the system.
To decide on the acceptance or rejection of a single string, a simple threshold is relatively straightforward to estimate, but to maintain a control of the error rate of a batch of documents, some strings that would be rejected might be accepted and vice versa depending on the remaining strings of the batch. If the measurements suggest that the quality of the set is good (or bad), the threshold can be higher (or lower), accordingly.
The optimization of this process is not straightforward. In an OCR system, the number of rejections is not predictable because it depends on the particular task at hand, being highly sensitive to factors such as the handwriting style, scanning process, image quality, field registration, etc, as well as the characteristics of the language model used (e.g., its perplexity). For example, for a given threshold, the amount of rejected strings in two sets of documents can be very different if the first one is composed of carefully written strings and the second set is poorly written. Thus, if we take a sample of OCR strings (observations) and compute its transformation costs using a given post-processing method, like the one described in Section 3, the distribution obtained can vary for different language models, as can be seen in Fig. 1 (left), as well as for different samples of a single language, as shown in Fig. 1 (right). This means that choosing a consistent rejection threshold on the transformation cost is difficult, since the number of accepted/rejected strings for a given threshold value will vary depending on the characteristics of both the language model and the sample processed. In fact, a slight variation of the threshold value can lead to unpredictable changes on the ratios of accepted/rejected strings, as well as on the error rate, as shown in Fig. 2. Therefore, in most cases, different thresholds should be applied to different samples to obtain the same error rate within the set of accepted strings.
Traditional analytic tools like Error-Reject Trade-off (Chow, 1970), Receiver Operating Characteristic (ROC) (Fawcett, 2006), and Precision-Recall (Rijsbergen, 1979) (and their different variations), have provided useful information for analyzing and comparing classifier performances, based on measurements obtained from a training sample. Particularly, predictions on the relationship between the rejection threshold and several indexes like error rate, precision, accuracy, number of rejections, false positive and false negative rates, can be established from such tools. Nevertheless, the accuracy of those predictions is strongly conditioned by the distribution of the confidence values found in the training sample used. In other words, predicting the error rate of a new sample entails the assumption that a similar confidence distribution is expected in it. However, in the task described here, this assumption can be unacceptable, as explained above: the amount of symbol errors of an OCR classifier can widely vary for different samples depending on many factors, and consequently, the distribution of the post-processing transformation costs, can vary too. Therefore, in this case, applying a fixed threshold to different samples will not guarantee meeting a pre-specified error rate.
In this work, we consider the hypothesis that a probability distribution of erroneous transcriptions can be estimated from the transformation costs produced by the application of a language model, and that it can be used to predict the error rate of a set of strings of the language, regardless of its cost distribution. Thus, in Section 4, given a set of transformation costs corresponding to a supervised sample of OCR hypotheses, the error probabilities associated to each cost (Error vs. Cost distribution) are obtained. In Section 5, we propose an approach for adaptive rejection thresholding, where the Error vs. Cost distribution of a language model is used to find the rejection threshold to be applied on a whole batch of strings in order to meet a given target error rate. We tested this approach in two scenarios:
- •
The transformation costs are obtained from a real sample of OCR hypotheses, and used to compute the Error vs. Cost distribution. Performance evaluation in this scenario is presented in Section 6.
- •
The Error vs. Cost distribution of a new language is automatically estimated, with a method based on the generation of synthetic OCR errors from the positive sample used to build the language model. No supervision is needed in this case, avoiding the time-consuming process of optical recognition and manual labeling of a significant amount of strings of the language. This is particularly important in practice when new language models are needed frequently for short batches of documents (even if they are subsets or special variants of previous language models), or in case of tasks where the labeling process is not possible or convenient. It is described and evaluated in Section 7.
Section snippets
Quality control
Industrial Quality Control is commonly approached from two points of view. The first one takes into account the quality acceptable by the recipient of the product or service and the quality level bearable by the producer. These qualities correspond to what is known in the literature as the risks of the producer and the consumer. The inspection plans for sample-based statistical quality control are designed from the Characteristic Operation Curve and these risks. In Paladini (2000), an expert
Post-processing algorithm and language models used
A technique based on weighted finite-state transducers (WFSTs) combining language, hypothesis and error models has been used to post-process the OCR hypotheses (Llobet et al., 2010). It is based on a finite-state transducer built from a formal grammar that encodes the strings in the lexicon or language sample. In this case, a k-Testable Language in the Strict Sense is inferred from a set of available language strings. The transducer is composed with other two WFSTs, representing the error model
Modeling the Error vs. Cost distribution
Given a language model and a set of transformation costs obtained using a post-processing algorithm with a sample of supervised OCR hypotheses (for which ground-truth transcriptions are available), a smoothed error rate curve as a function of the cost c can be computed using the equation,where w is a smoothing (rectangular) window size parameter, is the number of strings erroneously transcribed into incorrect strings (belonging to the language, but different to those
Batch-adaptive rejection threshold estimation
The error vs. cost function represented by H allows to obtain a direct estimation of the probability that an OCR hypothesis is wrong, based on its transformation cost. So, for a single string, we can simply look up H, and decide to reject it if the error estimation is higher than an error threshold.
But, if the goal is to control the error rate of a set of observations, i.e., a sample or batch of strings rather than a single OCR string, then different rejection thresholds are typically needed
Performance evaluation
The experiments have been designed to evaluate the accuracy of the adaptive rejection threshold estimation procedure presented. The ability of the system to approximate a pre-specified error rate (target error) on real samples having different degrees of representativity with respect to the “training set” has been characterized.
The supervised samples obtained from a real OCR workflow, as detailed in Section 3, were used in the experiments. In order to assess the performance of the proposed
Approach for new language models
In practice, when a new language model is defined in the system, a sample of OCR hypotheses could not be available to build the EC curve of that language. In this section, we propose to use literal strings from the positive sample of the new language to build a synthetic sample where OCR-like errors will be artificially introduced using a particular error-generation model. This way, the EC curve of the new language can be estimated from that synthetic sample.
An OCR error-generation model is
Conclusions
A method and experimental results for the automatic estimation of a rejection threshold on the confidence index (or transformation cost) of a sample of OCR hypotheses (batch) post-processed using a language model have been presented. The goal is that an operator should be able to set a target error rate for a whole sample instead of having to specify a threshold in an arbitrary scale.
The expert knowledge in this task comes naturally in the form of a target error rate which is negotiated between
References (31)
A logical framework for the correction of spelling errors in electronic documents
Information Processing and Management
(1987)- et al.
Rejection strategies for offline handwritten text line recognition
Pattern Recognition Letters
(2006) - et al.
Phrase-based correction model for improving handwriting recognition accuracies
Pattern Recognition
(2009) An introduction to ROC analysis
Pattern Recognition Letters
(2006)- et al.
Reject option with multiple thresholds
Pattern Recognition
(2000) - et al.
Review: A review of data mining applications for quality improvement in manufacturing industry
Expert Systems with Applications
(2011) - et al.
Confidence-based classifier design
Pattern Recognition
(2006) An expert system approach to quality control
Expert Systems with Applications
(2000)- et al.
Effective balancing error and user effort in interactive handwriting recognition
Pattern Recognition Letters
(2014) - et al.
Determination of optimal inspection sequence within misclassification error bound
Expert Systems with Applications
(2011)
Efficient error-correcting Viterbi parsing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Rejection threshold estimation for an unknown language model in an OCR task
On optimum recognition error and reject tradeoff
IEEE Transactions on Information Theory
Inference of k-testable languages in the strict sense and application to syntactic pattern recognition
IEEE Transactions on PAMI
Cited by (6)
BigFlow: Real-time and reliable anomaly-based intrusion detection for high-speed networks
2019, Future Generation Computer SystemsCitation Excerpt :In this sense, when an event is rejected, there is a high probability that a new network traffic behavior is taking place. Although classification rejection has been used in other areas (e.g., for optical character recognition (OCR) [12] or medical diagnosis [13]), in these areas contextual information can help to identify pattern deviations; however, in the high-speed network traffic field, such a task is challenging. The main challenge that is not present in other areas [12,13] relates to rejections based on the classifier confidence.
Performance measures for classification systems with rejection
2017, Pattern RecognitionCitation Excerpt :Classification with rejection is a viable option in real world applications of machine learning and pattern recognition, where the presence and cost of errors can be detrimental to the performance of automated classification systems. Classification with rejection is often used in situations where the cost and consequences of misclassification are high, such as in automated medical diagnosis [1–5], landcover classification in remote sensing [6–8], automated software detection prediction [9], or large-scale OCR applications [10]. Another interesting situation for the use of classification with rejection is the design of classification systems through a cascading architecture, used in text categorization systems [11,12], in time-series classification [13], or in medical diagnosis [14].
Survey of Post-OCR Processing Approaches
2021, ACM Computing SurveysSpeaker-independent emotion recognition for interstate measuring of user based on separation and rejection
2018, International Journal of Machine Learning and ComputingRecognition results classification and post-processing methods for painted characters on billet surface
2017, Advances in ManufacturingIntelligent immediately help system
2016, Proceedings of 2016 International Conference on Information Management, ICIM 2016