Batch-adaptive rejection threshold estimation with application to OCR post-processing

doi:10.1016/j.eswa.2015.06.022

Expert Systems with Applications

Volume 42, Issue 21, 30 November 2015, Pages 8111-8122

https://doi.org/10.1016/j.eswa.2015.06.022 Get rights and content

Highlights

•
OCR strings are post-processed to comply with language constraints.
•
A user-defined target error rate is met on a batch of OCR post-processed strings.
•
The rejection threshold is automatically estimated from the target error rate.
•
The threshold can be estimated for language models using real or synthetic samples.
•
Accurate error rate estimation on test samples having different confidence distributions.

Abstract

An OCR process is often followed by the application of a language model to find the best transformation of an OCR hypothesis into a string compatible with the constraints of the document, field or item under consideration. The cost of this transformation can be taken as a confidence value and compared to a threshold to decide if a string is accepted as correct or rejected in order to satisfy the need for bounding the error rate of the system. Widespread tools like ROC, precision-recall, or error-reject curves, are commonly used along with fixed thresholding in order to achieve that goal. However, those methodologies fail when a test sample has a confidence distribution that differs from the one of the sample used to train the system, which is a very frequent case in post-processed OCR strings (e.g., string batches showing particularly careful handwriting styles in contrast to free styles).

In this paper, we propose an adaptive method for the automatic estimation of the rejection threshold that overcomes this drawback, allowing the operator to define an expected error rate within the set of accepted (non-rejected) strings of a complete batch of documents (as opposed to trying to establish or control the probability of error of a single string), regardless of its confidence distribution. The operator (expert) is assumed to know the error rate that can be acceptable to the user of the resulting data. The proposed system transforms that knowledge into a suitable rejection threshold.

The approach is based on the estimation of an expected error vs. transformation cost distribution. First, a model predicting the probability of a cost to arise from an erroneously transcribed string is computed from a sample of supervised OCR hypotheses. Then, given a test sample, a cumulative error vs. cost curve is computed and used to automatically set the appropriate threshold that meets the user-defined error rate on the overall sample. The results of experiments on batches coming from different writing styles show very accurate error rate estimations where fixed thresholding clearly fails. An original procedure to generate distorted strings from a given language is also proposed and tested, which allows the use of the presented method in tasks where no real supervised OCR hypotheses are available to train the system.

Introduction

In many OCR and text recognition systems, a post-process addressed to the verification or correction of the errors yielded by the classifier is performed, due to its beneficial impact on the system accuracy. In many cases, a language-model-based correcting technique can be applied to find the best transformation of an OCR hypothesis into a string compatible with a given set of linguistic constraints.

Very different techniques have been employed to implement the post-processing of the OCR hypotheses (usually, strings of characters). Some examples can be found in Hull and Srihari, 1982, Tong and Evans, 1996, Perez-Cortes et al., 2000, Kolak and Resnik, 2005, Llobet et al., 2010. Most of them provide or can be easily modified to provide an estimation of the effort needed to make the output from the OCR classifier comply with the language constraints. This estimation is often called transformation cost. Sometimes this cost is substituted by an inversely related measure called correction confidence or reliability index, reflecting the likelihood that the OCR hypothesis agrees with the model.

Using a threshold on the transformation cost to reject the less reliable hypotheses, a variable level of (expected) accuracy can be imposed on the output of the fully automatic recognition process. The feature of allowing the user to specify an acceptable level of the expected error instead of having to deal with a threshold in an unfamiliar task-dependent scale is a clear advantage for the operator. This capability has important implications in many practical cases: the complete process is often requested to meet a maximum acceptable amount of erroneous transcriptions within the set of non-rejected strings in a batch of documents (known as false acceptance rate or simply error rate) to ensure a quality of service; and the rejected strings are usually sent to a high cost manual data-entry process that should be kept as limited as possible All these considerations suggest that a trade-off, where the threshold selection plays an important role, exists and has a significant impact on the practical and economic performance of the system.

To decide on the acceptance or rejection of a single string, a simple threshold is relatively straightforward to estimate, but to maintain a control of the error rate of a batch of documents, some strings that would be rejected might be accepted and vice versa depending on the remaining strings of the batch. If the measurements suggest that the quality of the set is good (or bad), the threshold can be higher (or lower), accordingly.

The optimization of this process is not straightforward. In an OCR system, the number of rejections is not predictable because it depends on the particular task at hand, being highly sensitive to factors such as the handwriting style, scanning process, image quality, field registration, etc, as well as the characteristics of the language model used (e.g., its perplexity). For example, for a given threshold, the amount of rejected strings in two sets of documents can be very different if the first one is composed of carefully written strings and the second set is poorly written. Thus, if we take a sample of OCR strings (observations) and compute its transformation costs using a given post-processing method, like the one described in Section 3, the distribution obtained can vary for different language models, as can be seen in Fig. 1 (left), as well as for different samples of a single language, as shown in Fig. 1 (right). This means that choosing a consistent rejection threshold on the transformation cost is difficult, since the number of accepted/rejected strings for a given threshold value will vary depending on the characteristics of both the language model and the sample processed. In fact, a slight variation of the threshold value can lead to unpredictable changes on the ratios of accepted/rejected strings, as well as on the error rate, as shown in Fig. 2. Therefore, in most cases, different thresholds should be applied to different samples to obtain the same error rate within the set of accepted strings.

Traditional analytic tools like Error-Reject Trade-off (Chow, 1970), Receiver Operating Characteristic (ROC) (Fawcett, 2006), and Precision-Recall (Rijsbergen, 1979) (and their different variations), have provided useful information for analyzing and comparing classifier performances, based on measurements obtained from a training sample. Particularly, predictions on the relationship between the rejection threshold and several indexes like error rate, precision, accuracy, number of rejections, false positive and false negative rates, can be established from such tools. Nevertheless, the accuracy of those predictions is strongly conditioned by the distribution of the confidence values found in the training sample used. In other words, predicting the error rate of a new sample entails the assumption that a similar confidence distribution is expected in it. However, in the task described here, this assumption can be unacceptable, as explained above: the amount of symbol errors of an OCR classifier can widely vary for different samples depending on many factors, and consequently, the distribution of the post-processing transformation costs, can vary too. Therefore, in this case, applying a fixed threshold to different samples will not guarantee meeting a pre-specified error rate.

In this work, we consider the hypothesis that a probability distribution of erroneous transcriptions can be estimated from the transformation costs produced by the application of a language model, and that it can be used to predict the error rate of a set of strings of the language, regardless of its cost distribution. Thus, in Section 4, given a set of transformation costs corresponding to a supervised sample of OCR hypotheses, the error probabilities associated to each cost (Error vs. Cost distribution) are obtained. In Section 5, we propose an approach for adaptive rejection thresholding, where the Error vs. Cost distribution of a language model is used to find the rejection threshold to be applied on a whole batch of strings in order to meet a given target error rate. We tested this approach in two scenarios:

•
The transformation costs are obtained from a real sample of OCR hypotheses, and used to compute the Error vs. Cost distribution. Performance evaluation in this scenario is presented in Section 6.
•
The Error vs. Cost distribution of a new language is automatically estimated, with a method based on the generation of synthetic OCR errors from the positive sample used to build the language model. No supervision is needed in this case, avoiding the time-consuming process of optical recognition and manual labeling of a significant amount of strings of the language. This is particularly important in practice when new language models are needed frequently for short batches of documents (even if they are subsets or special variants of previous language models), or in case of tasks where the labeling process is not possible or convenient. It is described and evaluated in Section 7.

Section snippets

Quality control

Industrial Quality Control is commonly approached from two points of view. The first one takes into account the quality acceptable by the recipient of the product or service and the quality level bearable by the producer. These qualities correspond to what is known in the literature as the risks of the producer and the consumer. The inspection plans for sample-based statistical quality control are designed from the Characteristic Operation Curve and these risks. In Paladini (2000), an expert

Post-processing algorithm and language models used

A technique based on weighted finite-state transducers (WFSTs) combining language, hypothesis and error models has been used to post-process the OCR hypotheses (Llobet et al., 2010). It is based on a finite-state transducer built from a formal grammar that encodes the strings in the lexicon or language sample. In this case, a k-Testable Language in the Strict Sense is inferred from a set of available language strings. The transducer is composed with other two WFSTs, representing the error model

Modeling the Error vs. Cost distribution

Given a language model and a set of transformation costs obtained using a post-processing algorithm with a sample of supervised OCR hypotheses (for which ground-truth transcriptions are available), a smoothed error rate curve as a function of the cost c can be computed using the equation, $H (c, w) = \frac{| S_{c, w}^{-} |}{| S_{c, w} |}$ where w is a smoothing (rectangular) window size parameter, $| S_{c, w}^{-} |$ is the number of strings erroneously transcribed into incorrect strings (belonging to the language, but different to those

Batch-adaptive rejection threshold estimation

The error vs. cost function represented by H allows to obtain a direct estimation of the probability that an OCR hypothesis is wrong, based on its transformation cost. So, for a single string, we can simply look up H, and decide to reject it if the error estimation is higher than an error threshold.

But, if the goal is to control the error rate of a set of observations, i.e., a sample or batch of strings rather than a single OCR string, then different rejection thresholds are typically needed

Performance evaluation

The experiments have been designed to evaluate the accuracy of the adaptive rejection threshold estimation procedure presented. The ability of the system to approximate a pre-specified error rate (target error) on real samples having different degrees of representativity with respect to the “training set” has been characterized.

The supervised samples obtained from a real OCR workflow, as detailed in Section 3, were used in the experiments. In order to assess the performance of the proposed

Approach for new language models

In practice, when a new language model is defined in the system, a sample of OCR hypotheses could not be available to build the EC curve of that language. In this section, we propose to use literal strings from the positive sample of the new language to build a synthetic sample where OCR-like errors will be artificially introduced using a particular error-generation model. This way, the EC curve of the new language can be estimated from that synthetic sample.

An OCR error-generation model is

Conclusions

A method and experimental results for the automatic estimation of a rejection threshold on the confidence index (or transformation cost) of a sample of OCR hypotheses (batch) post-processed using a language model have been presented. The goal is that an operator should be able to set a target error rate for a whole sample instead of having to specify a threshold in an arbitrary scale.

The expert knowledge in this task comes naturally in the form of a target error rate which is negotiated between

References (31)

H.L. Berghel
A logical framework for the correction of spelling errors in electronic documents
Information Processing and Management
(1987)
R. Bertolami et al.
Rejection strategies for offline handwritten text line recognition
Pattern Recognition Letters
(2006)
F. Farooq et al.
Phrase-based correction model for improving handwriting recognition accuracies
Pattern Recognition
(2009)
T. Fawcett
An introduction to ROC analysis
Pattern Recognition Letters
(2006)
G. Fumera et al.
Reject option with multiple thresholds
Pattern Recognition
(2000)
G. Köksal et al.
Review: A review of data mining applications for quality improvement in manufacturing industry
Expert Systems with Applications
(2011)
M. Li et al.
Confidence-based classifier design
Pattern Recognition
(2006)
E.P. Paladini
An expert system approach to quality control
Expert Systems with Applications
(2000)
N. Serrano et al.
Effective balancing error and user effort in interactive handwriting recognition
Pattern Recognition Letters
(2014)
H.Y. Wu et al.
Determination of optimal inspection sequence within misclassification error bound
Expert Systems with Applications
(2011)

J.C. Amengual et al.

Efficient error-correcting Viterbi parsing

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1998)

J. Arlandis et al.

Rejection threshold estimation for an unknown language model in an OCR task

Breuel, T. (1994). Language modeling for a real-world handwriting recognition task. In AISB workshop on computational...

C. Chow

On optimum recognition error and reject tradeoff

IEEE Transactions on Information Theory

(1970)

P. Garcia et al.

Inference of k-testable languages in the strict sense and application to syntactic pattern recognition

IEEE Transactions on PAMI

(1990)

Cited by (6)

BigFlow: Real-time and reliable anomaly-based intrusion detection for high-speed networks
2019, Future Generation Computer Systems
Citation Excerpt :
In this sense, when an event is rejected, there is a high probability that a new network traffic behavior is taking place. Although classification rejection has been used in other areas (e.g., for optical character recognition (OCR) [12] or medical diagnosis [13]), in these areas contextual information can help to identify pattern deviations; however, in the high-speed network traffic field, such a task is challenging. The main challenge that is not present in other areas [12,13] relates to rejections based on the classifier confidence.
Existing machine learning solutions for network-based intrusion detection cannot maintain their reliability over time when facing high-speed networks and evolving attacks. In this paper, we propose BigFlow, an approach capable of processing evolving network traffic while being scalable to large packet rates. BigFlow employs a verification method that checks if the classifier outcome is valid in order to provide reliability. If a suspicious packet is found, an expert may help BigFlow to incrementally change the classification model. Experiments with BigFlow, over a network traffic dataset spanning a full year, demonstrate that it can maintain high accuracy over time. It requires as little as 4% of storage and between 0.05% and 4% of training time, compared with other approaches. BigFlow is scalable, coping with a 10-Gbps network bandwidth in a 40-core cluster commodity hardware.
Performance measures for classification systems with rejection
2017, Pattern Recognition
Citation Excerpt :
Classification with rejection is a viable option in real world applications of machine learning and pattern recognition, where the presence and cost of errors can be detrimental to the performance of automated classification systems. Classification with rejection is often used in situations where the cost and consequences of misclassification are high, such as in automated medical diagnosis [1–5], landcover classification in remote sensing [6–8], automated software detection prediction [9], or large-scale OCR applications [10]. Another interesting situation for the use of classification with rejection is the design of classification systems through a cascading architecture, used in text categorization systems [11,12], in time-series classification [13], or in medical diagnosis [14].
Classifiers with rejection are essential in real-world applications where misclassifications and their effects are critical. However, if no problem specific cost function is defined, there are no established measures to assess the performance of such classifiers. We introduce a set of desired properties for performance measures for classifiers with rejection, based on which we propose a set of three performance measures for the evaluation of the performance of classifiers with rejection. The nonrejected accuracy measures the ability of the classifier to accurately classify nonrejected samples; the classification quality measures the correct decision making of the classifier with rejector; and the rejection quality measures the ability to concentrate all misclassified samples onto the set of rejected samples. We derive the concept of relative optimality that allows us to connect the measures to a family of cost functions that take into account the trade-off between rejection and misclassification. We illustrate the use of the proposed performance measures on classifiers with rejection applied to synthetic and real-world data.
Survey of Post-OCR Processing Approaches
2021, ACM Computing Surveys
Speaker-independent emotion recognition for interstate measuring of user based on separation and rejection
2018, International Journal of Machine Learning and Computing
Recognition results classification and post-processing methods for painted characters on billet surface
2017, Advances in Manufacturing
Intelligent immediately help system
2016, Proceedings of 2016 International Conference on Information Management, ICIM 2016

View full text

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Highlights

Abstract

Introduction

Section snippets

Quality control

Post-processing algorithm and language models used

Modeling the Error vs. Cost distribution

Batch-adaptive rejection threshold estimation

Performance evaluation

Approach for new language models

Conclusions

Information Processing and Management

Pattern Recognition Letters

Pattern Recognition

Pattern Recognition Letters

Pattern Recognition

Expert Systems with Applications

Pattern Recognition

Expert Systems with Applications

Pattern Recognition Letters

Expert Systems with Applications

Efficient error-correcting Viterbi parsing

IEEE Transactions on Pattern Analysis and Machine Intelligence

Rejection threshold estimation for an unknown language model in an OCR task

On optimum recognition error and reject tradeoff

IEEE Transactions on Information Theory

Inference of k-testable languages in the strict sense and application to syntactic pattern recognition

IEEE Transactions on PAMI