OCR Features and Systems

Relating Statistical Image Differences and Degradation Features

Document images are degraded through bilevel processes such as scanning, printing, and photocopying. The resulting image degradations can be categorized based either on observable degradation features or on degradation model parameters. The degradation features can be related mathematically to model parameters. In this paper we statistically compare pairs of populations of degraded character images created with different model parameters. The changes in the probability that the characters are from different populations when the model parameters vary correlate with the relationship between observable degradation features and the model parameters. The paper also shows which features have the largest impact on the image.

Elisa Barney Smith, Xiaohui Qiu

Script Identification in Printed Bilingual Documents

Identification of script in multi-lingual documents is essential for many language dependent applications suchas machine translation and optical character recognition. Techniques for script identification generally require large areas for operation so that sufficient information is available. Suchassumption is nullified in Indian context, as there is an interspersion of words of two different scripts in most documents. In this paper, techniques to identify the script of a word are discussed. Two different approaches have been proposed and tested. The first method structures words into 3 distinct spatial zones and utilizes the information on the spatial spread of a word in upper and lower zones, together with the character density, in order to identify the script. The second technique analyzes the directional energy distribution of a word using Gabor filters withsuitable frequencies and orientations. Words withv arious font styles and sizes have been used for the testing of the proposed algorithms and the results obtained are quite encouraging.

D. Dhanya, A. G. Ramakrishnan

Optimal Feature Extraction for Bilingual OCR

Feature extraction in bilingual OCR is handicapped bythe increase in the number of classes or characters to be handled. This is evident in the case of Indian languages whose alphabet set is large. It is expected that the complexityof the feature extraction process increases with the number of classes. Though the determination of the best set of features that could be used cannot be ascertained through anyquan titative measures, the characteristics of the scripts can help decide on the feature extraction procedure. This paper describes a hierarchical feature extraction scheme for recognition of printed bilingual (Tamil and Roman) text. The scheme divides the combined alphabet set of both the scripts into subsets bythe extraction of certain spatial and structural features. Three features viz geometric moments, DCT based features and Wavelet transform based features are extracted from the grouped symbols and a linear transformation is performed on them for the purpose of efficient representation in the feature space. The transformation is obtained bythe maximization of certain criterion functions. Three techniques : Principal component analysis, maximization of Fisher’s ratio and maximization of divergence measure have been employed to estimate the transformation matrix. It has been observed that the proposed hierarchical scheme allows for easier handling of the alphabets and there is an appreciable rise in the recognition accuracyas a result of the transformations.

D. Dhanya, A. G. Ramakrishnan

Machine Recognition of Printed Kannada Text

This paper presents the design of a full fledged OCR system for printed Kannada text. The machine recognition of Kannada characters is dificult due to similarity in the shapes of different characters, script complexity and non-uniqueness in the representation of diacritics. The document image is subject to line segmentation, word segmentation and zone detection. From the zonal information, base characters, vowel modifiers and consonant conjucts are separated. Knowledge based approach is employed for recognizing the base characters. Various features are employed for recognising the characters. These include the coefficients of the Discrete Cosine Transform, Discrete Wavelet Transform and Karhunen-Louve Transform. These features are fed to different classifiers. Structural features are used in the subsequent levels to discriminate confused characters. Use of structural features, increases recognition rate from 93% to 98%. Apart from the classical pattern classification technique of nearest neighbour, Artificial Neural Network (ANN) based classifiers like Back Propogation and Radial Basis Function (RBF) Networks have also been studied. The ANN classifiers are trained in supervised mode using the transform features. Highest recognition rate of 99% is obtained with RBF using second level approximation coefficients of Haar wavelets as the features on presegmented base characters.

B. Vijay Kumar, A. G. Ramakrishnan

An Integrated System for the Analysis and the Recognition of Characters in Ancient Documents

This paper describes an integrated system for processing and analyzing highly degraded ancient printed documents. For each page, the system reduces noise by wavelet-based filtering, extracts and segments the text lines into characters by a fast adaptive thresholding, and performs OCR by a feed-forward back-propagation multilayer neural network. The probability recognition is used as a discriminant parameter for determining the automatic activation of a feed-back process, leading back to a block for refining segmentation. This block acts only on the small portions of the text where the recognition was not trustable, and makes use of blind deconvolution and MRF-based segmentation techniques. The experimental results highlight the good performance of the whole system in the analysis of even strongly degraded texts.

Stefano Vezzosi, Luigi Bedini, Anna Tonazzini

A Complete Tamil Optical Character Recognition System

Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.

K. G. Aparna, A. G. Ramakrishnan

Distinguishing between Handwritten and Machine Printed Text in Bank Cheque Images

In the current literature about textual element identification in bank cheque images, many strategies put forward are strongly dependent on document layout. This means searching and employing contextual information as a pointer to a search region on the image. However human handwriting, as well as machine printed characters, are not dependent on the document in which they are inserted. Components of handwritten and machine printed behavior can be maintained in a generic and independent way. Based on these observations this paper presents a new approach to identifying textual elements from a set of local features enabling the category of a textual element to be established, without needing to observe its environment. The use of local features might allow a more generic and reach classificatory process, enabling it in some cases to be used over different sorts of documents. Based on this assumption, in our tests we used bank cheque images from Brazil, USA, Canada and France. The preliminary results show the efficiency and the potential of this approach.

José Eduardo Bastos Dos Santos, Bernard Dubuisson, Flávio Bortolozzi

Multi-expert Seal Imprint Verification System for Bankcheck Processing

A dificult problem encountered in automatic seal imprint verification is that the system is required an extremely low error rate despite of the variety of seal imprint quality. To conquer this problem, we propose a multi-expert seal imprint verification system, which combines two different verification algorithms. The first verification algorithm is based on a method using local and global features of seal imprint. The second one uses a special correlation method based on a global approach. The two algorithms are combined by a voting strategy. Experimental results showed that the combination of the two algorithms improves significantly the verification performance both on “false-acceptance error rate” and “false-rejection error rate”.

Katsuhiko Ueda, Ken’ichi Matsuo

Automatic Reading of Traffic Tickets

The present work presents a prototype system to extract and recognize handwritten information in a traffic ticket, and thereafter feeds them into a database of registered cars for further processing. Each extracted information consists either of handwritten isolated Arabic digits or tick mark “x”. The ticket form is designed in such a way to facilitate the extraction process. For each input, the output of the recognition module is a probabilistic value that indicates the system confidence of the correct pattern class. If the probabilistic output is less than the determined threshold, the system requests assistance from the user to identify the input pattern. This feature is necessary in order to avoid feeding in wrong information to the database, such as associating the traffic ticket with the wrong registered car.

Nabeel Murshed

Handwriting Recognition

A Stochastic Model Combining Discrete Symbols and Continuous Attributes and Its Application to Handwriting Recognition

This paper introduces a new stochastic framework of modeling sequences of features that are combinations of discrete symbols and continuous attributes. Unlike traditional hidden Markov models, the new model emits observations on transitions instead of states. In this framework, a feature is first labeled with a symbol and then a set of featuredependent continuous attributes is associated to give more details of the feature. This two-level hierarchy is modeled by symbol observation probabilities which are discrete and attribute observation probabilities which are continuous. The model is rigorously defined and the algorithms for its training and decoding are presented. This framework has been applied to off-line handwritten word recognition using high-level structural features and proves its effectiveness in experiments.

Hanhong Xue, Venu Govindaraju

Top-Down Likelihood Word Image Generation Model for Holistic Word Recognition

This paper describes a new top-down word image generation model for word recognition. This model can generate a word image with a likelihood based on linguistic knowledge, segmentation and character image. In the recognition process, first, the model generates the word image which approximates an input image best for each of a dictionary of possible words. Next, the model calculates the distance value between the input image and each generated word image. Thus, the proposed method is a type of holistic word recognition method. The effectiveness of the proposed method was evaluated in an experiment using type-written museum archive card images. The difference between a non-holistic method and the proposed method is shown by the evaluation. The small errors accumulate in non-holistic methods during the process carried out, because the non-holistic methods can’t cover the whole word image but only part images extracted by segmentation, and the non-holistic method can’t eliminate the blackpixels intruding in the recognition window from neighboring characters. In the proposed method, we can expect that no such errors will accumulate. Results show that a recognition rate of 99.8% was obtained, compared with only 89.4% for a recently published comparator algorithm.

Eiki Ishidera, Simon M. Lucas, Andrew C. Downton

The Segmentation and Identification of Handwriting in Noisy Document Images

In this paper we present an approach to the problem of segmenting and identifying handwritten annotations in noisy document images. In many types of documents such as correspondence, it is not uncommon for handwritten annotations to be added as part of a note, correction, clarification, or instruction, or a signature to appear as an authentication mark. It is important to be able to segment and identify such handwriting so we can 1) locate, interpret and retrieve them efficiently in large document databases, and 2) use different algorithms for printed/handwritten text recognition and signature verification. Our approach consists of two processes: 1) a segmentation process, which divides the text into regions at an appropriate level (character, word, or zone), and 2) a classification process which identifies the segmented regions as handwritten. To determine the approximate region size where classification can be reliably performed, we conducted experiments at the character, word and zone level. We found that the reliable results can be achieved at the word level with a classification accuracy of 97.3%. The identified handwritten text is further grouped into zones and verified to reduce false alarms. Experiments show our approach is promising and robust.

Yefeng Zheng, Huiping Li, David Doermann

The Impact of Large Training Sets on the Recognition Rate of Off-line Japanese Kanji Character Classifiers

Though it is commonly agreed that increasing the training set size leads to improved recognition rates, the deficit of publicly available Japanese character pattern databases prevents us from verifying this assumption empirically for large data sets. Whereas the typical number of training samples has usually been between 100-200 patterns per category until now, newly collected databases and increased computing power allows us to experiment with a much higher number of samples per category. In this paper, we experiment with off-line classifiers trained with up to 1550 patterns for 3036 categories respectively. We show that this bigger training set size indeed leads to improved recognition rates compared to the smaller training sets normally used.

Ondrej Velek, Masaki Nakagawa

Automatic Completion of Korean Words for Open Vocabulary Pen Interface

An automatic completion method of general Korean words is proposed. A word model that describes the frequency of usage is used for generation of word candidates from a given prefix. In experiments, several different models for Korean words were tested for prediction performance. The results show that the best model can reduce the number of writing characters about 38% with 5 candidates.

Sungho Ryu, Jin-Hyung Kim

Using Stroke-Number-Characteristics for Improving Efficiency of Combined Online and Offline Japanese Character Classifiers

We propose a new technique for normalizing likelihood of multiple classifiers prior to their combination. During a combination process we utilize the information about their efficiency correctly recognize a character with a given stroke number. In the beginning, we show that this recognizer’s efficiency based on a stroke number is different for a common on-line and offline recognizer. Later, we demonstrate on elementary combination rules, such as sum-rule and max-rule that using this information increases a recognition rate.

Ondrej Velek, Masaki Nakagawa

Closing Gaps of Discontinuous Lines: A New Criterion for Choosing the Best Prolongation

Polylines that are or should be continuous can have gaps in them, either because of scanning or digital processing or because they are depicted with discontinuous symbology like dots or dashes. This paper presents a new criterion for finding the most likely prolongation of discontinuous polylines.

Eugene Bodansky, Alexander Gribov

Classifiers and Leaning

Classifier Adaptation with Non-representative Training Data

We propose an adaptive methodology to tune the decision boundaries of a classifier trained on non-representative data to the statistics of the test data to improve accuracy. Specifically, for machine printed and handprinted digit recognition we demonstrate that adapting the class means alone can provide considerable gains in recognition. On machineprinted digits we adapt to the typeface, on hand-print to the writer. We recognize the digits with a Gaussian quadratic classifier when the style of the test set is represented by a subset of the training set, and also when it is not represented in the training set. We compare unsupervised adaptation and style-constrained classification on isogenous test sets of five machine-printed and two hand-printed NIST data sets. Both estimating mean and imposing style constraints reduce the error-rate in almost every case, and neither ever results in signi.cant loss. They are comparable under the first scenario (specialization), but adaptation is better under the second (new style). Adaptation is bene.cial when the test is large enough (even if only ten samples of each class by one writer in a 100- dimensional feature space), but style conscious classification is the only option with fields of only two or three digits.

Sriharsha Veeramachaneni, George Nagy

A Learning Pseudo Bayes Discriminant Method Based on Difference Distribution of Feature Vectors

We developed a learning pseudo Bayes discriminant method, that dynamically adapts a pseudo Bayes discriminant function to a font and image degradation condition present in a text. In this method, the characteristics of character pattern deformations are expressed as a statistic of a difference distribution, and information represented by the difference distribution is integrated into the pseudo Bayes discriminant function. The formulation of integrating the difference distribution into the pseudo Bayes discriminant function results in that a covariance matrix of each category is adjusted based on the difference distribution. We evaluated the proposed method on multi-font texts and degraded texts such as compressed color images and faxed copies. We found that the recognition accuracy of our method for the evaluated texts was much higher than that of conventional methods.

Hiroaki Takebe, Koji Kurokawa, Yutaka Katsuyama, Satoshi Naoi

Increasing the Number of Classifiers in Multi-classifier Systems: A Complementarity-Based Analysis

Complementarity among classifiers is a crucial aspect in classifier combination. A combined classifier is significantly superior to the individual classifiers only if they strongly complement each other. In this paper a complementarity-based analysis of sets of classifier is proposed for investigating the behaviour of multi-classifier systems, as new classifiers are added to the set. The experimental results confirm the theoretical evidence and allow the prediction of the performance of a multi-classifier system, as the number of classifiers increases.

L. Bovino, G. Dimauro, S. Impedovo, G. Pirlo, A. Salzo

Discovering Rules for Dynamic Configuration of Multi-classifier Systems

This paper addresses the problem of dynamic configuration of multiclassifier systems. For this purpose, the performance of combination methods for abstract-level classifiers is predicted, under different working conditions, and sets of rules are discovered and used for dynamic configuration of multiclassifier systems. The experimental tests have been carried out in the field of hand-written numeral recognition. The result demonstrates the validity of the proposed approach.

G. Dimauro, S. Impedovo, M. G. Lucchese, G. Pirlo, A. Salzo

Multiple Classifier Combination for Character Recognition: Revisiting the Majority Voting System and Its Variations

In recent years, strategies based on combination of multiple classifiers have created great interest in the character recognition research community. A huge number of complex and sophisticated decision combination strategies have been explored by researchers. However, it has been realized recently that the comparatively simple Majority Voting System and its variations can achieve very robust and often comparable, if not better, performance than many of these complex systems. In this paper, a review of various Majority Voting Systems and their variations are discussed, and a comparative study of some of these methods is presented for a typical character recognition task.

A. F. R. Rahman, H. Alam, M. C. Fairhurst

Layout Analysis

Correcting for Variable Skew

The proliferation of inexpensive sheet-feed scanners, particularly in fax machines, has led to a need to correct for the uneven paper feed rates during digitization if the images produced by these scanners are to be further analyzed. We develop a technique for detecting and compensating for this type of image distortion.

A. Lawrence Spitz

Two Geometric Algorithms for Layout Analysis

This paper presents geometric algorithms for solving two key problems in layout analysis: finding a cover of the background whitespace of a document in terms of maximal empty rectangles, and finding constrained maximum likelihood matches of geometric text line models in the presence of geometric obstacles. The algorithms are considerably easier to implement than prior methods, they return globally optimal solutions, and they require no heuristics. The paper also introduces an evaluation function that reliably identifies maximal empty rectangles corresponding to column boundaries. Combining this evaluation function with the two geometric algorithms results in an easy-to-implement layout analysis system. Reliability of the system is demonstrated on documents from the UW3 database.

Thomas M. Breuel

Text/Graphics Separation Revisited

Text/graphics separation aims at segmenting the document into two layers: a layer assumed to contain text and a layer containing graphical objects. In this paper, we present a consolidation of a method proposed by Fletcher and Kasturi, with a number of improvements to make it more suitable for graphics-rich documents. We discuss the right choice of thresholds for this method, and their stability. We also propose a post-processing step for retrieving text components touching the graphics, through local segmentation of the distance skeleton.

Karl Tombre, Salvatore Tabbone, Loïc Pélissier, Bart Lamiroy, Philippe Dosch

A Study on the Document Zone Content Classification Problem

A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database. With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 418pt and font size 1932 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Yalin Wang, Ihsin T. Phillips, Robert M. Haralick

Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning

Logical structure analysis of document images is an important problem in document image understanding. In this paper, we propose a graph matching approach to label logical components on a document page. Our system is able to learn a model for a document class, use this model to label document images through graph matching, and adaptively improve the model with error feed back. We tested our method on journal/proceeding article title pages. The experimental results show promising accuracy, and confirm the ability of adaptive learning.

Jian Liang, David Doermann

A Ground-Truthing Tool for Layout Analysis Performance Evaluation

There is a significant need for performance evaluation of Layout Analysis methods. The greatest stumbling block is the lack of sufficient ground truth. In particular, there is currently no ground-truth for the evaluation of the performance of page segmentation methods dealing with complexshaped regions and documents with non-uniformly oriented regions. This paper describes a new, flexible, ground-truthing tool. It is fast and easy to use as it performs page segmentation to obtain a first description of regions. The ground-truthing system allows for the editing (merging, splitting and shape alteration) of each of the region outlines obtained from page segmentation. The resulting ground-truth regions are described in terms of isothetic polygons to ensure flexibility and wide applicability. The system also provides for the labelling of each of the ground truth regions according to the type of their content and their logical function. The former can be used to evaluate page classification, while the latter can be used in assessing logical layout structure extraction.

A. Antonacopoulos, H. Meng

Simple Layout Segmentation of Gray-Scale Document Images

A simple yet effective layout segmentation of document images is proposed in this paper. First, n x n blocks are roughly labeled as background, line, text, images, graphics or mixed class. For blocks in mixed class, they are split into 4 sub-blocks and the process repeats until no mixed class is found. By exploiting Savitzky-Golay derivative filter in the classification, the computation of features is kept to the minimum. Next, the boundaries of each object are refined. The experimental results yields a satisfactory results as a pre-process prior to OCR.

A. Suvichakorn, S. Watcharabusaracum, W. Sinthupinyo

Tables and Forms

Detecting Tables in HTML Documents

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as <table> elements, a <table> element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf <table> elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

Yalin Wang, Jianying Hu

Document-Form Identification Using Constellation Matching of Keywords Abstracted by Character Recognition

A document-form identification method based on constellation matching of targets is proposed. Mathematical analysis shows that the method achieves a high identification rate by preparing plural targets. The method consists of two parts: (i) extraction of targets such as important keywords in a document by template matching between recogised characters and word strings in a keyword dictionary, and (ii) analysis of the positional or semantic relationship between the targets by point-pattern matching between these targets and word location information in the keyword dictionary. All characters in the document are recognised by means of a conventional character-recognition method. An automatic keyword-determination method, which is necessary for making a keyword dictionary beforehand, is also proposed. This method selects the most suitable keywords from a general word dictionary by measuring the uniqueness of keywords and the stability of their recognition. Experiments using 671 sample documents with 107 different forms in total confirmed that (i) the keyword-determination method can determine sets of keywords automatically in 92.5% of 107 different forms and (ii) that the form-identification method can correctly identify 97.1% of 671 document samples at a rejection rate 2.9%.

Hiroshi Sako, Naohiro Furukawa, Masakazu Fujio, Shigeru Watanabe

Table Detection via Probability Optimization

In this paper, we define the table detection problem as a probability optimization problem. We begin, as we do in our previous algorithm, finding and validating each detected table candidates. We proceed to compute a set of probability measurements for each of the table entities. The computation of the probability measurements takes into consideration tables, table text separators and table neighboring text blocks. Then, an iterative updating method is used to optimize the page segmentation probability to obtain the final result. This new algorithm shows a great improvement over our previous algorithm. The training and testing data set for the algorithm include 1, 125 document pages having 518 table entities and a total of 10, 934 cell entities. Compared with our previouswork, it raised the accuracy rate to 95.67% from 90.32% and to 97.05% from 92.04%.

Yalin Wang, Ihsin T. Phillips, Robert M. Haralick

Complex Table Form Analysis Using Graph Grammar

Various kinds of complex table forms are used for many purposes, e.g. application forms. This paper presents a graph grammar based approach to the complex table form structure analysis. In our study, field types are classified into four, i.e. blank, insertion, indication, explanation, and four kinds of indication patterns are defined between indication and blank or insertion. Then, two dimensional relations between horizontally and vertically adjacent fields are described by graph representation and those reduction procedures are defined as production rules. We have designed 56 meta rules from which 6745 rules are generated for a complex table form analysis. Experimental results have shown that 31 kinds of different table forms are successfully analyzed using two types of meta grammar.

Akira Amano, Naoki Asada

Detection Approaches for Table Semantics in Text

When linking information presented in documents as tables with data held in databases, it is important to determine as much information about the table and its content. Such an integrated use of Web-based data requires information about its organization and meaning i.e. the semantics of the table. This paper describes approaches that can be used to detect and extract the semantics for a table held in the text. Our objective is to detect and extract table semantics that are buried in the text. For this goal to be achieved, a domain ontology that covers the semantics of the term used in this table must be available to the information system. The overall aim is to link this tabular information in an interoperable environment containingdatabase and other structures information. . . .

Saleh Alrashed, W. A. Gray

A Theoretical Foundation and a Method for Document Table Structure Extraction and Decompositon

The algorithm described in this paper is designed to detect potential table regions in the document, to decide whether a potential table region is, in fact, a table, and, when it is, to analyze the table structure. The decision and analysis phases of the algorithm and the resulting system are based primarily on a precise definition of table, and it is such a definition that is discussed in this paper. An adequate definition need not be complete in the sense of encompassing all possible structures that might be deemed to be tables, but it should encompass most such structures, it should include essential features of tables, and it should exclude features never or very rarely possessed by tables.

Howard Wasserman, Keitaro Yukawa, Bon Sy, Kui-Lam Kwok, Ihsin Tsaiyun Phillips

Text Extraction

Fuzzy Segmentation of Characters in Web Images Based on Human Colour Perception

This paper describes a new approach for the segmentation of characters in images on Web pages. In common with the authors’ previous work in this subject, this approach attempts to emulate the ability of humans to differentiate between colours. In this case, pixels of similar colour are first grouped using a colour distance defined in a perceptually uniform colour space (as opposed to the commonly used RGB). The resulting colour connected components are then grouped to form larger (character-like) regions with the aid of a fuzzy propinquity measure. This measure expresses the likelihood for merging two components based on two features. The first feature is the colour distance in the L*a*b* colour space. The second feature expresses the topological relationship of two components. The results of the method indicate a better performance than the previous method devised by the authors and comparable (possibly better) performance to other existing methods.

A. Antonacopoulos, D. Karatzas

Word and Sentence Extraction Using Irregular Pyramid

This paper presents the result of our continued work on a further enhancement to our previous proposed algorithm. Moving beyond the extraction of word groups and based on the same irregular pyramid structure the new proposed algorithm groups the extracted words into sentences. The uniqueness of the algorithm is in its ability to process text of a wide variation in terms of size, font, orientation and layout on the same document image. No assumption is made on any specified document type. The algorithm is based on the irregular pyramid structure with the application of four fundamental concepts. The first is the inclusion of background information. The second is the concept of closeness where text information within a group is close to each other, in terms of spatial distance, as compared to other text areas. The third is the “majority win” strategy that is more suitable under the greatly varying environment than a constant threshold value. The final concept is the uniformity and continuity among words belonging to the same sentence.

Poh Kok Loo, Chew Lim Tan

Word Searching in Document Images Using Word Portion Matching

An approach with the capability of searching a word portion in document images is proposed in this paper, to facilitate the detection and location of the user-specified query words. A feature string is synthesized according to the character sequence in the user-specified word, and each word image extracted from documents are represented by a feature string. Then, an inexact string matching technology is utilized to measure the similarity between the two feature strings, based on which we can estimate how the document word image is relevant to the user-specified word and decide whether its portion is the same as the user-specified word. Experimental results on real document images show that it is a promising approach, which is capable of detecting and locating the document words that entirely match or partially match with the user-specified word.

Yue Lu, Chew Lim Tan

Scene Text Extraction in Complex Images

Text extraction and recognition from still and moving images have many important applications. But, when a source image is an ordinary natural scene, text extraction becomes very complicated and difficult. In this paper, we suggest text extraction methods based on color and gray information. The method using the color image is processed by color reduction, color clustering, and text region extraction and verifications. The method using the gray-level image is processed by edge detection, long line removal, repetitive run-length smearing (RLS), and text region extraction and verifications. Combining two approaches improves the extraction accuracies both in simple and in complex images. Also, estimating skew and perspective of the extracted text regions are considered.

Hye-Ran Byun, Myung-Cheol Roh, Kil-Cheon Kim, Yeong-Woo Choi, Seong-Whan Lee

Text Extraction in Digital News Video Using Morphology

In this paper, a new method is presented to extract both superimposed and embedded scene texts in digital news videos. The algorithm is summarized in the following three steps : preprocessing, extracting candidate regions, and filtering candidate regions. For the first preprocessing step, a color image is converted into a gray-level image and a modified local adaptive thresholding is applied to the contrast-stretched image. In the second step, various morphological operations and Geo-correction method are applied to remove non-text components while retaining the text components. In the third filtering step, non-text components are removed based on the characteristics of each candidate component such as the number of pixels and the bounding box of each connected component Acceptable results have been obtained using the proposed method on 300 domestic news images with a recognition rate of 93.6%. Also, the proposed method gives good performance on the various kinds of images such as foreign news and film videos.

Hyeran Byun, Inyoung Jang, Yeongwoo Choi

Indexing and Retrieval

Retrieval by Layout Similarity of Documents Represented with MXY Trees

Document image retrievalcan be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a treebased representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both global features of the page and a vectorialrepresen tation of its layout that is derived from the corresponding MXY tree. Occurrences of tree patterns are handled similarly to index terms in Information Retrieval in order to compute the similarity. When retrieving relevant documents, the images in the collection are sorted on the basis of a measure that is the combination of two values describing the similarity of global features and of the occurrences of tree patterns. The system is applied to the retrieval of documents belonging to digital libraries. Tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century, and to a collection of monographs printed in the same Century and containing more than 600 pages.

Francesca Cesarini, Simone Marinai, Giovanni Soda

Automatic Indexing of Newspaper Microfilm Images

This paper describes a proposed document analysis system that aims at automatic indexing of digitized images of old newspaper microfilms. This is done by extracting news headlines from microfilm images. The headlines are then converted to machine readable text by OCR to serve as indices to the respective news articles. A major challenge to us is the poor image quality of the microfilm as most images are usually inadequately illuminated and considerably dirty. To overcome the problem we propose a new effective method for separating characters from noisy background since conventional threshold selection techniques are inadequate to deal with these kinds of images. A Run Length Smearing Algorithm (RLSA) is then applied to the headline extraction. Experimental results confirm the validity of the approach.

Qing Hong Liu, Chew Lim Tan

Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts

Query expansion methods have been studied for a long time - with debatable success in many instances. In this paper, a new approach is presented based on using term concepts learned by other queries. Two important issues with query expansion are addressed: the selection and the weighing of additional search terms. In contrast to other methods, the regarded query is expanded by adding those terms which are most similar to the concept of individual query terms, rather than selecting terms that are similar to the complete query or that are directly similar to the query terms. Experiments have shown that this kind of query expansion results in notable improvements of the retrieval effectiveness if measured the recall/precision in comparison to the standard vector space model and to the pseudo relevance feedback. This approach can be used to improve the retrieval of documents in Digital Libraries, in Document Management Systems, in the WWW etc.

Stefan Klink, Armin Hust, Markus Junker, Andreas Dengel

Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images

This paper presents a new method of document image retrieval that is capable of spotting parts of page images relevant to a user’s query. This enables us to improve the usability of retrieval, since a user can find where to read on retrieved pages. The effectiveness of retrieval can also be improved because the method is little influenced by irrelevant parts on pages. The method is based on the assumption that parts of page images which densely contain keywords in a query are relevant to it. The characteristics of the proposed method are as follows: (1) Two-dimensional density distributions of keywords are calculated for ranking parts of page images, (2) The method relies only on the distribution of characters so as not to be affected by the errors of layout analysis. Based on the experimental results of retrieving Japanese newspaper articles, we have shown that the proposed method is superior to a method without the function of dealing with parts, and sometimes equivalent to a method of electronic document retrieval that works on error-free text.

Koichi Kise, Masaaki Tsujino, Keinosuke Matsumoto

Mining Documents for Complex Semantic Relations by the Use of Context Classification

Causal relations symbolize one of the most important document organization and knowledge representation principles. Consequently, the identification of cause-effect chains for later evaluation represents a valuable document analysis task. This work introduces a prototype implementation of a causal relation management and evaluation system which functions as a framework for mining documents for causal relations. The central part describes a new approach of classifying passages of documents as relevant considering the causal relations under inspection. The “Context Classification by Distance- Weighted Relevance Feedback” method combines passage retrieval and relevance feedback techniques and extends both of them with regard to the local contextual nature of causal relations. A wide range of parameter settings is evaluated in various experiments and the results are discussed on the basis of recall-precision figures. It is shown that the trained context classifier represents a good means for identifying relevant passages not only for already seen causal relations but also for new ones.

Andreas Schmidt, Markus Junker

Hairetes: A Search Engine for OCR Documents

In this paper, we report on the architecture and preliminary implementation of our search engine, Hairetes. This engine is based on an extended concept of Retrieval by General Logical Imaging (RbGLI). In this extension, word similarity measures are computed by EMIM and Bayes’ theorem.

Kazem Taghva, Jeffrey Coombs

Text Verification in an Automated System for the Extraction of Bibliographic Data

An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.

George R. Thoma, Glenn Ford, Daniel Le, Zhirong Li

Document Engineering

smartFIX: A Requirements-Driven System for Document Analysis and Understanding

Although the internet offers a wide-spread platform for information interchange, day-to-day work in large companies still means the processing of tens of thousands of printed documents every day. This paper presents the system smartFIX which is a document analysis and understanding system developed by the DFKI spin-off INSIDERS. It permits the processing of documents ranging from fixed format forms to unstructured letters of any format. Apart from the architecture, the main components and system characteristics, we also show some results when applying smartFIX to medical bills and prescriptions.

Andreas R. Dengel, Bertin Klein

Machine Learning of Generalized Document Templates for Data Extraction

The purpose of this research is to reverse engineer the process of encoding data in structured documents and subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes beyond form processing. In fact, the documents may have flexible layouts and consist of multiple and varying numbers of pages. The data extraction method (DataX) employs general templates generated by the Inductive Template Generator (InTeGen). The InTeGen method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user’s input.

Janusz Wnek

Machine Learning of Generalized Document Templates for Data Extraction

The purpose of this research is to reverse engineer the process of encoding data in structured documents and subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes beyond form processing. In fact, the documents may have flexible layouts and consist of multiple and varying numbers of pages. The data extraction method (DataX) employs general templates generated by the Inductive Template Generator (InTeGen). The InTeGen method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user’s input.

Janusz Wnek

Configuration REcognition Model for Complex Reverse Engineering Methods: 2(CREM)

This paper describes 2(CREM), a recognition method to be applied on documents with complex structures allowing incremental learning in an interactive environment. The classification is driven by a model, which contains a static as well as a dynamic part and evolves by use. The first prototype of 2(CREM) has been tested on four different phases of newspaper image analysis: line segment recognition, frame recognition, line merging into blocks, and logical labeling. Some promising experimental results are reported.

Karim Hadjar, Oliver Hitz, Lyse Robadey, Rolf Ingold

Electronic Document Publishing Using DjVu

Online access to complex compound documents with client side search and browsing capability is one of the key requirements of effective content management. “DjVu” (Déjà Vu) is a highly efficient document image compression methodology, a file format, and a delivery platform that, when considered together, has shown to effectively address these issues [1]. Originally developed for scanned color documents, the DjVu technology was recently expanded to electronic documents. The small file sizes and very efficient document browsing make DjVu a compelling alternative to such document interchange formats as PostScript or PDF. In addition, DjVu offers a uniform viewing experience for electronic or scanned original documents, on any platform, over any connection speed, which is ideal for digital libraries and electronic publishing. This paper describes the basics of DjVu encoding, with emphasis on the particular challenges posed by electronic sources. The DjVu Virtual Printer Driver we implemented as “Universal DjVu Converter” is then introduced. Basic performance statistics are given, and enterprise workflow applications of this technology are highlighted.

Artem Mikheev, Luc Vincent, Mike Hawrylycz, Léon Bottou

DAN: An Automatic Segmentation and Classification Engine for Paper Documents

The paper documents recognition is fundamental for office automation becoming every day a more powerful tool in those fields where information is still on paper. Document recognition follows from data acquisition, from both journals, and entire books in order to transform them in digital objects. We present a new system DAN (Document Analysis on Network) for Document recognition that follows the Open Source methodologies, XML description for documents segmentation and classification, which turns to be beneficial in terms of classification precision, and general-purpose availability.

L. Cinque, S. Levialdi, A. Malizia, F. De Rosa

Document Reverse Engineering: From Paper to XML

Since XML has the advantage of embedding logical structure information into documents, it is widely used as the universal format for structured documents on the Web. This makes it attractive to convert paper-based documents with logical hierarchy into XML representations automatically. Document image analysis and understanding [1] consists of two phases: geometric and logical structure analysis. Because the two phases take different kinds of data as input, it may not be desirable to apply the same method to them. Targeting technical journal document with multiple pages, we present a hybridization of knowledge-based and syntactic methods for geometric and logical structure analysis of document images.

Kyong-Ho Lee, Yoon-Chul Choy, Sung-Bae Cho, Xiao Tang, Victor McCrary

New Applications

Human Interactive Proofs and Document Image Analysis

The recently initiated and rapidly developing research field of ‘human interactive proofs’ (HIPs) and its implications for the document image analysis (DIA) research field are described. Over the last five years, efforts to defend Web services against abuse by programs (‘bots’) have led to a new family of security protocols able to distinguish between human and machine users. AltaVista pioneered this technology in 1997 [Bro01, LBBB01]. By the summer of 2000, Yahoo! and PayPal were using similar methods. In the Fall of 2000, Prof. Manuel Blum of Carnegie-Mellon University and his team, stimulated by Udi Manber of Yahoo!, were studying these and related problems [BAL00]. Soon thereafter a collaboration between the University of California at Berkeley and the Palo Alto Research Center (PARC) built a tool based on systematically generated image degradations [CBF01]. In January 2002, Prof. Blum and the present authors ran the first workshop (at PARC) on HIPs, defined broadly as a class of challenge/response protocols which allow a human to authenticate herself as a member of a given group - e.g. human (vs. machine), herself (vs. anyone else), an adult (vs. a child), etc. All commercial uses of HIPs known to us exploit the gap in ability between human and machine vision systems in reading images of machine printed text. Many technical issues that have been systematically studied by the DIA community are relevant to the HIP research program. This paper describes the evolution of HIP R& D, applications of HIPs now and on the horizon, highlights of the first HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs.

Henry S. Baird, Kris Popat

Data GroundTruth, Complexity, and Evaluation Measures for Color Document Analysis

Publications on color document image analysis present results on small, non-publicly available datasets.We propose in this paper a well defined and groundtruthed color dataset existing of over 1000 pages, with associated tools for evaluation. The color data groundtruthing and evaluation tools are based on a well defined document model, complexity measures to assess the inherent dificulty of analyzing a page, and well founded evaluation measures. Together they form a suitable basis for evaluating diverse applications in color document analysis.

Leon Todoran, Marcel Worring, Arnold W. M. Smeulders

Exploiting WWW Resources in Experimental Document Analysis Research

Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this paper, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial beneifts for both communities.

Daniel Lopresti

An Automated Tachograph Chart Analysis System

This paper describes a new system that analyses tachograph charts. These circular charts are legal records of information on the different types of driver activity (driving, other duty, standby and rest) and vehicle data (speed and distance travelled). As each driver of each passenger and goods vehicle over a certain capacity must use a chart for every 24-hour period, there is a significant need for automated analysis (currently, tachograph charts are analysed manually).The system starts by determining the shape parameters of the chart (location of the centre and radius). The position of the start of the 24-hour period (radius from centre to 24 hour tick mark) is then estimated. Finally the driver activity trace (recorded in a circular manner) is extracted, converted into a linear representation and recognised. Results from the evaluation of the system against professionally prepared ground truth indicate at least 94% accuracy in reading the driving time even on difficult (with scratches and marks) charts.

A. Antonacopoulos, D. P. Kennedy

A Multimodal System for Accessing Driving Directions

The focus of this paper is describing a system that repurposes a web document by a spoken language interface to provide both visual and audio driving directions. The spoken dialog interface is used to obtain the source and destination addresses from a user. The web document retrieved by querying the user on the addresses is parsed to extract the maps and the associated text. Further the system automatically generates two sets of web documents. One of these sets is used to render the maps on a hand-held device and the other set is used for the spoken dialog interface through a traditional phone. The system’s user interface allows navigation both through speech and pen stylus input. The system is built on a PhoneBrowser architecture that allows the user to browse the web by speech control over an ordinary telephone.

Ashutosh Morde, Ramanujan S. Kashi, Michael K. Brown, Deborah Grove, James L. Flanagan

Springer Professional

Inhaltsverzeichnis

Frontmatter

OCR Features and Systems

Relating Statistical Image Differences and Degradation Features

Script Identification in Printed Bilingual Documents

Optimal Feature Extraction for Bilingual OCR

Machine Recognition of Printed Kannada Text

An Integrated System for the Analysis and the Recognition of Characters in Ancient Documents

A Complete Tamil Optical Character Recognition System

Distinguishing between Handwritten and Machine Printed Text in Bank Cheque Images

Multi-expert Seal Imprint Verification System for Bankcheck Processing

Automatic Reading of Traffic Tickets

Handwriting Recognition

A Stochastic Model Combining Discrete Symbols and Continuous Attributes and Its Application to Handwriting Recognition

Top-Down Likelihood Word Image Generation Model for Holistic Word Recognition

The Segmentation and Identification of Handwriting in Noisy Document Images

The Impact of Large Training Sets on the Recognition Rate of Off-line Japanese Kanji Character Classifiers

Automatic Completion of Korean Words for Open Vocabulary Pen Interface

Using Stroke-Number-Characteristics for Improving Efficiency of Combined Online and Offline Japanese Character Classifiers

Closing Gaps of Discontinuous Lines: A New Criterion for Choosing the Best Prolongation

Classifiers and Leaning

Classifier Adaptation with Non-representative Training Data

A Learning Pseudo Bayes Discriminant Method Based on Difference Distribution of Feature Vectors

Increasing the Number of Classifiers in Multi-classifier Systems: A Complementarity-Based Analysis

Discovering Rules for Dynamic Configuration of Multi-classifier Systems

Multiple Classifier Combination for Character Recognition: Revisiting the Majority Voting System and Its Variations

Layout Analysis

Correcting for Variable Skew

Two Geometric Algorithms for Layout Analysis

Text/Graphics Separation Revisited

A Study on the Document Zone Content Classification Problem

Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning

A Ground-Truthing Tool for Layout Analysis Performance Evaluation

Simple Layout Segmentation of Gray-Scale Document Images

Tables and Forms

Detecting Tables in HTML Documents

Document-Form Identification Using Constellation Matching of Keywords Abstracted by Character Recognition

Table Detection via Probability Optimization

Complex Table Form Analysis Using Graph Grammar

Detection Approaches for Table Semantics in Text

A Theoretical Foundation and a Method for Document Table Structure Extraction and Decompositon

Text Extraction

Fuzzy Segmentation of Characters in Web Images Based on Human Colour Perception

Word and Sentence Extraction Using Irregular Pyramid

Word Searching in Document Images Using Word Portion Matching

Scene Text Extraction in Complex Images

Text Extraction in Digital News Video Using Morphology

Indexing and Retrieval

Retrieval by Layout Similarity of Documents Represented with MXY Trees

Automatic Indexing of Newspaper Microfilm Images

Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts

Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images

Mining Documents for Complex Semantic Relations by the Use of Context Classification

Hairetes: A Search Engine for OCR Documents

Text Verification in an Automated System for the Extraction of Bibliographic Data

Document Engineering

smartFIX: A Requirements-Driven System for Document Analysis and Understanding

Machine Learning of Generalized Document Templates for Data Extraction

Machine Learning of Generalized Document Templates for Data Extraction

Configuration REcognition Model for Complex Reverse Engineering Methods: 2(CREM)

Electronic Document Publishing Using DjVu

DAN: An Automatic Segmentation and Classification Engine for Paper Documents

Document Reverse Engineering: From Paper to XML

New Applications

Human Interactive Proofs and Document Image Analysis

Data GroundTruth, Complexity, and Evaluation Measures for Color Document Analysis

Exploiting WWW Resources in Experimental Document Analysis Research

An Automated Tachograph Chart Analysis System

A Multimodal System for Accessing Driving Directions

Backmatter