Using codebooks of fragmented connected-component contours in forensic and historic writer identification

doi:10.1016/j.patrec.2006.08.005

Pattern Recognition Letters

Volume 28, Issue 6, 15 April 2007, Pages 719-727

https://doi.org/10.1016/j.patrec.2006.08.005 Get rights and content

Abstract

Recent advances in ‘off-line’ writer identification allow for new applications in handwritten text retrieval from archives of scanned historical documents. This paper describes new algorithms for forensic or historical writer identification, using the contours of fragmented connected-components in free-style handwriting. The writer is considered to be characterized by a stochastic pattern generator, producing a family of character fragments (fraglets). Using a codebook of such fraglets from an independent training set, the probability distribution of fraglet contours was computed for an independent test set. Results revealed a high sensitivity of the fraglet histogram in identifying individual writers on the basis of a paragraph of text. Large-scale experiments on the optimal size of Kohonen maps of fraglet contours were performed, showing usable classification rates within a non-critical range of Kohonen map dimensions. The proposed automatic approach bridges the gap between image-statistics approaches and purely knowledge-based manual character-based methods.

Introduction

Writer identification on the basis of optically scanned handwritten samples enjoys a renewed interest (Srihari et al., 2002, Franke and Köppen, 2001, Said et al., 2000, Marti et al., 2001). The goal is to find in a large database a sample of a known writer (author) on the basis of an unknown or questioned handwritten document sample. The target performance for forensic writer-identification systems is a near-100% recall of the correct writer in a hit list of 100 writers, computed from a database in the order of 10⁴ samples, the size of search sets in current European forensic databases. Another application which enjoys increased interest is writer verification. Here, the goal is to develop systems which are able to decide whether two handwritten samples are from the same writer or not. In the domain of the cultural heritage, writer identification and verification are becoming a realistic tool in information retrieval methods. Additionally, interesting new applications are emerging in this domain. Due to the fact that writing style of an individual author evolves over time, attempts are currently made at dating handwritten samples of a writer whose style evolution may be present in a large scanned archive of samples with a known date of writing (Bensefia et al., 2003). Examples are the scanned collections of manuscripts and letter correspondence by authors such as Zola and Flaubert (Bensefia et al., 2003). The manuscripts in such collections are often annotated in a handwritten script of which the author may not be the same person as the main, original author. Also here, automatic writer identification may act as a useful tool for humanities researchers. Fig. 1 shows a sample from an administrative Dutch collection, with handwriting of one particular scribe.

Clearly, these new applications necessitate the development of powerful shape descriptors of free-style handwriting which are designed to capture individual style information. The problem is complex at a number of levels: (1) the degree of variability and variation of script; (2) the problem of foreground/background segmentation in highly textured and smudged documents; (3) the limited amount of text in unknown samples; (4) the differences in scanning technologies and image preprocessing. As a consequence, in forensic practice, a combination of statistical and knowledge-based techniques is used (Franke and Köppen, 2001). We have developed an ontology and XML format (WandaXML) for the systematic processing of forensic handwritten samples (Wanda, 2004). Elements of systematic style categorization can be entered in such a system to aid in boosting the performance of the pattern classification algorithms. It is to be expected that applications in historical writer identification and verification will similarly require a hybrid approach. In this paper, however, we will mainly focus on recent progress at the level of feature extraction in automatic, image-based (i.e., off-line) methods for writer identification.

Recently, we have proposed the use of connected-component contours (CO³s) and their occurrence histogram, i.e., discrete PDF, as a writer-identification feature (Schomaker and Bulacu, 2004) in upper-case Western handwriting. In this approach, a codebook of CO³s was constructed with a Kohonen self-organized map on the basis of a sufficiently large sample set of upper-case script. The writer is assumed to act as a stochastic generator of ink-blob shapes, such that the probability distribution of shape usage is characteristic of each writer. The performance of this approach is very promising, especially if it is used in conjunction with a complementary feature set which is based on edge-directional histograms which cover yet another aspect of writing style (Bulacu et al., 2003). Fig. 2 shows a number of connected-component contours. Table 1 shows the raw identification rates in a set of 150 writers, on the basis of a paragraph, comparing a basic edge-directional histogram feature (f0) and the proposed contour-based method (f1). Fig. 3 shows an example of an application of the method to upper-case script samples. Comparisons with other methods have been reported (Schomaker and Bulacu, 2004) and the proposed method appears to perform very well.

In spite of these promising results, a problem remains. Large collections of handwritten samples usually contain a mixture of upper case, isolated hand print, connected-cursive and mixed-style script. Therefore, it would be most convenient if the CO³ codebook approach could be generalized from upper-case style to free-style handwriting. However, isolated connected components (ink blobs) in upper-case handwriting are large in number but limited in complexity when compared to connected components which are present in cursive and mixed-style scripts. For cursive-script images, the construction of a CO³ codebook by a Kohonen self-organizing map would amount to the storage of complete word and syllable patterns. This is undesirable from the point of view of writer identification, since the text content is a confounding factor. It seems clear that a robust segmentation into small ink objects is needed, yielding a compound writing-style characterization similar to the successful case of the upper-case CO³ PDF as a writer feature.

Thus, the main goal of the current paper is to test whether a heuristic fragmentation of connected components in cursive and mixed-style script will allow for the construction of a PDF of fragmented connected-component contours (FCO³) such that in free-style script, a reliable writer identification is possible with similar performances as has been measured in the case of upper-case script samples. Furthermore, we will explore the code-book size parameter, the sensitivity of the approach to the number of reference writers in the comparison set, given an sample of unknown writer identity. Finally, we will also address the issue of small script samples and propose a method to improve writer-identification reliability.

It is useful to make a distinction between four factors which cause variability in handwriting (Schomaker, 1998, Schomaker and Bulacu, 2004): affine transforms; neuro-biomechanical variability; sequencing variability and allographic variation. The fourth factor, allographic variation, refers to the phenomenon of writer-specific character shapes, which produces most of the problems in automatic script recognition but at the same time provides useful information for automatic writer identification. In this paper, we will show how writer-specific allographic shape variation present in handwritten Western script allows for effective writer identification. A more thorough description of the rationale behind the approach is given in (Schomaker and Bulacu, 2004) (see Fig. 4).

It is assumed that each writer produces a recognizable set of allographs, due to schooling and personal preferences. This implies that a histogram of used allographs would characterize each writer, and given a sufficient number of allographs in a text, such a histogram of allographic usage could function as a feature vector in writer identification. However, there exists no exhaustive and world-wide accepted list of allographs in Western handwriting. The problem then, is to generate automatically a codebook, which sufficiently captures allographic information in samples of handwriting, given a histogram of the usage of its elements. Since automatic segmentation into characters is an unsolved problem, we would need, additionally, a reliable method to segment handwritten samples to yield components for such a codebook. It was demonstrated that the use of the shape of connected components of upper-case Western handwriting (i.e., not using allographs but the contours of their constituting connected components) as the basis for codebook construction can yield high writer-identification performance. On the basis of these results in writer identification on upper-case handwriting, the natural step is to explore the possibilities of the approach in free, connected-cursive styles. Here, the connected components may encompass several characters or syllables. Therefore, a fragmentation of the ink trace would be necessary, yielding broken connected components (fraglets), the ensemble of which still captures the shape details of the allographs emitted by the writer. Fortunately there are several heuristics which might deliver the proper fragmentation of connected components. An example of a possible method (“SegM”, segment on Y-minima) is based on segmentation at each vertical lower-contour minimum which is one ink-trace width away from a corresponding vertical minimum in the upper contour of the connected component under scrutiny. A similar method of segmentation is known to be useful in the text recognition of connected-cursive script (Bensefia et al., 2003, El-Yacoubi et al., 1999). In our case, for each vertical minimum in the lower contour, the nearest minimum in the upper contour is searched. If the path between these minima has a length in the order of the ink-trace width and covers a minimum amount of black (ink) pixels, a cut is generated in the trace such that the connected component may be fragmented (Fig. 5). The resulting fraglets will usually be of character size or smaller. Sometimes a fraglet will contain more than one letter. Other methods are possible, such as fragmentation at points of strong directional change (Franke et al., 2002). However, in this study we will focus on a fragmentation based on spatial minima to find out whether the resulting sub-allographic fraglets might be as usable for writer identification on the basis of free-style handwriting as the unbroken connected-components are in the case of upper-case script (Schomaker and Bulacu, 2004).

Section snippets

Data

The Firemaker¹ set is a database of handwritten pages of 250 writers, four pages per writer: Page1 contains a Copied text in natural writing style; Page2 contains copied Upper-case text; Page3 contains copied Forged text (“lease write as if to impersonate another person”) while Page4 contains a self-generated description of a cartoon image in Free writing style. The text content

Results

As regards nearest-neighbor search, we will report the results on the Hamming distance only: use of the Chi-square distance function (Schomaker and Bulacu, 2004) produced similar results, while Euclidean, Bhattacharya and Minkowski₃ distances performed much worse. Fig. 7 shows the Top-1 writer-identification performance as a function of Kohonen self-organized map dimensions. A point represents from 7 h (2 × 2) to 122 h (50 × 50 network) training time. However, training is an infrequent processing

Discussion

Results indicate that the use of fragmented connected-component contour shapes in writer identification on the basis mixed-style script yields valuable results. We think that the reason for this resides in the fact that writing style is largely determined by allographic shape variations. Small style elements which are present within a character are the result of the writer’s physiological make up as well as education and personal preference. Experiences on style variation in on-line handwriting

Conclusion

We have presented an overview of recently developed methods which use a connected-component contour codebook for the characterization of a writer of mixed-style Western letters. The use of the fragmented connected-component contour (FCO³) codebook and its histogram of usage has a number of advantages. No detailed manual measuring on text details is necessary, representing an advantage over interactive methods in forensic feature determination. This convenience can be exploited in the case of

References (20)

L.R.B. Schomaker
Using stroke- or character-based self-organizing maps in the recognition of on-line, connected-cursive script
Pattern Recognition
(1993)
A. Bensefia et al.
Information retrieval-based writer identification
M. Bulacu et al.
Writer identification using edge-based directional features
Bulacu, M., Schomaker, L., 2003. Writer style from oriented edge fragments. In: Proc. 10th Internat. Conf. on Computer...
G. Davis et al.
Wavelet-based image coding: an overview
Applied and Computational Control, Signals, and Circuits
(1998)
A. El-Yacoubi et al.
An hmm-based approach for off-line unconstrained handwritten word modeling and recognition
IEEE Trans. Pattern Anal Machine Intell.
(1999)
K. Franke et al.
A computer-based system to support forensic studies on handwritten documents
Internat. J. Document Anal. Recognition
(2001)
K. Franke et al.
Static signature verification employing a Kosko-Neuro-Fuzzy approach
Guyon, I., Schomaker, L., Plamondon, R., Liberman, R., Janet, S., 1994. Unipen project of on-line data exchange and...
T. Kohonen
Self-Organization and Associative Memory
(1988)

There are more references available in the full text version of this article.

Cited by (97)

Cross multi-scale locally encoded gradient patterns for off-line text-independent writer identification
2020, Engineering Applications of Artificial Intelligence
Writer identification is experiencing a revival of activity in recent years and continues to attract great deal of attention as a challenging and important area of research in the field of forensic and authentication. In this work, we introduce a reliable off-line system for text-independent writer identification of handwritten documents. Feature engineering is an essential component of a pattern recognition system, which can enhance or decrease the classification performance. A well-designed and defined feature extraction method improves the classification task. This paper proposes, for feature extraction, an effective, yet high-quality and conceptually simple feature image descriptor referred to as Cross multi-scale Locally encoded Gradient Patterns (CLGP). The proposed CLGP feature extraction method, which is expected to better represent salient local writing structure, operates at small observation regions (i.e., connected component sub-images) of the writing sample. CLGP histogram feature vectors computed from all these observation regions in all writing samples are considered as classification inputs to identify query writers using the Nearest Neighbor Classifier (1-NN). Our system is evaluated on six standard databases (IFN/ENIT, AHTID/MW, CVL, IAM, Firemaker, and ICDAR2011) including handwritten samples in Arabic, English, French, Greek, German, and Dutch languages. Comparing the identification performance with old and recent state-of-the-art methods, the proposed system achieves the highest performance on IFN/ENIT, AHTID/MW, and ICDAR2011 databases, and demonstrates competitive performance on IAM, CVL, and Firemaker databases.
What is the minimum training data size to reliably identify writers in medieval manuscripts?
2020, Pattern Recognition Letters
One of the most important research topics in the field of palaeography is the identification of the different scribes who participated in the writing process of a medieval book. Using traditional palaeographic tools, a palaeographer spends a lot of time reading, measuring and comparing thousands of letters or graphic signs. The aim is to evaluate different characteristics, such as height or width of letters, distance between characters, angles of inclination, number and type of abbreviations etc., which allow a reliable identification of the scribes who contributed to the production of a given manuscript. Despite the growing scientific interest that has been observed in recent years in the use of computer techniques applied to palaeographic research, a general agreement has not yet been reached among researchers, either about the effectiveness of automatic analysis tools, or on the features that should be considered to perform such an analysis. However, in the context of a highly standardized school, the use of some basic page layout features can be very useful for automatically identifying the presence of different hands. In this context, the aim of our study is to verify whether it is possible to strongly reduce the amount of data a palaeographer must analyse manually, in an attempt to answer the following question: what is the minimum size of the training set that allows a classification system to identify the different scribal hands reliably? To this purpose, we have considered two well-known and highly efficient classification techniques, progressively varying the size of the training set and comparing the corresponding classification results. To improve the classification reliability, we have also introduced a multi-expert classification architecture, enabling an easy implementation of a reject option. The experimental results, performed on two large sets of digital images extracted from two entire 12th-century Bibles, show that using only a few pages of these bibles as a training set, it is possible to identify automatically the scribal hands in the remaining pages with great reliability.
Handwriting based writer recognition using implicit shape codebook
2019, Forensic Science International
Citation Excerpt :
Popular of these projects include Trigraph [38], Wanda [39], Biografo [53], the CEDAR-FOX Forensic Document Examination system [40], GraphJ [41] and the well-known Forensic Information System for Handwriting (FISH) framework maintained by the U.S. Secret Service that allows searches using computerized solutions. With the recent advancements in different areas of image analysis and machine learning, a number of techniques have been reported in the literature realizing high identification/verification performance [1–16,45–47]. Comprehensive reviews on the latest developments on this problem can be found in Refs. [42,43].
Writer characterization from images of handwriting has remained an important research problem in the handwriting recognition community that finds applications in forensics, paleography and neuropsychology. This paper presents a study to evaluate the effectiveness of an implicit shape codebook technique to recognize writer from digitized images of handwriting. The technique relies on identifying the key points in handwriting and clustering the patches around these key points to generate an implicit shape codebook. A writer is then characterized by the probability distribution of producing the codebook patterns. Experiments are carried out in text-dependent as well text-independent mode using the standard BFL and CVL databases of handwriting images. Promising identification and verification performance is reported in a number of interesting experimental scenarios.
An effective and conceptually simple feature representation for off-line text-independent writer identification
2019, Expert Systems with Applications
Feature engineering forms an important component of machine learning and pattern recognition. It is a fundamental process for off-line writer identification of handwritten documents, which continues to be an interesting subject of research in various forensic and authentication areas. In this work, we propose an efficient, yet computationally and conceptually simple framework for off-line text independent writer identification using local textural features in characterizing the writing style of each writer. These include Local Binary Patterns (LBP), Local Ternary Patterns (LTP), and Local Phase Quantization (LPQ). Our approach focuses on exploiting the writing images at small observation regions where a set of connected component sub-images are cropped and extracted from each handwriting sample (document or set of word/text line images). These connected components are seen as texture images where each one of them is subjected to feature extraction using LBP, LPQ, or LTP. Then, a histogram sequence concatenation is applied to the feature image after dimensionality reduction followed by image subdivision into a number of non-overlapping regions. For classification, the 1-NN (Nearest Neighbor) classifier is used to identify the writer of the questioned samples based on the dissimilarity of feature vectors computed from all components in the writing. Experiments on IFN/ENIT (411 writers/Arabic), AHTID/MW (53 writers/Arabic), CVL (309 writers/English), and IAM (657 writers/English) databases demonstrate that our proposed system outperforms old and recent state-of-the-art writer identification systems on Arabic script, and demonstrates a competitive performance on English ones.
Deep adaptive learning for writer identification based on single handwritten word images
2019, Pattern Recognition
There are two types of information in each handwritten word image: explicit information which can be easily read or derived directly, such as lexical content or word length, and implicit attributes such as the author’s identity. Whether features learned by a neural network for one task can be used for another task remains an open question. In this paper, we present a deep adaptive learning method for writer identification based on single-word images using multi-task learning. An auxiliary task is added to the training process to enforce the emergence of reusable features. Our proposed method transfers the benefits of the learned features of a convolutional neural network from an auxiliary task such as explicit content recognition to the main task of writer identification in a single procedure. Specifically, we propose a new adaptive convolutional layer to exploit the learned deep features. A multi-task neural network with one or several adaptive convolutional layers is trained end-to-end, to exploit robust generic features for a specific main task, i.e., writer identification. Three auxiliary tasks, corresponding to three explicit attributes of handwritten word images (lexical content, word length and character attributes), are evaluated. Experimental results on two benchmark datasets show that the proposed deep adaptive learning method can improve the performance of writer identification based on single-word images, compared to non-adaptive and simple linear-adaptive approaches.
Reliable writer identification in medieval manuscripts through page layout features: The “Avila” Bible case
2018, Engineering Applications of Artificial Intelligence
Citation Excerpt :
In this context, run-length based features have been proposed in the literature to represent local binary patterns, such as the information about slant and curvature of handwritten texts, while grapheme-based ones have been exploited for extracting local structures and map them into a common space. These techniques have been widely used in document analysis applications for both binarized and gray scale images, and have also been applied to historical documents (Dinstein and Shapira, 1982; He et al., 2016a; Schomaker et al., 2007). The second approach focuses on the global, automated observation of the handwritten page by using texture features and/or layout analysis.
In the field of manuscript studies (palaeography and codicology), a particularly interesting case is the study of highly standardized handwriting and book typologies. In such cases, the analysis of some basic layout features, mainly related to the organization of the page and to the exploitation of the available space, may be very helpful for distinguishing similar scribal hands. In this framework, we have defined a set of layout features to develop a pattern recognition system for identifying the scribes who collaborated to the transcription of a single medieval Latin book. We have also experimentally characterized the discriminative power of each considered feature and we have verified whether the selection of an appropriate subset of features for each scribe, specifically devised for distinguishing him from all the others, could allow us to achieve better results. This approach allowed us to introduce in a very simple way a reject option for rejecting unreliably classified samples, namely those not assigned to any scribe or assigned to more scribes. The experiments, performed on a large database of digital images from the so called “Avila Bible” – a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain – confirmed the effectiveness of the proposed method. Finally, we made publicly available the data set extracted from the Avila Bible images.

View all citing articles on Scopus

View full text

Using codebooks of fragmented connected-component contours in forensic and historic writer identification

Abstract

Introduction

Section snippets

Data

Results

Discussion

Conclusion

Pattern Recognition

Information retrieval-based writer identification

Writer identification using edge-based directional features

Wavelet-based image coding: an overview

Applied and Computational Control, Signals, and Circuits

An hmm-based approach for off-line unconstrained handwritten word modeling and recognition

IEEE Trans. Pattern Anal Machine Intell.

A computer-based system to support forensic studies on handwritten documents

Internat. J. Document Anal. Recognition

Static signature verification employing a Kosko-Neuro-Fuzzy approach

Self-Organization and Associative Memory