Elsevier

Pattern Recognition Letters

Volume 28, Issue 6, 15 April 2007, Pages 719-727
Pattern Recognition Letters

Using codebooks of fragmented connected-component contours in forensic and historic writer identification

https://doi.org/10.1016/j.patrec.2006.08.005Get rights and content

Abstract

Recent advances in ‘off-line’ writer identification allow for new applications in handwritten text retrieval from archives of scanned historical documents. This paper describes new algorithms for forensic or historical writer identification, using the contours of fragmented connected-components in free-style handwriting. The writer is considered to be characterized by a stochastic pattern generator, producing a family of character fragments (fraglets). Using a codebook of such fraglets from an independent training set, the probability distribution of fraglet contours was computed for an independent test set. Results revealed a high sensitivity of the fraglet histogram in identifying individual writers on the basis of a paragraph of text. Large-scale experiments on the optimal size of Kohonen maps of fraglet contours were performed, showing usable classification rates within a non-critical range of Kohonen map dimensions. The proposed automatic approach bridges the gap between image-statistics approaches and purely knowledge-based manual character-based methods.

Introduction

Writer identification on the basis of optically scanned handwritten samples enjoys a renewed interest (Srihari et al., 2002, Franke and Köppen, 2001, Said et al., 2000, Marti et al., 2001). The goal is to find in a large database a sample of a known writer (author) on the basis of an unknown or questioned handwritten document sample. The target performance for forensic writer-identification systems is a near-100% recall of the correct writer in a hit list of 100 writers, computed from a database in the order of 104 samples, the size of search sets in current European forensic databases. Another application which enjoys increased interest is writer verification. Here, the goal is to develop systems which are able to decide whether two handwritten samples are from the same writer or not. In the domain of the cultural heritage, writer identification and verification are becoming a realistic tool in information retrieval methods. Additionally, interesting new applications are emerging in this domain. Due to the fact that writing style of an individual author evolves over time, attempts are currently made at dating handwritten samples of a writer whose style evolution may be present in a large scanned archive of samples with a known date of writing (Bensefia et al., 2003). Examples are the scanned collections of manuscripts and letter correspondence by authors such as Zola and Flaubert (Bensefia et al., 2003). The manuscripts in such collections are often annotated in a handwritten script of which the author may not be the same person as the main, original author. Also here, automatic writer identification may act as a useful tool for humanities researchers. Fig. 1 shows a sample from an administrative Dutch collection, with handwriting of one particular scribe.

Clearly, these new applications necessitate the development of powerful shape descriptors of free-style handwriting which are designed to capture individual style information. The problem is complex at a number of levels: (1) the degree of variability and variation of script; (2) the problem of foreground/background segmentation in highly textured and smudged documents; (3) the limited amount of text in unknown samples; (4) the differences in scanning technologies and image preprocessing. As a consequence, in forensic practice, a combination of statistical and knowledge-based techniques is used (Franke and Köppen, 2001). We have developed an ontology and XML format (WandaXML) for the systematic processing of forensic handwritten samples (Wanda, 2004). Elements of systematic style categorization can be entered in such a system to aid in boosting the performance of the pattern classification algorithms. It is to be expected that applications in historical writer identification and verification will similarly require a hybrid approach. In this paper, however, we will mainly focus on recent progress at the level of feature extraction in automatic, image-based (i.e., off-line) methods for writer identification.

Recently, we have proposed the use of connected-component contours (CO3s) and their occurrence histogram, i.e., discrete PDF, as a writer-identification feature (Schomaker and Bulacu, 2004) in upper-case Western handwriting. In this approach, a codebook of CO3s was constructed with a Kohonen self-organized map on the basis of a sufficiently large sample set of upper-case script. The writer is assumed to act as a stochastic generator of ink-blob shapes, such that the probability distribution of shape usage is characteristic of each writer. The performance of this approach is very promising, especially if it is used in conjunction with a complementary feature set which is based on edge-directional histograms which cover yet another aspect of writing style (Bulacu et al., 2003). Fig. 2 shows a number of connected-component contours. Table 1 shows the raw identification rates in a set of 150 writers, on the basis of a paragraph, comparing a basic edge-directional histogram feature (f0) and the proposed contour-based method (f1). Fig. 3 shows an example of an application of the method to upper-case script samples. Comparisons with other methods have been reported (Schomaker and Bulacu, 2004) and the proposed method appears to perform very well.

In spite of these promising results, a problem remains. Large collections of handwritten samples usually contain a mixture of upper case, isolated hand print, connected-cursive and mixed-style script. Therefore, it would be most convenient if the CO3 codebook approach could be generalized from upper-case style to free-style handwriting. However, isolated connected components (ink blobs) in upper-case handwriting are large in number but limited in complexity when compared to connected components which are present in cursive and mixed-style scripts. For cursive-script images, the construction of a CO3 codebook by a Kohonen self-organizing map would amount to the storage of complete word and syllable patterns. This is undesirable from the point of view of writer identification, since the text content is a confounding factor. It seems clear that a robust segmentation into small ink objects is needed, yielding a compound writing-style characterization similar to the successful case of the upper-case CO3 PDF as a writer feature.

Thus, the main goal of the current paper is to test whether a heuristic fragmentation of connected components in cursive and mixed-style script will allow for the construction of a PDF of fragmented connected-component contours (FCO3) such that in free-style script, a reliable writer identification is possible with similar performances as has been measured in the case of upper-case script samples. Furthermore, we will explore the code-book size parameter, the sensitivity of the approach to the number of reference writers in the comparison set, given an sample of unknown writer identity. Finally, we will also address the issue of small script samples and propose a method to improve writer-identification reliability.

It is useful to make a distinction between four factors which cause variability in handwriting (Schomaker, 1998, Schomaker and Bulacu, 2004): affine transforms; neuro-biomechanical variability; sequencing variability and allographic variation. The fourth factor, allographic variation, refers to the phenomenon of writer-specific character shapes, which produces most of the problems in automatic script recognition but at the same time provides useful information for automatic writer identification. In this paper, we will show how writer-specific allographic shape variation present in handwritten Western script allows for effective writer identification. A more thorough description of the rationale behind the approach is given in (Schomaker and Bulacu, 2004) (see Fig. 4).

It is assumed that each writer produces a recognizable set of allographs, due to schooling and personal preferences. This implies that a histogram of used allographs would characterize each writer, and given a sufficient number of allographs in a text, such a histogram of allographic usage could function as a feature vector in writer identification. However, there exists no exhaustive and world-wide accepted list of allographs in Western handwriting. The problem then, is to generate automatically a codebook, which sufficiently captures allographic information in samples of handwriting, given a histogram of the usage of its elements. Since automatic segmentation into characters is an unsolved problem, we would need, additionally, a reliable method to segment handwritten samples to yield components for such a codebook. It was demonstrated that the use of the shape of connected components of upper-case Western handwriting (i.e., not using allographs but the contours of their constituting connected components) as the basis for codebook construction can yield high writer-identification performance. On the basis of these results in writer identification on upper-case handwriting, the natural step is to explore the possibilities of the approach in free, connected-cursive styles. Here, the connected components may encompass several characters or syllables. Therefore, a fragmentation of the ink trace would be necessary, yielding broken connected components (fraglets), the ensemble of which still captures the shape details of the allographs emitted by the writer. Fortunately there are several heuristics which might deliver the proper fragmentation of connected components. An example of a possible method (“SegM”, segment on Y-minima) is based on segmentation at each vertical lower-contour minimum which is one ink-trace width away from a corresponding vertical minimum in the upper contour of the connected component under scrutiny. A similar method of segmentation is known to be useful in the text recognition of connected-cursive script (Bensefia et al., 2003, El-Yacoubi et al., 1999). In our case, for each vertical minimum in the lower contour, the nearest minimum in the upper contour is searched. If the path between these minima has a length in the order of the ink-trace width and covers a minimum amount of black (ink) pixels, a cut is generated in the trace such that the connected component may be fragmented (Fig. 5). The resulting fraglets will usually be of character size or smaller. Sometimes a fraglet will contain more than one letter. Other methods are possible, such as fragmentation at points of strong directional change (Franke et al., 2002). However, in this study we will focus on a fragmentation based on spatial minima to find out whether the resulting sub-allographic fraglets might be as usable for writer identification on the basis of free-style handwriting as the unbroken connected-components are in the case of upper-case script (Schomaker and Bulacu, 2004).

Section snippets

Data

The Firemaker1 set is a database of handwritten pages of 250 writers, four pages per writer: Page1 contains a Copied text in natural writing style; Page2 contains copied Upper-case text; Page3 contains copied Forged text (“lease write as if to impersonate another person”) while Page4 contains a self-generated description of a cartoon image in Free writing style. The text content

Results

As regards nearest-neighbor search, we will report the results on the Hamming distance only: use of the Chi-square distance function (Schomaker and Bulacu, 2004) produced similar results, while Euclidean, Bhattacharya and Minkowski3 distances performed much worse. Fig. 7 shows the Top-1 writer-identification performance as a function of Kohonen self-organized map dimensions. A point represents from 7 h (2 × 2) to 122 h (50 × 50 network) training time. However, training is an infrequent processing

Discussion

Results indicate that the use of fragmented connected-component contour shapes in writer identification on the basis mixed-style script yields valuable results. We think that the reason for this resides in the fact that writing style is largely determined by allographic shape variations. Small style elements which are present within a character are the result of the writer’s physiological make up as well as education and personal preference. Experiences on style variation in on-line handwriting

Conclusion

We have presented an overview of recently developed methods which use a connected-component contour codebook for the characterization of a writer of mixed-style Western letters. The use of the fragmented connected-component contour (FCO3) codebook and its histogram of usage has a number of advantages. No detailed manual measuring on text details is necessary, representing an advantage over interactive methods in forensic feature determination. This convenience can be exploited in the case of

References (20)

  • L.R.B. Schomaker

    Using stroke- or character-based self-organizing maps in the recognition of on-line, connected-cursive script

    Pattern Recognition

    (1993)
  • A. Bensefia et al.

    Information retrieval-based writer identification

  • M. Bulacu et al.

    Writer identification using edge-based directional features

  • Bulacu, M., Schomaker, L., 2003. Writer style from oriented edge fragments. In: Proc. 10th Internat. Conf. on Computer...
  • G. Davis et al.

    Wavelet-based image coding: an overview

    Applied and Computational Control, Signals, and Circuits

    (1998)
  • A. El-Yacoubi et al.

    An hmm-based approach for off-line unconstrained handwritten word modeling and recognition

    IEEE Trans. Pattern Anal Machine Intell.

    (1999)
  • K. Franke et al.

    A computer-based system to support forensic studies on handwritten documents

    Internat. J. Document Anal. Recognition

    (2001)
  • K. Franke et al.

    Static signature verification employing a Kosko-Neuro-Fuzzy approach

  • Guyon, I., Schomaker, L., Plamondon, R., Liberman, R., Janet, S., 1994. Unipen project of on-line data exchange and...
  • T. Kohonen

    Self-Organization and Associative Memory

    (1988)
There are more references available in the full text version of this article.

Cited by (97)

  • Handwriting based writer recognition using implicit shape codebook

    2019, Forensic Science International
    Citation Excerpt :

    Popular of these projects include Trigraph [38], Wanda [39], Biografo [53], the CEDAR-FOX Forensic Document Examination system [40], GraphJ [41] and the well-known Forensic Information System for Handwriting (FISH) framework maintained by the U.S. Secret Service that allows searches using computerized solutions. With the recent advancements in different areas of image analysis and machine learning, a number of techniques have been reported in the literature realizing high identification/verification performance [1–16,45–47]. Comprehensive reviews on the latest developments on this problem can be found in Refs. [42,43].

  • Reliable writer identification in medieval manuscripts through page layout features: The “Avila” Bible case

    2018, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    In this context, run-length based features have been proposed in the literature to represent local binary patterns, such as the information about slant and curvature of handwritten texts, while grapheme-based ones have been exploited for extracting local structures and map them into a common space. These techniques have been widely used in document analysis applications for both binarized and gray scale images, and have also been applied to historical documents (Dinstein and Shapira, 1982; He et al., 2016a; Schomaker et al., 2007). The second approach focuses on the global, automated observation of the handwritten page by using texture features and/or layout analysis.

View all citing articles on Scopus
View full text