Elsevier

Pattern Recognition

Volume 40, Issue 6, June 2007, Pages 1825-1839
Pattern Recognition

Text line extraction from multi-skewed handwritten documents

https://doi.org/10.1016/j.patcog.2006.10.002Get rights and content

Abstract

A novel text line extraction technique is presented for multi-skewed document images of handwritten English or Bengali text. It assumes that hypothetical water flows, from both left and right sides of the image frame, face obstruction from characters of text lines. The stripes of areas left unwetted on the image frame are finally labelled for extraction of text lines. The success rate of the technique, as observed experimentally, are 90.34% and 91.44% for handwritten Bengali and English document images, respectively. The work may contribute significantly for the development of applications related to optical character recognition of Bengali/English text.

Introduction

Text line extraction from optically scanned document images is one of the major problems of optical character recognition (OCR) of printed/handwritten text. Appearance of skewed lines in the text makes the problem complex. The problem becomes compounded if the lines in a text image are skewed with different orientations. Such lines are called multi-skewed lines. Appearance of multi-skewed lines in text images is common to both printed and handwritten texts for various reasons.

Lines in a text image get skewed mainly for two reasons. Firstly, a few degrees of misalignment of the document with respect to the scanner or copier bed is unavoidable at the time of scanning. This makes all text lines in the document image uniformly skewed as illustrated with a sample text image in Fig. 1(a). Secondly, text lines in the original document may be skewed differently for either some individual's handwriting style or some special design choice. Images of such documents always consist of multi-skewed lines. Figs. 1(b)–(d) and 2(b) show four sample images of multi-skewed lines. In one of these images, shown in Fig. 1(b), skewness of each text line is different from that of the others. For the rest of the two images, shown in Fig. 1(c)–(d), skewness of one part of each individual text line differs from that of some other parts of the same line. The text line extraction technique to be presented here can deal with all sorts of skewness described above.

The problem of text line extraction from optically scanned document images is simple under the ideal situation. In such situation, document images contain unskewed text lines, i.e., all the text lines therein have parallel orientations with some edge of the image frame. Text lines from these images can be easily extracted just by identifying valleys of horizontal pixel density histograms of the text lines as shown in Fig. 2(a). But this technique fails for document images with skewed lines, i.e., for all practical situations. One such document image is shown in Fig. 2(b). A straightway solution in such situation may be one which suggests for skew correction first and then line extraction with horizontal pixel density histograms of document images. But it does not work with complex cases of skewness of text lines. Special techniques are necessary to deal with these cases. In some of these techniques, skewed text lines are first extracted from document images and then performing correction to these lines becomes a trivial problem.

Solutions so far devised for the text line extraction problem can be grouped into three categories. Each category of these solutions targets certain kind of skewness of the text lines in document images for certain particular script, either handwritten or printed. The first category of solutions deals with the uniformly skewed text lines in an image of a document page, which is similar to one shown in Fig. 1(a). The second category of solutions deals with nature of skewness shown in Fig. 1(b). And the third category of solutions deals with complex nature of skewness of text lines, similar to those shown in Fig. 1(c), (d).

For text images of printed Roman script, most skew correction and line extraction techniques predominantly deal with a single skew angle for an entire document page [1], [2], [3]. For dealing with handwritten text of Roman script, the techniques, as described in Refs. [4], [5], [6], all principally determine a skew angle from a page of text lines on the basis of the base lines of the text words before skew correction. To extract text lines and words from document images of handwritten English text lines, a work described in Ref. [7] uses horizontal and vertical histogram values of the same.

A work is described in Ref. [8] for performing skew correction to document images of printed Bengali and Devnagri scripts, in which all text lines are equally skewed. With each image of one such document page, it computes a skew angle so that the document image can be subsequently rotated in a suitable direction by an angle same as the computed skew angle, for necessary skew correction. The technique is based on identification of digital straight line segments (DSLs) along the ‘shirorekha’ or ‘Matra’ of the scripts and then computation of the skew angle of the document image from the inclinations of the DSLs. The shirorekha is an important feature of Bengali text. It is also known as ‘Matra’ or head line. A ‘Matra’ is a horizontal line touching the upper parts of the most of the characters of Bengali script. ‘Matras’ of consecutive characters in a word are joined to form a common ‘Matra’ of the word.

The concept of skew angle computation on the basis of DSLs has been extended in Ref. [9] to deal with images of printed Bengali text lines with different orientations. In this work, the upper envelope of each printed text line is used for identifying DSL segments. DSL segments so identified are then clustered on the basis of normal distances of all DSL segments from the longest one. All the DSL segments, which are grouped into a single cluster, represent the parts of a single text line, to be extracted from a document image of differently skewed text lines.

A work presented in Ref. [10] involves cut text minimization (CTM) for segmentation of text lines from handwritten English text documents. In doing so, an optimization technique is applied which varies the cutting angle and start location to minimize the text pixels cut while tracking between two text lines.

As a part of a digitization project of cultural heritage manuscripts, a production system [11] is applied for text line segmentation in handwritten textual documents of English script. In this work, a textual document is considered as a set of objects with spatial relations between them. These objects represent text lines and connected components in the document. Since a graph is a natural choice for representing relations between objects, the global database of the production system is represented with a graph under the work. The nodes of this graph represent connected components and the text lines of the document, and the edges of the graph represent adjacency relations between objects. Each edge in the graph is weighted by the gap measure between the two objects in the document image.

The technique described in Ref. [12] can deal with complex skewnesses of printed text lines of Roman script. It can accept printed pages of non-rectangular layouts with various skewnesses of text lines. The technique performs thinning on backgrounds of text images. The background being so thinned produces loops around various textual portions of the input document image. Irrelevant loops are removed by using some predetermined distance and width thresholds. These thresholds are computed on the basis of the average line width.

To deal with complex types of skewness of printed Bengali text lines, as shown in Fig. 1(c), (d), water reservoir principle is applied in Refs. [13], [14]. Left, right, top and bottom reservoirs are used for detection of both isolated and touching word components in text images. Word components thus detected are finally so clustered into groups that each group contains all word components belonging to a single text line. The techniques described above have limited applications only on images of printed texts of Indian scripts, Bengali or Devnagri. Such a technique mostly considers the upper envelope of a text line in determining skewness of the same. It becomes possible because text lines of Bengali or Devnagri script are featured with prominent head lines orMatras’. The upper envelope in the Roman script does not necessarily represent the slope of the text lines. Therefore, the upper envelope based skew correction and line extraction techniques [9], [13], [14] cannot be applied for Roman scripts due to the absence of prominent headlines.

A technique ‘extended linear segment linking’ (ELSL) is described in Ref. [15] to extract both of the multi-skewed and curved text lines from images of printed documents. It is also applicable to document images containing both horizontal and vertical text lines on the same page. Such text lines are found in documents of Japanese language. In ELSL technique, a document is split into some small sub-regions and local orientations of text lines in all the sub-regions are then estimated. The consecutive sub-regions with the same orientation are finally connected to extract the text line therein. The technique is applicable to text documents with characters of pre-fixed maximum and minimum sizes.

From the above discussions, it is clear that there is a need for developing a general technique for extraction of text lines from multi-skewed document images, prepared with Roman or Bengali script, either handwritten or printed, as demanded by some specific applications. Keeping this in mind, the present technique has been developed.

Ideas, which have motivated the work, are as follows. All text lines in a document are separated from each other with uniform or non-uniform spacings depending on nature of skewness of the lines. To access these lines, all line spacings in the document are to be labelled first. Each of the unlabelled stripes of text left in the document image is then to be labelled distinctly to identify different text lines in the document. The technique so conceived can work irrespective of nature of skewness of text lines in the document image.

Section snippets

The present work

To get over the hurdles for implementation of the above idea, some technique is firstly required to label all line spacings in the document irrespective of their degrees of uniformity. Secondly, a technique is also required to identify separately all unlabelled stripes left after labelling of line spacings in it.

To develop a technique for labelling all line spacings in the document image, the present work hypothetically assumes a flow of water in a particular direction across the image frame in

Results and discussion

To conduct experiments with the technique described here, various samples of English and Bengali documents have been collected from different sources. The documents are digitized using a flatbed scanner at a resolution of 300 dpi. The digitized documents so prepared are finally binarized simply through thresholding.

The performances of the present technique are tested separately with samples of printed and handwritten text lines. Two data sets, one with printed text lines and the other with

Acknowledgments

The authors are thankful to the CMATER and the SRUVM project, C.S.E. Department, Jadavpur University, for providing necessary infrastructural facilities during the progress of the work.

About the Author—SUBHADIP BASU received his B.E. degree in Computer Science and Engineering from Kuvempu University, Karnataka, India, in 1999. He received his Ph.D. (Eng.) degree thereafter from Jadavpur University (J.U.) in 2006. He joined J.U. as a lecturer in 2006. His areas of current research interest are OCR of handwritten text, gesture recognition, real-time image processing.

References (18)

  • O. Okun et al.

    Robust document skew detection based on line extraction

  • O. Okun et al.

    Large-scale experiments with skew detection techniques

  • D.X. Le et al.

    Document skew angle detection algorithm

  • P. Slavik et al.

    Equivalence of different methods for slant and skew corrections in word recognition applications

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • S. Madhavanath et al.

    Chaincode contour processing for handwritten word recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1999)
  • R.M. Bozinovic et al.

    Off-line cursive script word recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1989)
  • A.W. Senior et al.

    An off-line cursive handwriting recognition system

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • B.B. Chaudhuri et al.

    Skew angle detection of digitized Indian script documents

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • U. Pal et al.

    Multi-skew detection of Indian script documents

There are more references available in the full text version of this article.

Cited by (101)

  • Text-line extraction from handwritten document images using GAN

    2020, Expert Systems with Applications
    Citation Excerpt :

    Jamuna and Haribabu (2015) developed the energy minimization framework to group the CCs where they have used two classifiers; one for text pixels and another one for non-text pixels. Basu, Chaudhuri, Kundu, Nasipuri, and Basu, (2007) have used the hypothetical water flows, from both left and right sides of the image frame where the stripes of un-wetted areas identifies the text lines. This is extended in piece-wise Water-flow technique by Sarkar et al. (2009).

  • A set of benchmarks for Handwritten Text Recognition on historical documents

    2019, Pattern Recognition
    Citation Excerpt :

    As previously discussed, for a given text line image, a trained CRNN estimates a sequence of character posterior probability vectors (often referred to as “ConfMat”). While raw images can be directly accepted as input, results can often be improved if images are previously deskewed, deslanted, cleaned, contrast-enhanced, and/or size-normalized [15–18]. Trained N-gram contextual constraints can be applied to the CRNN output character probabilities in several ways [5].

View all citing articles on Scopus

About the Author—SUBHADIP BASU received his B.E. degree in Computer Science and Engineering from Kuvempu University, Karnataka, India, in 1999. He received his Ph.D. (Eng.) degree thereafter from Jadavpur University (J.U.) in 2006. He joined J.U. as a lecturer in 2006. His areas of current research interest are OCR of handwritten text, gesture recognition, real-time image processing.

About the Author—CHITRITA CHAUDHURI received her B.E.Te1.E. and M.E.Te1.E. degrees from Jadavpur University, in 1980 and 1982, respectively. She joined J.U. as a lecturer in 2001 and is currently working there as a Reader. Her areas of current research interest are pattern recognition, image processing, multimedia techniques, and data mining.

About the Author—MAHANTAPAS KUNDU received his B.E.E, M.E.Tel.E and Ph.D. (Eng.) degrees from Jadavpur University, in 1983, 1985 and 1995, respectively. Prof. Kundu has been a faculty member of J.U. since 1988. His areas of current research interest include pattern recognition, image processing, multimedia database, and artificial intelligence.

About the Author—MITA NASIPURI received her B.E.Tel.E., M.E.Tel.E., and Ph.D. (Eng.) degrees from Jadavpur University, in 1979, 1981 and 1990, respectively. Prof. Nasipuri has been a faculty member of J.U. since 1987. Her current research interest includes image processing, pattern recognition, and multimedia systems. She is a senior member of the IEEE, U.S.A., Fellow of I.E. (India) and W.B.A.S.T., Kolkata, India.

About the Author—DIPAK KUMAR BASU received his B.E.Tel.E., M.E.Tel., and Ph.D. (Eng.) degrees from Jadavpur University, in 1964, 1966 and 1969, respectively. Prof. Basu has been a faculty member of J.U. since 1968. His current fields of research interest include pattern recognition, image processing, and multimedia systems. He is a senior member of the IEEE, U.S.A., Fellow of I.E. (India) and W.B.A.S.T., Kolkata, India and a former Fellow, Alexander von Humboldt Foundation, Germany.

View full text