Skip to main content

Über dieses Buch

This book constitutes the thoroughly refereed post-workshop proceedings of the 5th International Workshop on Camera-Based Document Analysis and Recognition, CBDAR 2013, held in Washington, DC, USA, in August 2013. The 14 revised full papers presented were carefully selected during two rounds of reviewing and improvement from numerous original submissions. Intended to give a snapshot of the state-of-the-art research in the field of camera based document analysis and recognition, the papers are organized in topical sections on text detection and recognition in scene images and camera-based systems.



Text Detection and Recognition in Scene Images


Spatially Prioritized and Persistent Text Detection and Decoding

We show how to exploit temporal and spatial coherence to achieve efficient and effective text detection and decoding for a sensor suite moving through an environment in which text occurs at a variety of locations, scales and orientations with respect to the observer. Our method uses simultaneous localization and mapping (SLAM) to extract planar “tiles” representing scene surfaces. Multiple observations of each tile, captured from different observer poses, are aligned using homography transformations. Text is detected using Discrete Cosine Transform (DCT) and Maximally Stable Extremal Regions (MSER), and decoded by an Optical Character Recognition (OCR) engine. The decoded characters are then clustered into character blocks to obtain an MLE word configuration. This paper’s contributions include: (1) spatiotemporal fusion of tile observations via SLAM, prior to inspection, thereby improving the quality of the input data; and (2) combination of multiple noisy text observations into a single higher-confidence estimate of environmental text.
Hsueh-Cheng Wang, Yafim Landa, Maurice Fallon, Seth Teller

A Hierarchical Visual Saliency Model for Character Detection in Natural Scenes

Visual saliency models have been introduced to the field of character recognition for detecting characters in natural scenes. Researchers believe that characters have different visual properties from their non-character neighbors, which make them salient. With this assumption, characters should response well to computational models of visual saliency. However in some situations, characters belonging to scene text mignt not be as salient as one might expect. For instance, a signboard is usually very salient but the characters on the signboard might not necessarily be so salient globally. In order to analyze this hypothesis in more depth, we first give a view of how much these background regions, such as sign boards, affect the task of saliency-based character detection in natural scenes. Then we propose a hierarchical-saliency method for detecting characters in natural scenes. Experiments on a dataset with over 3,000 images containing scene text show that when using saliency alone for scene text detection, our proposed hierarchical method is able to capture a larger percentage of text pixels as compared to the conventional single-pass algorithm.
Renwu Gao, Faisal Shafait, Seiichi Uchida, Yaokai Feng

A Robust Approach to Extraction of Texts from Camera Captured Images

Here, we present our recent study of a robust but simple approach to extraction of texts from camera-captured images. In the proposed approach, we first identify pixels which are highly specular. Connected components of this set of specular pixels are obtained. Pixels belonging to each such component are separately binarized using the well-known Otsu’s approach. We next apply smoothing on the whole image before obtaining its Canny edge representation. Bounding rectangle of each connected component of the Canny edge image is obtained and multiple components with pairwise overlapping bounding boxes are merged. Otsu’s thresholding technique is applied separately on different parts of input image defined by the resulting bounding boxes. Although Otsu’s thresholding approach does not generally provide acceptable performance on camera captured images, we observed its suitability when applied severally as in the above. The binarized specular components obtained at the initial stage replace the corresponding regions of the latter binarized image. Finally, a set of postprocessing operations is used to remove certain non-text components of the binarized image.
Sudipto Banerjee, Koustav Mullick, Ujjwal Bhattacharya

Scene Text Detection via Integrated Discrimination of Component Appearance and Consensus

In this paper, we propose an approach to scene text detection that leverages both the appearance and consensus of connected components. A component appearance is modeled with an SVM based dictionary classifier and the component consensus is represented with color and spatial layout features. Responses of the dictionary classifier are integrated with the consensus features into a discriminative model, where the importance of features is determined with a text level training procedure. In text detection, hypotheses are generated on component pairs and an iterative extension procedure is used to aggregate hypotheses into text objects. In the detection procedure, the discriminative model is used to perform classification as well as control the extension. Experiments show that the proposed approach reaches the state of the art in both detection accuracy and computational efficiency, and in particularly, it performs best when dealing with low-resolution text in clutter backgrounds.
Qixiang Ye, David Doermann

Accuracy Improvement of Viewpoint-Free Scene Character Recognition by Rotation Angle Estimation

This paper addresses the problem of detecting characters in natural scene image. How to correctly discriminate character/non-character is also a very challenging problem. In this paper, we propose new character/non-character discrimination technique using the rotation angle of characters to improve character detection accuracy in natural scene image. In particular, we individually recognize characters and estimate the rotation angle of those characters by our previously reported method and use the rotation angle for character/non-character discrimination. As the result of the character recognition experiment evaluating 50 alphanumeric natural scene images, we have confirmed the accuracy improvement of precision and \(F\)-measure by 9.37 % and 4.73 % respectively when compared to the performance with previously reported paper.
Kanta Kuramoto, Wataru Ohyama, Tetsushi Wakabayashi, Fumitaka Kimura

Sign Detection Based Text Localization in Mobile Device Captured Scene Images

Sign text is one of the most seen text types appearing in scene images. In this paper, we present a new sign text localization method for scene images captured by mobile device. The candidate characters are first localized by detecting closed boundaries in the image. Then, based on the properties of signboard, the convex regions that contain enough candidate characters are extracted and marked as sign regions. After removing the false positives using the proposed layer analysis, the candidate characters inside the detected sign regions are yielded as sign text. A sign text database with 241 images captured by a mobile device was used to evaluate our method. The experimental results demonstrate the validity of the proposed method.
Jing Zhang, Rangachar Kasturi

Font Distribution Observation by Network-Based Analysis

The off-the-shelf Optical Character Recognition (OCR) engines return mediocre performance on the decorative characters which usually appear in natural scenes such as signboards. A reasonable way towards the so-called camera-based OCR is to collect a large-scale font set and analyze the distribution of font samples for realizing some character recognition engine which is tolerant to font shape variations. This paper is concerned with the issue of font distribution analysis by network. Minimum Spanning Tree (MST) is employed to construct font network with respect to Chamfer distance. After clustering, some centrality criterion, namely closeness centrality, eccentricity centrality or betweenness centrality, is introduced for extracting typical font samples. The network structure allows us to observe the font shape transition between any two samples, which is useful to create new fonts and recognize unseen decorative characters. Moreover, unlike the Principal Component Analysis (PCA), the font network fulfills distribution visualization through measuring the dissimilarity between samples rather than the lossy processing of dimensionality reduction. Compared with K-means algorithm, network-based clustering has the ability to preserve small size font clusters which generally consist of samples taking special appearances. Experiments demonstrate that the proposed network-based analysis is an effective way to grasp font distribution, and thus provides helpful information for decorative character recognition.
Chihiro Nakamoto, Rong Huang, Sota Koizumi, Ryosuke Ishida, Yaokai Feng, Seiichi Uchida

Camera-Based Systems


Dewarping Book Page Spreads Captured with a Mobile Phone Camera

Capturing book images is more convenient with a mobile phone camera than with more specialized flat-bed scanners or 3D capture devices. We built an application for the iPhone 4S that captures a sequence of hi-res (8 MP) images of a page spread as the user sweeps the device across the book. To do the 3D dewarping, we implemented two algorithms: optical flow (OF) and structure from motion (SfM). Making further use of the image sequence, we examined the potential of multi-frame OCR. Preliminary evaluation on a small set of data shows that OF and SfM had comparable OCR performance for both single-frame and multi-frame techniques, and that multi-frame was substantially better than single-frame. The computation time was much less for OF than for SfM.
Chelhwon Kim, Patrick Chiu, Surendar Chandra

A Dataset for Quality Assessment of Camera Captured Document Images

With the proliferation of cameras on mobile devices there is an increased desire to image document pages as an alternative to scanning. However, the quality of captured document images is often lower than its scanned equivalent due to hardware limitations and stability issues. In this context, automatic assessment of the quality of captured images is useful for many applications. Although there has been a lot of work on developing computational methods and creating standard datasets for natural scene image quality assessment, until recently quality estimation of camera captured document images has not been given much attention. One traditional quality indicator for document images is the Optical Character Recognition (OCR) accuracy. In this work, we present a dataset of camera captured document images containing varying levels of focal-blur introduced manually during capture. For each image we obtained the character level OCR accuracy. Our dataset can be used to evaluate methods for predicting OCR quality of captured documents as well as enhancements. In order to make the dataset publicly and freely available, originals from two existing datasets - University of Washington dataset and Tobacco Database were selected. We present a case study with three recent methods for predicting the OCR quality of images on our dataset.
Jayant Kumar, Peng Ye, David Doermann

A Morphology-Based Border Noise Removal Method for Camera-Captured Label Images

Printed labels are widely used in our life to track items, especially in logistics management. If item information on a label could be recognized automatically, the efficiency of the logistics would be greatly improved. However, some particular properties of label images make them difficult for off-the-shelf optical character recognition (OCR) system to recognize directly. To prepare the label images for OCR, border noise removal is an important step. With text region only, the resulting image would be easier for OCR to read. In this paper, we propose a simple and effective approach to remove border noise in textile label images. Border noise in those label images is more complex than that in conventional document images. Our solution consists of four parts: label boundary detection, label blank region extraction, holes filling and border noise deletion. The experiment shows that the proposed method yields satisfactory performance.
Mengyang Liu, Chongshou Li, Wenbin Zhu, Andrew Lim

Robust Binarization of Stereo and Monocular Document Images Using Percentile Filter

Camera captured documents can be a difficult case for standard binarization algorithms. These algorithms are specifically tailored to the requirements of scanned documents which in general have uniform illumination and high resolution with negligible geometric artifacts. Contrary to this, camera captured images generally are low resolution, contain non-uniform illumination and also posses geometric artifacts. The most important artifact is the defocused or blurred text which is the result of the limited depth of field of the general purpose hand-held capturing devices. These artifacts could be reduced with controlled capture with a single camera but it is inevitable for the case of stereo document images even with the orthoparallel camera setup.
Existing methods for binarization require tuning for the parameters separately both for the left and the right images of a stereo pair. In this paper, an approach for binarization based on the local adaptive background estimation using percentile filter has been presented. The presented approach works reasonably well under the same set of parameters for both left and right images. It also shows competitive results for monocular images in comparison with standard binarization methods.
Muhammad Zeshan Afzal, Martin Krämer, Syed Saqib Bukhari, Mohammad Reza Yousefi, Faisal Shafait, Thomas M. Breuel

Hyperspectral Document Imaging: Challenges and Perspectives

Hyperspectral imaging provides measurement of a scene in contiguous bands across the electromagnetic spectrum. It is an effective sensing technology having vast applications in agriculture, archeology, surveillance, medicine and forensics. Traditional document imaging has been centered around monochromatic or trichromatic (RGB) sensing often through a scanning device. Cameras have emerged in the last decade as an alternative to scanners for capturing document images. However, the focus has remained on mono-/tri-chromatic imaging. In this paper, we explore the new paradigm of hyperspectral imaging for document capture. We outline and discuss the key components of a hyperspectral document imaging system, which offers new challenges and perspectives. We discuss the issues of filter transmittance and spatial/spectral non-uniformity of the illumination and propose possible solutions via pre and post processing. As a sample application, the proposed imaging system is applied to the task of writing ink mismatch detection in documents on a newly collected database (UWA Writing Ink Hyperspectral Image Database http://​www.​csse.​uwa.​edu.​au/​%7Eajmal/​databases.​html). The results demonstrate the strength of hyperspectral imaging in capturing minute differences in spectra of different inks that are very hard to distinguish using traditional RGB imaging.
Zohaib Khan, Faisal Shafait, Ajmal Mian

Mobile Phone Camera-Based Video Scanning of Paper Documents

Mobile phone camera-based document video scanning is an interesting research problem which has entered into a new era with the emergence of widely used, processing capable and motion sensors equipped smartphones. We present our ongoing research on mobile phone camera-based document image mosaic reconstruction method for video scanning of paper documents. In this work, we have optimized the classic keypoint feature descriptor-based image registration method, by employing the accelerometer and gyroscope sensor data. Experimental results are evaluated using optical character recognition (OCR) on the reconstructed mosaic from mobile phone camera-based video scanning of paper documents.
Muhammad Muzzamil Luqman, Petra Gomez-Krämer, Jean-Marc Ogier

Real-life Activity Recognition – Focus on Recognizing Reading Activities

As the field of physical activity recognition matures, we can build more and more robust pervasive systems and slowly move towards tracking knowledge acquisition tasks. We are especially interested one particular cognitive task, namely reading (the decoding of letters, words and sentences into information) Reading is a ubiquitous activity that many people even perform in transit, such as while on the bus or while walking. Tracking reading and other high level user actions gives us more insights about the knowledge life of the users enabling a whole range of novel applications. Yet, how can we extract high level information about human activities (e.g. reading) and complex real world situations from heterogeneous ensembles of simple, often unreliable sensors embedded in commodity devices?
The paper focuses on how to use body-worn devices for activity recognition and how to combine them with infrastructure sensing, in general. In the second part, we take lessons from the physical activity recognition field and see how we can leverage to track knowledge acquisition tasks (in particular recognizing reading activities). We discuss challenges and opportunities.
Kai Kunze


Weitere Informationen

Premium Partner