Elsevier

Pattern Recognition

Volume 51, March 2016, Pages 125-134
Pattern Recognition

Multilingual scene character recognition with co-occurrence of histogram of oriented gradients

https://doi.org/10.1016/j.patcog.2015.07.009Get rights and content

Highlights

  • Introduced powerful features Co-HOG and ConvCo-HOG for scene character recognition.

  • Designed a new offset based strategy for dimension reduction of above features.

  • Developed two new scene character datasets for Chinese and Bengali scripts.

  • Extensive simulations on 5 datasets of 3 scripts show the efficiency of the approach.

Abstract

Automatic machine reading of texts in scenes is largely restricted by the poor character recognition accuracy. In this paper, we extend the Histogram of Oriented Gradient (HOG) and propose two new feature descriptors: Co-occurrence HOG (Co-HOG) and Convolutional Co-HOG (ConvCo-HOG) for accurate recognition of scene texts of different languages. Compared with HOG which counts orientation frequency of each single pixel, the Co-HOG encodes more spatial contextual information by capturing the co-occurrence of orientation pairs of neighboring pixels. Additionally, ConvCo-HOG exhaustively extracts Co-HOG features from every possible image patches within a character image for more spatial information. The two features have been evaluated extensively on five scene character datasets of three different languages including three sets in English, one set in Chinese and one set in Bengali. Experiments show that the proposed techniques provide superior scene character recognition accuracy and are capable of recognizing scene texts of different scripts and languages.

Introduction

Text is a fundamental tool for information recording and communication in our daily life. To help humans process the imaged text, Optical Character Recognition (OCR) has been investigated for a few decades with great success, largely for the document text scanned by a document scanner. With the recent advance of the sensor technology, increasing amount of texts have been captured by digital cameras or mobile phone cameras under weakly controlled environments. Automatic reading of texts in the camera-captured images is very useful to a wide range of applications such as autonomous vehicle navigation, textual image retrieval, machine translation, and living aids for visually impaired persons. On the other hand, most existing OCR systems were designed for scanned document images, where texts are usually well-formatted and captured under a well-controlled environment. When applied for texts in scene images, the recognition performance of these existing OCR systems is often not satisfactory because such texts could appear in arbitrary size, color, fonts, orientations, lighting, and background as illustrated in Fig. 1.

Recently, the Histograms of Oriented Gradients (HOG) [1] has been widely investigated for scene text recognition. As studied in [2], [3], [4], HOG outperforms almost all the other features due to its robustness to illumination variation and invariance to the local geometric and photometric transformations. Another important reason is that HOG has the capability to encode and match the strong gradients in characters. In fact, the HOG based approach obtained the best performance in both ICDAR2003 and SVT datasets when combined with a deep neural network as trained with a huge amount of training data (up to 40 million) [5]. On the other hand, HOG just captures the frequency of gradient orientation in each block which misses the spatial context of neighboring pixels. For example, two image patches with similar HOG features may look very different when their pixel locations are rearranged.

Therefore, we propose an extension of HOG, namely, Co-occurrence HOG (Co-HOG) [6] for recognition of texts in scenes. Different from HOG, Co-HOG encodes gradient orientation of neighboring pixel pairs and accordingly captures more spatial and contextual information, making it more powerful to describe the character shape precisely and effectively. In addition, we further design a Convolutional Co-HOG (ConvCo-HOG) feature by exhaustively extracting Co-HOG for every possible image patch. ConvCo-HOG is more robust and discriminative, because it captures the co-occurrence dual edge characteristics of text strokes by exhaustively exploring every image patches within a character image.

Co-HOG and ConvCo-HOG were presented before in our earlier studies [7], [8]. However, in contrast to the above two earlier studies, here we present them under a uniform framework with more detailed descriptions and a comparative study. Additionally, two new scene character image datasets are created including one in Chinese and the other in Bengali and a few samples from these two datasets are shown respectively in Fig. 1(b) and (c). To the best of our knowledge, these two datasets are the first in the literature which must be very useful for the benchmarking of the future multilingual scene text recognition technologies. Furthermore, the Co-HOG and ConvCo-HOG based techniques are evaluated extensively on the two new scene character image datasets as well as the three publicly available datasets with English texts. Last but not the least, a new offset metric has been designed for feature dimension reduction.

The rest of the paper is organized as follows. Section 2 discusses related works and Section 3 describes our proposed Co-HOG and ConvCo-HOG features. The proposed scene text recognition is then presented in Section 4. The two proposed multilingual datasets and experimental results are then presented in 5 Multilingual scene character dataset, 6 Experimental results, respectively. Concluding remarks and future works are finally presented in Section 7.

Section snippets

Related work

Quite a number of scene text recognition techniques have been reported in the literature which can be broadly classified into two categories. The first is text segmentation based, which first segments (binarizes) text regions from scene images and then exploits the traditional OCR for scene text recognition. The second bypasses the binarization and traditional OCR processes by designing new visual features and training new classifiers.

Co-HOG feature

HOG describes the key characteristics of an image block by capturing the statistics of oriented gradients of image pixels within the image block. It is widely used for object detection and has achieved great success. On the other hand, HOG captures only the orientations of isolated pixels and ignores spatial information with respect to their neighboring pixels. The proposed Co-HOG instead captures rich spatial information by counting frequency of co-occurrence of oriented gradients between

Scene text recognition

Character images segmented from scene texts can be recognized by training a suitable classifier with the help of either Co-HOG or ConvCo-HOG feature descriptor described in the last section. Linear Support Vector Machines (SVMs) are perhaps the most prominent machine learning technique used for high dimensional data since they are fast and capable of providing the state-of-the-art classification accuracies [33] in similar situations. Thus, in the present study, we have chosen linear SVM

Multilingual scene character dataset

Most existing scene text recognition works focus on texts in English or Latin script. However, a significant amount of texts in natural scene images are written/printed in non-English languages such as Chinese and Bengali. On the other hand, Chinese, English and Bengali are of different types, viz. logosyllabary, alphabet and abiguda respectively. A few standard datasets of English scene character images such as ICDAR 2003 and Chars74K are available for training/testing of English scene text

Experimental setup

In the experiment, we resize each character image to 24×24 pixels and then divide each into 4×4 blocks before feature extraction. After we get the Co-HOG and ConvCo-HOG features, a linear SVM classifier is trained with LIBLINEAR and evaluated on the testing datasets.

The training data for ICDAR2003 and SVT dataset compose of Char74K dataset [23] and the training part of ICDAR2003 in consistent with the setup in [7]. For IIIT 5K-word and ISI_Bengali_Character datasets, we use their respective

Conclusion

In this paper, we propose two new feature descriptors, called Co-HOG and ConvCo-HOG, for efficient recognition of scene characters. Also, we present here two new scene character datasets, one for Chinese and the other for Bengali towards the advancement of scene character recognition research for multiple scripts.

The proposed Co-HOG feature descriptor is designed to capture the local spatial information by counting the frequency of co-occurrence of gradient orientation of neighboring pixel

Conflict of interest

None declared.

Acknowledgments

We acknowledge Srikanta Mondal, Prakriti Banik, Suman Mondal and Sanjib Palui for their help in the creation of Bengali segmented scene character database.

Shangxuan Tian is a Ph.D. candidate in the School of Computing, National University of Singapore. He received his B.S. degree in School of Computer Science and Technology from Northwestern Polytechnical University, Xi’an, China. His current research interests include document image analysis and text extraction from scene images.

References (42)

  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition...
  • K. Wang, S. Belongie, Word spotting in the wild, in: European Conference on Computer Vision (ECCV), 2010, pp....
  • K. Wang, B. Babenko, S. Belongie, End-to-end scene text recognition, in: International Conference on Computer Vision...
  • A. Mishra, K. Alahari, C. Jawahar, Top-down and bottom-up cues for scene text recognition, in: Computer Vision and...
  • A. Bissacco, M. Cummins, Y. Netzer, H. Neven, Photoocr: reading text in uncontrolled conditions, in: International...
  • T. Watanabe, S. Ito, K. Yokoi, Co-occurrence histograms of oriented gradients for human detection, in: Information and...
  • S. Tian, S. Lu, B. Su, C.L. Tan, Scene text recognition using co-occurrence of histogram of oriented gradients, in:...
  • B. Su, S. Lu, S. Tian, J.H. Lim, C.L. Tan, Character recognition in natural scenes using convolutional co-occurrence...
  • S.M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, ICDAR 2003 robust reading competitions, in:...
  • A. Mishra, K. Alahari, C. Jawahar, Scene text recognition using higher order language priors, in: British Machine...
  • P. Stathis, E. Kavallieratou, N. Papamarkos, An evaluation survey of binarization algorithms on historical documents,...
  • X. Chen, A. Yuille, Detecting and reading text in natural scenes, in: Computer Vision and Pattern Recognition (CVPR),...
  • W. Niblack

    An Introduction to Digital Image Processing

    (1985)
  • K. Kita, T. Wakahara, Binarization of color characters in scene images using k-means clustering and support vector...
  • U. Bhattacharya, S.K. Parui, S. Mondal, Devanagari and Bangla text extraction from natural scene images, in:...
  • A. Mishra, K. Alahari, C. Jawahar, An MRF model for binarization of natural scene text, in: International Conference on...
  • L. Neumann, J. Matas, Real-time scene text localization and recognition, in: Computer Vision and Pattern Recognition...
  • J.L. Feild, E.G. Learned-Miller, Improving open-vocabulary scene text recognition, in: International Conference on...
  • B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in: Computer Vision and...
  • P. Shivakumara, T.Q. Phan, S. Bhowmick, C.L. Tan, U. Pal, A novel ring radius transform for video character...
  • Y. Zhou, J. Feild, E. Learned-Miller, R. Wang, Scene text segmentation via inverse rendering, in: International...
  • Cited by (0)

    Shangxuan Tian is a Ph.D. candidate in the School of Computing, National University of Singapore. He received his B.S. degree in School of Computer Science and Technology from Northwestern Polytechnical University, Xi’an, China. His current research interests include document image analysis and text extraction from scene images.

    Ujjwal Bhattacharya received B.Sc. (Hons.), M.Sc. and M.Phil. degrees in Pure Mathematics from the Calcutta University in the years 1986, 1989 and 1991 respectively. He did Post-graduate Diploma in Computer Applications in 1990. He joined the Indian Statistical Institute, Kolkata, in 1991 as a research scholar. He received National Scholarship (1980, 1982), ISCA Young Scientist Award (1995) and Amiya K. Pujari Award for best paper (2006). The topic of his Ph.D. thesis was offline handwritten Bangla character recognition. He is a member of the IUPRAI which is affiliated to IAPR. He joined the Computer Vision and Pattern Recognition Unit of Indian Statistical Institute as a faculty member in 1999. He is currently working at the same place in an Associate Professor (equiv.) position. His research interests include online/offline handwriting recognition, analysis of camera captured scene texts, historical document analysis and machine learning. He was one of the Guest Editors of the forthcoming Special Issue of Pattern Recognition Letter Journal on Frontiers of Handwriting Processing.

    Shijian Lu is currently a Research Scientist in the Institute for Infocomm Research, A*STAR, Singapore. He received his Ph.D. degree in Electrical and Computer Engineering in 2005 from the National University of Singapore, Singapore. His research interests include document image analysis, medical image analysis, and bio-inspired computer vision. He has published up to 80 research papers in these areas. He has served in the program committees of many international conferences such as International Conference on Document Analysis and Recognition (ICDAR), International Conference on Pattern Recognition (ICPR), International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

    Bolan Su is currently a Research Scientist in the Institute for Infocomm Research, A*STAR, Singapore. He received his B.Sc. degree in Computer Science in 2008 from the Fudan University, Shanghai, China, and his Ph.D. degree in Computer Science in 2012 from the National University of Singapore, Singapore. His research interests include document image analysis, medical image analysis and computer vision.

    Qingqing Wang is a Ph.D. candidate in the Department of Computer Science and Technology, East China Normal University, China. She received her B.S. degree in Computer Science in 2013 from East China Normal University. Her current research interests include digital image processing, pattern recognition, and artificial intelligence.

    Xiaohua Wei received his B.S. degree from Guangxi Normal University, China, in 2011. He is currently pursuing his Ph.D. degree in the Department of Computer Science and Technology, East China Normal University, China. His current research interests include document image recognition and analysis.

    Yue Lu is a Professor of the Department of Computer Science and Technology, East China Normal University. He received the B.S. degree in Wireless Technology and the M.S. degree in Telecommunications and Electronic System from Zhejiang University in 1990 and 1993, respectively, and the Ph.D. degree in Pattern Recognition and Intelligent Systems from Shanghai Jiao Tong University in 2000. From 1993 to 2000, he was an Engineer with the Third Research Institute of Posts and Telecommunications Ministry of China. Before joining East China Normal University in 2004, he was a Research Fellow with the Department of Computer Science, National University of Singapore. He serves as the Vice Dean of the School of Information Science and Technology with East China Normal University, the Vice President of the Shanghai Research Institute of China Post Group, and the Director of the Shanghai Key Laboratory of Multidimensional Information Processing. His research interests include document image recognition and retrieval, natural language processing, biometrics, and intelligent system development. He has contributed to more than 100 reviewed publications in journals and conferences. He serves as an Editorial Board Member of Pattern Recognition, and an Associate Editor of the International Journal of Pattern Recognition and Artificial Intelligence, and the ACM Transactions on Asian and Low-Resource Language Information Processing.

    Chew Lim Tan is a Professor in the Department of Computer Science, School of Computing, National University of Singapore. He received his B.Sc. (Hons) degree in Physics in 1971 from University of Singapore, his M.Sc. degree in Radiation Studies in 1973 from University of Surrey, UK, and his Ph.D. degree in Computer Science in 1986 from University of Virginia, USA. His research interests include document image analysis, text and natural language processing, neural networks and genetic programming. He has published more than 400 research publications in these areas. He is an Associate Editor of ACM Transactions on Asian Language Information Processing and an Editorial Member of the International Journal on Document Analysis and Recognition. He is a Fellow of the International Association of Pattern Recognition (IAPR) and also a Senior Member of IEEE.

    View full text