Top

International Journal of Computer Vision

Published in:

27-08-2020

Scene Text Detection and Recognition: The Deep Learning Era

Authors: Shangbang Long, Xin He, Cong Yao

Published in: International Journal of Computer Vision | Issue 1/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inevitably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and remaining grand challenges. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected in our Github repository (https://github.com/Jyouhou/SceneTextPapers).

previous article Recursive Context Routing for Object Detection

next article Improving Image Description with Auxiliary Modality for Visual Localization in Challenging Conditions

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.CrossRef

Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 898–916.CrossRef

Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019a). What is wrong with scene text recognition model comparisons? Dataset and model analysis. In Proceedings of the IEEE international conference on computer vision (pp. 4715–4723).

Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019b). Character region awareness for text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 9365–9374).

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In ICLR 2015.

Bai, F., Cheng, Z., Niu, Y., Pu, S., & Zhou, S. (2018). Edit probability for scene text recognition. In CVPR 2018.

Bartz, C., Yang, H., & Meinel, C. (2017). See: Towards semi-supervised end-to-end scene text recognition. arXiv preprint arXiv:1712.05404.

Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE international conference on computer vision (pp. 785–792).

Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 71–79). ACM.

Busta, M., Neumann, L., & Matas, J. (2015). Fastext: Efficient unconstrained scene text detector. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1206–1214).

Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of ICCV.

Chen, X., Yang, J., Zhang, J., & Waibel, A. (2004). Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing, 13(1), 87–99.CrossRef

Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017a). Focusing attention: Towards accurate text recognition in natural images. In 2017 IEEE international conference on computer vision (ICCV) (pp. 5086–5094). IEEE.

Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017b). Arbitrarily-oriented text recognition. In CVPR2018.

Ch’ng, C.K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 935–942). IEEE.

Chowdhury, M. A., & Deb, K. (2013). Extracting and segmenting container name from container images. International Journal of Computer Applications, 74(19), 18–22.CrossRef

Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., et al. (2011). Text detection and character recognition in scene images with unsupervised feature learning. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 440–445). IEEE.

Dai, Y., Huang, Z., Gao, Y., & Chen, K. (2017). Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272.

Dalal, N., & Triggs, B., (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893). IEEE.

Deng, D., Liu, H., Li, X., & Cai, D. (2018). Pixellink: Detecting scene text via instance segmentation. Proceedings of AAA, I, 2018.

DeSouza, G. N., & Kak, A. C. (2002). Vision for mobile robot navigation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 237–267.CrossRef

Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545.CrossRef

Dvorin, Y., & Havosha, U. E. (2009). Method and device for instant translation, June 4. US Patent App. 11/998,931.

Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2963–2970). IEEE.

Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRef

Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.CrossRef

Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.

Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303.

Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV).

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

Goldberg, A. V. (1997). An efficient implementation of a scaling minimum-cost flow algorithm. Journal of Algorithms, 22(1), 1–29.MathSciNetCrossRef

Gordo, A. (2015). Supervised mid-level features for word image representation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2956–2964).

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). ACM.

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., & Fernández, S. (2008). Unconstrained on-line handwriting recognition with recurrent neural networks. In Advances in neural information processing systems (pp. 577–584).

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2315–2324).

Ham, Y. K., Kang, M. S., Chung, H. K., Park, R.-H., & Park, G. T. (1995). Recognition of raised characters for automatic classification of rubber tires. Optical Engineering, 34(1), 102–110.CrossRef

Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018). Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Processing Magazine, 35(1), 84–100.CrossRef

He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A. G., Kifer, D., & Giles, C. L. (2017a). Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 474–483). IEEE.

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017b). Mask R-CNN. In 2017 IEEE international conference on computer vision (ICCV) (pp. 2980–2988). IEEE.

He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017c). Single shot text detector with regional attention. In The IEEE international conference on computer vision (ICCV).

He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Reading scene text in deep convolutional sequences. In Thirtieth AAAI conference on artificial intelligence.

He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5020–5029).

He, W., Zhang, X.-Y., Yin, F., & Liu, C.-L. (2017d). Deep direct regression for multi-oriented scene text detection. In The IEEE international conference on computer vision (ICCV).

He, Z., Liu, J., Ma, H., & Li, P. (2005). A new automatic extraction method of container identity codes. IEEE Transactions on Intelligent Transportation Systems, 6(1), 72–78.CrossRef

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef

Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision. 2017.

Huang, W., Lin, Z., Yang, J., & Wang, J. (2013). Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE international conference on computer vision (pp. 1241–1248).

Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced MSER trees. In European conference on computer vision (pp. 497–511). Springer.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014a). Deep structured output learning for unconstrained text recognition. In ICLR2015.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014b). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.MathSciNetCrossRef

Jaderberg, M., Simonyan, K., Zisserman, A. et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014c). Deep features for text spotting. In In Proceedings of European conference on computer vision (ECCV) (pp. 512–528). Springer.

Jain, A. K., & Yu, B. (1998). Automatic text location in images and video frames. Pattern Recognition, 31(12), 2055–2076.CrossRef

Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., & Luo, Z. (2017). R2CNN: rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579.

Jung, K., Kim, K. I., & Jain, A. K. (2004). Text information extraction in images and video: A survey. Pattern Recognition, 37(5), 977–997.CrossRef

Kang, L., Li, Y., & Doermann, D. (2014). Orientation robust text line detection in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4034–4041).

Karatzas, D., & Antonacopoulos, A. (2004). Text extraction from web images based on a split-and-merge segmentation method using colour perception. In Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004 (Vol. 2, pp. 634–637). IEEE.

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1156–1160). IEEE.

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., Mestre, S. R., et al. (2013). ICDAR 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 1484–1493). IEEE.

Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

Lee, C.-Y., & Osindero, S. (2016). Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2231–2239).

Lee, J.-J, Lee, P.-H., Lee, S.-W., Yuille, A., & Koch, C. (2011). Adaboost for text detection in natural scene. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 429–434). IEEE.

Lee, S., & Kim, J. H. (2013). Integrating multiple character proposals for robust scene text extraction. Image and Vision Computing, 31(11), 823–840.CrossRef

Li, H., Wang, P., & Shen, C. (2017a). Towards end-to-end text spotting with convolutional recurrent neural networks. In The IEEE international conference on computer vision (ICCV).

Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI.

Li, R., En, M., Li, J., & Zhang, H. (2017b). Weakly supervised text attention network for generating text proposals in scene images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 324–330). IEEE.

Liao, M., Shi, B., & Bai, X. (2018a). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.MathSciNetCrossRef

Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In AAAI (pp. 4161–4167).

Liao, M., Song, B., He, M., Long, S., Yao, C., & Bai, X. (2019a). Synthtext3d: Synthesizing scene text images from 3d virtual worlds. arXiv preprint arXiv:1907.06007.

Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., & Bai, X. (2019b). Scene text recognition from two-dimensional perspective. In AAAI.

Liao, M., Zhu, Z., Shi, B., Xia, G.-S., & Bai, X. (2018b). Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5909–5918).

Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5162–5170).

Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2018a). Deep learning for generic object detection: A survey. arXiv preprint arXiv:1809.02165.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016a). SSD: Single shot multibox detector. In In Proceedings of European conference on computer vision (ECCV) (pp. 21–37). Springer.

Liu, W., Chen, C., & Wong, K. (2018b). Char-net: A character-aware neural network for distorted scene text recognition. In AAAI conference on artificial intelligence, New Orleans, Louisiana, USA.

Liu, W., Chen, C., Wong, K.-Y. K., Su, Z., & Han, J. (2016b). Star-net: A spatial attention residue network for scene text recognition. In BMVC (Vol. 2, p. 7).

Liu, X. (1975). Old book of tang. Beijing: Zhonghua Book Company.

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018c). FOTS: Fast oriented text spotting with a unified network. In CVPR2018.

Liu, X., & Samarabandu, J. (2005a). An edge-based text region extraction algorithm for indoor mobile robot navigation. In 2005 IEEE international conference mechatronics and automation (Vol. 2, pp. 701–706). IEEE.

Liu, X., & Samarabandu, J. K. (2005b). A simple and fast text localization algorithm for indoor mobile robot navigation. In Image processing: Algorithms and systems IV (Vol. 5672, pp. 139–151). International Society for Optics and Photonics.

Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter multi-oriented text detection.

Liu, Y., Jin, L., Xie, Z., Luo, C., Zhang, S., & Xie, L. (2019). Tightness-aware evaluation protocol for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9612–9620).

Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.

Liu, Z., Li, Y., Ren, F., Yu, H., & Goh, W. (2018d). Squeezedtext: A real-time scene text recognition by binary convolutional encoder–decoder network. In AAAI.

Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., & Goh, W. L. (2018e). Learning Markov clustering networks for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6936–6944).

Long, S., Guan, Y., Bian, K., & Yao, C. (2020). A new perspective for flexible feature gathering in scene text recognition via character anchor pooling. In ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2458–2462. IEEE.

Long, S., Guan, Y., Wang, B., Bian, K., & Yao, C. (2019). Alchemy: Techniques for rectification based irregular scene text recognition. arXiv preprint arXiv:1908.11834.

Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of European conference on computer vision (ECCV).

Long, S., & Yao, C. (2020). Unrealtext: Synthesizing realistic scene text images from the unreal world. arXiv preprint arXiv:2003.10608.

Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018a). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of European conference on computer vision (ECCV).

Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018b). Multi-oriented scene text detection via corner localization and region segmentation. In 2018 IEEE conference on computer vision and pattern recognition (CVPR).

Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20, 3111–3122.CrossRef

Mammeri, A., & Boukerche, A. et al. (2016). MSER-based text detection and communication algorithm for autonomous vehicles. In 2016 IEEE symposium on computers and communication (ISCC) (pp. 1218–1223). IEEE.

Mammeri, A., Khiari, E.-H., & Boukerche, A. (2014). Road-sign text recognition architecture for intelligent transportation systems. In 2014 IEEE 80th vehicular technology conference (VTC Fall) (pp. 1–5). IEEE.

Mishra, A., Alahari, K., & Jawahar, C. (2011). An MRF model for binarization of natural scene text. In ICDAR-international conference on document analysis and recognition. IEEE.

Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In BMVC-British machine vision conference. BMVA.

Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Asian conference on computer vision (pp. 770–783). Springer.

Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3538–3545). IEEE.

Neumann, L., & Matas, J. (2013). On combining multiple segmentations in scene text recognition. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 523–527). IEEE.

Nomura, S., Yamanaka, K., Katai, O., Kawakami, H., & Shiose, T. (2005). A novel adaptive morphological approach for degraded character image segmentation. Pattern Recognition, 38(11), 1961–1975.CrossRef

Parkinson, C., Jacobsen, J. J., Ferguson, D. B., & Pombo, S. A. (2016). Instant translation system, Nov. 29. US Patent 9,507,772.

Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE international conference on computer vision (pp. 4704–4714).

Qiu, W., Zhong, F., Zhang, Y., Qiao, S., Xiao, Z., Kim, T. S., & Wang, Y. (2017). Unrealcv: Virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on multimedia (pp. 1221–1224). ACM.

Phan, T. Q., Shivakumara, P., Tian, S., & Tan, C. L. (2013). Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 569–576).

Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. arXiv preprint.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3), 193–207.CrossRef

Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British machine vision conference. Citeseer.

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. Berlin: Springer.

Roy, P. P., Pal, U., Llados, J., & Delalandre, M. (2009). Multi-oriented and multi-sized touching character segmentation using dynamic programming. In 10th international conference on document analysis and recognition, 2009. IEEE.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRef

Schroth, G., Hilsenbeck, S., Huitl, R., Schweiger, F., & Steinbach, E. (2011). Exploiting text-related features for content-based image retrieval. In 2011 IEEE international symposium on multimedia (pp. 77–84). IEEE.

Schulz, R., Talbot, B., Lam, O., Dayoub, F., Corke, P., Upcroft, B., & Wyeth, G. (2015). Robot navigation using human cues: A robot navigation system for symbolic goal-directed exploration. In Proceedings of the 2015 IEEE international conference on robotics and automation (ICRA 2015) (pp. 1100–1105). IEEE.

Sheshadri, K., & Divvala, S. K. (2012). Exemplar driven character recognition in the wild. In BMVC (pp. 1–10).

Shi, B., Bai, X., & Belongie, S. (2017a). Detecting oriented text in natural images by linking segments. In The IEEE conference on computer vision and pattern recognition (CVPR).

Shi, B., Bai, X., & Yao, C. (2017b). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.CrossRef

Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4168–4176).

Shi, B., Yang, M., Wang, X., Lyu, P., Bai, X., & Yao, C. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 855–868.

Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., & Zhang, Z. (2013). Scene text recognition using part-based tree-structured character detection. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2961–2968). IEEE.

Shivakumara, P., Bhowmick, S., Su, B., Tan, C. L., & Pal, U. (2011). A new gradient based character segmentation method for video text recognition. In 2011 international conference on document analysis and recognition (ICDAR). IEEE.

Su, B., & Lu, S. (2014). Accurate scene text recognition based on recurrent neural network. In Asian conference on computer vision (pp. 35–48). Springer.

Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., & Liu, J. (2019). Chinese street view text: Large-scale Chinese text reading with partially supervised learning. In Proceedings of the IEEE international conference on computer vision (pp. 9086–9095).

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).

Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Tan, C. L. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision (pp. 4651–4659).

Tian, S., Lu, S., & Li, C. (2017). Wetext: Scene text detection under weak supervision. In Proceedings of ICCV.

Tian, Z. Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text in natural image with connectionist text proposal network. In In Proceedings of European conference on computer vision (ECCV) (pp. 56–72). Springer.

Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., & Jia, J. (2019). Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4234–4243).

Tsai, S. S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., & Girod, B. (2011). Mobile visual search on printed documents using text and low bit-rate features. In 18th IEEE international conference on image processing (ICIP) (pp. 2601–2604). IEEE.

Tu, Z., Ma, Y., Liu, W., Bai, X., & Yao, C. (2012). Detecting texts of arbitrary orientations in natural images. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1083–1090). IEEE.

Uchida, S. (2014). Text localization and recognition in images and video. In Handbook of document image processing and recognition (pp. 843–883). Springer.

Wachenfeld, S., Klein, H.-U., & Jiang, X. (2006). Recognition of screen-rendered text. In 18th international conference on pattern recognition, 2006. ICPR 2006 (Vol. 2, pp. 1086–1089). IEEE.

Wakahara, T., & Kita, K. (2011). Binarization of color character strings in scene images using k-means clustering and support vector machines. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 274–278). IEEE.

Wang, C., Yin, F., & Liu, C.-L. (2017). Scene text detection with novel superpixel based character candidate extraction. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 929–934). IEEE.

Wang, F., Zhao, L., Li, X., Wang, X., & Tao, D. (2018). Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1381–1389).

Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In 2011 IEEE international conference on computer vision (ICCV), (pp. 1457–1464). IEEE.

Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In 2012 21st international conference on pattern recognition (ICPR) (pp. 3304–3308). IEEE.

Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., & Shao, S. (2019a). Shape robust text detection with progressive scale expansion network. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Wang, X., Jiang, Y., Luo, Z., Liu, C.-L., Choi, H., & Kim, S. (2019b). Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6449–6458).

Weinman, J., Learned-Miller, E., & Hanson, A. (2007). Fast lexicon-based scene text recognition with sparse belief propagation. In ICDAR (pp. 979–983). IEEE.

Wolf, C., & Jolion, J.-M. (2006). Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR), 8(4), 280–296.CrossRef

Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., & Bai, X. (2019). Editing text in the wild. In Proceedings of the 27th ACM international conference on multimedia (pp. 1500–1508).

Wu, Y., & Natarajan, P. (2017). Self-organized text detection with minimal post-processing via border learning. In Proceedings of the IEEE conference on CVPR (pp. 5000–5009).

Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.-Y. (2017). Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in neural information processing systems (pp. 1784–1794).

Xing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional character networks. In Proceedings of the IEEE international conference on computer vision (pp. 9126–9136).

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).

Xue, C., Lu, S., & Zhan, F. (2018). Accurate scene text detection through border semantics awareness and bootstrapping. In In Proceedings of European conference on computer vision (ECCV).

Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., et al. (2019). Symmetry-constrained rectification network for scene text recognition. In Proceedings of the IEEE international conference on computer vision (pp. 9147–9156).

Yang, Q., Jin, H., Huang, J., & Lin, W. (2020). Swaptext: Image based texts transfer in scenes. arXiv preprint arXiv:2003.08152.

Yang, X., He, D., Zhou, Z., Kifer, D., & Giles, C. L. (2017). Learning to read irregular text with attention mechanisms. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 3280–3286).

Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049).

Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002.

Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.CrossRef

Ye, Q., Gao, W., Wang, W., & Zeng, W. (2003). A robust text detection algorithm in images and video frames. In IEEE ICICS-PCM (pp. 802–806).

Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.MathSciNetCrossRef

Yin, F., Wu, Y.-C, Zhang, X.-Y., & Liu, C.-L. (2017). Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727.

Yin, X.-C., Yin, X., Huang, K., & Hao, H.-W. (2014). Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 970–983.CrossRef

Yin, X.-C., Zuo, Z.-Y., Tian, S., & Liu, C.-L. (2016). Text detection, tracking and recognition in video: A comprehensive survey. IEEE Transactions on Image Processing, 25(6), 2752–2773.MathSciNetCrossRef

Yu, D., Li, X., Zhang, C., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. arXiv preprint arXiv:2003.12294.

Yuan, T.-L., Zhu, Z., Xu, K., Li, C.-J., & Hu, S.-M. (2018). Chinese text in the wild. arXiv preprint arXiv:1803.00085.

Zhan, F., & Lu, S. (2019). ESIR: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Zhan, F., Lu, S., & Xue, C. (2018). Verisimilar image synthesis for accurate detection and recognition of texts in scenes.

Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., & Ding, X. (2019). Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Zhang, D., & Chang, S.-F. (2003). A Bayesian framework for fusing multiple word knowledge models in videotext recognition. In Computer vision and pattern recognition, 2003. IEEE.

Zhang, S., Liu, Y., Jin, L., & Luo, C. (2018). Feature enhancement network: A refined scene text detector. In Proceedings of AAAI, 2018.

Zhang, S.-X., Zhu, X., Hou, J.-B., Liu, C., Yang, C., Wang, H., & Yin, X.-C. (2020). Deep relational reasoning graph network for arbitrary shape text detection. arXiv preprint arXiv:2003.07493.

Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., & Bai, X. (2016). Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Zhiwei, Z., Linlin, L., & Lim, T. C. (2010). Edge based binarization for video text images. In 2010 20th international conference on pattern recognition (ICPR) (pp. 133–136). IEEE.

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017). EAST: An efficient and accurate scene text detector. In The IEEE conference on computer vision and pattern recognition (CVPR).

Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.CrossRef

Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In Proceedings of European conference on computer vision (ECCV) (pp. 391–405). Springer.

Title: Scene Text Detection and Recognition: The Deep Learning Era
Authors: Shangbang Long
Xin He
Cong Yao
Publication date: 27-08-2020
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 1/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01369-0

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 1/2021

Image Matching from Handcrafted to Deep Features: A Survey

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Pixel-Wise Crowd Understanding via Synthetic Data

A Camera Model for Line-Scan Cameras with Telecentric Lenses

DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention

Improving Image Description with Auxiliary Modality for Visual Localization in Challenging Conditions

Premium Partner