nach oben

International Journal of Computer Vision

Erschienen in:

24.10.2020

Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

verfasst von: Wei Feng, Fei Yin, Xu-Yao Zhang, Wenhao He, Cheng-Lin Liu

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Existing methods for arbitrary shaped text spotting can be divided into two categories: bottom-up methods detect and recognize local areas of text, and then group them into text lines or words; top-down methods detect text regions of interest, then apply polygon fitting and text recognition to the detected regions. In this paper, we analyze the advantages and disadvantages of these two methods, and propose a novel text spotter by fusing bottom-up and top-down processing. To detect text of arbitrary shapes, we employ a bottom-up detector to describe text with a series of rotated squares, and design a top-down detector to represent the region of interest with a minimum enclosing rotated rectangle. Then the text boundary is determined by fusing the outputs of two detectors. To connect arbitrary shaped text detection and recognition, we propose a differentiable operator named RoISlide, which can extract features for arbitrary text regions from whole image feature maps. Based on the extracted features through RoISlide, a CNN and CTC based text recognizer is introduced to make the framework free from character-level annotations. To improve the robustness against scale variance, we further propose a residual dual scale spotting mechanism, where two spotters work on different feature levels, and the high-level spotter is based on residuals of the low-level spotter. Our method has achieved state-of-the-art performance on four English datasets and one Chinese dataset, including both arbitrary shaped and oriented texts. We also provide abundant ablation experiments to analyze how the key components affect the performance.

Vorheriger Artikel Progressive Multi-granularity Analysis for Video Prediction

Nächster Artikel Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019). Character region awareness for text detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 785–792). IEEE.

Bušta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2231).

Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017). Arbitrarily-oriented text recognition. arXiv preprint arXiv:1711.04226

Ch’ng, C. K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the international conference on document analysis and recognition (Vol. 1, pp. 935–942).

Feng, W., He, W., Yin, F., Zhang, X. Y., & Liu, C. L. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE international conference on computer vision.

Gómez, L., & Karatzas, D. (2017). Textproposals: A text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.CrossRef

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the international conference on machine learning (pp. 369–376).

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2315–2324).

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017a). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017b). Single shot text detector with regional attention. In Proceedings of the IEEE international conference on computer vision (pp. 3066–3074).

He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016a). Reading scene text in deep convolutional sequences. In Proceedings of the AAAI conference on artificial intelligence (Vol. 16, pp. 3501–3508).

He, T., Huang, W., Qiao, Y., & Yao, J. (2016b). Text-attentional convolutional neural network for scene text detection. IEEE Transactions on Image Processing, 25(6), 2529–2541.MathSciNetCrossRef

He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018a). An end-to-end textspotter with explicit alignment and attention. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5020–5029).

He, W., Zhang, X., Yin, F., & Liu, C. (2018b). Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Transactions on Image Processing, 27(11), 5406–5419.MathSciNetCrossRef

He, W., Zhang, X., Yin, F., Luo, Z., Ogier, J., & Liu, C. (2020). Realtime multi-scale scene text detection with scale-based region proposal network. Pattern Recognition, 98, 107026.CrossRef

He, W., Zhang, X. Y., Yin, F., Liu, & C. L. (2017c). Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision (pp. 745–753).

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRef

Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision.

Huang, L., Yang, Y., Deng, Y., & Yu, Y. (2015). Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.MathSciNetCrossRef

Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In Proceedings of the European conference on computer vision (pp. 512–528).

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM international conference on multimedia (pp. 675–678).

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In Proceedings of the international conference on document analysis and recognition (pp. 1156–1160).

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the neural information processing systems (pp. 1097–1105).

Li, H., Wang, P., & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (pp. 5238–5246).

Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2019). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1908.08207.

Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.MathSciNetCrossRef

Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence (pp. 4161–4167).

Liao, M., Zhu, Z., Shi, B., Xia, G., & Bai, X. (2018). Rotation-sensitive regression for oriented scene text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5909–5918).

Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (p. 4).

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5676–5685).

Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 3454–3461).

Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.

Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European conference on computer vision (pp. 19–35).

Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision.

Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018). Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 7553–7563).

Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In Proceedings of IEEE conference on computer vision and pattern recognition.

Neumann, L., & Matas, J. (2016). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.CrossRef

Patel, Y., Bušta, M., & Matas, J. (2018). E2e-mlt—An unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919.

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the neural information processing systems (pp. 91–99).

Shi, B., Bai, X., & Belongie, S. (2017). Detecting oriented text in natural images by linking segments. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2550–2558).

Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.CrossRef

Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2019). ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2035–2048.CrossRef

Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 761–769).

Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations.

Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Lim Tan, C. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision (pp. 4651–4659).

Veit, A., Matera, T., Neumann, L., Matas, J., & Belongie, S. J. (2016). Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140.

Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1457–1464).

Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In Proceedings of IEEE conference on pattern recognition (pp. 3304–3308).

Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., et al. (2019). Shape robust text detection with progressive scale expansion network. In Proceedings of IEEE conference on computer vision and pattern recognition.

Wang, X., Jiang, Y., Luo, Z., Liu, C. L., Choi, H., & Kim, S. (2019). Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of IEEE conference on computer vision and pattern recognition.

Wikipedia. Eye movement in reading—wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Eye_movement_in_reading

Yang, X., He, D., Zhou, Z., Kifer, D., Giles, & C. L. (2017). Learning to read irregular text with attention mechanisms. In Proceedings of international joint conference on artificial intelligence (pp. 3280–3286).

Ye, Q., & Doermann, D. S. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.CrossRef

Yin, F., Wu, Y. C., Zhang, X. Y., & Liu, C. L. (2017). Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727.

Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. S. (2016). Unitbox: An advanced object detection network. In Proceedings of the ACM international conference on multimedia (pp. 516–520).

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., et al. (2017). East: An efficient and accurate scene text detector. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 5551–5560).

Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.CrossRef

Titel: Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing
verfasst von: Wei Feng
Fei Yin
Xu-Yao Zhang
Wenhao He
Cheng-Lin Liu
Publikationsdatum: 24.10.2020
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 3/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01388-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2021

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Entrack: Probabilistic Spherical Regression with Entropy Regularization for Fiber Tractography

Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection

Progressive Multi-granularity Analysis for Video Prediction

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion