ABSTRACT
Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, frame-wisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.
Supplemental Material
Available for Download
The supp.zip contains a demo video and a supplement document of the main paper.
- Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit Probability for Scene Text Recognition. In CVPR. 1508--1516.Google Scholar
- Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In SIGKDD . 71--79.Google Scholar
- Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. 2018. Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding. In CVPR. 1169--1178.Google Scholar
- Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In ICCV . 5086--5094.Google Scholar
- Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. AON: Towards Arbitrarily-Oriented Text Recognition. In CVPR . 5571--5579.Google Scholar
- Victor Fragoso, Steffen Gauglitz, Shane Zamora, Jim Kleban, and Matthew Turk. 2011. TranslatAR: A mobile augmented reality translator. In WACV. 497--502.Google Scholar
- Ll'ifs Gómez and Dimosthenis Karatzas. 2014. MSER-based real-time text detection and tracking. In ICPR. 3110--3115.Google Scholar
- Hideaki Goto and Makoto Tanaka. 2009. Text-tracking wearable camera system for the blind. In ICDAR. 141--145.Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369--376.Google Scholar
- Jack Greenhalgh and Majid Mirmehdi. 2015. Recognizing Text-Based Traffic Signs . IEEE TITS , Vol. 16, 3 (2015), 1360--1369.Google Scholar
- Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. 1735--1742.Google Scholar
- Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR . 5020--5029.Google Scholar
- Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep Direct Regression for Multi-Oriented Scene Text Detection. In ICCV . 745--753.Google Scholar
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. (2015).Google Scholar
Index Terms
- You Only Recognize Once: Towards Fast Video Text Spotting
Recommendations
Learn to Recognise: Exploring Priors of Sparse Face Recognition on Smartphones
Face recognition is one of the important components of many smart devices apps, e.g., face unlocking, people tagging and games on smart phones, tablets, or smart glasses. Sparse Representation Classification (SRC) is a state-of-the-art face recognition ...
Integrated Detect-Track Framework for Multi-view Face Detection in Video
ICVGIP '08: Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image ProcessingAn Experiential sampling and Meanshift tracker based Multi-view face detection in video is proposed in this paper. In this framework, instead of performing face detection at every position in a frame, we determine certain key positions to run the multi-...
A Bayesian approach to recognise facial expressions using vector flows
CompSysTech '09: Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in ComputingFacial expressions play an important role in human nonverbal communication. They can be generated by activation and dilatation of facial muscles. In this paper we describe a system to recognize facial expressions automatically. Special areas in the face ...
Comments