research-article

You Only Recognize Once: Towards Fast Video Text Spotting

Authors:
Zhanzhan Cheng

Hikvision Research Institution, Hangzhou, China

Hikvision Research Institution, Hangzhou, China
View Profile

,
Jing Lu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Yi Niu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Shiliang Pu

Hikvision Research Institute, Hangzhou, China

Hikvision Research Institute, Hangzhou, China
View Profile

,
Fei Wu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Shuigeng Zhou

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 855–863https://doi.org/10.1145/3343031.3351093

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 855–863

ABSTRACT

Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, frame-wisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.

Supplemental Material

Available for Download

zip

fp997aux.zip (9.5 MB)

The supp.zip contains a demo video and a supplement document of the main paper.

References

Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit Probability for Scene Text Recognition. In CVPR. 1508--1516.Google Scholar
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In SIGKDD . 71--79.Google Scholar
Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. 2018. Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding. In CVPR. 1169--1178.Google Scholar
Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In ICCV . 5086--5094.Google Scholar
Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. AON: Towards Arbitrarily-Oriented Text Recognition. In CVPR . 5571--5579.Google Scholar
Victor Fragoso, Steffen Gauglitz, Shane Zamora, Jim Kleban, and Matthew Turk. 2011. TranslatAR: A mobile augmented reality translator. In WACV. 497--502.Google Scholar
Ll'ifs Gómez and Dimosthenis Karatzas. 2014. MSER-based real-time text detection and tracking. In ICPR. 3110--3115.Google Scholar
Hideaki Goto and Makoto Tanaka. 2009. Text-tracking wearable camera system for the blind. In ICDAR. 141--145.Google Scholar
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369--376.Google Scholar
Jack Greenhalgh and Majid Mirmehdi. 2015. Recognizing Text-Based Traffic Signs . IEEE TITS , Vol. 16, 3 (2015), 1360--1369.Google Scholar
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. 1735--1742.Google Scholar
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR . 5020--5029.Google Scholar
Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep Direct Regression for Multi-Oriented Scene Text Detection. In ICCV . 745--753.Google Scholar
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. (2015).Google Scholar

Index Terms

You Only Recognize Once: Towards Fast Video Text Spotting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
        Object recognition
        Tracking

Recommendations

Learn to Recognise: Exploring Priors of Sparse Face Recognition on Smartphones

Face recognition is one of the important components of many smart devices apps, e.g., face unlocking, people tagging and games on smart phones, tablets, or smart glasses. Sparse Representation Classification (SRC) is a state-of-the-art face recognition ...
Read More
Integrated Detect-Track Framework for Multi-view Face Detection in Video
ICVGIP '08: Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing

An Experiential sampling and Meanshift tracker based Multi-view face detection in video is proposed in this paper. In this framework, instead of performing face detection at every position in a frame, we determine certain key positions to run the multi-...
Read More
A Bayesian approach to recognise facial expressions using vector flows
CompSysTech '09: Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing

Facial expressions play an important role in human nonverbal communication. They can be generated by activation and dilatation of facial muscles. In this paper we describe a system to recognize facial expressions automatically. Special areas in the face ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
detection
quality scoring
tracking
video text spotting
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 318
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

You Only Recognize Once: Towards Fast Video Text Spotting

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Learn to Recognise: Exploring Priors of Sparse Face Recognition on Smartphones

Integrated Detect-Track Framework for Multi-view Face Detection in Video

A Bayesian approach to recognise facial expressions using vector flows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

You Only Recognize Once: Towards Fast Video Text Spotting

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Learn to Recognise: Exploring Priors of Sparse Face Recognition on Smartphones

Integrated Detect-Track Framework for Multi-view Face Detection in Video

A Bayesian approach to recognise facial expressions using vector flows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media