nach oben

International Journal of Computer Vision

Erschienen in:

01.07.2015

Label Embedding: A Frugal Baseline for Text Recognition

verfasst von: Jose A. Rodriguez-Serrano, Albert Gordo, Florent Perronnin

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

Vorheriger Artikel Relatively-Paired Space Analysis: Learning a Latent Common Space From Relatively-Paired Observations

Nächster Artikel A Spline-Based Trajectory Representation for Sensor Fusion and Rolling Shutter Cameras

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

An alternative upper-bound is the slack-rescaled hinge loss \(\max _{y \in \mathcal {Y}} \Delta (y_n,y) (1 - F(x_n,y_n;w) + F(x_n,y;w))\). Note that in the 0/1 loss case, both are equivalent. See (Nowozin and Lampert (2011), p.120) for more details.

Marginalization can be done “early”, by constructing a string representation that includes all possible symbols in that position (weighted by the size of the symbols’ alphabet), or “late”, by explicitly generating a new set of queries that match the query with the wildcard and averaging the similarities of those queries with the image. This is equivalent to generating the new set of queries, averaging them, and then computing the similarity between that average query and the image. The subtle differences between “early” and “late” marginalization are only due to the way the string representation is normalized. We focus on late marginalization since it obtained slightly better results than early marginalization.

Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2013). Handwritten word spotting with corrected attributes. In ICCV.

Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.

Bai, B., Weston, J., Grangier, D., Collobert, R., Chapelle, O., & Weinberger, K. (2009). Supervised semantic indexing. In CIKM.

Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary ocr system for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495–504.CrossRef

Bishop, C. (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation.

Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.MATH

Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013) Photoocr: Reading text in uncontrolled conditions. In ICCV.

Brakensiek, A., & Rigoll, G. (2004). Handwritten address recognition using hidden markov models. Reading and Learning (pp. 103–122). Berlin: Springer.

Brakensiek, A., Rottland, J., Kosmala, A., & Rigoll, G. (2000). Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In ICFHR.

Breuel, T. M. (2001). Segmentation of handprinted letter strings using a dynamic programming algorithm. In ICDAR.

Bunke, H., Roth, M., & Schukat-Talamazzini, E. G. (1995). Off-line cursive handwriting recognition using hidden Markov models. Pattern Recognition, 28(9), 1399–1413.CrossRef

Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3), 291–310.CrossRef

Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC.

Chen, M. Y., Kundu, A., & Zhou, J. (1994). Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 481–496. doi:10.1109/34.291449.CrossRef

Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV SLCV workshop.

Dutta, S., Sankaran, N., Sankar, K. P., & Jawahar, C. V. (2012). Robust recognition of degraded documents using character n-grams. In DAS.

El-Yacoubi, A., Sabourin, R., Suen, C. Y., & Gilloux, M. (1999). An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 752–760.CrossRef

Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv.

Jain, R. & Jawahar, C. (2010). Towards more effective distance functions for word image matching. In DAS (pp. 363–370). ACM.

Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRef

Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD.

Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q. (2012). Non-linear metric learning. In NIPS.

Knerr, S., Augustin, E., Baret, O., & Price, D. (1998). Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3), 404–419.CrossRef

Koerich, A. L., Sabourin, R., & Suen, C. Y. (2003). Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis and Applications, 6(2), 97–121.MathSciNetCrossRef

Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI.

Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. In G. Orr & K. Muller (Eds.), Neural networks: Tricks of the trade. New York: Springer.

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. J. Mach. Learn. Res., 2, 419–444.MATH

Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRef

Madhvanath, S., & Govindaraju, V. (2001). The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 149–164.CrossRef

Marti, U. V., & Bunke, H. (2001). Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15, 65–90.CrossRef

Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS.

Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In BMVC.

Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.

Mohamed, M. A., & Gader, P. D. (1996). Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5), 548–554. doi:10.1109/34.494644.CrossRef

Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition. New York: Wiley.

Nagy, G. (2000). Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 38–62.CrossRef

Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In CVPR.

Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In ECCV.

Nowozin, S., & Lampert, C. (2011). Structured learning and prediction in computer vision. Foundations and trends in computer graphics and vision.

Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed Fisher vectors. In CVPR.

Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.

Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In ECCV.

Rath, T. M., & Manmatha, R. (2003). Word image matching using dynamic time warping. In CVPR.

Rodríguez-Serrano, J. A., & Perronnin, F. (2012). A model-based sequence similarity with application to handwritten word spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2108–2120.CrossRef

Rodriguez-Serrano, J. A., & Perronnin, F. (2013). Label embedding for text recognition. In BMVC.

Rodríguez-Serrano, J. A., Sandhawalia, H., Bala, R., Perronnin, F., & Saunders, C. (2012). Data-driven vehicle identification by image matching. In ECCV Workshop on Computer Vision for Vehicle Technology.

Sankar, K., Manmatha, R., Jawahar, C. V., & Manmatha, R. (2010). Nearest neighbor based collection ocr. In DAS.

Schölkopf, B., Smola, A., & Müller, K. R. (1998). Non-linear component analysis as a kernel eigenvalue problem. In Neural Computation.

Senior, A. W., & Robinson, A. J. (1998). An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 309–321. doi:10.1109/34.667887.CrossRef

Vinciarelli, A., Bengio, S., & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709–720.CrossRef

Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In ICCV.

Wang, K., & Belongie, S. (2010). Word spotting in the wild. In ECCV.

Weston, J., Bengio, S., & Usunier, N. (2010). Learning to rank with joint word-image embeddings. ECML: Large scale image annotation.

Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS.

Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In CVPR.

Zimmermann, M., Chappelier, J. C., & Bunke, H. (2006). Offline grammar-based recognition of handwritten sentences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 818–821.CrossRef

Titel: Label Embedding: A Frugal Baseline for Text Recognition
verfasst von: Jose A. Rodriguez-Serrano
Albert Gordo
Florent Perronnin
Publikationsdatum: 01.07.2015
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 3/2015
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-014-0793-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2015

A Spline-Based Trajectory Representation for Sensor Fusion and Rolling Shutter Cameras

Metric Regression Forests for Correspondence Estimation

Discovering Beautiful Attributes for Aesthetic Image Analysis

Relatively-Paired Space Analysis: Learning a Latent Common Space From Relatively-Paired Observations

Efficient Dense Rigid-Body Motion Segmentation and Estimation in RGB-D Video

Morphologically Invariant Matching of Structures with the Complete Rank Transform