skip to main content
10.1145/2034617acmotherconferencesBook PagePublication Pagesmocr-andConference Proceedingsconference-collections
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
ACM2011 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
MOCR/AND '11: The Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data Beijing China 17 September 2011
ISBN:
978-1-4503-0685-0
Published:
17 September 2011

Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome all participants to the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (J-MOCR-AND). This year's workshop is born of a merger between what would have been the Fifth Workshop on Analytics for Noisy Unstructured Text Data and the Third Workshop on Multilingual OCR. The two topics are naturally related and we look forward to an exciting exchange of new research ideas.

Skip Table Of Content Section
SESSION: Analytics for noisy unstructured text data
research-article
New method for the selection of binarization parameters based on noise features of historical documents

Historical documents contain generally different kind of degradations. Due to this degradations the application of methods of noise removal during a preprocessing stage seems to be necessary. Since the noise which, exists in the original document can ...

research-article
A real-world noisy unstructured handwritten notebook corpus for document image analysis research

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, ...

research-article
Acquiring competitive intelligence from social media

Competitive intelligence is the art of defining, gathering and analyzing intelligence about competitor's products, promotions, sales etc. from external sources. The Web comes across as an important source for gathering competitive intelligence. News, ...

research-article
Experiments with artificially generated noise for cleansing noisy text

Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well ...

research-article
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...

research-article
Tackling content spamming with a term weighting scheme

A term weighting scheme is described here which is able to circumvent the effect of web spam and content stuffing such as keyword stuffing, hidden unrelated text and meta tag stuffing. This scheme is composed of three components, namely, term frequency, ...

research-article
Segmenting eBay item descriptions into coherent sections

Item descriptions on an online e-Commerce site such as eBay consist of item-specific information along with generic information such as shipping and return policies, requests for feedback, and contact information. Extracting these textual segments from ...

research-article
Recognizing garbage in OCR output on historical documents

Erroneous tokens in the output of an OCR engine can be roughly divided into two categories. For less serious OCR errors typically human readers - in many cases also text correction systems - are able to reconstruct the correct original word, or to ...

SESSION: Multilingual OCR
research-article
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing ...

research-article
Topological features for recognizing printed and handwritten Bangla characters

In this paper, we present novel topological features based on the structural shape of a character. We detect the convexshaped segments formed by the various strokes. The convex segments are then represented with shape primitives from a repertoire. The ...

research-article
Script based text identification: a multi-level architecture

Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a ...

research-article
Recognition of Tibetan wood block prints with generalized hidden Markov and kernelized modified quadratic distance function
Article No.: 12, pp 1–14https://doi.org/10.1145/2034617.2034631

Recognition of Tibetan wood block print is a difficult problem that has many challenging steps. We propose a two stage framework involving image preprocessing, which consists of noise removal and baseline detection, and simultaneous character ...

research-article
Lampung - a new handwritten character benchmark: database, labeling and recognition

This research paper deals with our effort of creation and recognition of isolated Lampung characters, a script originated from Indonesia. The aim is to describe this new script with all its peculiarities, propose a labeling scheme to manage a large ...

research-article
MAST: multi-script annotation toolkit for scenic text

This paper describes a semi-automatic tool for annotation of multi-script text from natural scene images. To our knowledge, this is the maiden tool that deals with multi-script text or arbitrary orientation. The procedure involves manual seed selection ...

research-article
Text level performance evaluation of Indic OCR using split & merge

A methodology to evaluate the text level performance of Indic OCR is proposed in this paper. Indian language OCR has been an active area of research for a long time, there has been lot of work in this challenging area. Because of the complexity of Indic ...

research-article
Unconstrained Bangla online handwriting recognition based on MLP and SVM

With the increasing popularity of pen-based digital devices, online handwriting recognition has generated potential markets in India. Hand-held devices equipped with pen-based technologies are now affordable to a large section of Indian population. ...

research-article
Automatic localization of page segmentation errors

Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often ...

research-article
Sparsity-based super-resolution for offline handwriting recognition

We present a sparsity-based approach to super-resolution for handwritten document images, and demonstrate that it improves handwriting recognition accuracy. Given high resolution training images, low and high resolution dictionaries are constructed by ...

Contributors
  • Tata Consultancy Services India
  • University at Buffalo, The State University of New York
  • Lehigh University
  • Amazon.com, Inc.
  • Ludwig-Maximilians-University Munich
  • Xerox Research Center India
  1. Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

    Recommendations