It is our great pleasure to welcome all participants to the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (J-MOCR-AND). This year's workshop is born of a merger between what would have been the Fifth Workshop on Analytics for Noisy Unstructured Text Data and the Third Workshop on Multilingual OCR. The two topics are naturally related and we look forward to an exciting exchange of new research ideas.
Proceeding Downloads
New method for the selection of binarization parameters based on noise features of historical documents
Historical documents contain generally different kind of degradations. Due to this degradations the application of methods of noise removal during a preprocessing stage seems to be necessary. Since the noise which, exists in the original document can ...
A real-world noisy unstructured handwritten notebook corpus for document image analysis research
Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, ...
Acquiring competitive intelligence from social media
Competitive intelligence is the art of defining, gathering and analyzing intelligence about competitor's products, promotions, sales etc. from external sources. The Web comes across as an important source for gathering competitive intelligence. News, ...
Experiments with artificially generated noise for cleansing noisy text
Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well ...
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...
Tackling content spamming with a term weighting scheme
A term weighting scheme is described here which is able to circumvent the effect of web spam and content stuffing such as keyword stuffing, hidden unrelated text and meta tag stuffing. This scheme is composed of three components, namely, term frequency, ...
Segmenting eBay item descriptions into coherent sections
Item descriptions on an online e-Commerce site such as eBay consist of item-specific information along with generic information such as shipping and return policies, requests for feedback, and contact information. Extracting these textual segments from ...
Recognizing garbage in OCR output on historical documents
Erroneous tokens in the output of an OCR engine can be roughly divided into two categories. For less serious OCR errors typically human readers - in many cases also text correction systems - are able to reconstruct the correct original word, or to ...
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts
- Deepak Arya,
- C. V. Jawahar,
- Chakravorty Bhagvati,
- Tushar Patnaik,
- B. B. Chaudhuri,
- G. S. Lehal,
- Santanu Chaudhury,
- A. G. Ramakrishna
This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing ...
Topological features for recognizing printed and handwritten Bangla characters
In this paper, we present novel topological features based on the structural shape of a character. We detect the convexshaped segments formed by the various strokes. The convex segments are then represented with shape primitives from a repertoire. The ...
Script based text identification: a multi-level architecture
Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a ...
Recognition of Tibetan wood block prints with generalized hidden Markov and kernelized modified quadratic distance function
Recognition of Tibetan wood block print is a difficult problem that has many challenging steps. We propose a two stage framework involving image preprocessing, which consists of noise removal and baseline detection, and simultaneous character ...
Lampung - a new handwritten character benchmark: database, labeling and recognition
This research paper deals with our effort of creation and recognition of isolated Lampung characters, a script originated from Indonesia. The aim is to describe this new script with all its peculiarities, propose a labeling scheme to manage a large ...
MAST: multi-script annotation toolkit for scenic text
This paper describes a semi-automatic tool for annotation of multi-script text from natural scene images. To our knowledge, this is the maiden tool that deals with multi-script text or arbitrary orientation. The procedure involves manual seed selection ...
Text level performance evaluation of Indic OCR using split & merge
A methodology to evaluate the text level performance of Indic OCR is proposed in this paper. Indian language OCR has been an active area of research for a long time, there has been lot of work in this challenging area. Because of the complexity of Indic ...
Unconstrained Bangla online handwriting recognition based on MLP and SVM
With the increasing popularity of pen-based digital devices, online handwriting recognition has generated potential markets in India. Hand-held devices equipped with pen-based technologies are now affordable to a large section of Indian population. ...
Automatic localization of page segmentation errors
Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often ...
Sparsity-based super-resolution for offline handwriting recognition
We present a sparsity-based approach to super-resolution for handwritten document images, and demonstrate that it improves handwriting recognition accuracy. Given high resolution training images, low and high resolution dictionaries are constructed by ...
Cited By
-
Blanco-González-Tejero C, Cano-Marin E and Ribeiro-Navarrete S (2024). Social Media Analytics for Digital Entrepreneurs Digital Entrepreneurship, 10.1007/978-3-031-58359-9_4, (89-110),
-
Silva D and Bação F (2023). MapIntel : A visual analytics platform for competitive intelligence , Expert Systems, 10.1111/exsy.13445
-
Nasir M, Rehmat U and Ahmad I (2022). Social Media Analysis of Customer Emotions in Pizza Industry, The Computer Journal, 10.1093/comjnl/bxac042, 66:7, (1777-1783), Online publication date: 13-Jul-2023.
- Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data