Top

2011 | Book

Read chapter Read first chapter

Information Systems for Indian Languages

International Conference, ICISIL 2011, Patiala, India, March 9-11, 2011. Proceedings

Editors: Chandan Singh, Gurpreet Singh Lehal, Jyotsna Sengupta, Dharam Veer Sharma, Vishal Goyal

Publisher: Springer Berlin Heidelberg

Book Series : Communications in Computer and Information Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the International Conference on Information Systems for Indian Languages, ICISIL 2011, held in Patiala, India, in March 2011. The 63 revised papers presented were carefully reviewed and selected from 126 paper submissions (full papers as well as poster papers) and 25 demo submissions. The papers address all current aspects on localization, e-governance, Web content accessibility, search engine and information retrieval systems, online and offline OCR, handwriting recognition, machine translation and transliteration, and text-to-speech and speech recognition - all with a particular focus on Indic scripts and languages.

Frontmatter

Oral

A Novel Method to Segment Online Gurmukhi Script

Segmentation of handwritten script in general and Gurmukhi Script in particular is a critical task due to the type of shapes of character and large variation in writing style of different users. The data captured for online Gurmukhi Script consists of x,y coordinates of pen position on the tablet, pressure of the pen on the tablet, time of each point and pen down status of the pen. A novel method to segment the Gurmukhi script based on pressure, pen down status and time together is presented in the paper. In case of some characters getting over segmented due to the shape of character and user’s style of writing, a method to merge the substrokes is presented. The proposed algorithms have been applied on a set of 2150 words and have given very good results.

Manoj K. Sachan, Gurpreet Singh Lehal, Vijender Kumar Jain

Automatic Speech Segmentation and Multi Level Labeling Tool

An accurate, properly labeled speech corpus is very important for speech research. However, manual segmentation and labeling is very laborious and error prone. This paper describes an automatic tool for segmenting and labeling of Malayalam speech data. The tool is based on Hidden Markov Model (HMM). HMM Tool Kit is used for training, segmentation and labeling the data. Special care was taken in the preparation of pronunciation dictionary so that it will cover most of the possible pronunciation variations. Syllabification rule is applied in the phone label for generating syllable label also.. Segmentation and labeling experiment was done on the speech corpus collected for building text-to-speech system. The performance of the tool is reasonably good as it shows only 19ms average deviation compared to manual labels.

R. Ravindra Kumar, K. G. Sulochana, Jose Stephen

Computational Aspect of Verb Classification in Malayalam

In applications like Morphological Analyzer, Machine Aided Translation (MAT), Spell checker, etc. the verb synthesis and or generation are prime tasks. For paradigm approach verb classification is needed. There exist many verb classifications in Malayalam. Suranad Kunjan Pillai’s classification contains sixteen classes, Wickremasinghe and Menon proposed eight, Sekhar and Glazov have twelve, Asher and Prabodhchandran Nair have four and Valentine have two.[1] All descriptions focus on past tense forms, because the much simpler forms present and future tense forms are easily predictable. In regard to verbs an entirely new item of work had to be undertaken. The verbs in the language present a multiplicity of conjugational forms which may perplex anyone who is not thoroughly familiar with them.[3] This paper is focused on the classification of verbs based on the past forms and the morphophonemic changes in the verb roots. This classification is basically done for the rule based MAT System and can be used in the similar NLP applications.

R. Ravindra Kumar, K. G. Sulochana, V. Jayan

Period Prediction System for Tamil Epigraphical Scripts Based on Support Vector Machine

Tamil is one of the ancient languages of the world with records in the language dating back over two millennia. Epigraphical scripts are the inscription written on various materials and the study of it is vital in knowing the civilized past and hence classification of character belonging to various periods is imperative before using the character bank of the particular period. Therefore a system is proposed for prediction of the period and it is being done by examining a few character referred to as test characters in Tamil language. These test characters are sampled from the script automatically and matched with the characters available for different periods using machine intelligence. The proposed system here has various modules like binarization, thinning, segmentation, feature extraction and finally classification and period prediction using Support Vector Machine. Its performance is most successful in differentiating between four centuries of character. The performance of the system is measured using the four parameters such as prediction rate, Correction rate, Error rate and Time taken to predict the centuries. The system achieves overall accuracy of 90.45%.

P. Subashini, M. Krishnaveni, N. Sridevi

Name Entity Recognition Systems for Hindi Using CRF Approach

This paper describes the named Entity Recognition (NER) System for Hindi using CRF approach. In this paper, our experiments with various feature combinations for Hindi NER have been explained. The training set has been manually annotated with a Named Entity (NE) tagset of 12 tags. The performance of the system has shown improvements by using the part of speech (POS) information of the current and surrounding words, name list, location name list, organization list, person prefix gazetteers list etc. It has been observed that using prefix and suffix feature helped a lot in improving the results. We have achieved Precision, Recall and F-score of 72.78%, 65.82% and 70.45% respectively for the current NER Hindi system. We have used CRF++ toolkit for training and testing data.

Rajesh Sharma, Vishal Goyal

An N-Gram Based Method for Bengali Keyphrase Extraction

Keyphrases provide the subject metadata that gives the clues about the content of a document. In this paper, we present a new method for Bengali keyphrase extraction. The proposed method has several steps such as extraction of n-grams, identification of candidate keyphrases and assigning scores to the candidate keyphrases. Since Bengali is a highly inflectional language, we have developed a lightweight stemmer for stemming the candidate keyphrases. The proposed method has been tested on a collection of Bengali documents selected from a Bengali corpus downloadable from TDIL website.

Kamal Sarkar

Feature Extraction and Recognition of Bengali Word Using Gabor Filter and Artificial Neural Network

Character recognition is an emerging area of research in the field of image processing and pattern recognition. The objective here is to generate a Bengali dictionary and develop an artificial neural network based technique for matching an input word with the dictionary word. The technique uses the features of the words as a whole rather than the features of each character. For feature extraction purpose, we have used 2D Gabor filter. For the dictionary words, it shows 93.67% accuracy in matching and for non-dictionary words, it shows 83% accuracy in non-matching. The overall accuracy of the system becomes 91%.

Mahua Nandy (Pal), Sumit Majumdar

The Segmentation of Half Characters in Handwritten Hindi Text

Character recognition is an important stage of any text recognition system. In Optical Character Recognition (OCR) system, the presence of half characters decreases the recognition rate. Due to touching of half character with full characters, the determination of presence of half character is very challenging task. In this paper, we have proposed new algorithm based on structural properties of text to segment the half characters in handwritten Hindi text. The results are shown for both handwritten Hindi text as well as for printed Hindi text. The proposed algorithm achieves the segmentation accuracy as 83.02% for half characters in handwritten text and 87.5% in printed text.

Naresh Kumar Garg, Lakhwinder Kaur, M. K. Jindal

Finding Influence by Cross-Lingual Blog Mining through Multiple Language Lists

Blogs has been one of the important resources of information on the internet. Now-a-days lot of Indian language content being generated in the form of blogs. People express their opinions on various situations and events. The content in the blogs may contain named entities–names of people, places, and organizations. Named entities also contain names of eminent personalities who are famous in or out of that language community. The goal of this paper is to find the influence of a personality among cross-language bloggers. The approach we follow is to collect information from blog pages and index the named entities along with their probabilities of occurrence by removing irrelevant information from the blog. When user searches to find the influence of a personality through a query in Indian language, we use a cross language lexicon in the form of multiple language parallel lists to transliterate the query into other Indian languages and mine blogs to return the influence of the personality across Indian language bloggers. An overview of the system and preliminary results are described.

Aditya Mogadala, Vasudeva Varma

Renaissance of Opinion Mining

Everyone has short span of time and the information to be analyzed is too large. Opinion Finder or sentiment analysis provides the quick response to user that whether the sentence follow positive or negative opinion. As WWW is growing more rapidly more and more information is available on web. Various sites provide daily routine facilities like shopping, blogs and consultancy etc. On shopping site various users provide the reviews for the particular product with rating. But to read each review (where 1000’s of review has been posted by users) is difficult and time consuming. Sentiment analysis or Opinion finder provides a summarization and overall opinion for all the reviews. Sentiment can be positive or negative and favorable or unfavorable. In this paper, we will discuss research work done by various researchers related to sentiment analysis.

Ankur Rana, Vishal Goyal, Vimal K. Soni

OpenLogos Machine Translation: Exploring and Using It in Anusaaraka Platform

OpenLogos is the open source version of the Logos Machine Translation System. The current system translates from English and German into the European languages (French, Italian, Spanish and Portuguese). This papers deals with extracting parse and useful linguistic information from English-German OpenLogos MT system. Understanding and extracting useful information from linguistic rich diagnosis file is explained in detail. Various parse relations such as POS, clause boundary, dependency, constituent information is extracted and mapped to Paninian format for use in English to Hindi MT system Anusaaraka.

Sriram Chaudhury, Sukhada, Akshar Bharati

Role of e-Learning Models for Indian Languages to Implement e-Governance

E-learning is becoming dominant delivery method in workplaces around the globe in various sectors and of varying sizes. E-learning had become now a three decade old technology in comparison to computer based training and education. There is essential need of design of new and efficient e-learning models which can incorporate all Indian languages spoken across for successful implementation of e-governance. Basically, e-learning models are attempts to develop a generalized framework to address the concerns of the learner and challenges presented by the technology so that online learning can take place effectively. The growth of e-learning changes the very nature of education, how it is designed, administered, delivered, supported, and evaluated. In this paper, of e-learning models has been proposed for Indian languages to implement e-governance to develop school networks, to upgrade non-formal systems to improve literacy and life skills, for teacher education, for development of policy in information and communications technology systems, and for modernizing curricula and learning methods.

Avinash Sharma, Vijay Singh Rathore

Retracted Chapter: A Compiler for Morphological Analyzer Based on Finite-State Transducers

Morphological analyzers are an essential parts of many natural language processing (NLP) systems such as machine translation systems. They may be efficiently implemented as finite state transducers. This paper describes a morphological system that can be used as stemmer, lemmatizer, spell checker, POS tagger, and as E-learning tool for Kannada learning people giving detailed explanation of various morphophonemics changes that occur in saMdhi. The language specific components, the lexicon and the rules, can be combined with a runtime engine applicable to all languages. Building Morphological analyzer/generator for morphologically complex and agglutinative language like Kannada is highly challenging. The major types of morphological process like inflection, derivation, and compounding are handled in this system.

Bhuvaneshwari C. Melinamath, A. G. Math, Sunanda D. Biradar

On Multifont Character Classification in Telugu

A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this paper, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. Our experiments are conducted on 1453950 Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier.

Venkat Rasagna, K. J. Jinesh, C. V. Jawahar

Parallel Implementation of Devanagari Document Image Segmentation Approach on GPU

Fast and accurate algorithms are necessary for Optical Character Recognition (OCR) systems to perform operations on document images such as pre-processing, segmentation, extracting features, training-testing of classifiers and post processing. The main goal of this research work is to make segmentation accurate and faster for processing of large numbers of Devnagari document images using parallel implementation of algorithm on Graphics Processing Unit (GPU). Proposed method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimize parallel reduction in Compute Unified Device Architecture (CUDA). Proposed method achieved a speedup of 20x-30x over the serial implementation when running on a GPU named GeForce 9500 GT.

Brijmohan Singh, Nitin Gupta, Rashi Tyagi, Ankush Mittal, Debashish Ghosh

A Rule Based Schwa Deletion Algorithm for Punjabi TTS System

Phonetically, schwa is a very short neutral vowel sound, and like all vowels, its precise quality varies depending on the adjacent consonants. During utterance of words not every schwa following a consonant is pronounced. In order to determine the proper pronunciation of words, it is necessary to identify which schwas are to be deleted and which are to be retained. Schwa deletion is an important step for the development of a high quality Text-To-Speech synthesis system. This paper specifically describes the schwa deletion rules for Punjabi written in Gurmukhi script. Performance analysis of the implemented rule based schwa deletion algorithm, evaluates its accuracy to be 98.27%.

Parminder Singh, Gurpreet Singh Lehal

Clause Based Approach for Ordering in MT Using OpenLogos

We are proposing an approach to improve the final output coming English-Hindi Anusaaraka System. To improve final output, we need to refine the way of ordering of the Hindi sentence in existing system. The basic idea is to improve the target language reordering by marking the clausal level information using OpenLogos diagnosis file. This paper presents, how to use clausal information along with pada, relations information (already used by present Anusaaraka system) to get Hindi sentence reordering correctly.

Sriram Chaudhury, Arpana Sharma, Neha Narang, Sonal Dixit, Akshar Bharati

Comparison of Feature Extraction Methods for Recognition of Isolated Handwritten Characters in Gurmukhi Script

The present paper is a comparative study of different feature extraction techniques for recognition of isolated handwritten characters in Gurmukhi script. The whole process consists of three stages. The first, feature extraction stage, analyzes the set of isolated characters and select the set of features that can be used to uniquely identify characters. For the selection of stable and representative set of features of character under consideration in this problem Zoning, Directional Distance Distribution (DDD) and Gabor methods have been used. The second stage is classification stage which uses features extracted in the first stage to identify the character. For classification Support Vector Machine (SVM) has been used to identify the character. In the third stage, feature extraction methods have been compared with respect to recognition rate. An annotated sample image database of isolated handwritten characters in Gurmukhi script has been prepared which has been used for training and testing of the system. Gabor based feature extraction proved to be better as compared to others.

Dharam Veer Sharma, Puneet Jhajj

Dewarping Machine Printed Documents of Gurmukhi Script

During the scanning of bound documents, some part of the document image is curled near the corners or near the binding resulting in bending of text lines. This hard to tackle distortion makes recognition very difficult. A method has been proposed for estimation and removal of line bending deformations introduced in document images during the process of scanning. The estimation of bend involves determining the side of the document on which curl is present and direction of the bend. The method has been tested on varieties of printed document images of Gurmukhi containing the bent text-lines at page borders. The method consists of three stages. In the first stage, a decision methodology is proposed to locate the site of deformation and the direction of deformation. An elliptical approximation model is derived to estimate the amount of deformation in the second stage. Finally, a transformation process brings out the correction. Experiments show that the method developed works well under conditions where pixel distribution is uniform.

Dharam Veer Sharma, Shilpi Wadhwa

Developing Oriya Morphological Analyzer Using Lt-Toolbox

In this paper we present the work done on developing a Morphological Analyzer (MA) for Oriya language, following the paradigm approach. A paradigm defines all the word forms of a given stem, and also provides a feature structure associated with every word. It consists of various paradigms under which nouns, adjectives, indeclinables (avyaya) and finite verbs of Oriya are classified. Further, we discuss the construction of paradigms and the thought process that goes into their construction. The paradigms have been created using an XML based morphological dictionary from the Lt-toolbox package.

Itisree Jena, Sriram Chaudhury, Himani Chaudhry, Dipti M. Sharma

Durational Characteristics of Indian Phonemes for Language Discrimination

Speech is the most important and common means of communication. Human beings identify a language by looking at the acoustics and the letter to sound rules (LTS) that govern the language. But pronunciation is governed by the person’s exposure to his/her native language. This is a major issue while considering words, especially nouns in Indian languages. In this paper, a new methodology of analyzing phoneme durations for language discrimination is presented. The work has been carried out on a database built with words, mostly nouns, common to Hindi, Tamil and Telugu languages. Durational analysis of phonemes has been carried out on the collected database. Our results show that phoneme durations play a significant role in differentiating Hindi, Telugu and Tamil languages with regard to stop sounds, vowels and nasals.

B. Lakshmi Kanth, Venkatesh Keri, Kishore S. Prahallad

A Transliteration Based Word Segmentation System for Shahmukhi Script

Word Segmentation is an important prerequisite for almost all Natural Language Processing (NLP) applications. Since word is a fundamental unit of any language, almost every NLP system first needs to segment input text into a sequence of words before further processing. In this paper, Shahmukhi word segmentation has been discussed in detail. The presented word segmentation module is part of Shahmukhi-Gurmukhi transliteration system. Shahmukhi script is usually written without short vowels leading to ambiguity. Therefore, we have designed a novel approach for Shahmukhi word segmentation in which we used target Gurmukhi script lexical resources instead of Shahmukhi resources. We employ a combination of techniques to investigate an effective algorithm by applying syntactical analysis process using Shahmukhi Gurmukhi dictionary, writing system rules and statistical methods based on n-grams models.

Gurpreet Singh Lehal, Tejinder Singh Saini

Optimizing Character Class Count for Devanagari Optical Character Recognition

Optical character recognition is a widely used technique for generating digital counterpart of printed or handwritten text. A lot of work has been done in the field of character recognition of Devanagari script. Devanagari script consists of several basic characters, half form of characters, vowel-modifiers and diacritics. From character recognition point of view only 78 character classes are sufficient for the identification of these characters. But in Devanagari the characters fuse with each other, which result in segmentation errors. Therefore to avoid such errors we shall consider such compound characters as separate recognizable unit. We have identified 864 such compound characters that make a total of 942 recognizable units. But it is very difficult to handle such a large number of classes; therefore we have optimized the character class count. We have found that the first 100 classes can contribute to 98.0898% of the overall recognition.

Jasbir Singh, Gurpreet Singh Lehal

Multifont Oriya Character Recognition Using Curvelet Transform

In this paper, we have proposed a new character recognition method for Oriya script which is based on curvelet transform. Multi font Oriya character recognition has not been attempted previously. Ten popular Oriya fonts have been used for the purpose of character recognition. The wavelet transform has widely been used for character recognition purpose, but it cannot well describe curve discontinuities. We have used curvelet transform for recognition which is done using curvelet coefficients. This method is suitable for Oriya character recognition as well as various other scripts’ recognition purpose also. The proposed method is simple and extracts effectively the features in target region, which characterizes better and represents more robustly the characters. The experimental results validate that the proposed method improves greatly the recognition accuracy and efficiency than other traditional methods.

Swati Nigam, Ashish Khare

Exploiting Ontology for Concept Based Information Retrieval

Traditional approaches for information retrieval from textual documents are based on keyword based similarity. A key limitation of these approaches is that they do not take care of meaning and semantic relationship between words. Recently some work has been done on concept based information retrieval(CBIR), which allows to capture semantic relations between words in order to identify importance of a word. These semantic relations can be explored by using ontology. Most of the work for CBIR has been done in English language. In this paper we explore the use of Hindi Wordnet ontology for CBIR from Hindi text documents. Our work is significant because very limited amount of work has been done on CBIR for Hindi documents. Basic motivation of this paper is to provide an efficient structure for representing concept clusters and develop an algorithm for identifying concept clusters. Further we suggest a way of assigning weights to words based on their semantic importance in the document.

Aditi Sharan, Manju Lata Joshi, Anupama Pandey

Parsing of Kumauni Language Sentences after Modifying Earley’s Algorithm

Kumauni language is one of the relatively understudied regional languages of India. Here, we have attempted to develop a parsing tool for use in Kumauni language studies, with the eventual aim of developing a technique for checking grammatical structures of sentences in Kumauni language. For this purpose, we have taken a set of pre-existing Kumauni sentences and derived rules of grammar from them, which have been converted to a mathematical model using Earley’s algorithm, suitably modified by us. The Mathematical model so developed has been verified by testing it on a separate set of pre-existing Kumauni language sentences. This mathematical model can be used for parsing new Kumauni language sentences, thus providing researchers a new parsing tool.

Rakesh Pandey, Nihar Ranjan Pande, H. S. Dhami

Comparative Analysis of Gabor and Discriminating Feature Extraction Techniques for Script Identification

A considerable amount of success has been achieved in developing monolingual OCR systems for Indian Scripts. But in a country like India, where many languages and scripts exist, it is more common that a single document contain words from more than one script. Therefore a script identification system is required to select the appropriate OCR. This paper presents a comparative analysis of two different feature extraction techniques for script identification of each word. In this work, for script identification discriminating and Gabor filter based features are computed of Punjabi words and English numerals. Extracted feature are simulated with Knn and SVM classifiers to identify the script and then recognition rates are compared. It has been observed that by selecting the appropriate value of k and appropriate kernel function with appropriate combination of feature extraction and classification scheme, there is significant drop in error rate.

Rajneesh Rani, Renu Dhir, G. S. Lehal

Poster

Automatic Word Aligning Algorithm for Hindi-Punjabi Parallel Text

In this paper, an automatic alignment system for Hindi-Punjabi parallel texts at the word level has been described. Automatic word alignment means that without the human interaction the parallel corpus should be aligned word by word with the machine accurately. Boundary-detection and minimum distance function approaches have been used to deal with multi-words. In the existing algorithm, only 1:1 partial word alignment had been done with very less accuracy. But for the multi-words alignment, no work had been implemented. For removing this limitation in the existing system, Different techniques like Boundary-detection, Dictionary lookup and Scoring based Minimum distance function for word alignment has been used in the present system. After implementing above mentioned techniques, the present system accuracy was found to be 99% for one-to-one word alignment and 83% accuracy for multi-word alignment.

Karuna Jindal, Vishal Goyal, Shikha Jindal

Making Machine Translations Polite: The Problematic Speech Acts

In this paper, a study of politeness in a translated parallel corpus of Hindi and English is done. It presents how politeness in a Hindi text is translated into English. A theoretical model (consisting of different situations that may arise while translating politeness from one language to another and different consequences of these situations) has been developed to compare the politeness value in the source and the translated text. The polite speech acts of Hindi which are most likely to be translated improperly into English are described. Based on this description, such rules will be developed which could be fed into the MT systems so that the problematic polite speech acts could be handled effectively and efficiently by the machine while translating.

Ritesh Kumar

Tagging Sanskrit Corpus Using BIS POS Tagset

This paper presents the application of BIS POS tagset for tagging Sanskrit. Traditionally, the number of grammatical categories for Sanskrit varies from one to five [3]. The language has been exhaustively described in the tradition. And this description is still prevalent in today’s grammar teaching. In such a situation, the application of this tagset, which is a new paradigm with respect to Sanskrit, is a challenge. In this paper, we explore how this tagset could be used in categorizing/describing the language.

Madhav Gopal, Girish Nath Jha

Manipuri Transliteration from Bengali Script to Meitei Mayek: A Rule Based Approach

This paper describes about the transliteration of Manipuri from Bengali Script to Meitei Mayek (Meitei script). So far no work of Manipuri transliteration is done and being an Eight Schedule Language of Indian Constitution we felt necessary to start through a rule based. A model and algorithm is being designed for transliterating Manipuri from Bengali script to Meitei Mayek (Meitei Script). Even though the model is a simple rule base approached but to our surprise the algorithm proved to come up with an accuracy of 86.28%.

Kishorjit Nongmeikapam, Ningombam Herojit Singh, Sonia Thoudam, Sivaji Bandyopadhyay

Online Handwriting Recognition for Malayalam Script

Online handwriting recognition refers to machine recognition of handwriting captured in the form of pen trajectories. This paper describes a trainable online handwriting recognition system for Malayalam using elastic matching technique. Each character/stroke is subjected to a feature extraction procedure. The extracted features forms input to a nearest neighborhood classifier which returns the label having the minimum distance. The recognized characters are assigned their corresponding Unicode code points and are displayed using appropriate fonts. With a database containing 8389 handwritten samples, we get an average word recognition rate of 82%.

R. Ravindra Kumar, K. G. Sulochana, T. R. Indhu

Optimized Multi Unit Speech Database for High Quality FESTIVAL TTS

This paper describes the development of optimized multiunit speech database for a high quality concatenative TTS System. The core engine used for the development of Text to speech is the open source FESTIVAL engine and is tested for Malayalam language. The optimal text selection algorithm selects the optimal text, ensuring the maximum coverage of units without discarding entire low frequency units. In this work we created a multiunit database with syllable as the highest unit, ensuring the coverage of all CV-VC units, and phones.

R. Ravindra Kumar, K. G. Sulochana, T. Sajini

Comparative Analysis of Printed Hindi and Punjabi Text Based on Statistical Parameters

Statistical analysis of a language is a vital part of natural language processing. In this paper, the statistical analysis of printed Hindi text is performed and then its comparison is done with the analysis already available with printed Punjabi text. Besides analysis of the characters frequency and word length analysis, a more useful unigram, bigram analysis is done. Miscellaneous analysis like Percentage occurrence of various grouped characters and number of distinct words and their coverage in Hindi and Punjabi Corpus is studied.

Lalit Goyal

Participles in English to Sanskrit Machine Translation

In this paper, we discuss the participle type of English sentences in our English to Sanskrit machine translation (EST) system. Our EST system is an integrated model of a rule based machine translation (RBMT) with artificial neural network (ANN) model which translates an English sentence into equivalent Sanskrit sentence. We use feed forward ANN for the selection of Sanskrit word like noun, verb, object, adjective etc from English to Sanskrit user data vector (UDV). Our system uses only morphological markings to identify various part of speech (POS) as well as participle type of sentences.

Vimal Mishra, R. B. Mishra

Web-Drawn Corpus for Indian Languages: A Case of Hindi

Text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. For linguists and Natural Language Processing practitioners this could serve as a great corpus to conduct studies. This paper describes how good a manually collected corpus from the web could be. I start with my observations on finding the Hindi text and creating a representative corpus out of it. I compare this corpus with another standard corpus crafted manually and draw conclusions as to what needs to be done with such a web corpus to make it more useful for studies in linguistics.

Narayan Choudhary

Handwritten Hindi Character Recognition Using Curvelet Transform

In this paper, we proposed a new approach for Hindi character recognition using digital curvelet transform. Curvelet transform well approximate the curved singularities of images therefore very useful for feature extraction to character images. A Devanagari script contains more than 49 characters (13 vowels and 33 consonants) and all the characters are rich in curve information. The input image is segmented first then curvelet features are obtained by calculating statistics of thick and thin images by applying curvelet transform. The system is trained with K-Nearest Neighbor classifier. The experiments are evaluated with in-house dataset containing 200 images of character set (each image contains all Hindi characters). The results obtained are very promising with more than 90% recognition accuracy.

Gyanendra K. Verma, Shitala Prasad, Piyush Kumar

Challenges in Developing a TTS for Sanskrit

In this paper the authors present ongoing research on Sanskrit Text-to-Speech (TTS) system called ‘Samvachak’ at Special Centre for Sanskrit Studies, JNU. No TTS for Sanskrit has been developed so far. After reviewing the related research work, the paper focuses on the development of different modules of TTS System and possible challenges. The research for the TTS can be divided into two categories – TTS independent linguistic study, TTS related Research and Development (R&D). The TTS development is based on the Festival Speech Synthesis Engine.

Diwakar Mishra, Girish Nath Jha, Kalika Bali

A Hybrid Learning Algorithm for Handwriting Recognition

Generally, gradient based learning algorithms have showed reasonable performance in the training of multi layer feed forward neural networks; but they are still relatively slow in learning. In this context, a hybrid learning algorithm is used by combining differential evolution (DE) algorithm and Moore-Penrose (MP) generalized inverse to classify handwritten Malayalam characters. DE is used to select the input weights and biases in a single layer feed forward neural network (SLFN), and the output weights are analytically determined with MP inverse. A new set of features known as division point distance from centroid (DPDC) is used to generate patterns for the classifier. The system could provide overall recognition accuracy 83.98% by spending only 184 seconds for training.

Binu P Chacko, P. Babu Anto

Hindi to Punjabi Machine Translation System

Hindi-Punjabi being closely related language pair, Hybrid Machine Translation approach has been used for developing Hindi to Punjabi Machine Translation System. Non-availability of lexical resources, spelling variations in the source language text, source text ambiguous words, named entity recognition and collocations are the major challenges faced while developing this syetm. The key activities involved during translation process are preprocessing, translation engine and post processing. Lookup algorithms,pattern matching algorithms etc formed the basis for solving these issues. The system accuracy has been evaluated using intelligibility test, accuracy test and BLEU score. The hybrid syatem is found to perform better than the constituent systems.

Vishal Goyal, Gurpreet Singh Lehal

Cascading Style Sheet Styling Issues in Punjabi Language

This paper describes the styling issues in Punjabi Websites using Cascading Style Sheets (CSS) in various web browsers. Seven different styling issues for Indian Languages have been identified by other researchers. It has been noted that most Punjabi websites make use of only underline and hyperlinks for styling the web content. To test all the styling issues, we developed our own testing website for Punjabi with the use of CSS. We have checked all the styling issues in six different browsers. The results of comparative study in different browsers are presented in this paper.

Swati Mittal, R. K. Sharma, Parteek Bhatia

Translation of Hindi se to Tamil in a MT System

The paper attempts to describe how a word like

of Hindi can be a challenging task to a Machine Translation (MT) system. In most of the literature of Hindi and Urdu, a noun marked with

is assigned instrumental or ablative case. So it is called instrumental and ablative case marker. But a close look at its distribution shows that apart from instrumental and ablative case function, it denotes other functions also and in each of these types, it is translated differently in Tamil.

Sobha Lalitha Devi, P. Pralayankar, V. Kavitha, S. Menaka

Preprocessing Phase of Punjabi Language Text Summarization

Punjabi Text Summarization is the process of condensing the source Punjabi text into a shorter version, preserving its information content and overall meaning. It comprises two phases: 1) Pre Processing 2) Processing. Pre Processing is structured representation of the Punjabi text. This paper concentrates on Pre processing phase of Punjabi Text summarization. Various sub phases of pre processing are: Punjabi words boundary identification, Punjabi language stop words elimination, Punjabi language noun stemming, finding Common English Punjabi noun words, finding Punjabi language proper nouns, Punjabi sentence boundary identification, and identification of Punjabi language Cue phrase in a sentence.

Vishal Gupta, Gurpreet Singh Lehal

Comparative Analysis of Tools Available for Developing Statistical Approach Based Machine Translation System

Statistical Machine Translation model take the view that every sentence in the target language is a translation of the source language sentence with some probability. The best translation, of course, is the sentence that has the highest probability. A large sample of human translated text (parallel corpus) is examined by the SMT algorithms for automatic learning of translation parameters. SMT has undergone tremendous development in last two decades. A large number of tools has been developed for SMT and put to work on different language pairs with fair accuracy. This paper will give brief introduction to Statistical Machine Translation; tools available for developing Statistical Machine Translation systems based on Statistical approach and their comparative study. This paper will help researcher in finding the information about SMT tools at one place.

Ajit Kumar, Vishal Goyal

Discriminative Techniques for Hindi Speech Recognition System

For the last two decades, research in the field of automatic speech recognition (ASR) has been intensively carried out worldwide, motivated by the advances in signal processing techniques, pattern recognition algorithms, computational resources and storage capability. Most state-of–the–art speech recognition systems are based on the principles of statistical pattern recognition. In such systems, the speech signal is captured and preprocessed at front-end for feature extraction and evaluated at back-end using continuous density hidden Markov model (CDHMM). Maximum likelihood estimation (MLE) and several discriminative training methods have been used to train the ASR, based on European languages such as English. This paper reviews the existing discriminative techniques like maximum mutual information estimation (MMIE), minimum classification error (MCE), and minimum phone error (MPE), and presents a comparative study in the context of Hindi language ASR. The system is speaker independent and works with medium size vocabulary in typical field conditions.

Rajesh Kumar Aggarwal, Mayank Dave

An Experiment on Resolving Pronominal Anaphora in Hindi: Using Heuristics

India is a multilingual, linguistically dense and diverse country with rich resources of information. In this paper we describe the heuristics in a pre-processor layer to an existing anaphora resolution approaches in Hindi language and tested with an experiment. The experiment was conducted on pronouns and presented the results as observations.

Kiran Pala, Rafiya Begum

A Novel GA Based OCR Enhancement and Segmentation Methodology for Marathi Language in Bimodal Framework

Automated learning systems used to extract information from images play a major role in document analysis. Optical character recognition or OCR has been widely used to automatically segment and index the documents from a wide space. Most of the methods used for OCR recognition and extraction like HMM’s, Neural etc, mentioned in literature have errors which require human operators to be rectified and fail to extract images with blur as well as illumination variance. This paper explains proposes an enhancement supported threshold based pre-processing methodology for word spotting in Marathi printed bimodal images using image segmentation. The methodology makes use of an enhanced image obtained by histogram equalization followed by followed by age segmentation using a specific threshold. The threshold can be obtained using genetic algorithms. GA based segmentation technique is codified as an optimization problem used efficiently to search maxima and minima from the histogram of the image to obtain the threshold for segmentation. The system described is capable of extracting normal as well as blurred images and images for different lighting conditions. The same inputs are tested for a standard GA based methodology and the results are compared with the proposed method. The paper further elaborates the limitations of the method.

Amarjot Singh, Ketan Bacchuwar, Akash Choubey

Panmozhi Vaayil - A Multilingual Indic Keyboard Interface for Business and Personal Use

A multilingual Indic keyboard interface is an Input Method that can be used to input text in any Indic language. The input can follow the phonetic style making use of the standard QWERTY layout along with support for popular keyboard and typewriter layouts [1] of Indic languages using overlays. Indic-keyboards provides a simple and clean interface supporting multiple languages and multiple styles of input working on multiple platforms. XML based processing makes it possible to add new layouts or new languages on the fly. These features, along with the provision to change key maps in real time make this input method suitable for most, if not all text editing purposes. Since Unicode is used to represent text, the input method works with most applications. This is available for free download and free use by individuals or commercial organizations, on code.google.com under Apache 2.0 license.

H. R. Shiva Kumar, Abhinava Shivakumar, Akshay Rao, S. Arun, A. G. Ramakrishnan

Power Spectral Density Estimation Using Yule Walker AR Method for Tamil Speech Signal

Window theory always an active topic of research in digital signal processing. It is mainly used for leakage reduction in spectral analysis. In this paper, the effect of windowing in power spectral density estimation of Tamil speech signal is analyzed. Four different window functions are implemented and their performances are evaluated based on the parameters such as sidelobe level, fall off and gain. Based on the experiments it is found that the effect of applying hamming window best suites for the Tamil speech signal. It can reduce the spectral discontinuities of the signal and this effect of hamming window is given as the potent metric for estimating the spectral power of Tamil speech signal. Here Power Spectral Density (PSD) estimation is computed by using parametric and non-parametric methods. The reduction of noise ratio in PSD is considered as the parameter and it is estimated through crest factor. Finally the paper concludes with the need of best windowing method for PSD particularly in parametric techniques. Evaluation is handled both objectively and subjectively for Tamil speech datasets.

V. Radha, C. Vimala, M. Krishnaveni

Challenges in NP Case-Mapping in Sanskrit Hindi Machine Translation

Sanskrit and Hindi are considered structurally close owing to genealogical relations. However, on a closer look, Hindi appears to have diverged significantly more in terms of structure than in lexical ingenuities. Gender, number distinctions, ergative, postposition, verb group, double causative, echo are some (among many) remarkable structural innovations that Hindi has gone through over the ages. While the structure of Sanskrit vibhakti was fairly organized, the same may not be true for Hindi. The present paper is a study in mapping Sanskrit Noun Phrase (NP) case markers with Hindi for Machine Translation (MT) purposes with a view to evolve cross-linguistic model for Indian languages.

Kumar Nripendra Pathak, Girish Nath Jha

Modified BLEU for Measuring Performance of a Machine-Translation Software

The BLEU score compares various n-grams of words of a MT software with an ideal or reference translation of a sentence. We suggest a Weighted BLEU (WBLEU) which is probably more suitable for translation system for English into an Indian language. Weights are obtained so that the correlation between the weighted BLEU scores and a human evaluator’s scores is maximum.

Kalyan Joshi, M. B. Rajarshi

Demo Abstracts

A System for Online Gurmukhi Script Recognition

Handwriting recognition is the task of transforming a language represented in its spatial form of graphical marks into its symbolic representation. There are two type of handwriting recognition, offline and online. In offline handwriting recognition, the user writes on the paper which is digitized by the scanner. The output of the scanner is presented as an image to the system which recognizes the writing. In contrast, the online handwriting recognition requires that the user’s writing is captured through digitizer pen and tablet before recognition. Online handwriting recognition assumes importance as it is still much more convenient to write with pen as compared to typing on the keyboard. Secondly, these days so many PDAs and handheld devices are used where it is easier to work with stylus then using keyboard. This has motivated research in online handwriting recognition in different languages of the world including Indic scripts such as Tamil, Telgu, Kannada, Devanagari and Gurmukhi. In our work, a system for recognition of Gurmukhi Script is presented. In this work, the input of the user’s handwriting is taken as a sequence of packets captured through the movement of stylus or pen on the surface of the tablet. The packet consists of x,y position of the stylus, button(tip of stylus), pressure of the stylus and the time of each packet. The user’s writing is preprocessed and is segmented into meaniningful shapes. The segmented shapes are processed to extract features which are Distributed Directional Feature. The feature data is fed to the recognition engine which is a Nearest Neighbor Classifier. The average recognition accuracy is 76% approximately. The block diagram of the system for Online Gurmukhi Script recognition is shown in Fig 1 below. The main strengths of this system is that it takes complete word for segmentation and recognition.

Manoj K. Sachan, Gurpreet Singh Lehal, Vijender Kumar Jain

Spoken Isolated Word Recognition of Punjabi Language Using Dynamic Time Warp Technique

This research work is to develop a speech recognition system for speaker dependent, real time, isolated words of Punjabi language. The methods used for speech recognition have since been developed and improved with increasing accuracy and efficiency leading to a better human machine interface. In this work, I have developed a speech recognition system, which has a medium size dictionary of isolated words of Punjabi language. The study involved the detailed learning of the various phases of the signal modeling process like preprocessing and feature extraction as well as the study of multimedia API (Application Programming Interface) implemented in Windows 98/95 or above. Visual C++ has been used to program sound blaster using MCI (Media Control Interface) commands. In this system the input speech can be captured with the help of microphone. I have used MCI commands and record speech. The sampling frequency is 16 kHz, sample size is 8 bits, and mono channels. The Vector Quantization and Dynamic Time Warping (DTW) have been used for the recognition system and some modifications have been proposed to noise detection, word detection algorithms. In this work, vector quantization codebook of size 256 is used. This size selection is based on the experimental results. The experiments were performed with different size of the codebook (8, 16, 32, 64, 128, and 256). In DTW, there are two modes: one is training mode and other is testing mode. In training mode the database of the features (LPC Coefficients or LPC derived coefficients) of the training data is created. In testing mode, the test pattern (features of the test token) is compared with each reference pattern using dynamic time warp alignment that simultaneously provides a distance score associated with the alignment. The distance scores for all the reference patterns are sent to a decision rule, which gives the word with least distance as recognized word. Symmetrical DTW algorithm is used in the implementation of this work. The system with small isolated word vocabulary on Punjabi language gives 94.0% accuracy. System can recognize 20 – 24 words per minute of interactive nature with recording time 3 – 2.5 seconds respectively.

Ravinder Kumar, Mohanjit Singh

Text-To-Speech Synthesis System for Punjabi Language

A Text-To-Speech (TTS) synthesis system has been developed for Punjabi text written in Gurmukhi script. Concatenative method has been used to develop this TTS system. Syllables have been reported as good choice of speech unit for speech databases of many languages. Since Punjabi is a syllabic language, so syllables has been selected as the basic speech unit for this TTS system, which preserves within unit co-articulation effects. The working of this Punjabi TTS system can be divided into two modules: Online Process and Offline Process. Online process is responsible for pre-processing of the input text, schwa deletion, syllabification and then searching the syllables in the speech database. Pre-processing involves the expansion of abbreviations, numeric figures and special symbols etc. Schwa deletion is an important step for the development of a high quality Text-To-Speech synthesis system. Phonetically, schwa is a very short neutral vowel sound, and like all vowels, its precise quality varies depending on the adjacent consonants. During utterance of words not every schwa following a consonant is pronounced. In order to determine the proper pronunciation of words, it is necessary to identify which schwas are to be deleted and which are to be retained. Grammar rules, inflectional rules and morphotactics of language play important role for identification of schwa those are to be deleted. A rule based schwa deletion algorithm has been developed for Punjabi having accuracy of about 98.27%. Syllabification of the words of input text is also a challenging task. Defining a syllable in a language is a complex task. There are many theories available in phonetics and phonology to define a syllable. In phonetics, syllables are defined based upon the articulation. However in phonological approach, syllables are defined by the different sequences of the phonemes. In every language, certain sequences of phonemes are recognized. In Punjabi seven types of syllables are recognized – V, VC, CV, VCC, CVC, CVCC and CCVC (where V and C represents vowel and consonant respectively), which combine in turn to produce words. A syllabification algorithm for Punjabi has been developed having accuracy of about 96.7%, which works on the output of the schwa deletion algorithm.

The Offline process of this TTS system involved the development of the Punjabi speech database. In order to minimize the size of speech database, effort has been made to select a minimal set of syllables covering almost whole Punjabi word set. To accomplish this all Punjabi syllables have been statistically analyzed on the Punjabi corpus having more than 104 million words. Interesting and very important results have been obtained from this analysis those helps to select a relatively smaller syllable set (about first ten thousand syllables (0.86% of total syllables)) of most frequently occurring syllables having cumulative frequency of occurrence less than 99.81%, out of 1156740 total available syllables. The developed Punjabi speech database stores the starting and end positions of the selected syllable-sounds labeled carefully in a wave file of recorded words. As the syllable sound varies depending upon its position (starting, middle or end) in the word, so separate entries for these three positions has been made in the database for each syllable. An algorithm has been developed based on the set covering problem for selecting the minimum number of words containing above selected syllables for recording of sound file in which syllable positions are marked.

The syllables of the input text are first searched in the speech database for corresponding syllable-sound positions in recorded wave file and then these syllable sounds are concatenated. Normalisation of the synthesized Punjabi sound is done in order to remove the discontinuities at the concatenation points and hence producing smooth, natural sound. A good quality sound is being produced by this TTS system for Punjabi language.

Parminder Singh, Gurpreet Singh Lehal

Hand-Filled Form Processing System for Gurmukhi Script

A form processing system improves efficiency of data entry and analyses in offices using state-of-the-art technology. It typically consists of several sequential tasks or functional components viz. form designing, form template registration, field isolation, bounding box removal or colour dropout, field-image extraction, segmentation, feature-extraction from the field-image, field-recognition. The major challenges for a form processing system are large quantity of forms and large variety of writing styles of different individuals.

Some of the Indian scripts have very complex structures e.g. Gurmukhi, Devnagri and Bengali etc. Use of head line, appearance of vowels, parts of vowel or half characters over headline and below the normal characters (in foot) and compound characters makes the segmentation and consequently recognition tasks very difficult.

The present system is a pioneering effort for developing a form processing system for any of the Indian languages. The system covers form template generation, form image scanning and digitization, pre-processing, feature extraction, classification and post-processing. Pre-processing covers form level skew detection, field data extraction by field frame boundary removal, field segmentation, word level skew correction, word segmentation, character level slant correction and size normalization. For feature extraction Zoning, DDD and Gabor filter have been use and for for classification, kNN and SVM have been put to use. A new method has been developed for post processing based on the shape similarity of handwritten characters.

The results of using kNN classifier for different values of k with all features combined are 72.64 percent for alphabets and 93.00 percent for digits. With SVM as classifier and all the features combined, the results improve marginally (73.63 percent for alphabets and 94.83 percent for digits). In this demo we shall demonstrate the working of the whole system.

Dharam Veer Sharma, Gurpreet Singh Lehal

Urdu to Hindi and Reverse Transliteration System

In spoken form Hindi and Urdu are mutually comprehensible languages but they are written in mutually incomprehensible scripts. This research work aims to bring the Urdu and Hindi speaking people closer by developing a transliteration tool for the two languages. Even though the language is same, but still developing a high accuracy transliteration system is not a trivial job. The accuracies of statistical and rule based Urdu-Hindi and reverse transliteration systems are 97.12% and 99.46% at word level.

Gurpreet Singh Lehal, Tejinder Singh Saini, V. S. Kalra

iPlugin: Indian Language Web Application Development Tool

iPlugin is an Indian Language web application development software tool. iPlugin allows user to type in Indian Languages in web pages over the internet. iPlugin software helps to develop unique interactive applications for users in Indian languages. iPlugin is ideal for creating interactive applications such as online chat, localize database query applications, type blogs in your language, feedback and e-mail in Indian languages, reports or any other such application which requires support of typing in Indian Languages over Web. iPlugin helps in creation of Indian language web content for front end and back end of internet and intranet portal solutions.

Anup Kanaskar, Vrundesh Waghmare

An OCR System for Printed Indic Scripts

The project ‘Development of Robust document Analysis and Recognition for printed Indian Scripts’ is a Department of Information Technology sponsored project to develop OCR for printed Indian scripts. A consortia led by IIT Delhi has completed the phase –I in OCR. The consortia members include

IIT Delhi

IISC Bangalore

ISI Kolkatta

IIIT Hyderabad

Central University , Hyderabad

Punjabi University, Patiala

MS University, Baroda

Utkal University, Bhubaneswar

CDAC Noida

CDAC Pune

Different consortia members are responsible for different language OCRs like Punjabi University has contributed. Gurumukhi OCR, IIIT Hyderabad for Malalayam OCR etc. CDAC Noida has done the integration of OCRs with pre processings.

Tushar Patnaik

Gujarati Text – To – Speech System

The need for text-to-speech systems in various languages is obvious with the current fast paced technology development in information and communication technology. Keeping the Gujarati language, intimately used by 55 odd million people in India and abroad, abreast with technology development is not just logical but heartfelt.

Samyak Bhuta, S. Rama Mohan

Large Web Corpora for Indian Languages

A crucial resource for language technology development is a corpus. It should, if possible, be large and varied: otherwise it may well simply fail to cover all the core phenomena of the language, so tools based on it will sometimes fail because they are encountering something which was not encountered in the development corpus. They are critical for the development of morphological analysers because they show a good sample of all the words, in all their forms, that the analyser might be expected to handle. Since the advent of the web, corpora development has become much easier: now, for many languages, of many text types, vast quantities of text are available by mouse-click (Kilgarriff and Grefenstette 2003).

‘Corpora for all’ is our company’s mission. We want to encourage corpus use for linguists, language learners and language technologists in all sorts of contexts. To that end we have set up a ‘Corpus Factory’ which quickly and efficiently creates general corpora, from the web, for any widely-spoken languages (Kilgarriff et al 2010). To start with we addressed the large languages of Europe and East Asia: now we have large corpora for all of those and have moved on to the many large languages (Hindi, Bengali, Malayalam, Telugu, Kannada, Urdu, Gujarati, Tamil etc) of the subcontinent. At time of writing, we have corpora that are all multi-million-word. We believe these corpora are larger and more varied than any others available for the languages in question.

Once we have created a corpus, we load it into the Sketch Engine corpus query system (Kilgarriff et al 2004) and make them available through the web service at http://www.sketchengine.co.uk. (Sign up for a free trial; all the corpora listed above, and more as the months proceed, will be available for you to explore.)

The Sketch Engine is a web-based Corpus Query System, which takes as its input a corpus of any language with an appropriate level of linguistic mark-up and offers a number of language-analysis functions like Concordance, word sketches, distributional thesaurus and sketch difference. Concordance is display of all occurrences from the corpus for a given query. This system accepts simple queries (lemma) as well as complex queries in CQL [4] format. A Word Sketch is a corpus-based summary of a word’s grammatical and collocational behaviour. This also checks to see which words occur with the same collocates as other words, and on the basis of this data it generates a ‘distributional thesaurus’. A distributional thesaurus is an automatically produced ‘thesaurus’ which finds words that tend to occur in similar contexts as the target word. And finally Sketch Difference is a neat way of comparing two very similar words: it shows those patterns and combinations that the two items have in common, and also those patterns and combinations that are more typical of, or unique to, one word rather than the other. Ideally, prior to loading into the Sketch Engine, we lemmatise and part-of-speech-tag the data and we can then prepare word sketches, distributional thesaurus and also sketch differences.

We are currently looking for collaborators with expertise in lemmatisers and taggers for one or more of the Indian languages, so we can jointly prepare world class resources for Indian languages to match those for European and East Asian ones.

Adam Kilgarriff, Girish Duvuru

Localization of EHCPRs System in the Multilingual Domain: An Implementation

The increase of cross-cultural communication triggered by the Internet and the diverse language distribution of Internet users intensifies the needs for the globalization of the online intelligent system The Extended Hierarchical Censored Production Rules (EHCPRs) system might act as a generalized intelligent agent which takes care of context sensitivity in its reasoning. Efforts are to make the EHCPRs system online which can be localized as per the requirement of any specific multilingual domain of users.

Sarika Jain, Deepa Chaudhary, N. K. Jain

Retraction Note to: A Compiler for Morphological Analyzer Based on Finite-State Transducers

The paper starting on page 81 of this volume has been retracted as a significant amount of text was copied from the following paper:

Alicia Garrido , Amaia Iturraspe , Sandra Montserrat , Hermínia Pastor , Mikel L. Forcada ”A compiler for morphological analysers and generators based on finite-state transducers”, Procesamiento del Lenguaje Natural vol. 25(1999) pages 93-98.