nach oben

2011 | Buch

Kapitel lesen Erstes Kapitel lesen

Computational Linguistics and Intelligent Text Processing

12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II

herausgegeben von: Alexander Gelbukh

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two-volume set, consisting of LNCS 6608 and LNCS 6609, constitutes the thoroughly refereed proceedings of the 12th International Conference on Computer Linguistics and Intelligent Processing, held in Tokyo, Japan, in February 2011. The 74 full papers, presented together with 4 invited papers, were carefully reviewed and selected from 298 submissions. The contents have been ordered according to the following topical sections: lexical resources; syntax and parsing; part-of-speech tagging and morphology; word sense disambiguation; semantics and discourse; opinion mining and sentiment detection; text generation; machine translation and multilingualism; information extraction and information retrieval; text categorization and classification; summarization and recognizing textual entailment; authoring aid, error correction, and style analysis; and speech recognition and generation.

Inhaltsverzeichnis

Frontmatter

Machine Translation and Multilingualism

Ontology Based Interlingua Translation

In this paper we describe an interlingua translation system from Italian to Italian Sign Language. The main components of this systems are a broad coverage dependency parser, an ontology based semantic interpreter and a grammar-based generator: we provide the description of the main features of these components.

Leonardo Lesmo, Alessandro Mazzei, Daniele P. Radicioni

Phrasal Equivalence Classes for Generalized Corpus-Based Machine Translation

Generalizations of sentence-pairs in Example-based Machine Translation (EBMT) have been shown to increase coverage and translation quality in the past. These template-based approaches (G-EBMT) find common patterns in the bilingual corpus to generate generalized templates. In the past, patterns in the corpus were found by only few of the following ways: finding similar or dissimilar portions of text in groups of sentence-pairs, finding semantically similar words, or use dictionaries and parsers to find syntactic correspondences. This paper combines all the three aspects for generating templates. In this paper, the boundaries for aligning and extracting members (phrase-pairs) for clustering are found using chunkers (hence, syntactic information) trained independently on the two languages under consideration. Then semantically related phrase-pairs are grouped based on the contexts in which they appear. Templates are then constructed by replacing these clustered phrase-pairs by their class labels. We also perform a filtration step by simulating human labelers to obtain only those phrase-pairs that have high correspondences between the source and the target phrases that make up the phrase-pairs. Templates with English-Chinese and English-French language pairs gave significant improvements over a baseline with no templates.

Rashmi Gangadharaiah, Ralf D. Brown, Jaime Carbonell

A Multi-view Approach for Term Translation Spotting

This paper presents a multi-view approach for term translation spotting, based on a bilingual lexicon and comparable corpora. We propose to study different levels of representation for a term: the context, the theme and the orthography. These three approaches are studied individually and combined in order to rank translation candidates. We focus our task on French-English medical terms. Experiments show a significant improvement of the classical context-based approach, with a F-score of 40.3% for the first ranked translation candidates.

Raphaël Rubino, Georges Linarès

ICE-TEA: In-Context Expansion and Translation of English Abbreviations

The wide use of abbreviations in modern texts poses interesting challenges and opportunities in the field of NLP. In addition to their dynamic nature, abbreviations are highly polysemous with respect to regular words. Technologies that exhibit some level of language understanding may be adversely impacted by the presence of abbreviations. This paper addresses two related problems: (1) expansion of abbreviations given a context, and (2) translation of sentences with abbreviations. First, an efficient retrieval-based method for English abbreviation expansion is presented. Then, a hybrid system is used to pick among simple abbreviation-translation methods. The hybrid system achieves an improvement of 1.48 BLEU points over the baseline MT system, using sentences that contain abbreviations as a test set.

Waleed Ammar, Kareem Darwish, Ali El Kahki, Khaled Hafez

Word Segmentation for Dialect Translation

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.

Michael Paul, Andrew Finch, Eiichiro Sumita

TEP: Tehran English-Persian Parallel Corpus

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment are addressed. To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.

Mohammad Taher Pilevar, Heshaam Faili, Abdol Hamid Pilevar

Effective Use of Dependency Structure for Bilingual Lexicon Creation

Existing dictionaries may be effectively enlarged by finding the translations of single words, using comparable corpora. The idea is based on the assumption that similar words have similar contexts across multiple languages. However, previous research suggests the use of a simple bag-of-words model to capture the lexical context, or assumes that sufficient context information can be captured by the successor and predecessor of the dependency tree. While the latter may be sufficient for a close language-pair, we observed that the method is insufficient if the languages differ significantly, as is the case for Japanese and English. Given a query word, our proposed method uses a statistical model to extract relevant words, which tend to co-occur in the same sentence; additionally our proposed method uses three statistical models to extract relevant predecessors, successors and siblings in the dependency tree. We then combine the information gained from the four statistical models, and compare this lexical-dependency information across English and Japanese to identify likely translation candidates. Experiments based on openly accessible comparable corpora verify that our proposed method can increase Top 1 accuracy statistically significantly by around 13 percent points to 53%, and Top 20 accuracy to 91%.

Daniel Andrade, Takuya Matsuzaki, Jun’ichi Tsujii

Online Learning via Dynamic Reranking for Computer Assisted Translation

New techniques for online adaptation in computer assisted translation are explored and compared to previously existing approaches. Under the online adaptation paradigm, the translation system needs to adapt itself to real-world changing scenarios, where training and tuning may only take place once, when the system is set-up for the first time. For this purpose, post-edit information, as described by a given quality measure, is used as valuable feedback within a dynamic reranking algorithm. Two possible approaches are presented and evaluated. The first one relies on the well-known perceptron algorithm, whereas the second one is a novel approach using the Ridge regression in order to compute the optimum scaling factors within a state-of-the-art SMT system. Experimental results show that such algorithms are able to improve translation quality by learning from the errors produced by the system on a sentence-by-sentence basis.

Pascual Martínez-Gómez, Germán Sanchis-Trilles, Francisco Casacuberta

Information Extraction and Information Retrieval

Learning Relation Extraction Grammars with Minimal Human Intervention: Strategy, Results, Insights and Plans

The paper describes the operation and evolution of a linguistically oriented framework for the minimally supervised learning of relation extraction grammars from textual data. Cornerstones of the approach are the acquisition of extraction rules from parsing results, the utilization of closed-world semantic seeds and a filtering of rules and instances by confidence estimation. By a systematic walk through the major challenges for this approach the obtained results and insights are summarized. Open problems are addressed and strategies for solving these are outlined.

Hans Uszkoreit

Using Graph Based Method to Improve Bootstrapping Relation Extraction

Many bootstrapping relation extraction systems processing large corpus or working on the Web have been proposed in the literature. These systems usually return a large amount of extracted relationship instances as an out-of-ordered set. However, the returned result set often contains many irrelevant or weakly related instances. Ordering the extracted examples by their relevance to the given seeds is helpful to filter out irrelevant instances. Furthermore, ranking the extracted examples makes the selection of most similar instance easier. In this paper, we use a graph based method to rank the returned relation instances of a bootstrapping relation extraction system. We compare the used algorithm to the existing methods, relevant score based methods and frequency based methods, the results indicate that the proposed algorithm can improve the performance of the bootstrapping relation extraction systems.

Haibo Li, Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

A Hybrid Approach for the Extraction of Semantic Relations from MEDLINE Abstracts

With the continuous digitisation of medical knowledge, information extraction tools become more and more important for practitioners of the medical domain. In this paper we tackle semantic relationships extraction from medical texts. We focus on the relations that may occur between diseases and treatments. We propose an approach relying on two different techniques to extract the target relations: (i) relation patterns based on human expertise and (ii) machine learning based on SVM classification. The presented approach takes advantage of the two techniques, relying more on manual patterns when few relation examples are available and more on feature values when a sufficient number of examples are available. Our approach obtains an overall 94.07% F-measure for the extraction of cure, prevent and side effect relations.

Asma Ben Abacha, Pierre Zweigenbaum

An Active Learning Process for Extraction and Standardisation of Medical Measurements by a Trainable FSA

Medical scores and measurements are a very important part of clinical notes as clinical staff infer a patient’s state by analysing them, especially their variation over time. We have devised an active learning process for rapid training of an engine for detecting regular patterns of scores, measurements and people and places in clinical texts. There are two objectives to this task. Firstly, to find a comprehensive collection of validated patterns in a time efficient manner, and second to transform the captured examples into canonical forms. The first step of the process was to train an FSA from seed patterns and then use the FSA to extract further examples of patterns from the corpus.

The next step was to identify partial true positives (PTP) from the newly extracted examples. A manual annotator reviewed the extractions to identify the partial true positives (PTPs) and added the corrected form of these examples to the training set as new patterns. This cycle was continued until no new PTPs were detected. The process showed itself to be effective in requiring 5 cycles to create 371 true positives from 200 texts. We believe this gives 95% coverage of the TPs in the corpus.

Jon Patrick, Mojtaba Sabbagh

Topic Chains for Understanding a News Corpus

The Web is a great resource and archive of news articles for the world. We present a framework, based on probabilistic topic modeling, for uncovering the meaningful structure and trends of important topics and issues hidden within the news archives on the Web. Central in the framework is a

topic chain

, a temporal organization of similar topics. We experimented with various topic similarity metrics and present our insights on how best to construct topic chains. We discuss how to interpret the topic chains to understand the news corpus by looking at long-term topics, temporary issues, and shifts of focus in the topic chains. We applied our framework to nine months of Korean Web news corpus and present our findings.

Dongwoo Kim, Alice Oh

From Italian Text to TimeML Document via Dependency Parsing

This paper describes the first prototype for building TimeML xml documents starting from raw text for Italian. First, the text is parsed with the TULE parser, a dependency parser developed at the University of Turin. The parsed text is then used as input to the TimeML rule-based module we have implemented, henceforth called as ‘The converter’. So far, the converter identifies and classifies events in the sentence. The results are rather satisfatory, and this leads us to support the use of dependency syntactic relations for the development of higher level semantic tools.

Livio Robaldo, Tommaso Caselli, Irene Russo, Matteo Grella

Self-adjusting Bootstrapping

Bootstrapping has been used as a very efficient method to extract a group of items similar to a given set of seeds. However, the bootstrapping method intrinsically has several parameters whose optimal values differ from task to task, and from target to target. In this paper, first, we will demonstrate that this is really the case and serious problem. Then, we propose

self-adjusting bootstrapping

, where the original seed is segmented into the real seed and validation data. We initially bootstrap starting with the real seed, trying alternative parameter settings, and use the validation data to identify the optimal settings. This is done repeatedly with alternative segmentations in typical cross-validation fashion. Then the final bootstrapping is performed using the best parameter setting and the entire original seed set in order to create the final output. We conducted experiments to collect sets of company names in different categories. Self-adjusting bootstrapping substantially outperformed a baseline using a uniform parameter setting.

Shoji Fujiwara, Satoshi Sekine

Story Link Detection Based on Event Words

In this paper, we propose an event words based method for story link detection. Different from previous studies, we use time and places to label nouns and named entities, the featured nouns/named entities are called event words. In our approach, a document is represented by five dimensions including nouns/named entities, time featured nouns/named entities, place featured nouns/named entities, time&place featured nouns/named entities and publication date. Experimental results show that, our method gain a significant improvement over baseline and event words plays a vital role in this improvement. Especially when using publication date, we can reach the highest 92% on precision.

Letian Wang, Fang Li

Ranking Multilingual Documents Using Minimal Language Dependent Resources

This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking.

G. S. K. Santosh, N. Kiran Kumar, Vasudeva Varma

Measuring Chinese-English Cross-Lingual Word Similarity with HowNet and Parallel Corpus

Cross-lingual word similarity (CLWS) is a basic component in cross-lingual information access systems. Designing a CLWS measure faces three challenges: (i) Cross-lingual knowledge base is rare; (ii) Cross-lingual corpora are limited; and (iii) No benchmark cross-lingual dataset is available for CLWS evaluation. This paper presents some Chinese-English CLWS measures that adopt

HowNet

as cross-lingual knowledge base and sentence-level parallel corpus as development data. In order to evaluate these measures, a Chinese-English cross-lingual benchmark dataset is compiled based on the Miller-Charles’ dataset. Two conclusions are drawn from the experimental results. Firstly,

HowNet

is a promising knowledge base for the CLWS measure. Secondly, parallel corpus is promising to fine-tune the word similarity measures using cross-lingual co-occurrence statistics.

Yunqing Xia, Taotao Zhao, Jianmin Yao, Peng Jin

Text Categorization and Classification

Comparing Manual Text Patterns and Machine Learning for Classification of E-Mails for Automatic Answering by a Government Agency

E-mails to government institutions as well as to large companies may contain a large proportion of queries that can be answered in a uniform way. We analysed and manually annotated 4,404 e-mails from citizens to the Swedish Social Insurance Agency, and compared two methods for detecting answerable e-mails: manually-created text patterns (rule-based) and machine learning-based methods. We found that the text pattern-based method gave much higher precision at 89 percent than the machine learning-based method that gave only 63 percent precision. The recall was slightly higher (66 percent) for the machine learning-based methods than for the text patterns (47 percent). We also found that 23 percent of the total e-mail flow was processed by the automatic e-mail answering system.

Hercules Dalianis, Jonas Sjöbergh, Eriks Sneiders

Using Thesaurus to Improve Multiclass Text Classification

With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.

Nooshin Maghsoodi, Mohammad Mehdi Homayounpour

Adaptable Term Weighting Framework for Text Classification

In text classification, term frequency and term co-occurrence factors are dominantly used in weighting term features. Category relevance factors have recently been used to propose term weighting approaches. However, these approaches are mainly based on their own-designed text classifiers to adapt to category information, where the advantages of popular text classifiers have been ignored. This paper proposes a term weighting framework for text classification tasks. The framework firstly inherits the benefits of provided category information to estimate the weighting of features. Secondly, based on the feedback information, it is able to continuously adjust feature weightings to find the best representations for documents. Thirdly, the framework robustly makes it possible to work with different text classifiers on classifying the text representations, based on category information. On several corpora with SVM classifier, experiments show that given predicted information from TFxIDF method as initial status, the proposed approach leverages accuracy results and outperforms current text classification approaches.

Dat Huynh, Dat Tran, Wanli Ma, Dharmendra Sharma

Automatic Specialized vs. Non-specialized Sentence Differentiation

Compilation of Languages for Specific Purposes (LSP) corpora is a task which is fraught with several difficulties (mainly time and human effort), because it is not easy to discern between specialized and non-specialized text. The aim of this work is to study automatic specialized vs. non-specialized sentence differentiation. The experiments are carried out on two corpora of sentences extracted from specialized and non-specialized texts. One in economics (academic publications and news from newspapers), another about sexuality (academic publications and texts from forums and blogs). First we show the feasibility of the task using a statistical n-gram classifier. Then we show that grammatical features can also be used to classify sentences from the first corpus. For such purpose we use association rule mining.

Iria da Cunha, M. Teresa Cabré, Eric SanJuan, Gerardo Sierra, Juan Manuel Torres-Moreno, Jorge Vivaldi

Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features

Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content.

In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions.

B. Thomas Adler, Luca de Alfaro, Santiago M. Mola-Velasco, Paolo Rosso, Andrew G. West

Costco: Robust Content and Structure Constrained Clustering of Networked Documents

Connectivity analysis of networked documents provides high quality link structure information, which is usually lost upon a content-based learning system. It is well known that combining links and content has the potential to improve text analysis. However, exploiting link structure is non-trivial because links are often noisy and sparse. Besides, it is difficult to balance the term-based content analysis and the link-based structure analysis to reap the benefit of both. We introduce a novel networked document clustering technique that integrates the content and link information in a unified optimization framework. Under this framework, a novel dimensionality reduction method called COntent & STructure COnstrained (Costco) Feature Projection is developed. In order to extract robust link information from sparse and noisy link graphs, two link analysis methods are introduced. Experiments on benchmark data and diverse real-world text corpora validate the effectiveness of proposed methods.

Su Yan, Dongwon Lee, Alex Hai Wang

Summarization and Recognizing Textual Entailment

Learning Predicate Insertion Rules for Document Abstracting

The insertion of linguistic material into document sentences to create new sentences is a common activity in document abstracting. We investigate a transformation-based learning method to simulate this type of operation relevant for text summarization. Our work is framed on a theory of transformation-based abstracting where an initial text summary is transformed into an abstract by the application of a number of rules learnt from a corpus of examples. Our results are as good as recent work on classification-based predicate insertion.

Horacio Saggion

Multi-topical Discussion Summarization Using Structured Lexical Chains and Cue Words

We propose a method to summarize threaded, multi-topical texts automatically, particularly online discussions and e-mail conversations. These corpora have a so-called reply-to structure among the posts, where multiple topics are discussed simultaneously with a certain level of continuity, although each post is typically short. We specifically focus on the multi-topical aspect of the corpora, and propose the use of two linguistically motivated features: lexical chains and cue words, which capture the topics and topic structure. Particularly, we introduce the

structured lexical chain

, which is a combination of traditional lexical chains with the thread structure. In experiments, we show the effectiveness of these features on the Innovation Jam 2008 Corpus and the BC3 Mailing List Corpus based on two task settings: key-sentence and keyword extraction. We also present detailed analysis of the result with some intuitive examples.

Jun Hatori, Akiko Murakami, Jun’ichi Tsujii

Multi-document Summarization Using Link Analysis Based on Rhetorical Relations between Sentences

With the accelerating rate of data growth on the Internet, automatic multi-document summarization has become an important task. In this paper, we propose a link analysis incorporated with rhetorical relations between sentences to perform extractive summarization for multiple-documents. We make use of the documents headlines to extract sentences with salient terms from the documents set using statistical model. Then we assign rhetorical relations learned by SVMs to determine the connectivity between the sentences which include the salient terms. Finally, we rank these sentences by measuring their relative importance within the document set based on link analysis method, PageRank. The rhetorical relations are used to evaluate the complementarity and redundancy of the ranked sentences. Our evaluation results show that the combination of PageRank along with rhetorical relations among sentences does help to improve the quality of extractive summarization.

Nik Adilah Hanin Binti Zahri, Fumiyo Fukumoto

Co-clustering Sentences and Terms for Multi-document Summarization

Two issues are crucial to multi-document summarization: diversity and redundancy. Content within some topically-related articles are usually redundant while the topic is delivered from diverse perspectives. This paper presents a co-clustering based multi-document summarization method that makes full use of the diverse and redundant content. A multi-document summary is generated in three steps. First, the sentence-term co-occurrence matrix is designed to reflect diversity and redundancy. Second, the co-clustering algorithm is performed on the matrix to find globally optimal clusters for sentences and terms in an iterative manner. Third, a more accurate summary is generated by selecting representative sentences from the optimal clusters. Experiments on DUC2004 dataset show that the co-clustering based multi-document summarization method is promising.

Yunqing Xia, Yonggang Zhang, Jianmin Yao

Answer Validation Using Textual Entailment

We present an Answer Validation System (AV) based on Textual Entailment and Question Answering. The important features used to develop the AV system are Lexical Textual Entailment, Named Entity Recognition, Question-Answer type analysis, chunk boundary module and syntactic similarity module. The proposed AV system is rule based. We first combine the question and the answer into Hypothesis (H) and the Supporting Text as Text (T) to identify the entailment relation as either “VALIDATED” or “REJECTED”. The important features used for the lexical Textual Entailment module in the present system are: WordNet based unigram match, bigram match and skip-gram. In the syntactic similarity module, the important features used are: subject-subject comparison, subject-verb comparison, object-verb comparison and cross subject-verb comparison. The results obtained from the answer validation modules are integrated using a voting technique. For training purpose, we used the AVE 2008 development set. Evaluation scores obtained on the AVE 2008 test set show 66% precision and 65% F-Score for “VALIDATED” decision.

Partha Pakray, Alexander Gelbukh, Sivaji Bandyopadhyay

Authoring Aid, Error Correction, and Style Analysis

SPIDER: A System for Paraphrasing in Document Editing and Revision — Applicability in Machine Translation Pre-editing

This paper presents SPIDER, a system for paraphrasing in document editing and revision with applicability in machine translation pre-editing. SPIDER applies its linguistic knowledge (dictionaries and grammars) to create paraphrases of distinct linguistic phenomena. The first version of this tool was initially developed for Portuguese (ReEscreve v01), but it is extensible to different languages and can also operate across languages. SPIDER has a totally new interface, new resources which contemplate a wider coverage of linguistic phenomena, and applicability to legal terminology, which is described here.

Anabela Barreiro

Providing Cross-Lingual Editing Assistance to Wikipedia Editors

We propose a framework to assist Wikipedia editors to transfer information among different languages. Firstly, with the help of some machine translation tools, we analyse the texts in two different language editions of an article and identify information that is only available in one edition. Next, we propose an algorithm to look for the most probable position in the other edition where the new information can be inserted. We show that our method can accurately suggest positions for new information. Our proposal is beneficial to both readers and editors of Wikipedia, and can be easily generalised and applied to other multi-lingual corpora.

Ching-man Au Yeung, Kevin Duh, Masaaki Nagata

Reducing Overdetections in a French Symbolic Grammar Checker by Classification

We describe the development of an “overdetection” identifier, a system for filtering detections erroneously flagged by a grammar checker. Various families of classifiers have been trained in a supervised way for 14 types of detections made by a commercial French grammar checker. Eight of these were integrated in the most recent commercial version of the system. This is a striking illustration of how a machine learning component can be successfully embedded in

Antidote

, a robust, commercial, as well as popular natural language application.

Fabrizio Gotti, Philippe Langlais, Guy Lapalme, Simon Charest, Éric Brunelle

Performance Evaluation of a Novel Technique for Word Order Errors Correction Applied to Non Native English Speakers’ Corpus

This work presents the evaluation results of a novel technique for word order errors correction, using non native English speakers’ corpus. This technique, which is language independent, repairs word order errors in sentences using the probabilities of most typical trigrams and bigrams extracted from a large text corpus such as the British National Corpus (BNC). A good indicator of whether a person really knows a language is the ability to use the appropriate words in a sentence in correct word order. The “scrambled” words in a sentence produce a meaningless sentence. Most languages have a fairly fixed word order. For non-native speakers and writers, word order errors are more frequent in English as a Second Language. These errors come from the student if he is translating (thinking in his/her native language and trying to translate it into English). For this reason, the experimentation task involves a test set of 50 sentences translated from Greek to English. The purpose of this experiment is to determine how the system performs on real data, produced by non native English speakers.

Theologos Athanaselis, Stelios Bakamidis, Ioannis Dologlou

Correcting Verb Selection Errors for ESL with the Perceptron

We study the task of correcting verb selection errors for English as a Second Language (ESL) learners, which is meaningful but also challenging. The difficulties of this task lie in two aspects: the lack of annotated data and the diversity of verb usage context. We propose a perceptron based novel approach to this task. More specifically, our method generates correction candidates using predefined confusion sets, to avoid the tedious and prohibitively unaffordable human labeling; moreover, rich linguistic features are integrated to represent verb usage context, using a global linear model learnt by the perceptron algorithm. The features used in our method include a language model, local text, chunks, and semantic collocations. Our method is evaluated on both synthetic and real-world corpora, and consistently achieves encouraging results, outperforming all baselines.

Xiaohua Liu, Bo Han, Ming Zhou

A Posteriori Agreement as a Quality Measure for Readability Prediction Systems

All readability research is ultimately concerned with the research question whether it is possible for a prediction system to automatically determine the level of readability of an unseen text. A significant problem for such a system is that readability might depend in part on the reader. If different readers assess the readability of texts in fundamentally different ways, there is insufficient a priori agreement to justify the correctness of a readability prediction system based on the texts assessed by those readers. We built a data set of readability assessments by expert readers. We clustered the experts into groups with greater a priori agreement and then measured for each group whether classifiers trained only on data from this group exhibited a classification bias. As this was found to be the case, the classification mechanism cannot be unproblematically generalized to a different user group.

Philip van Oosten, Véronique Hoste, Dries Tanghe

A Method to Measure the Reading Difficulty of Japanese Words

In this paper, we propose an automatic method to measure the reading difficulty of Japanese words. The proposed method uses a statistical transliteration framework, which was inspired by statistical machine translation research. A Dirichlet process model is used for the alignment between single kanji characters and one or more hiragana characters. The joint probability of kanji and hiragana is used to measure the difficulty. In our experiment, we carried out a linear discriminate analysis using three kinds of lexicons: a Japanese place name lexicon, a Japanese last name lexicon and a general noun lexicon. We compared the discrimination ratio given by the proposed method and the conventional method, which estimates a word difficulty based on manually defined kanji difficulty. According to the experimental results, the proposed method performs well for scoring Japanese proper noun reading difficulty. The proposed method produces a higher discrimination ratio with the proper noun lexicons (14 points higher on the place name lexicon and 26.5 points higher on the last name lexicon) than the conventional method.

Keiji Yasuda, Andrew Finch, Eiichiro Sumita

Informality Judgment at Sentence Level and Experiments with Formality Score

Formality and its converse, informality, are important dimensions of authorial style that serve to determine the social background a particular document is coming from, and the potential audience it is targeted to. In this paper we explored the concept of formality at the sentence level from two different perspectives. One was the Formality Score (F-score) and its distribution across different datasets, how they compared with each other and how F-score could be linked to human-annotated sentences. The other was to measure the inherent agreement between two independent judges on a sentence annotation task. It gave us an idea how subjective the concept of formality was at the sentence level. Finally, we looked into the related issue of document readability and measured its correlation with document formality.

Shibamouli Lahiri, Prasenjit Mitra, Xiaofei Lu

Speech Recognition and Generation

Combining Word and Phonetic-Code Representations for Spoken Document Retrieval

The traditional approach for spoken document retrieval (SDR) uses an automatic speech recognizer (ASR) in combination with a word-based information retrieval method. This approach has only showed limited accuracy, partially because ASR systems tend to produce transcriptions of spontaneous speech with significant word error rate. In order to overcome such limitation we propose a method which uses word and phonetic-code representations in collaboration. The idea of this combination is to reduce the impact of transcription errors in the processing of some (presumably complex) queries by representing words with similar pronunciations through the same phonetic code. Experimental results on the CLEF-CLSR-2007 corpus are encouraging; the proposed hybrid method improved the mean average precision and the number of retrieved relevant documents from the traditional word-based approach by 3% and 7% respectively.

Alejandro Reyes-Barragán, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda

Automatic Rule Extraction for Modeling Pronunciation Variation

This paper describes the technique for automatic extraction of pronunciation rules from continuous speech corpus. The purpose of the work is to model pronunciation variation in phoneme based continuous speech recognition at language model level. In modeling pronunciation variations, morphological variations and out-of-vocabulary words problem are also implicitly modeled in the system. It is not possible to model these kind of variations using dictionary based approach in phoneme based automatic speech recognition. The variations are automatically learned from annotated continuous speech corpus. The corpus is first aligned, on the basis of phoneme and letter, using a dynamic string alignment algorithm. The DSA is applied to isolated words to deal with intra-word variations as well as to complete sentences in the corpus to deal with inter-word variations. The pronunciation rules

phonemes

→

letters

are extracted from these aligned speech units to build pronunciation model. The rules are finally fed to a phoneme-to-word decoder for recognition of the words having different pronunciations or that are OOV.

Zeeshan Ahmed, Julie Carson-Berndsen

Predicting Word Pronunciation in Japanese

This paper addresses the problem of predicting the pronunciation of Japanese words, especially those that are newly created and therefore not in the dictionary. This is an important task for many applications including text-to-speech and text input method, and is also challenging, because Japanese kanji (ideographic) characters typically have multiple possible pronunciations. We approach this problem by considering it as a simplified machine translation/transliteration task, and propose a solution that takes advantage of the recent technologies developed for machine translation and transliteration research. More specifically, we divide the problem into two subtasks: (1) Discovering the pronunciation of new words or those words that are difficult to pronounce by mining unannotated text, much like the creation of a bilingual dictionary using the web; (2) Building a decoder for the task of pronunciation prediction, for which we apply the state-of-the-art discriminative substring-based approach. Our experimental results show that our classifier for validating the word-pronunciation pairs harvested from unannotated text achieves over 98% precision and recall. On the pronunciation prediction task of unseen words, our decoder achieves over 70% accuracy, which significantly improves over the previously proposed models.

Jun Hatori, Hisami Suzuki

A Minimum Cluster-Based Trigram Statistical Model for Thai Syllabification

Syllabification is a process of extracting syllables from a word. Problems of syllabification are majorly caused from unknown and ambiguous words. This research aims to resolve these problems in Thai language by exploiting relationships among characters in the word. A character clustering scheme is proposed to generate units smaller than a syllable, called Thai Minimum Clusters (TMCs), from a word. TMCs are then merged into syllables using a trigram statistical model. Experimental evaluations are performed to assess the effectiveness of the proposed technique on a standard data set of 77,303 words. The results show that the technique yields 97.61% accuracy.

Chonlasith Jucksriporn, Ohm Sornil

Automatic Generation of a Pronunciation Dictionary with Rich Variation Coverage Using SMT Methods

Constructing a pronunciation lexicon with variants in a fully automatic and language-independent way is a challenge, with many uses in human language technologies. Moreover, with the growing use of web data, there is a recurrent need to add words to existing pronunciation lexicons, and an automatic method can greatly simplify the effort required to generate pronunciations for these out-of-vocabulary words. In this paper, a machine translation approach is used to perform grapheme-to-phoneme (g2p) conversion, the task of finding the pronunciation of a word from its written form. Two alternative methods are proposed to derive pronunciation variants. In the first case, an n-best pronunciation list is extracted directly from the g2p converter. The second is a novel method based on a pivot approach, traditionally used for the paraphrase extraction task, and applied as a post-processing step to the g2p converter. The performance of these two methods is compared under different training conditions. The range of applications which require pronunciation lexicons is discussed and the generated pronunciations are further tested in some preliminary automatic speech recognition experiments.

Panagiota Karanasou, Lori Lamel

Backmatter

Titel: Computational Linguistics and Intelligent Text Processing
herausgegeben von: Alexander Gelbukh
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-19437-5
Print ISBN: 978-3-642-19436-8
DOI: https://doi.org/10.1007/978-3-642-19437-5