Skip to main content

2010 | Buch

Computational Linguistics and Intelligent Text Processing

11th International Conference, CICLing 2010, Iaşi, Romania, March 21-27, 2010. Proceedings

insite
SUCHEN

Über dieses Buch

th CICLing 2010 was the 11 Annual Conference on Intelligent Text Processing and Computational Linguistics. The CICLing conferences provide a wide-scope forum for discussion of the art and craft of natural language processing research as well as the best practices in its applications. This volume contains three invited papers and the regular papers accepted for oral presentation at the conference. The papers accepted for poster pres- tation were published in a special issue of another journal (see information on thewebsite).Since 2001,theproceedingsofCICLingconferenceshavebeen p- lished in Springer’s Lecture Notes in Computer Science series, as volumes 2004, 2276, 2588, 2945, 3406, 3878, 4394, 4919, and 5449. The volume is structured into 12 sections: – Lexical Resources – Syntax and Parsing – Word Sense Disambiguation and Named Entity Recognition – Semantics and Dialog – Humor and Emotions – Machine Translation and Multilingualism – Information Extraction – Information Retrieval – Text Categorization and Classi?cation – Plagiarism Detection – Text Summarization – Speech Generation The 2010 event received a record high number of submissions in the - year history of the CICLing series. A total of 271 papers by 565 authors from 47 countriesweresubmittedforevaluationbytheInternationalProgramCommittee (see Tables 1 and 2). This volume contains revised versions of 61 papers, by 152 authors, selected for oral presentation; the acceptance rate was 23%.

Inhaltsverzeichnis

Frontmatter

Lexical Resources

Invited Paper

Planning the Future of Language Resources: The Role of the FLaReNet Network

In this paper we analyse the role of Language Resources (LR) and Language Technologies (LT) in today Human Language Technology field and try to speculate on some of the priorities for the next years, from the particular perspective of the FLaReNet project, that has been asked to act as an observatory to assess current status of the field on Language Resources and Technology and to indicate priorities of action for the future.

Nicoletta Calzolari, Claudia Soria

Best Paper Award – Second Place

Cross-Lingual Alignment of FrameNet Annotations through Hidden Markov Models

Resources annotated with frame semantic information support the development of robust systems for shallow semantic parsing. Several researches proposed to automatically transfer the semantic information available for English corpora towards other resource-poor languages. In this paper, a semantic transfer approach is proposed based on Hidden Markov Models applied to aligned corpora. The experimental evaluation reported over an English-Italian corpus is successful, achieving 86% of accuracy on average, and improves on the state of the art methods for the same task.

Paolo Annesi, Roberto Basili
On the Automatic Generation of Intermediate Logic Forms for WordNet Glosses

This paper presents an automatically generated Intermediate Logic Form of WordNet’s glosses. Our proposed logic form includes neo-Davidsonian reification in a simple and flat syntax close to natural language. We offer a comparison with other semantic representations such as those provided by Hobbs and Extended WordNet. The Intermediate Logic Forms are straightforwardly obtained from the output of a pipeline consisting of a part-of-speech tagger, a dependency parser and our own Intermediate Logic Form generator (all freely available tools). We apply the pipeline to the glosses of WordNet 3.0 to obtain a lexical resource ready to be used as knowledge base or resource for a variety of tasks involving some kind of semantic inference. We present a qualitative evaluation of the resource and discuss its possible application in Natural Language Understanding.

Rodrigo Agerri, Anselmo Peñas
Worth Its Weight in Gold or Yet Another Resource — A Comparative Study of Wiktionary, OpenThesaurus and GermaNet

In this paper, we analyze the topology and the content of a range of lexical semantic resources for the German language constructed either in a controlled (GermaNet), semi-controlled (OpenThesaurus), or collaborative, i.e. community-based, manner (Wiktionary). For the first time, the comparison of the corresponding resources is performed at the word sense level. For this purpose, the word senses of terms are automatically disambiguated in Wiktionary and the content of all resources is converted to a uniform representation. We show that the resources’ topology is well comparable as they share the small world property and contain a comparable number of entries, although differences in their connectivity exist. Our study of content related properties reveals that the German Wiktionary has a different distribution of word senses and contains more polysemous entries than both other resources. We identify that each resource contains the highest number of a particular type of semantic relation. We finally increase the number of relations in Wiktionary by considering symmetric and inverse relations that have been found to be usually absent in this resource.

Christian M. Meyer, Iryna Gurevych
Issues in Analyzing Telugu Sentences towards Building a Telugu Treebank

This paper describes an effort towards building a Telugu Dependency Treebank. We discuss the basic framework and issues we encountered while annotating. 1487 sentences have been annotated in Paninian framework. We also discuss how some of the annotation decisions would effect the development of a parser for Telugu.

Chaitanya Vempaty, Viswanatha Naidu, Samar Husain, Ravi Kiran, Lakshmi Bai, Dipti M Sharma, Rajeev Sangal
EusPropBank: Integrating Semantic Information in the Basque Dependency Treebank

This paper deals with theoretical problems found in the work that is being carried out for annotating semantic roles in the Basque Dependency Treebank (BDT). We will present the resources used and the way the annotation is being done. Following the model proposed in the PropBank project, we will show the problems found in the annotation process and decisions we have taken. The representation of the semantic tag has been established and detailed guidelines for the annotation process have been defined, although it is a task that needs continuous updating. Besides, we have adapted AbarHitz, a tool used in the construction of the BDT, to this task.

Izaskun Aldezabal, María Jesús Aranzabe, Arantza Díaz de Ilarraza, Ainara Estarrona, Larraitz Uria
Morphological Annotation of a Corpus with a Collaborative Multiplayer Game

In most of the natural language processing tasks, state-of-the-art systems usually rely on machine learning methods for building their mathematical models. Given that the majority of these systems employ supervised learning strategies, a corpus that is annotated for the problem area is essential. The current method for annotating a corpus is to hire several experts and make them annotate the corpus manually or by using a helper software. However, this method is costly and time-consuming. In this paper, we propose a novel method that aims to solve these problems. By employing a multiplayer collaborative game that is playable by ordinary people on the Internet, it seems possible to direct the covert labour force so that people can contribute by just playing a fun game. Through a game site which incorporates some functionality inherited from social networking sites, people are motivated to contribute to the annotation process by answering questions about the underlying morphological features of a target word. The experiments show that the 63.5% of the actual question types are successful based on a two-phase evaluation.

Onur Güngör, Tunga Güngör

Syntax and Parsing

Invited Paper

Computational Models of Language Acquisition

Child language acquisition, one of Nature’s most fascinating phenomena, is to a large extent still a puzzle. Experimental evidence seems to support the view that early language is highly formulaic, consisting for the most part of frozen items with limited productivity. Fairly quickly, however, children find patterns in the ambient language and generalize them to larger structures, in a process that is not yet well understood. Computational models of language acquisition can shed interesting light on this process. This paper surveys various works that address language learning from data; such works are conducted in different fields, including psycholinguistics, cognitive science and computer science, and we maintain that knowledge from all these domains must be consolidated in order for a well-informed model to emerge. We identify the commonalities and differences between the various existing approaches to language learning, and specify desiderata for future research that must be considered by any plausible solution to this puzzle.

Shuly Wintner
ETL Ensembles for Chunking, NER and SRL

We present a new ensemble method that uses Entropy Guided Transformation Learning (ETL) as the base learner. The proposed approach, ETL Committee, combines the main ideas of Bagging and Random Subspaces. We also propose a strategy to include redundancy in transformation-based models. To evaluate the effectiveness of the ensemble method, we apply it to three Natural Language Processing tasks: Text Chunking, Named Entity Recognition and Semantic Role Labeling. Our experimental findings indicate that ETL Committee significantly outperforms single ETL models, achieving state-of-the-art competitive results. Some positive characteristics of the proposed ensemble strategy are worth to mention. First, it improves the ETL effectiveness without any additional human effort. Second, it is particularly useful when dealing with very complex tasks that use large feature sets. And finally, the resulting training and classification processes are very easy to parallelize.

Cícero N. dos Santos, Ruy L. Milidiú, Carlos E. M. Crestana, Eraldo R. Fernandes
Unsupervised Part-of-Speech Disambiguation for High Frequency Words and Its Influence on Unsupervised Parsing

Current unsupervised part-of-speech tagging algorithms build context vectors containing high frequency words as features and cluster words – regarding to their context vectors – into classes. While part-of-speech disambiguation for mid and low frequency words is achieved by applying a Hidden Markov Model, no corresponding method is applied to high frequency terms. But those are exactly the words being essential for analyzing syntactic dependencies of natural language. Thus, we want to introduce an approach employing unsupervised clustering of contexts to detect and separate a word’s different syntactic roles. Experiments on German and English corpora show how this methodology addresses and solves some of the major problems of unsupervised part-of-speech tagging.

Christian Hänig
A Machine Learning Parser Using an Unlexicalized Distituent Model

Despite the popularity of lexicalized parsing models, practical concerns such as data sparseness and applicability to domains of different vocabularies make unlexicalized models that do not refer to word tokens themselves deserve more attention. A classifier-based parser using an unlexicalized parsing model has been developed. Most importantly, to enhance the accuracy of these tasks, we investigated the notion of

distituency

(the possibility that two parts of speech cannot remain in the same constituent or phrase) and incorporated it as attributes using various statistic measures. A machine learning method integrates linguistic attributes and information-theoretic attributes in two tasks, namely sentence chunking and phrase recognition. The parser was applied to parsing English and Chinese sentences in the Penn Treebank and the Tsinghua Chinese Treebank. It achieved a parsing performance of

F-

Score 80.3% in English and 82.4% in Chinese.

Samuel W. K. Chan, Lawrence Y. L. Cheung, Mickey W. C. Chong
Ontology-Based Semantic Interpretation as Grammar Rule Constraints

We present an ontology-based semantic interpreter that can be linked to a grammar through grammar rule constraints, providing access to meaning during parsing and generation. In this approach, the parser will take as input natural language utterances and will produce ontology-based semantic representations. We rely on a recently developed constraint-based grammar formalism, which balances expressiveness with practical learnability results. We show that even with a weak “ontological model”, the semantic interpreter at the grammar rule level can help remove erroneous parses obtained when we do not have access to meaning.

Smaranda Muresan
Towards a Cascade of Morpho-syntactic Tools for Arabic Natural Language Processing

This paper presents a cascade of morpho-syntactic tools to deal with Arabic natural language processing. It begins with the description of a large coverage formalization of the Arabic lexicon. The built electronic dictionary, named "El-DicAr", which stands for “

El

ectronic

Dic

tionary for

Ar

abic”, links inflectional, morphological, and syntactic-semantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over 3 million inflected forms. El-DicAr represents the linguistic engine for the automatic analyzer, built through a lexical analysis module, and a cascade of morpho-syntactic tools including: a morphological analyzer, a spell-checker, a named entity recognition tool, an automatic annotator and tools for linguistic research and contextual exploration. The morphological analyzer identifies the component morphemes of the agglutinative forms using large coverage morphological grammars. The spell-checker corrects the most frequent typographical errors. The lexical analysis module handles the different vocalization statements in Arabic written texts. Finally, the named entity recognition tool is based on a combination of the morphological analysis results and a set of rules represented as local grammars.

Slim Mesfar
An Open-Source Computational Grammar for Romanian

We describe the implementation of a computational grammar for Romanian as a resource grammar in the GF project (Grammatical Framework). Resource grammars are the basic constituents of the GF library. They consist of morphological and syntactical modules which implement a common abstract syntax, also describing the basic features of a language. The present paper explores the main features of the Romanian grammar, along with the way they fit into the framework that GF provides. We also compare the implementation for Romanian with related resource grammars that exist already in the library. The current resource grammar allows generation and parsing of natural language and can be used in multilingual translations and other GF-related applications. Covering a wide range of specific morphological and syntactical features of the Romanian language, this GF resource grammar is the most comprehensive open-source grammar existing so far for Romanian.

Ramona Enache, Aarne Ranta, Krasimir Angelov
Chinese Event Descriptive Clause Splitting with Structured SVMs

Chinese event descriptive clause splitting is a novel task in Chinese information processing. Different from English clause splitting problem, Chinese event descriptive clause splitting aims at recognizing the high-level clauses. In this paper, we present a Chinese clause splitting system with a discriminative approach. By formulating the Chinese clause splitting task as a sequence labeling problem, we apply the structured SVMs model to Chinese clause splitting. Compared with other two baseline systems, our approach gives much better performance.

Junsheng Zhou, Yabing Zhang, Xinyu Dai, Jiajun Chen

Word Sense Disambiguation and Named Entity Recognition

Best Paper Award – First Place

An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation

Recent research works on unsupervised word sense disambiguation report an increase in performance, which reduces their handicap from the respective supervised approaches for the same task. Among the latest state of the art methods, those that use semantic graphs reported the best results. Such methods create a graph comprising the words to be disambiguated and their corresponding candidate senses. The graph is expanded by adding semantic edges and nodes from a thesaurus. The selection of the most appropriate sense per word occurrence is then made through the use of graph processing algorithms that offer a degree of importance among the graph vertices. In this paper we experimentally investigate the performance of such methods. We additionally evaluate a new method, which is based on a recently introduced algorithm for computing similarity between graph vertices, P-Rank. We evaluate the performance of all alternatives in two benchmark data sets, Senseval 2 and 3, using WordNet. The current study shows the differences in the performance of each method, when applied on the same semantic graph representation, and analyzes the pros and cons of each method for each part of speech separately. Furthermore, it analyzes the levels of inter-agreement in the sense selection level, giving further insight on how these methods could be employed in an unsupervised ensemble for word sense disambiguation.

George Tsatsaronis, Iraklis Varlamis, Kjetil Nørvåg
A Case Study of Using Web Search Statistics: Case Restoration

We investigate the use of Web search engine statistics for the task of case restoration. Because most engines are case insensitive, an approach based on search hit counts, as employed in previous work in natural language ambiguity resolution, is not applicable for this task. Consequently, we study the use of statistics computed from the snippets generated by a Web search engine, and we show that such statistics can achieve performance similar to corpus-based approaches. We also note that the top few results returned by a search engine may not the most representative for modeling phenomena in a language.

Silviu Cucerzan
A Named Entity Extraction using Word Information Repeatedly Collected from Unlabeled Data

This paper proposes a method for Named Entity (NE) extraction using NE-related labels of words repeatedly collected from unlabeled data. NE-related labels of words are candidate NE classes of each word, NE classes of co-occurring words of each word, and so on. To collect NE-related labels of words, we extract NEs from unlabeled data with an NE extractor. Then we collect NE-related labels of words from the extraction results. We create a new NE extractor using the NE-related labels of each word as new features. The new NE extractor is used to collect new NE-related labels of words. The experimental results using IREX data set for Japanese NE extraction show that our method contributes improved accuracy.

Tomoya Iwakura
A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities

Named Entity Recognition and Classification is being studied for last two decades. Since semantic features take huge amount of training time and are slow in inference, the existing tools apply features and rules mainly at the word level or use lexicons. Recent advances in distributional semantics allow us to efficiently create paradigmatic models that encode word order. We used Sahlgren

et al’

s permutation-based variant of the Random Indexing model to create a scalable and efficient system to simultaneously recognize multiple entity classes mentioned in natural language, which is validated on the GENIA corpus which has annotations for 46 biomedical entity classes and supports nested entities. Using distributional semantics features only, it achieves an overall micro-averaged F-measure of 67.3% based on fragment matching with performance ranging from 7.4% for “DNA substructure” to 80.7% for “Bioentity”.

Siddhartha Jonnalagadda, Robert Leaman, Trevor Cohen, Graciela Gonzalez

Semantics and Dialog

Invited Paper

The Recognition and Interpretation of Motion in Language

In this paper, we develop a framework for interpreting linguistic descriptions of places and locations as well as objects in motion as found in natural language texts. We present an overview of existing qualitative spatiotemporal models in order to discuss a more dynamic model of motion called Dynamic Interval Temporal Logic (DITL). The resulting static and dynamic descriptions are represented in a spatiotemporal markup language called STML. The STML output then enables a grounding within a metric representation such as Google Earth, through an automatic conversion to KML. Consistent with the STML output, DITL provides a semantics for STML for subsequent reasoning about the text.

James Pustejovsky, Jessica Moszkowicz, Marc Verhagen
Flexible Disambiguation in DTS

Quantifier scope ambiguities may engender several logical readings of a NL sentence. For instance, sentence (1) yields six possible readings, depending on the scoping of its three quantifiers: ( ∀ 5 ∃ ), ( ∀ ∃ 5), ( ∃ ∀ 5), ( ∃ 5 ∀ ), (5 ∀ ∃ ), and (5 ∃ ∀ ).

Livio Robaldo, Jurij Di Carlo
A Syntactic Textual Entailment System Based on Dependency Parser

The development of a syntactic textual entailment system that compares the dependency relations in both the text and the hypothesis has been reported. The Stanford Dependency Parser has been run on the 2-way RTE-3 development set and the dependency relations obtained for a text and hypothesis pair has been compared. Some of the important comparisons are: subject-subject comparison, subject-verb comparison, object-verb comparison and cross subject-verb comparison. Corresponding verbs are further compared using the WordNet. Each of the matches is assigned some weight learnt from the development corpus. A threshold has been set on the fraction of matching hypothesis relations based on the development set. The threshold score has been applied on the RTE-4 gold standard test set using the same methods of dependency parsing followed by comparisons. Evaluation scores obtained on the test set show 54.75% precision and 53% recall for YES decisions and 54.45% precision and 56.2% recall for NO decisions.

Partha Pakray, Alexander Gelbukh, Sivaji Bandyopadhyay
Semantic Annotation of Transcribed Audio Broadcast News Using Contextual Features in Graphical Discriminative Models

In this paper we propose an efficient approach to perform named entities retrieval (NER) using their hierarchical structure in transcribed speech documents. The NER task consists of identifying and classifying every word in a document into some predefined categories such as person name, locations, organizations, and dates. Usually the classical NER systems use generative approaches to learn models considering only the words characteristics (word context). In this work we show that NER is also sensitive to syntactic and semantic contexts. For this reason, we introduce an extension of conditional random fields (CRFs) approach to consider multiple contexts. We present an adaptation of the text-approach to the automatic speech recognition (ASR) outputs. Experimental results show that the proposed approach outperformed a CRFs simple application. Our experiments are done using ESTER 2 campaign data. The proposed approach is ranked in 4th position in ESTER 2 participating sites, it achieves a significant relative improvement of 18% in slot rate error (SER) measure over HMMs method.

Azeddine Zidouni, Hervé Glotin
Lexical Chains Using Distributional Measures of Concept Distance

In practice, lexical chains are typically built using term reiteration or resource-based measures of semantic distance. The former approach misses out on a significant portion of the inherent semantic information in a text, while the latter suffers from the limitations of the linguistic resource it depends upon.

In this paper, chains are constructed using the framework of distributional measures of concept distance, which combines the advantages of resource-based and distributional measures of semantic distance. These chains were evaluated by applying them to the task of text segmentation, where they performed as well as or better than state-of-the-art methods.

Meghana Marathe, Graeme Hirst
Incorporating Cohesive Devices into Entity Grid Model in Evaluating Local Coherence of Japanese Text

This paper describes improvements made to the entity grid local coherence model for Japanese text. We investigate the effectiveness of taking into account cohesive devices, such as conjunction, demonstrative pronoun, lexical cohesion, and refining syntactic roles for a topic marker in Japanese. To take into account lexical cohesion, we consider a semantic relation between entities using lexical chaining. Through the experiments on discrimination where the system has to select the more coherent sentence ordering, and comparison of the system’s ranking of automatically created summaries against human judgment based on quality questions, we show that these factors contribute to improve the performance of the entity grid model.

Hikaru Yokono, Manabu Okumura
A Sequential Model for Discourse Segmentation

Identifying discourse relations in a text is essential for various tasks in Natural Language Processing, such as automatic text summarization, question-answering, and dialogue generation. The first step of this process is segmenting a text into elementary units. In this paper, we present a novel model of discourse segmentation based on sequential data labeling. Namely, we use Conditional Random Fields to train a discourse segmenter on the RST Discourse Treebank, using a set of lexical and syntactic features. Our system is compared to other statistical and rule-based segmenters, including one based on Support Vector Machines. Experimental results indicate that our sequential model outperforms current state-of-the-art discourse segmenters, with an F-score of 0.94. This performance level is close to the human agreement F-score of 0.98.

Hugo Hernault, Danushka Bollegala, Mitsuru Ishizuka
Towards Automatic Detection and Tracking of Topic Change

We present an approach for automatic detection of topic change. Our approach is based on the analysis of statistical features of topics in time-sliced corpora and their dynamics over time. Processing large amounts of time-annotated news text, we identify new facets regarding a stream of topics consisting of latest news of public interest. Adaptable as an addition to the well known task of topic detection and tracking we aim to boil down a daily news stream to its novelty. For that we examine the contextual shift of the concepts over time slices. To quantify the amount of change, we adopt the volatility measure from econometrics and propose a new algorithm for frequency-independent detection of topic drift and change of meaning. The proposed measure does not rely on plain word frequency but the mixture of the co-occurrences of words. So, the analysis is highly independent of the absolute word frequencies and works over the whole frequency spectrum, especially also well for low-frequent words. Aggregating the computed time-related data of the terms allows to build overview illustrations of the most evolving terms for a whole time span.

Florian Holz, Sven Teresniak
Modelling Illocutionary Structure: Combining Empirical Studies with Formal Model Analysis

In this paper we revisit the topic of dialogue grammars at the illocutionary force level and present a new approach to the formal modelling, evaluation and comparison of these models based on recursive transition networks. Through the use of appropriate tools such finite-state grammars can be formally analysed and validated against empirically collected corpora. To illustrate our approach we show: (a) the construction of human-human dialogue grammars on the basis of recently collected natural language dialogues in joint-task situations; and (b) the evaluation and comparison of these dialogue grammars using formal methods. This work provides a novel basis for developing and evaluating dialogue grammars as well as for engineering corpus-tailored dialogue managers which can be verified for adequacy.

Hui Shi, Robert J. Ross, Thora Tenbrink, John Bateman
A Polyphonic Model and System for Inter-animation Analysis in Chat Conversations with Multiple Participants

Discourse in instant messenger conversations (chats) with multiple participants is often composed of several intertwining threads. Some chat environments for Computer-Supported Collaborative Learning (CSCL) support and encourage the existence of parallel threads by providing explicit referencing facilities. The paper proposes a discourse model for such chats, based on Mikhail Bakhtin’s dialogic theory. It considers that multiple voices (which do not limit to the participants) inter-animate, sometimes in a polyphonic, counterpointal way. An implemented system is also presented, which analyzes such chat logs for detecting additional, implicit links among utterances and threads and, more important for CSCL, for detecting the involvement (inter-animation) of the participants in problem solving. The system begins with a NLP pipe and concludes with inter-animation identification in order to generate feedback and to propose grades for the learners.

Stefan Trausan-Matu, Traian Rebedea

Humor and Emotions

Computational Models for Incongruity Detection in Humour

Incongruity resolution is one of the most widely accepted theories of humour, suggesting that humour is due to the mixing of two disparate interpretation frames in one statement. In this paper, we explore several computational models for incongruity resolution. We introduce a new data set, consisting of a series of ‘set-ups’ (preparations for a punch line), each of them followed by four possible coherent continuations out of which only one has a comic effect. Using this data set, we redefine the task as the automatic identification of the humorous punch line among all the plausible endings. We explore several measures of semantic relatedness, along with a number of joke-specific features, and try to understand their appropriateness as computational models for incongruity detection.

Rada Mihalcea, Carlo Strapparava, Stephen Pulman
Emotions in Words: Developing a Multilingual WordNet-Affect

In this paper we describe the process of Russian and Romanian WordNet-Affect creation. WordNet-Affect is a lexical resource created on the basis of the Princeton WordNet which contains information about the emotions that the words convey. It is organized in six basic emotions:

anger, disgust, fear, joy, sadness, surprise

.

We translated the WordNet-Affect synsets into Russian and Romanian and created an aligned English – Romanian – Russian lexical resource. The resource is freely available for research purposes.

Victoria Bobicev, Victoria Maxim, Tatiana Prodan, Natalia Burciu, Victoria Angheluş
Emotion Holder for Emotional Verbs – The Role of Subject and Syntax

Human-like holder plays an important role in identifying actual emotion expressed in text. This paper presents a baseline followed by syntactic approach for capturing emotion holders in the emotional sentences. The emotional verbs collected from WordNet Affect List (WAL) have been used in extracting the holder annotated emotional sentences from VerbNet. The baseline model is developed based on the

subject

information of the dependency-parsed emotional sentences. The unsupervised syntax based model is based on the relationship of the emotional verbs with their argument structure extracted from the

head

information of the chunks in the parsed sentences. Comparing the system extracted argument structure with available VerbNet frames’ syntax for 942 emotional verbs, it has been observed that the model based on syntax outperforms the baseline model. The precision, recall and F-Score values for the baseline model are 63.21%, 66.54% and 64.83% and for the syntax based model are 68.11%, 65.89% and 66.98% respectively on a collection of 4,112 emotional sentences.

Dipankar Das, Sivaji Bandyopadhyay

Machine Translation and Multilingualism

Best Paper Award – Third Place

A Chunk-Driven Bootstrapping Approach to Extracting Translation Patterns

We present a linguistically-motivated sub-sentential alignment system that extends the intersected IBM Model 4 word alignments. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers.

We conceive the sub-sentential aligner as a cascaded model consisting of two phases. In the first phase, anchor chunks are linked based on the intersected word alignments and syntactic similarity. In the second phase, we use a bootstrapping approach to extract more complex translation patterns.

The results show an overall AER reduction and competitive F-Measures in comparison to the commonly used symmetrized IBM Model 4 predictions (intersection, union and grow-diag-final) on six different text types for English-Dutch. More in particular, in comparison with the intersected word alignments, the proposed method improves recall, without sacrificing precision. Moreover, the system is able to align discontiguous chunks, which frequently occur in Dutch.

Lieve Macken, Walter Daelemans
Computing Transfer Score in Example-Based Machine Translation

This paper presents an idea in Example-Based Machine Translation - computing the transfer score for each produced translation. When an EBMT system finds an example in the translation memory, it tries to modify the sentence in order to produce the best possible translation of the input sentence. The user of the system, however, is unable to judge the quality of the translation. This problem can be solved by providing the user with a percentage score for each translated sentence.

The idea to base transfer score computation on the similarity between the input sentence and the example is not sufficient. Real-life examples show that the transfer process is as likely to go well with a bad translation memory example as to fail with a good example.

This paper describes a method of computing transfer score strictly associated with the transfer process. The transfer score is inversely proportional to the number of linguistic operations executed on the example target sentence. The paper ends with an evaluation of the suggested method.

Rafał Jaworski
Systematic Processing of Long Sentences in Rule Based Portuguese-Chinese Machine Translation

The translation quality and parsing efficiency are often disappointed when Rule based Machine Translation systems deal with long sentences. Due to the complicated syntactic structure of the language, many ambiguous parse trees can be generated during the translation process, and it is not easy to select the most suitable parse tree for generating the correct translation. This paper presents an approach to parse and translate long sentences efficiently in application to Rule based Portuguese-Chinese Machine Translation. A systematic approach to break down the length of the sentences based on patterns, clauses, conjunctions, and punctuation is considered to improve the performance of the parsing analysis. On the other hand, Constraint Synchronous Grammar is used to model both source and target languages simultaneously at the parsing stage to further reduce ambiguities and the parsing efficiency.

Francisco Oliveira, Fai Wong, Iok-Sai Hong
Syntax Augmented Inversion Transduction Grammars for Machine Translation

In this paper we propose a novel method for inferring an Inversion Transduction Grammar (ITG) from a bilingual parallel corpus with linguistic information from the source or target language. Our method combines bilingual ITG parse trees with monolingual linguistic trees in order to obtain a Syntax Augmented ITG (SAITG). The use of a modified bilingual parsing algorithm with bracketing information makes possible that each bilingual subtree has a correspondent subtree in the monolingual parsing. In addition, several binarization techniques have been tested for the resulting SAITG. In order to evaluate the effects of the use of SAITGs in Machine Translation tasks, we have used them in an ITG-based machine translation decoder. The results obtained using SAITGs with the decoder for the IWSLT-08 Chinese-English machine translation task produce significant improvements in BLEU.

Guillem Gascó Mora, Joan Andreu Sánchez Peiró
Syntactic Structure Transfer in a Tamil to Hindi MT System – A Hybrid Approach

We describe the syntactic structure transfer, a central design question in machine translation, between two languages Tamil (source) and Hindi (target), belonging to two different language families, Dravidian and Indo-Aryan respectively. Tamil and Hindi differ extensively at the clausal construction level and transferring the structure is difficult. The syntactic structure transfer described here is a hybrid approach where we use CRFs for identifying the clause boundaries in the source language, Transformation Based Learning (TBL) for extracting the rules and use semantic classification of Postpositions (PSP) for choosing semantically appropriate structure in constructions where there are one to many mapping in the target language. We have evaluated the system using web data and the results are encouraging.

Sobha Lalitha Devi, Vijay Sundar Ram R, Pravin Pralayankar, Bakiyavathi T
A Maximum Entropy Approach to Syntactic Translation Rule Filtering

In this paper we will present a maximum entropy filter for the translation rules of a statistical machine translation system based on tree transducers. This filter can be successfully used to reduce the number of translation rules by more than 70% without negatively affecting translation quality as measured by BLEU. For some filter configurations, translation quality is even improved.

Our investigations include a discussion of the relationship of

Alignment Error Rate

and

Consistent Translation Rule Score

with translation quality in the context of Syntactic Statistical Machine Translation.

Marcin Junczys-Dowmunt
Mining Parenthetical Translations for Polish-English Lexica

Documents written in languages other than English sometimes include parenthetical English translations, usually for technical and scientific terminology. Techniques had been developed for extracting such translations (as well as transliterations) from large Chinese text corpora. This paper presents methods for mining parenthetical translation in Polish texts. The main difference between translation mining in Chinese and Polish is that the latter is based on the Latin alphabet and it is more difficult to identify English translations in Polish texts. On the other hand, some parenthetically translated terms are preceded with the abbreviation ”ang.” (=English), a kind of an ”anchor”, allowing for querying a Web search engine for such translations.

Filip Graliński
Automatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora

This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The quality of the entries of the derived dictionary is very high, similar to that of hand-crafted dictionaries. We report a case study where a new, non noisy, English-Galician dictionary with about 12,000 correct bilingual correspondences was automatically generated.

Pablo Gamallo Otero, José Ramom Pichel Campos
Hierarchical Finite-State Models for Speech Translation Using Categorization of Phrases

In this work a hierarchical translation model is formally defined and integrated in a speech translation system. As it is well known, the relations between two languages are better arranged in terms of phrases than in terms of running words. Nevertheless phrase-based models may suffer from data sparsity at training time. The aim of this work is to improve current speech translation systems by integrating categorization within the translation model. The categories are sets of phrases either linguistically or statistically motivated. Both category and translation and acoustic models are within the framework of finite-state models. In what temporal cost is concerned, finite-state models count on efficient decoding algorithms. Regarding the spatial cost, all the models where integrated on-the-fly at decoding time, allowing an efficient use of memory.

Raquel Justo, Alicia Pérez, M. Inés Torres, Francisco Casacuberta
Drive-by Language Identification
A Byproduct of Applied Prototype Semantics

While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics.

Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words.

We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.

Ronald Winnemöller
Identification of Translationese: A Machine Learning Approach

This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.

Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, Ruslan Mitkov

Information Extraction

Acquiring IE Patterns through Distributional Lexical Semantic Models

Techniques for the automatic acquisition of Information Extraction Pattern are still a crucial issue in knowledge engineering. A semi supervised learning method, based on large scale linguistic resources, such as FrameNet and WordNet, is discussed. In particular, a robust method for assigning conceptual relations (i.e. roles) to relevant grammatical structures is defined according to distributional models of lexical semantics over a large scale corpus. Experimental results show that the use of the resulting knowledge base provide significant results, i.e. correct interpretations for about 90% of the covered sentences. This confirms the impact of the proposed approach on the quality and development time of large scale IE systems.

Roberto Basili, Danilo Croce, Cristina Giannone, Diego De Cao
Multi-view Bootstrapping for Relation Extraction by Exploring Web Features and Linguistic Features

Binary semantic relation extraction from Wikipedia is particularly useful for various NLP and Web applications. Currently frequent pattern mining-based methods and syntactic analysis-based methods are two types of leading methods for semantic relation extraction task. With a novel view on integrating syntactic analysis on Wikipedia text with redundancy information from the Web, we propose a multi-view learning approach for bootstrapping relationships between entities with the complementary between the Web view and linguistic view. On the one hand, from the linguistic view, linguistic features are generated from linguistic parsing on Wikipedia texts by abstracting away from different surface realizations of semantic relations. On the other hand, Web features are extracted from the Web corpus to provide frequency information for relation extraction. Experimental evaluation on a relational dataset demonstrates that linguistic analysis on Wikipedia texts and Web collective information reveal different aspects of the nature of entity-related semantic relationships. It also shows that our multi-view learning method considerably boosts the performance comparing to learning with only one view of features, with the weaknesses of one view complement the strengths of the other.

Yulan Yan, Haibo Li, Yutaka Matsuo, Mitsuru Ishizuka
Sequential Patterns to Discover and Characterise Biological Relations

In this paper, we present a method to automatically detect and characterise interactions between genes in biomedical literature. Our approach is based on a combination of data mining techniques: frequent sequential patterns filtered by linguistic constraints and recursive mining. Unlike most Natural Language Processing (NLP) approaches, our approach does not use syntactic parsing to learn and apply linguistic rules. It does not require any resource except the training corpus to learn patterns.

The process is in two steps. First, frequent sequential patterns are extracted from the training corpus. Second, after validation of those patterns, they are applied on the application corpus to detect and characterise new interactions. An advantage of our method is that interactions can be enhanced with modalities and biological information.

We use two corpora containing only sentences with gene interactions as training corpus. Another corpus from PubMed abstracts is used as application corpus. We conduct an evaluation that shows that the precision of our approach is good and the recall correct for both targets: interaction detection and interaction characterisation.

Peggy Cellier, Thierry Charnois, Marc Plantevit
Extraction of Genic Interactions with the Recursive Logical Theory of an Ontology

We introduce an Information Extraction (IE) system which uses the logical theory of an ontology as a generalisation of the typical information extraction patterns to extract biological interactions from text. This provides inferences capabilities beyond current approaches: first, our system is able to handle multiple relations; second, it allows to handle dependencies between relations, by deriving new relations from the previously extracted ones, and using inference at a semantic level; third, it addresses recursive or mutually recursive rules. In this context, automatically acquiring the resources of an IE system becomes an ontology learning task: terms, synonyms, conceptual hierarchy, relational hierarchy, and the logical theory of the ontology have to be acquired. We focus on the last point, as learning the logical theory of an ontology, and

a fortiori

of a recursive one, remains a seldom studied problem. We validate our approach by using a relational learning algorithm, which handles recursion, to learn a recursive logical theory from a text corpus on the bacterium

Bacillus subtilis

. This theory achieves a good recall and precision for the ten defined semantic relations, reaching a global recall of 67.7% and a precision of 75.5%, but more importantly, it captures complex mutually recursive interactions which were implicitly encoded in the ontology.

Alain-Pierre Manine, Erick Alphonse, Philippe Bessières
Ontological Parsing of Encyclopedia Information

Semi-automatic ontology learning from encyclopedia is presented with primary focus on syntax and semantic analyses of definitions.

Victor Bocharov, Lidia Pivovarova, Valery Rubashkin, Boris Chuprin

Information Retrieval

Selecting the N-Top Retrieval Result Lists for an Effective Data Fusion

Although the application of data fusion in information retrieval has yielded good results in the majority of the cases, it has been noticed that its achievement is dependent on the quality of the input result lists. In order to tackle this problem, in this paper we explore the combination of only the

n

-top result lists as an alternative to the fusion of all available data. In particular, we describe a heuristic measure based on redundancy and ranking information to evaluate the quality of each result list, and, consequently, to select the presumably

n

-best lists per query. Preliminary results in four IR test collections, containing a total of 266 queries, and employing three different DF methods are encouraging. They indicate that the proposed approach could significantly outperform the results achieved by fusion all available lists, showing improvements in mean average precision of 10.7%, 3.7% and 18.8% when it was used along with Maximum RSV, CombMNZ and Fuzzy Borda methods.

Antonio Juárez-González, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, David Pinto-Avendaño, Manuel Pérez-Coutiño
Multi Word Term Queries for Focused Information Retrieval

In this paper, we address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top

n

ranked documents. MWTs are special text units that represent domain concepts and objects. As such, they can better represent query topics than ordinary phrases or

n

-grams. We tested different query representations: bag-of-words, phrases, flat list of MWTs, subsets of MWTs. We also combined the initial set of MWTs obtained in an IQE process with automatic query expansion (AQE) using language models and smoothing mechanism. We chose as baseline the Indri IR engine based on the language model using Dirichlet smoothing. The experiment is carried out on two benchmarks: TREC Enterprise track (TRECent) 2007 and 2008 collections; INEX 2008 Ad-hoc track using the Wikipedia collection.

Eric SanJuan, Fidelia Ibekwe-SanJuan
Optimal IR: How Far Away?

There exists a gap between what a human user wants in mind and what (s)he could get from the information retrieval (IR) systems by his/her queries. We say an IR system is

perfect

if it could always provide the users with what they want in their minds if available in corpus, and

optimal

if it could present to the users what it finds in an optimal way. In this paper, we empirically study how far away we are still from the optimal IR or the perfect IR based on submitted runs to TREC Genomics track 2007. We assume perfect IR would always achieve a score of 100% for given evaluation methods. The optimal IR is simulated by optimized runs based on the evaluation methods provided by TREC. Then the average performance difference between submitted runs and the perfect or optimal runs can be obtained. Given annual average performance improvement made by reranking from literature, we figure out how far away we are from the optimal or the perfect IRs. The study indicates we are about 7 and 16 years away from the optimal and the perfect IRs, respectively. These are absolutely not exact distances, but they do give us a partial perspective regarding where we are in the IR development path. This study also provides us with the lowest upper bound on IR performance improvement by reranking.

Xiangdong An, Xiangji Huang, and Nick Cercone
Adaptive Term Weighting through Stochastic Optimization

Term weighting strongly influences the performance of text mining and information retrieval approaches. Usually term weights are determined through statistical estimates based on static weighting schemes. Such static approaches lack the capability to generalize to different domains and different data sets. In this paper, we introduce an on-line learning method for adapting term weights in a supervised manner. Via stochastic optimization we determine a linear transformation of the term space to approximate expected similarity values among documents. We evaluate our approach on 18 standard text data sets and show that the performance improvement of a k-NN classifier ranges between 1% and 12% by using adaptive term weighting as preprocessing step. Further, we provide empirical evidence that our approach is efficient to cope with larger problems.

Michael Granitzer

Text Categorization and Classification

Enhancing Text Classification by Information Embedded in the Test Set

Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.

Gabriela Ramírez-de-la-Rosa, Manuel Montes-y-Gómez, Luis Villaseńor-Pineda
Rank Distance Aggregation as a Fixed Classifier Combining Rule for Text Categorization

In this paper we show that Rank Distance Aggregation can improve ensemble classifier precision in the classical text categorization task by presenting a series of experiments done on a 20 class newsgroup corpus, with a single correct class per document. We aggregate four established document classification methods (TF-IDF, Probabilistic Indexing, Naive Bayes and KNN) in different training scenarios, and compare these results to widely used fixed combining rules such as Voting, Min, Max, Sum, Product and Median.

Liviu P. Dinu, Andrei Rusu
The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.

Vidas Daudaravicius
A General Bio-inspired Method to Improve the Short-Text Clustering Task

“Short-text clustering” is a very important research field due to the current tendency for people to use very short documents, e.g. blogs, text-messaging and others. In some recent works, new clustering algorithms have been proposed to deal with this difficult problem and novel bio-inspired methods have reported the best results in this area. In this work, a general bio-inspired method based on the AntTree approach is proposed for this task. It takes as input the results obtained by arbitrary clustering algorithms and refines them in different stages. The proposal shows an interesting improvement in the results obtained with different algorithms on several short-text collections.

Diego Ingaramo, Marcelo Errecalde, Paolo Rosso
An Empirical Study on the Feature’s Type Effect on the Automatic Classification of Arabic Documents

The Arabic language is a highly flexional and morphologically very rich language. It presents serious challenges to the automatic classification of documents, one of which is determining what type of attribute to use in order to get the optimal classification results. Some people use roots or lemmas which, they say, are able to handle problems with the inflections that do not appear in other languages in that fashion. Others prefer to use character-level n-grams since n-grams are simpler to implement, language independent, and produce satisfactory results. So which of these two approaches is better, if any? This paper tries to answer this question by offering a comparative study between four feature types: words in their original form, lemmas, roots, and character level n-grams and shows how each affects the performance of the classifier. We used and compared the performance of Support Vector Machines and Naïve Bayesian Networks algorithms respectively.

Saeed Raheel, Joseph Dichy

Plagiarism Detection

Word Length n-Grams for Text Re-use Detection

The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store

n

-gram lists is reduced; (ii) computation times are decreased; and (iii) length

n

-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

Alberto Barrón-Cedeño, Chiara Basile, Mirko Degli Esposti, Paolo Rosso
Who’s the Thief? Automatic Detection of the Direction of Plagiarism

Determining the direction of plagiarism (who plagiarized whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009. We have tested it on a large-scale corpus of artificial plagiarism, with good results.

Cristian Grozea, Marius Popescu

Text Summarization

Best Student Paper Award

Integer Linear Programming for Dutch Sentence Compression

Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles from Dutch and Flemish newspapers written in Dutch using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.

Jan De Belder, Marie-Francine Moens
GEMS: Generative Modeling for Evaluation of Summaries

Automated evaluation is crucial in the context of automated text summaries, as is the case with evaluation of any of the language technologies. In this paper we present a Generative Modeling framework for evaluation of

content

of summaries. We used two simple alternatives to identifying

signature-terms

from the reference summaries based on

model consistency

and

Parts-Of-Speech (POS) features

. By using a Generative Modeling approach we capture the sentence level presence of these

signature-terms

in peer summaries. We show that parts-of-speech such as noun and verb, give simple and robust method to

signature-term

identification for the Generative Modeling approach. We also show that having a large set of ’significant

signature-terms

’ is better than a small set of ‘strong

signature-terms

’ for our approach. Our results show that the generative modeling approach is indeed promising — providing high correlations with manual evaluations — and further investigation of

signature-term

identification methods would obtain further better results. The efficacy of the approach can be seen from its ability to capture ‘

overall responsiveness

’ much better than the state-of-the-art in distinguishing a human from a system.

Rahul Katragadda
Quantitative Evaluation of Grammaticality of Summaries

Automated evaluation is crucial in the context of automated text summaries, as is the case with evaluation of any of the language technologies. While the quality of a summary is determined by both content and form of a summary, throughout the literature there has been extensive study on the automatic and semi-automatic evaluation of content of summaries and most such applications have been largely successful. What lacks is a careful investigation of automated evaluation of readability aspects of a summary. In this work we dissect readability into five parameters and try to automate the evaluation of grammaticality of text summaries. We use surface level methods like Ngrams and LCS sequence on POS-tag sequences and chunk-tag sequences to capture acceptable grammatical constructions, and these approaches have produced impressive results. Our results show that it is possible to use relatively shallow features to quantify degree of acceptance of grammaticality.

Ravikiran Vadlapudi, Rahul Katragadda

Speech Generation

Integrating Contrast in a Framework for Predicting Prosody

Information Structure (IS) is known to bear a significant effect on Prosody, making the identification of this effect crucial for improving the quality of synthetic speech. Recent theories identify contrast as a central IS element affecting accentuation. This paper presents the results of two experiments aiming to investigate the function of the different levels of contrast within the topic and focus of the utterance, and their effect on the prosody of Greek. Analysis showed that distinguishing between at least two contrast types is important for determining the appropriate accent type, and, therefore, such a distinction should be included in a description of the IS – Prosody interaction. For this description to be useful for practical applications, a framework is required that makes this information accessible to the speech synthesizer. This work reports on such a language-independent framework integration of all identified grammatical and syntactic prerequisites for creating a linguistically enriched input for speech synthesis.

Pepi Stavropoulou, Dimitris Spiliotopoulos, Georgios Kouroupetroglou
Backmatter
Metadaten
Titel
Computational Linguistics and Intelligent Text Processing
herausgegeben von
Alexander Gelbukh
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-12116-6
Print ISBN
978-3-642-12115-9
DOI
https://doi.org/10.1007/978-3-642-12116-6