Skip to main content
main-content

Über dieses Buch

This two-volume set, consisting of LNCS 6608 and LNCS 6609, constitutes the thoroughly refereed proceedings of the 12th International Conference on Computer Linguistics and Intelligent Processing, held in Tokyo, Japan, in February 2011. The 74 full papers, presented together with 4 invited papers, were carefully reviewed and selected from 298 submissions. The contents have been ordered according to the following topical sections: lexical resources; syntax and parsing; part-of-speech tagging and morphology; word sense disambiguation; semantics and discourse; opinion mining and sentiment detection; text generation; machine translation and multilingualism; information extraction and information retrieval; text categorization and classification; summarization and recognizing textual entailment; authoring aid, error correction, and style analysis; and speech recognition and generation.

Inhaltsverzeichnis

Frontmatter

Lexical Resources

Influence of Treebank Design on Representation of Multiword Expressions

Multiword Expressions (MWEs) are important linguistic units that require special treatment in many NLP applications. It is thus desirable to be able to recognize them automatically. Semantically annotated corpora should mark MWEs in a clear way that facilitates development of automatic recognition tools. In the present paper we discuss various corpus design decisions from this perspective. We propose guidelines that should lead to MWE-friendly annotation and evaluate them on numerous sentence examples. Our experience of identifying MWEs in the Prague Dependency Treebank provides the base for the discussion and examples from other languages are added whenever appropriate.

Eduard Bejček, Pavel Straňák, Daniel Zeman

Combining Contextual and Structural Information for Supersense Tagging of Chinese Unknown Words

Supersense tagging classifies unknown words into semantic categories defined by lexicographers and inserts them into a thesaurus. Previous studies on supersense tagging show that context-based methods perform well for English unknown words while structure-based methods perform well for Chinese unknown words. The challenge before us is how to successfully combine contextual and structural information together for supersense tagging of Chinese unknown words. We propose a simple yet effective approach to address the challenge. In this approach, contextual information is used for measuring contextual similarity between words while structural information is used to filter candidate synonyms and adjusting contextual similarity score. Experiment results show that the proposed approach outperforms the state-of-art context-based method and structure-based method.

Likun Qiu, Yunfang Wu, Yanqiu Shao

Identification of Conjunct Verbs in Hindi and Its Effect on Parsing Accuracy

This paper introduces a work on identification of conjunct verbs in Hindi. The paper will first focus on investigating which noun-verb combination makes a conjunct verb in Hindi using a set of linguistic diagnostics. We will then see which of these diagnostics can be used as features in a MaxEnt based automatic identification tool. Finally we will use this tool to incorporate certain features in a graph based dependency parser and show an improvement over previous best Hindi parsing accuracy.

Rafiya Begum, Karan Jindal, Ashish Jain, Samar Husain, Dipti Misra Sharma

Identification of Reduplicated Multiword Expressions Using CRF

This paper deals with the identification of Reduplicated Multiword Expressions (RMWEs) which is important for any natural language applications like Machine Translation, Information Retrieval etc. In the present task, reduplicated MWEs have been identified in Manipuri language texts using CRF tool. Manipuri is highly agglutinative in nature and reduplication is quite high in this language. The important features selected for running the CRF tool include stem words, number of suffixes, number of prefixes, prefixes in the word, suffixes in the word, Part Of Speech (POS) of the surrounding words, surrounding stem words, length of the word, word frequency and digit feature. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 92.91%, 91.90% and 92.40% respectively.

Kishorjit Nongmeikapam, Dhiraj Laishram, Naorem Bikramjit Singh, Ngariyanbam Mayekleima Chanu, Sivaji Bandyopadhyay

Syntax and Parsing

Computational Linguistics and Natural Language Processing

Researches in Computational Linguistics (CL) and Natural Language Processing (NLP) have been increasingly dissociated from each other. Empirical techniques in NLP show good performances in some tasks when large amount of data (with annotation) are available. However, in order for these techniques to be adapted easily to new text types or domains, or for similar techniques to be applied to more complex tasks such as text entailment than POS taggers, parsers, etc., rational understanding of language is required. Engineering techniques have to be underpinned by scientific understanding. In this paper, taking grammar in CL and parsing in NLP as an example, we will discuss how to re-integrate these two research disciplines. Research results of our group on parsing are presented to show how grammar in CL is used as the backbone of a parser.

Jun’ichi Tsujii

An Unsupervised Approach for Linking Automatically Extracted and Manually Crafted LTAGs

Though the lack of semantic representation of automatically extracted LTAGs is an obstacle in using these formalism, due to the advent of some powerful statistical parsers that were trained on them, these grammars have been taken into consideration more than before. Against of this grammatical class, there are some widely usage manually crafted LTAGs that are enriched with semantic representation but suffer from the lack of efficient parsers. The available representation of latter grammars beside the statistical capabilities of former encouraged us in constructing a link between them.

Here, by focusing on the automatically extracted LTAG used by MICA [4] and the manually crafted English LTAG namely XTAG grammar [32], a statistical approach based on HMM is proposed that maps each sequence of former elementary trees onto a sequence of later elementary trees. To avoid of converging the HMM training algorithm in a local optimum state, an EM-based learning process for initializing the HMM parameters were proposed too. Experimental results show that the mapping method can provide a satisfactory way to cover the deficiencies arises in one grammar by the available capabilities of the other.

Heshaam Faili, Ali Basirat

Tamil Dependency Parsing: Results Using Rule Based and Corpus Based Approaches

Very few attempts have been reported in the literature on dependency parsing for Tamil. In this paper, we report results obtained for Tamil dependency parsing with rule-based and corpus-based approaches. We designed annotation scheme partially based on Prague Dependency Treebank (PDT) and manually annotated Tamil data (about 3000 words) with dependency relations. For corpus-based approach, we used two well known parsers MaltParser and MSTParser, and for the rule-based approach, we implemented series of linguistic rules (for resolving coordination, complementation, predicate identification and so on) to build dependency structure for Tamil sentences. Our initial results show that, both rule-based and corpus-based approaches achieved the accuracy of more than 74% for the unlabeled task and more than 65% for the labeled tasks. Rule-based parsing accuracy dropped considerably when the input was tagged automatically.

Loganathan Ramasamy, Zdeněk Žabokrtský

Incremental Combinatory Categorial Grammar and Its Derivations

Incremental parsing is appealing for applications such as speech recognition and machine translation due to its inherent efficiency as well as being a natural match for the language models commonly used in such systems. In this paper we introduce an Incremental Combinatory Categorical Grammar (ICCG) that extends the standard CCG grammar to enable fully incremental left-to-right parsing. Furthermore, we introduce a novel dynamic programming algorithm to convert CCGbank normal form derivations to incremental left-to-right derivations and show that our incremental CCG derivations can recover the unlabeled predicate-argument dependency structures with more than 96% F-measure. The introduced CCG incremental derivations can be used to train an incremental CCG parser.

Ahmed Hefny, Hany Hassan, Mohamed Bahgat

Dependency Syntax Analysis Using Grammar Induction and a Lexical Categories Precedence System

The unsupervised approach for syntactic analysis tries to discover the structure of the text using only raw text. In this paper we explore this approach using Grammar Inference Algorithms. Despite of still having room for improvement, our approach tries to minimize the effect of the current limitations of some grammar inductors by adding morphological information before the grammar induction process, and a novel system for converting a shallow parse to dependencies, which reconstructs information about inductor’s undiscovered heads by means of a lexical categories precedence system. The performance of our parser, which needs no syntactic tagged resources or rules, trained with a small corpus, is 10% below to that of commercial semi-supervised dependency analyzers for Spanish, and comparable to the state of the art for English.

Hiram Calvo, Omar J. Gambino, Alexander Gelbukh, Kentaro Inui

Labelwise Margin Maximization for Sequence Labeling

In sequence labeling problems, the objective functions of most learning algorithms are usually inconsistent with evaluation measures, such as Hamming loss. In this paper, we propose an online learning algorithm that addresses the problem of labelwise margin maximization for sequence labeling. We decompose the sequence margin to per-label margins and maximize these per-label margins individually, which can result to minimize the Hamming loss of sequence. We compare our algorithm with three state-of-art methods on three tasks, and the experimental results show our algorithm outperforms the others.

Wenjun Gao, Xipeng Qiu, Xuanjing Huang

Co-related Verb Argument Selectional Preferences

Learning Selectional Preferences has been approached as a verb and argument problem, or at most as a tri-nary relationship between subject, verb and object. The correlation of all arguments in a sentence, however, has not been extensively studied for sentence plausibility measuring because of the increased number of potential combinations and data sparseness. We propose a unified model for machine learning using SVM (Support Vector Machines) with features based on topic-projected words from a PLSI (Probabilistic Latent Semantic Indexing) Model and PMI (Pointwise Mutual Information) as co-occurrence features, and WordNet top concept projected words as semantic classes. We perform tests using a pseudo-disambiguation task. We found that considering all arguments in a sentence improves the correct identification of plausible sentences with an increase of 10% in recall among other things.

Hiram Calvo, Kentaro Inui, Yuji Matsumoto

Combining Diverse Word-Alignment Symmetrizations Improves Dependency Tree Projection

For many languages, we are not able to train any supervised parser, because there are no manually annotated data available. This problem can be solved by using a parallel corpus with English, parsing the English side, projecting the dependencies through word-alignment connections, and training a parser on the projected trees. In this paper, we introduce a simple algorithm using a combination of various word-alignment symmetrizations. We prove that our method outperforms previous work, even though it uses McDonald’s maximum-spanning-tree parser as it is, without any “unsupervised” modifications.

David Mareček

An Analysis of Tree Topological Features in Classifier-Based Unlexicalized Parsing

A novel set of “tree topological features” (TTFs) is investigated for improving a classifier-based unlexicalized parser. The features capture the location and shape of subtrees in the treebank. Four main categories of TTFs are proposed and compared. Experimental results showed that each of the four categories independently improved the parsing accuracy significantly over the baseline model. When combined using the ensemble technique, the best unlexicalized parser achieves 84% accuracy without any extra language resources, and matches the performance of early lexicalized parsers. Linguistically, TTFs approximate linguistic notions such as grammatical weight, branching property and structural parallelism. This is illustrated by studying how the features capture structural parallelism in processing coordinate structures.

Samuel W. K. Chan, Mickey W. C. Chong, Lawrence Y. L. Cheung

Part of Speech Tagging and Morphology

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

Christopher D. Manning

Ripple Down Rules for Part-of-Speech Tagging

This paper presents a new approach to learn a rule based system for the task of part of speech tagging. Our approach is based on an incremental knowledge acquisition methodology where rules are stored in an exception-structure and new rules are only added to correct errors of existing rules; thus allowing systematic control of interaction between rules. Experimental results of our approach on English show that we achieve in the best accuracy published to date: 97.095% on the Penn Treebank corpus. We also obtain the best performance for Vietnamese VietTreeBank corpus.

Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, Dang Duc Pham

An Efficient Part-of-Speech Tagger for Arabic

In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic POS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic POS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.

Selçuk Köprü

An Evaluation of Part of Speech Tagging on Written Second Language Spanish

With the increase in the number and size of computer learner corpora in the field of Second Language Acquisition, there is a growing need to automatically analyze the language produced by learners. However, the computational tools developed for natural language processing are generally not considered as appropiate because they are designed to treat native language. This paper analyzes the reliability of two part-of-speech taggers on second language Spanish and investigates the most frequent tagger errors and the impact of learner errors in the performance of the taggers.

M. Pilar Valverde Ibañez

Onoma: A Linguistically Motivated Conjugation System for Spanish Verbs

In this paper we introduce a new conjugating tool which generates and analyses both existing verbs and verb neologisms in Spanish. This application of finite state transducers is based on novel linguistically motivated morphological rules describing the verbal paradigm. Given that these transducers are simpler than the ones created in previous developments and are easy to learn and remember, the method can also be employed as a pedagogic tool in itself. A comparative evaluation of the tool against other online conjugators demonstrates its efficacy.

Luz Rello, Eduardo Basterrechea

Word Sense Disambiguation

Measuring Similarity of Word Meaning in Context with Lexical Substitutes and Translations

Representation of word meaning has been a topic of considerable debate within the field of computational linguistics, and particularly in the subfield of word sense disambiguation. While word senses enumerated in manually produced inventories have been very useful as a start point to research, we know that the inventory should be selected for the purposes of the application. Unfortunately we have no clear understanding of how to determine the appropriateness of an inventory for monolingual applications, or when the target language is unknown in cross-lingual applications. In this paper we examine datasets which have paraphrases or translations as alternative annotations of lexical meaning on the same underlying corpus data. We demonstrate that overlap in lexical paraphrases (substitutes) between two uses of the same lemma correlates with overlap in translations. We compare the degree of overlap with annotations of usage similarity on the same data and show that the overlaps in paraphrases or translations also correlate with the similarity judgements. This bodes well for using any of these methods to evaluate unsupervised representations of lexical semantics. We do however find that the relationship breaks down for some lemmas, but this behaviour on a lemma by lemma basis itself correlates with low inter-tagger agreement and higher proportions of mid-range points on a usage similarity dataset. Lemmas which have many inter-related usages might potentially be predicted from such data.

Diana McCarthy

A Quantitative Evaluation of Global Word Sense Induction

Word sense induction (WSI) is the task aimed at automatically identifying the senses of words in texts, without the need for handcrafted resources or annotated data. Up till now, most WSI algorithms extract the different senses of a word ‘locally’ on a per-word basis, i.e. the different senses for each word are determined separately. In this paper, we compare the performance of such algorithms to an algorithm that uses a ‘global’ approach, i.e. the different senses of a particular word are determined by comparing them to, and demarcating them from, the senses of other words in a full-blown word space model. We adopt the evaluation framework proposed in the SemEval-2010 Word Sense Induction & Disambiguation task. All systems that participated in this task use a local scheme for determining the different senses of a word. We compare their results to the ones obtained by the global approach, and discuss the advantages and weaknesses of both approaches.

Marianna Apidianaki, Tim Van de Cruys

Incorporating Coreference Resolution into Word Sense Disambiguation

Word sense disambiguation (WSD) and coreference resolution are two fundamental tasks for natural language processing. Unfortunately, they are seldom studied together. In this paper, we propose to incorporate the coreference resolution technique into a word sense disambiguation system for improving disambiguation precision. Our work is based on the existing instance knowledge network (IKN) based approach for WSD. With the help of coreference resolution, we are able to connect related candidate dependency graphs at the candidate level and similarly the related instance graph patterns at the instance level in IKN together. Consequently, the contexts which can be considered for WSD are expanded and precision for WSD is improved. Based on Senseval-3 all-words task, we run extensive experiments by following the same experimental approach as the IKN based WSD. It turns out that each combined algorithm between the extended IKN WSD algorithm and one of the best five existing algorithms consistently outperforms the corresponding combined algorithm between the IKN WSD algorithm and the existing algorithm.

Shangfeng Hu, Chengfei Liu

Semantics and Discourse

Deep Semantics for Dependency Structures

Although dependency parsers have become increasingly popular, little work has been done on how to associate dependency structures with deep semantic representations. In this paper, we propose a semantic calculus for dependency structures which can be used to construct deep semantic representations from joint syntactic and semantic dependency structures similar to those used in the ConLL 2008 Shared Task.

Paul Bédaride, Claire Gardent

Combining Heterogeneous Knowledge Resources for Improved Distributional Semantic Models

The Explicit Semantic Analysis (ESA) model based on term cooccurrences in Wikipedia has been regarded as state-of-the-art semantic relatedness measure in the recent years. We provide an analysis of the important parameters of ESA using datasets in five different languages. Additionally, we propose the use of ESA with multiple lexical semantic resources thus exploiting multiple evidence of term cooccurrence to improve over the Wikipedia-based measure. Exploiting the improved robustness and coverage of the proposed combination, we report improved performance over single resources in word semantic relatedness, solving word choice problems, classification of semantic relations between nominals, and text similarity.

György Szarvas, Torsten Zesch, Iryna Gurevych

Improving Text Segmentation with Non-systematic Semantic Relation

Text segmentation is a fundamental problem in natural language processing, which has application in information retrieval, question answering, and text summarization. Almost previous works on unsupervised text segmentation are based on the assumption of lexical cohesion, which is indicated by relations between words in the two units of text. However, they only take into account the reiteration, which is a category of lexical cohesion, such as word repetition, synonym or superordinate. In this research, we investigate the non-systematic semantic relation, which is classified as collocation in lexical cohesion. This relation holds between two words or phrases in a discourse when they pertain to a particular theme or topic. This relation has been recognized via a topic model, which is, in turn, acquired from a large collection of texts. The experimental results on the public dataset show the advantages of our approach in comparison to the available unsupervised approaches.

Viet Cuong Nguyen, Le Minh Nguyen, Akira Shimazu

Automatic Identification of Cause-Effect Relations in Tamil Using CRFs

We present our work on automatic identification of cause-effect relations in a given Tamil text. Based on the analysis of causal constructions in Tamil, we identified a set of causal markers for Tamil and arrived at certain features used to develop our language model. We manually annotated a Tamil corpus of 8648 sentences for cause-effect relations. With this corpus, we developed the model for identifying causal relations using the machine learning technique, Conditional Random Fields (CRFs). We performed experiments and the results are encouraging. We performed an error analysis of the results and found that the errors can be attributed to some very interesting structural interdependencies between closely occurring causal relations. After comparing these structures in Tamil and English, we claim that at discourse level, the complexity of structural interdependencies between causal relations is more complex in Tamil than in English due to the free word order nature of Tamil.

Menaka S., Pattabhi R. K. Rao, Sobha Lalitha Devi

Comparing Approaches to Tag Discourse Relations

It is widely accepted that in a text, sentences and clauses cannot be understood in isolation but in relation with each other through discourse relations that may or may not be explicitly marked. Discourse relations have been found useful in many applications such as machine translation, text summarization, and question answering; however, they are often not considered in computational language applications because domain and genre independent robust discourse parsers are very few. In this paper, we analyze existing approaches to identify five discourse relations automatically (namely,

comparison, contingency, illustration, attribution

, and

topic-opinion

), and propose a new approach to identify

attributive

relations. We evaluate the accuracy of each approach with respect to the discourse relations it can identify and compare it to a human gold standard. The evaluation results show that the state of the art systems are rather effective at identifying most of the relations considered, but other relations such as

attribution

are still not identified with high accuracy.

Shamima Mithun, Leila Kosseim

Semi-supervised Discourse Relation Classification with Structural Learning

The corpora available for training discourse relation classifiers are annotated using a general set of discourse relations. However, for certain applications, custom discourse relations are required. Creating a new annotated corpus with a new relation taxonomy is a time-consuming and costly process. We address this problem by proposing a semi-supervised approach to discourse relation classification based on Structural Learning. First, we solve a set of auxiliary classification problems using unlabeled data. Second, the learned classifiers are used to extend feature vectors to train a discourse relation classifier. By defining a relevant set of auxiliary classification problems, we show that the proposed method brings improvement of at least 50% in accuracy and F-score on the RST Discourse Treebank and Penn Discourse Treebank, when small training sets of ca. 1000 training instances are employed. This is an attractive perspective for training discourse relation classifiers on domains where little amount of labeled training data is available.

Hugo Hernault, Danushka Bollegala, Mitsuru Ishizuka

Integrating Japanese Particles Function and Information Structure

This paper presents a new analysis of the discourse functions of Japanese particles

wa

and

ga

. Such functions are integrated with information structures into the constraint-based grammar under the HPSG framework. We examine the distribution of these particles and demonstrate how the thematic-rhematic dichotomy of the constituent can be determined by the informational status of one or more of its daughter constituents through various linguistic constraints. We show that the relation between syntactic constituency and information structure of a sentence is not a one-to-one mapping as a purely syntax-based analysis assumes, and then propose the multi-dimensional grammar which expresses mutual constraints on the thematic-rhematic interpretation, syntax and phonology.

Akira Ohtani

Assessing Lexical Alignment in Spontaneous Direction Dialogue Data by Means of a Lexicon Network Model

We apply a network model of lexical alignment, called

Two-Level Time-Aligned Network Series

, to natural route direction dialogue data. The model accounts for the structural similarity of interlocutors’ dialogue lexica. As classification criterion the directions are divided into effective and ineffective ones. We found that effective direction dialogues can be separated from ineffective ones with a hit ratio of 96% with regard to the structure of the corresponding dialogue lexica. This value is achieved when taking into account just nouns. This hit ratio decreases slightly as soon as other parts of speech are also considered. Thus, this paper provides a machine learning framework for telling apart effective dialogues from insufficient ones. It also implements first steps in more fine-grained alignment studies: we found a difference in the efficiency contribution between (the interaction of) lemmata of different parts of speech.

Alexander Mehler, Andy Lücking, Peter Menke

Opinion Mining and Sentiment Detection

Towards Well-Grounded Phrase-Level Polarity Analysis

We propose a new rule-based system for phrase-level polarity analysis and show how it benefits from empirically validating its polarity composition through surveys with human subjects. The system’s two-layer architecture and its underlying structure, i.e. its composition model, are presented. Two functions for polarity aggregation are introduced that operate on newly defined semantic categories. These categories detach a word’s syntactic from its semantic behavior. An experimental setup is described that we use to carry out a thorough evaluation. It incorporates a newly created German-language data set that is made freely and publicly available. This data set contains polarity annotations at word-level, phrase-level and sentence-level and facilitates comparability between different studies and reproducibility of our results.

Robert Remus, Christian Hänig

Implicit Feature Identification via Co-occurrence Association Rule Mining

In sentiment analysis, identifying features associated with an opinion can help produce a finer-grained understanding of online reviews. The vast majority of existing approaches focus on explicit feature identification, few attempts have been made to identify implicit features in reviews. In this paper, we propose a novel two-phase co-occurrence association rule mining approach to identifying implicit features. Specifically, in the first phase of rule generation, for each opinion word occurring in an explicit sentence in the corpus, we mine a significant set of association rules of the form [opinion-word, explicit-feature] from a co-occurrence matrix. In the second phase of rule application, we first cluster the rule consequents (explicit features) to generate more robust rules for each opinion word mentioned above. Given a new opinion word with no explicit feature, we then search a matched list of robust rules, among which the rule having the feature cluster with the highest frequency weight is fired, and accordingly, we assign the representative word of the cluster as the final identified implicit feature. Experimental results show considerable improvements of our approach over other related methods including baseline dictionary lookups, statistical semantic association models, and bi-bipartite reinforcement clustering.

Zhen Hai, Kuiyu Chang, Jung-jae Kim

Construction of Wakamono Kotoba Emotion Dictionary and Its Application

Currently, we can find a lot of weblogs written by young people. In the weblogs, they tend to describe their undiluted emotions or opinions. To analyze the emotions of young people and what causes those emotions, our study focuses on the specific Japanese language used among young people, which is called

Wakamono Kotoba

. The proposed method classifies

Wakamono Kotoba

into emotion categories based on superficial information and the descriptive texts of the words. Specifically, the method uses literal information used for

Wakamono Kotoba

, such as Katakana, Hiragana, and Kanji, etc., stroke count, and the difficulty level of the Kanji as features. Then we classified

Wakamono Kotoba

into emotion categories according to the superficial similarity between the word and the

Wakamono Kotoba

registered in the dictionary with an annotation of its emotional strength level. We also proposed another method to classify

Wakamono Kotoba

into emotion categories by using the co-occurrence relation between the words included in the descriptive text of the

Wakamono Kotoba

and the emotion words included in the existing emotion word dictionary. We constructed the

Wakamono Kotoba

emotion dictionary based on these two methods. Finally, the applications of the

Wakamono Kotoba

emotion dictionary are discussed.

Kazuyuki Matsumoto, Fuji Ren

Temporal Analysis of Sentiment Events – A Visual Realization and Tracking

In recent years, extraction of temporal relations for events that express sentiments has drawn great attention of the Natural Language Processing (NLP) research communities. In this work, we propose a method that involves the association and contribution of sentiments in determining the event-event relations from texts. Firstly, we employ a machine learning approach based on Conditional Random Field (CRF) for solving the problem of Task C (identification of event-event relations) of TempEval-2007 within TimeML framework by considering

sentiment

as a feature of an event. Incorporating sentiment property, our system achieves the performance that is better than all the participated state-of-the-art systems of TempEval 2007. Evaluation results on the Task C test set yield the F-score values of 57.2% under the strict evaluation scheme and 58.6% under the relaxed evaluation scheme. The positive or negative coarse grained sentiments as well as the Ekman’s six basic universal emotions (or, fine grained sentiments) are assigned to the events. Thereafter, we analyze the temporal relations between events in order to track the sentiment events. Representation of the temporal relations in a graph format shows the shallow visual realization path for tracking the sentiments over events. Manual evaluation of temporal relations of sentiment events identified in 20 documents sounds satisfactory from the purview of event-sentiment tracking.

Dipankar Das, Anup Kumar Kolya, Asif Ekbal, Sivaji Bandyopadhyay

Text Generation

Highly-Inflected Language Generation Using Factored Language Models

Statistical language models based on n-gram counts have been shown to successfully replace grammar rules in standard 2-stage (or ‘generate-and-select’) Natural Language Generation (NLG). In highly-inflected languages, however, the amount of training data required to cope with n-gram sparseness may be simply unobtainable, and the benefits of a statistical approach become less obvious. In this work we address the issue of text generation in a highly-inflected language by making use of factored language models (FLM) that take morphological information into account. We present a number of experiments involving the use of simple FLMs applied to various surface realisation tasks, showing that FLMs may implement 2-stage generation with results that are far superior to standard n-gram models alone.

Eder Miranda de Novais, Ivandré Paraboni, Diogo Takaki Ferreira

Prenominal Modifier Ordering in Bengali Text Generation

In this paper, we propose a machine learning based approach for ordering adjectival premodifiers of a noun phrase (NP) in Bengali. We propose a novel method to learn the pairwise orders of the modifiers. Using the learned pairwise orders, longer sequences of pronominal modifiers are ordered following a graph based method. The proposed modifier ordering approach is compared with an existing approach using our own dataset. We have achieved approximately 4% increment in the

F-measure

with our approach indicating an overall improvement. The modifier ordering approach proposed here can be implemented in a Bengali text generation system resulting in more fluent and natural output.

Sumit Das, Anupam Basu, Sudeshna Sarkar

Bootstrapping Multiple-Choice Tests with The-Mentor

It is very likely that, at least once in their lifetime, everyone has answered a multiple-choice test. Multiple-choice tests are considered an effective technique for knowledge assessment, requiring a short response time and with the possibility of covering a broad set of topics. Nevertheless, when it comes to their creation, it can be a time-consuming and labour-intensive task. Here, the generation of multiple-choice tests aided by computer can reduce these drawbacks: to the human assessor is attributed the final task of approving or rejecting the generated test items, depending on their quality.

In this paper we present

The-Mentor

, a system that employs a fully automatic approach to generate multiple-choice tests. In a first offline step, a set of lexico-syntactic patterns are bootstrapped by using several question/answer seed pairs and leveraging the redundancy of the Web. Afterwards, in an online step, the patterns are used to select sentences in a text document from which answers can be extracted and the respective questions built. In the end, several filters are applied to discard low quality items and distractors are named entities that comply with the question category, extracted from the same text.

Ana Cristina Mendes, Sérgio Curto, Luísa Coheur

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise