Top

2013 | Book

Read chapter Read first chapter

Computational Linguistics and Intelligent Text Processing

14th International Conference, CICLing 2013, Samos, Greece, March 24-30, 2013, Proceedings, Part I

Editor: Alexander Gelbukh

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This two-volume set, consisting of LNCS 7816 and LNCS 7817, constitutes the thoroughly refereed proceedings of the 13th International Conference on Computer Linguistics and Intelligent Processing, CICLING 2013, held on Samos, Greece, in March 2013. The total of 91 contributions presented was carefully reviewed and selected for inclusion in the proceedings. The papers are organized in topical sections named: general techniques; lexical resources; morphology and tokenization; syntax and named entity recognition; word sense disambiguation and coreference resolution; semantics and discourse; sentiment, polarity, subjectivity, and opinion; machine translation and multilingualism; text mining, information extraction, and information retrieval; text summarization; stylometry and text simplification; and applications.

Frontmatter

General Techniques

Unsupervised Feature Adaptation for Cross-Domain NLP with an Application to Compositionality Grading

In this paper, we introduce

feature adaptation

, an unsupervised method for cross-domain natural language processing (NLP). Feature adaptation adapts a supervised NLP system to a new domain by recomputing feature values while retaining the model and the feature definitions used on the original domain. We demonstrate the effectiveness of feature adaptation through cross-domain experiments in compositionality grading and show that it rivals supervised target domain systems when moving from generic web text to a specialized physics text domain.

Lukas Michelbacher, Qi Han, Hinrich Schütze

Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification

The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sn-grams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship attribution for corpora of three and seven authors with very promising results.

Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, Liliana Chanona-Hernández

Lexical Resources

A Quick Tour of BabelNet 1.1

In this paper we present BabelNet 1.1, a brand-new release of the largest “encyclopedic dictionary”, obtained from the automatic integration of the most popular computational lexicon of English, i.e. WordNet, and the largest multilingual Web encyclopedia, i.e. Wikipedia. BabelNet 1.1 covers 6 languages and comes with a renewed Web interface, graph explorer and programmatic API. BabelNet is available online at

http://www.babelnet.org

Roberto Navigli

Automatic Pipeline Construction for Real-Time Annotation

Many annotation tasks in computational linguistics are tackled with manually constructed pipelines of algorithms. In real-time tasks where information needs are stated and addressed ad-hoc, however, manual construction is infeasible. This paper presents an artificial intelligence approach to

automatically

construct annotation pipelines for given information needs and quality prioritizations. Based on an abstract ontological model, we use partial order planning to select a pipeline’s algorithms and informed search to obtain an efficient pipeline schedule. We realized the approach as an expert system on top of Apache UIMA, which offers evidence that pipelines can be constructed ad-hoc in near-zero time.

Henning Wachsmuth, Mirko Rose, Gregor Engels

A Multilingual GRUG Treebank for Underresourced Languages

In this paper, we describe outcomes of an undertaking on building Treebanks for underresourced languages Georgian, Russian, Ukrainian, and German - one of the “major” languages in the NLT world. The monolingual parallel sentences in four languages were syntactically annotated manually using

the Synpathy

tool. The tagsets follow an adapted version of the German TIGER guidelines with necessary changes relevant for the Georgian, the Russian and the Ukrainian languages grammar formal description. An output of the monolingual syntactic annotation is in the TIGER-XML format. Alignment of monolingual repository into the bilingual Treebanks was done by

the Stockholm TreeAligner

software. A demo of the GRUG treebank resources will be held during a poster session.

Oleg Kapanadze, Alla Mishchenko

Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

This paper describes the creation of an annotated corpus supporting the task of extracting information–particularly canonical citations, that are references to the ancient sources–from Classics-related texts. The corpus is multilingual and contains approximately 30,000 tokens of POS-tagged, cleanly transcribed text drawn from the

L’Année Philologique

. In the corpus the named entities that are needed to capture such citations were annotated by using an annotation scheme devised specifically for this task.

The contribution of the paper is two-fold: firstly, it describes how the corpus was created using Active Annotation, an approach which combines automatic and manual annotation to optimize the human resources required to create any corpus. Secondly, the performances of an NER classifier, based on Conditional Random Fields, are evaluated using the created corpus as training and test set: the results obtained by using three different feature sets are compared and discussed.

Matteo Romanello

Approaches of Anonymisation of an SMS Corpus

This paper presents two anonymisation methods to process an SMS corpus. The first one is based on an unsupervised approach called

Seek&Hide

. The implemented system uses several dictionaries and rules in order to predict if a SMS needs anonymisation process. The second method is based on a supervised approach using machine learning techniques. We evaluate the two approaches and we propose a way to use them together. Only when the two methods do not agree on their prediction, will the SMS be checked by a human expert. This greatly reduces the cost of anonymising the corpus.

Namrata Patel, Pierre Accorsi, Diana Inkpen, Cédric Lopez, Mathieu Roche

A Corpus Based Approach for the Automatic Creation of Arabic Broken Plural Dictionaries

Research has shown that Arabic broken plurals constitute approximately 10% of the content of Arabic texts. Detecting Arabic broken plurals and mapping them to their singular forms is a task that can greatly affect the performance of information retrieval, annotation or tagging tasks, and many other text mining applications. It has been reported that the most effective way of detecting broken plurals is through the use of dictionaries. However, if the target domain is a specialized one, or one for which there are no such dictionaries, building those manually becomes a tiresome, not to mention expensive task. This paper presents a corpus based approach for automatically building broken plural dictionaries. The approach utilizes a set of rules for mapping broken plural patterns to their candidate singular forms, and a corpus based co-occurrence statistic to determine when an entry should be added to the broken plural dictionary. Evaluation of the approach has shown that it is capable of creating dictionaries with high levels of precision and recall.

Samhaa R. El-Beltagy, Ahmed Rafea

Temporal Classifiers for Predicting the Expansion of Medical Subject Headings

Ontologies such as the

Medical Subject Headings

(

MeSH

) and the

Gene Ontology

(

) play a major role in biology and medicine since they facilitate data integration and the consistent exchange of information between different entities. They can also be used to index and annotate data and literature, thus enabling efficient search and analysis. Unfortunately, maintaining the ontologies manually is a complex, error-prone, and time and personnel-consuming effort. One major problem is the continuous growth of the biomedical literature, which expands by almost 1 million new scientific papers per year, indexed by

Medline

. The enormous annual increase of scientific publications constitutes the task of monitoring and following the changes and trends in the biomedical domain extremely difficult. For this purpose, approaches that try to learn and maintain ontologies automatically from text and data have been developed in the past. The goal of this paper is to develop temporal classifiers in order to create, for the first time to the best of our knowledge, an automated method that may predict which regions of the

MeSH

ontology will expand in the near future.

George Tsatsaronis, Iraklis Varlamis, Nattiya Kanhabua, Kjetil Nørvåg

Knowledge Discovery on Incompatibility of Medical Concepts

This work proposes a method for automatically discovering incompatible medical concepts in text corpora. The approach is distantly supervised based on a seed set of incompatible concept pairs like symptoms or conditions that rule each other out. Two concepts are considered incompatible if their definitions match a template, and contain an antonym pair derived from WordNet, VerbOcean, or a hand-crafted lexicon. Our method creates templates from dependency parse trees of definitional texts, using seed pairs. The templates are applied to a text corpus, and the resulting candidate pairs are categorized and ranked by statistical measures. Since experiments show that the results face semantic ambiguity problems, we further cluster the results into different categories. We applied this approach to the concepts in Unified Medical Language System, Human Phenotype Ontology, and Mammalian Phenotype Ontology. Out of 77,496 definitions, 1,958 concept pairs were detected as incompatible with an average precision of 0.80.

Adam Grycner, Patrick Ernst, Amy Siu, Gerhard Weikum

Extraction of Part-Whole Relations from Turkish Corpora

In this work, we present a model for semi-automatically extracting part-whole relations from a Turkish raw text. The model takes a list of manually prepared seeds to induce syntactic patterns and estimates their reliabilities. It then captures the variations of part-whole candidates from the corpus. To get precise meronymic relationships, the candidates are ranked and selected according to their reliability scores. We use and compare some metrics to evaluate the strength of association between a pattern and matched pairs. We conclude with a discussion of the result and show that the model presented here gives promising results for Turkish text.

Tuğba Yıldız, Savaş Yıldırım, Banu Diri

Chinese Terminology Extraction Using EM-Based Transfer Learning Method

As an important part of information extraction, terminology extraction attracts more attention. Currently, statistical and rule-based methods are used to extract terminologies in a specific domain. However, cross-domain terminology extraction task has not been well addressed yet. In this paper we propose using EM-based transfer learning method for cross-domain Chinese terminology extraction. Firstly, a naive bayes model is learned from source domain. Then EM-based transfer learning algorithm is used to adapt the classifier learnt from source domain to target domain, which is in different data distribution and domain from source domain. The advantage of our proposed method is to enable the target domain to utilize the knowledge from the source domain. Experimental results between computer domain and environment domain show the proposed Chinese terminology extraction with EM-based transfer learning method outperforms traditional statistical terminology extraction method significantly.

Yanxia Qin, Dequan Zheng, Tiejun Zhao, Min Zhang

Orthographic Transcription for Spoken Tunisian Arabic

Transcribing spoken Arabic dialects is an important task for building speech corpora. Therefore, it is necessary to follow a definite orthography and a definite annotation to transcribe speech data. In this paper, we present OTTA, Orthographic Transcription for Tunisian Arabic. This convention proposes the use of some rules based on the standard Arabic transcription conventions and we define a set of conventions which preserve the particularities of Tunisian dialect.

Inès Zribi, Marwa Graja, Mariem Ellouze Khmekhem, Maher Jaoua, Lamia Hadrich Belguith

Morphology and Tokenization

An Improved Stemming Approach Using HMM for a Highly Inflectional Language

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.

Navanath Saharia, Kishori M. Konwar, Utpal Sharma, Jugal K. Kalita

Semi-automatic Acquisition of Two-Level Morphological Rules for Iban Language

We describe in this paper a semi-automatic acquisition of morphological rules for morphological analyser in the case of under-resourced language, which is Iban language. We modify ideas from previous automatic morphological rules acquisition approaches, where the input requirements has become constraints to develop the analyser for under-resourced language. This work introduces three main steps in acquiring the rules from the under-resourced language, which are morphological data acquisition, morphological information validation and morphological rules extraction. The experiment shows that this approach gives successful results with 0.76 of precision and 0.99 of recall. Our findings also suggest that the availability of linguistic references and the selection of assorted techniques for morphology analysis could lead to the design of the workflow. We believe this workflow will assist other researchers to build morphological analyser with the validated morphological rules for the under-resourced languages.

Suhaila Saee, Lay-Ki Soon, Tek Yong Lim, Bali Ranaivo-Malançon, Enya Kong Tang

Finite State Morphology for Amazigh Language

In the aim of safeguarding the Amazigh heritage from being threatened of disappearance, it seems opportune to equip this language of necessary means to confront the stakes of access to the domain of New Information and Communication Technologies (ICT). In this context, and in the perspective to build tools and linguistic resources for the automatic processing of Amazigh language, we have undertaken to develop a module for automatic lexical-analysis of the Amazigh which can recognize lexical units from texts. To achieve this goal, we have made in the first instance, a formalization of the Amazigh vocabulary namely: noun, verb and particles. This work began with the formalization of the two categories noun and particles by building a dictionary named “EDicAm” (Electronic Dictionary for Amazigh), in which entry is associated with linguistic information such as lexical categories and classes of semantics distribution.

Fatima Zahra Nejme, Siham Boulaknadel, Driss Aboutajdine

New Perspectives in Sinographic Language Processing through the Use of Character Structure

Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate “most semantic subcharacter paths” for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.

Yannis Haralambous

The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation

This paper presents a human-computer interaction learning model for segmenting Chinese texts depending upon neither lexicon nor any annotated corpus. It enables users to add language knowledge to the system by directly intervening the segmentation process. Within limited times of user intervention, a segmentation result that fully matches the use (or with an accurate rate of 100% by manual judgement) is returned. A Kalman filter based model is adopted to learn and estimate the intention of users quickly and precisely from their interventions to reduce system prediction error hereafter. Experiments show that it achieves an encouraging performance in saving human effort and the segmenter with knowledge learned from users outperforms the baseline model by about 10% in segmenting homogenous texts.

Weimeng Zhu, Ni Sun, Xiaojun Zou, Junfeng Hu

Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes

In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple

CRF

binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to ‘classic’ schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.

Murhaf Fares, Stephan Oepen, Yi Zhang

Syntax and Named Entity Recognition

Structural Prediction in Incremental Dependency Parsing

For dependency structures of incomplete sentences to be fully connected, nodes in addition to those corresponding to the words in the sentence prefix are necessary. We analyze a German dependency corpus to estimate the extent of such predictive structures. We also present an incremental parser based on the WCDG framework and describe how the results from the corpus study can be used to adapt an existing weighted constraint dependency grammar for complete sentences to the case of incremental parsing.

Niels Beuck, Wolfgang Menzel

Semi-supervised Constituent Grammar Induction Based on Text Chunking Information

There is a growing interest in unsupervised grammar induction, which does not require syntactic annotations, but provides less accurate results than the supervised approach. Aiming at improving the accuracy of the unsupervised approach, we have resorted to additional information, which can be obtained more easily. Shallow parsing or chunking identifies the sentence constituents (noun phrases, verb phrases, etc.), but without specifying their internal structure. There exist highly accurate systems to perform this task, and thus this information is available even for languages for which large syntactically annotated corpora are lacking. In this work we have investigated how the results of a pattern-based unsupervised grammar induction system improve as data on new kind of phrases are added, leading to a significant improvement in performance. We have analyzed the results for three different languages. We have also shown that the system is able to significantly improve the results of the unsupervised system using the chunks provided by automatic chunkers.

Jesús Santamaría, Lourdes Araujo

Turkish Constituent Chunking with Morphological and Contextual Features

State-of-the-art phrase chunking focuses on English and shows high accuracy with very basic word features such as the word itself and the POS tag. In case of morphologically rich languages like Turkish, basic features are not sufficient. Moreover, phrase chunking may not be appropriate and the “chunk” term should be redefined for these languages. In this paper, the first study on Turkish constituent chunking using two different methods is presented. In the first method, we directly extracted chunks from the results of the Turkish dependency parser. In the second method, we experimented with a CRF-based chunker enhanced with morphological and contextual features using the annotated sentences from the Turkish dependency treebank. The experiments showed that the CRF-based chunking augmented with extra features outperforms the baseline chunker with basic features and dependency parser-based chunker. Overall, we developed a CRF-based Turkish chunker with an F-measure of 91.95 for verb chunks and 87.50 for general chunks.

İlknur Durgar El-Kahlout, Ahmet Afşın Akın

Enhancing Czech Parsing with Verb Valency Frames

In this paper an exploitation of the verb valency lexicons for the Czech parsing system

Synt

is presented and an effective implementation is described that uses the syntactic information in the complex valency frames to resolve some of the standard parsing ambiguities, thereby improving the analysis results. We discuss the implementation in detail and provide evaluation showing improvements in parsing accuracy on the Brno Phrasal Treebank.

Miloš Jakubıček, Vojtěch Kovář

An Automatic Approach to Treebank Error Detection Using a Dependency Parser

Treebanks play an important role in the development of various natural language processing tools. Amongst other things, they provide crucial language-specific patterns that are exploited by various machine learning techniques. Quality control in any treebanking project is therefore extremely important. Manual validation of the treebank is one of the steps that is generally necessary to ensure good annotation quality. Needless to say, manual validation requires a lot of human time and effort. In this paper, we present an automatic approach which helps in detecting potential errors in a treebank. We use a dependency parser to detect such errors. By using this tool, validators can validate a treebank in less time and with reduced human effort.

Bhasha Agrawal, Rahul Agarwal, Samar Husain, Dipti M. Sharma

Topic-Oriented Words as Features for Named Entity Recognition

Research has shown that topic-oriented words are often related to named entities and can be used for Named Entity Recognition. Many have proposed to measure topicality of words in terms of ‘informativeness’ based on global distributional characteristics of words in a corpus. However, this study shows that there can be large discrepancy between informativeness and topicality; empirically, informativeness based features can damage learning accuracy of NER. This paper proposes to measure words’ topicality based on local distributional features specific to individual documents, and proposes methods to transform topicality into gazetteer-like features for NER by binning. Evaluated using five datasets from three domains, the methods have shown consistent improvement over a baseline by between 0.9 and 4.0 in F-measure, and always outperformed methods that use informativeness measures.

Ziqi Zhang, Trevor Cohn, Fabio Ciravegna

Named Entities in Judicial Transcriptions: Extended Conditional Random Fields

The progressive deployment of ICT technologies in the courtroom is leading to the development of integrated multimedia folders where the entire trial contents (documents, audio and video recordings) are available for online consultation via web-based platforms. The current amount of unstructured textual data available into the judicial domain, especially related to hearing transcriptions, highlights therefore the need to automatically extract structured data from the unstructured ones for improving the efficiency of consultation processes. In this paper we address the problem of extracting structured information from the transcriptions generated automatically using an ASR (Automatic Speech Recognition) system, by integrating Conditional Random Fields with available background information. The computational experiments show promising results in structuring ASR outputs, enabling a robust and efficient document consultation.

Elisabetta Fersini, Enza Messina

Introducing Baselines for Russian Named Entity Recognition

Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multi-language Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05%

measure), although they do not employ morphological or syntactical analysis.

Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov

Word Sense Disambiguation and Coreference Resolution

Five Languages Are Better Than One: An Attempt to Bypass the Data Acquisition Bottleneck for WSD

This paper presents a multilingual classification-based approach to Word Sense Disambiguation that directly incorporates translational evidence from four other languages. The need of a large predefined monolingual sense inventory (such as WordNet) is avoided by taking a language-independent approach where the word senses are derived automatically from word alignments on a parallel corpus. As a consequence, the task is turned into a cross-lingual WSD task, that consists in selecting the contextually correct translation of an ambiguous target word.

In order to evaluate the viability of cross-lingual Word Sense Disambiguation, we built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both local context features as well as translation features that are extracted from the aligned translations. The experimental results confirm the validity of our approach: the classifiers that employ translational evidence outperform the classifiers that only exploit local context information. Furthermore, a comparison with state-of-the-art systems for the same task revealed that our system outperforms all other systems for all five target languages.

Els Lefever, Véronique Hoste, Martine De Cock

Analyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach

In corpus-based lexicography and natural language processing fields some authors have proposed using the Internet as a source of corpora for obtaining concordances of words. Most techniques implemented with this method are based on information retrieval-oriented web searchers. However, rankings of concordances obtained by these search engines are not built according to linguistic criteria but to topic similarity or navigational oriented criteria, such as page-rank. It follows that examples or concordances could not be linguistically representative, and so, linguistic knowledge mined by these methods might not be very useful. This work analyzes the linguistic representativeness of concordances obtained by different relevance criteria based web search engines (web, blog and news search engines). The analysis consists of comparing web concordances and

SemCor

(the reference) with regard to the distribution of word senses. Results showed that sense distributions in concordances obtained by web search engines are, in general, quite different from those obtained from the reference corpus. Among the search engines, those that were found to be the most similar to the reference were the informational oriented engines (news and blog search engines).

Xabier Saralegi, Pablo Gamallo

MaxMax: A Graph-Based Soft Clustering Algorithm Applied to Word Sense Induction

This paper introduces a linear time graph-based soft clustering algorithm. The algorithm applies a simple idea: given a graph, vertex pairs are assigned to the same cluster if either vertex has maximal affinity to the other. Clusters of varying size, shape, and density are found automatically making the algorithm suited to tasks such Word Sense Induction (WSI), where the number of classes is unknown and where class distributions may be skewed. The algorithm is applied to two WSI tasks, obtaining results comparable with those of systems adopting existing, state-of-the-art methods.

David Hope, Bill Keller

A Model of Word Similarity Based on Structural Alignment of Subject-Verb-Object Triples

In this paper we propose a new model of word semantics and similarity that is based on the structural alignment of 〈

Subject

Verb

Object

〉 triples extracted from a corpus. The model gives transparent and meaningful representations of word semantics in terms of the predicates asserted of those words in a corpus. The model goes beyond current corpus-based approaches to word similarity in that it reflects the current psychological understanding of similarity as based on structural comparison and alignment. In an assessment comparing the model’s similarity scores with those provided by people for 350 word pairs, the model closely matches people’s similarity judgments and gives a significantly better fit to people’s judgments than that provided by a standard measure of semantic similarity.

Dervla O’Keeffe, Fintan Costello

Coreference Annotation Schema for an Inflectional Language

Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution.

Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.

Maciej Ogrodniczuk, Magdalena Zawisławska, Katarzyna Głowińska, Agata Savary

Exploring Coreference Uncertainty of Generically Extracted Event Mentions

Because event mentions in text may be referentially ambiguous, event coreferentiality often involves uncertainty. In this paper we consider event coreference uncertainty and explore how it is affected by the context. We develop a supervised event coreference resolution model based on the comparison of generically extracted event mentions. We analyse event coreference uncertainty in both human annotations and predictions of the model, and in both within-document and cross-document setting. We frame event coreference as a classification task when full context is available and no uncertainty is involved, and a regression task in a limited context setting that involves uncertainty. We show how a rich set of features based on argument comparison can be utilized in both settings. Experimental results on English data suggest that our approach is especially suitable for resolving cross-document event coreference. Results also suggest that modelling human coreference uncertainty in the case of limited context is feasible.

Goran Glavaš, Jan Šnajder

Semantics and Discourse

LIARc: Labeling Implicit ARguments in Spanish Deverbal Nominalizations

This paper deals with the automatic identification and annotation of the implicit arguments of deverbal nominalizations in Spanish. We present the first version of the LIAR system focusing on its classifier component. We have built a supervised Machine Learning feature based model that uses a subset of AnCora-Es as a training corpus. We have built four different models and the overall F-Measure is 89.9%, which means an increase F-Measure performance approximately 35 points over the baseline (55%). However, a detailed analysis of the feature performance is still needed. Future work will focus on using LIAR to automatically annotate the implicit arguments in the whole AnCora-Es.

Aina Peris, Mariona Taulé, Horacio Rodríguez, Manuel Bertran Ibarz

Automatic Detection of Idiomatic Clauses

We describe several experiments whose goal is to automatically identify idiomatic expressions in written text. We explore two approaches for the task: 1) idiom recognition as outlier detection; and 2) supervised classification of sentences. We apply principal component analysis for outlier detection. Detecting idioms as lexical outliers does not exploit class label information. So, in the following experiments, we use linear discriminant analysis to obtain a discriminant subspace and later use the three nearest neighbor classifier to obtain accuracy. We discuss pros and cons of each approach. All the approaches are more general than the previous algorithms for idiom detection – neither do they rely on target idiom types, lexicons, or large manually annotated corpora, nor do they limit the search space by a particular type of linguistic construction.

Anna Feldman, Jing Peng

Evaluating the Results of Methods for Computing Semantic Relatedness

The semantic relatedness between two concepts is a measure that quantifies the extent to which two concepts are semantically related. Due to the growing interest of researchers in areas such as Semantic Web, Information Retrieval and NLP, various approaches have been proposed in the literature for automatically computing the semantic relatedness. However, despite the growing number of proposed approaches, there are still significant criticalities in evaluating the results returned by different semantic relatedness methods. The limitations of the state of the art evaluation mechanisms prevent an effective evaluation and several works in the literature emphasize that the exploited approaches are rather inconsistent. In this paper we describe the limitations of the mechanisms used for evaluating the results of semantic relatedness methods. By taking into account these limitations, we propose a new methodology and new resources for comparing in an effective way different semantic relatedness approaches.

Felice Ferrara, Carlo Tasso

Similarity Measures Based on Latent Dirichlet Allocation

We present in this paper the results of our investigation on semantic similarity measures at word- and sentence-level based on two fully-automated approaches to deriving meaning from large corpora: Latent Dirichlet Allocation, a probabilistic approach, and Latent Semantic Analysis, an algebraic approach. The focus is on similarity measures based on Latent Dirichlet Allocation, due to its novelty aspects, while the Latent Semantic Analysis measures are used for comparison purposes. We explore two types of measures based on Latent Dirichlet Allocation: measures based on distances between probability distribution that can be applied directly to larger texts such as sentences and a word-to-word similarity measure that is then expanded to work at sentence-level. We present results using paraphrase identification data in the Microsoft Research Paraphrase corpus.

Vasile Rus, Nobal Niraula, Rajendra Banjade

Evaluating the Premises and Results of Four Metaphor Identification Systems

This study first examines the implicit and explicit premises of four systems for identifying metaphoric utterances from unannotated input text. All four systems are then evaluated on a common data set in order to see which premises are most successful. The goal is to see if these systems can find metaphors in a corpus that is mostly non-metaphoric without over-identifying literal and humorous utterances as metaphors. Three of the systems are distributional semantic systems, including a source-target mapping method [1-4]; a word abstractness measurement method [5], [6, 7]; and a semantic similarity measurement method [8, 9]. The fourth is a knowledge-based system which uses a domain interaction method based on the SUMO ontology [10, 11], implementing the hypothesis that metaphor is a product of the interactions among all of the concepts represented in an utterance [12, 13].

Jonathan Dunn

Determining the Conceptual Space of Metaphoric Expressions

We present a method of constructing the semantic signatures of target concepts expressed in metaphoric expressions as well as a method to determine the conceptual space of a metaphor using the constructed semantic signatures and a semantic expansion. We evaluate our methodology by focusing on metaphors where the target concept is

Governance

. Using the semantic signature constructed for this concept, we show that the conceptual spaces generated by our method are judged to be highly acceptable by humans.

David B. Bracewell, Marc T. Tomlinson, Michael Mohler

What is being Measured in an Information Graphic?

Information graphics (such as bar charts and line graphs) are widely used in popular media. The majority of such non-pictorial graphics have the purpose of communicating a high-level message which is often not repeated in the text of the article. Thus, information graphics together with the textual segments contribute to the overall purpose of an article and cannot be ignored. Unfortunately, information graphics often do not label the dependent axis with a full descriptor of what is being measured. In order to realize the high-level message of an information graphic in natural language, a referring expression for the dependent axis must be generated. This task is complex in that the required referring expression often must be constructed by extracting and melding pieces of information from the textual content of the graphic. Our heuristic-based solution to this problem has been shown to produce reasonable text for simple bar charts. This paper presents the extensibility of that approach to other kinds of graphics, in particular to grouped bar charts and line graphs. We discuss the set of component texts contained in these two kinds of graphics, how the methodology for simple bar charts can be extended to these kinds, and the evaluation of the enhanced approach.

Seniz Demir, Stephanie Elzer Schwartz, Richard Burns, Sandra Carberry

Comparing Discourse Tree Structures

The existing discourse parsing systems make use of different theories to put at the basis of processes of building discourse trees. Many of them use Recall, Precision and F-measure to compare discourse tree structures. These measures can be used only on topologically identical structures. However, there are known cases when two different tree structures of the same text can express the same discourse interpretation, or something very similar. In these cases Precision, Recall and F-measures are not so conclusive. In this paper, we propose three new scores for comparing discourse trees. These scores take into consideration more and more constraints. As basic elements of building the discourse structure we use those embraced by two discourse theories: Rhetorical Structure Theory (RST) and Veins Theory, both using binary trees augmented with nuclearity notation. We will ignore the second notation used in RST – the name of relations. The first score takes into account the coverage of inner nodes. The second score complements the first score with the nuclearity of the relation. The third score computes Precisions, Recall and F-measures on the vein expressions of the elementary discourse units. We show that these measures reveal comparable scores there where the differences in structure are not doubled by differences in interpretation.

Elena Mitocariu, Daniel Alexandru Anechitei, Dan Cristea

Assessment of Different Workflow Strategies for Annotating Discourse Relations: A Case Study with HDRB

In this paper we present our experiments with different annotation workflows for annotating discourse relations in the Hindi Discourse Relation Bank(HDRB). In view of the growing interest in the development of discourse data-banks based on the PDTB framework and the complexities associated with the discourse annotation, it is important to study and analyze approaches and practices followed in the annotation process. The ultimate goal is to find an optimal balance between accurate description of discourse relations and maximal inter-rater reliability. We address the question of the choice of annotation work-flow for discourse and how it affects the consistency and hence the quality of annotation. We conduct multiple annotation experiments using different work-flow strategies, and evaluate their impact on inter-annotator agreement. Our results show that the choice of annotation work-flow has a significant effect on the annotation load and the comprehension of discourse relations for annotators, as is reflected in the inter-annotator agreement results.

Himanshu Sharma, Praveen Dakwale, Dipti M. Sharma, Rashmi Prasad, Aravind Joshi

Building a Discourse Parser for Informal Mathematical Discourse in the Context of a Controlled Natural Language

The lack of specific data sets makes difficult the discourse parsing for Informal Mathematical Discourse (IMD). In this paper, we propose a data driven approach to identify arguments and connectives in an IMD structure within the context of Controlled Natural Language (CNL). Our approach follows a low-level discourse parsing under Peen Discourse TreeBank (PDTB) guidelines. Three classifiers have been trained: one that identifies the

Arg2

, other that locates the relative position of

Arg1

and a third that identifies the (

Arg1

and

Arg2

) arguments of each connective. These classifiers are instances of Support Vector Machines (SVMs), fed from an own Mathematical TreeBank. Finally, our approach defines an End-to-End discourse parser into IMD, whose results will be used to classify of informal deductive proofs via the low level discourse in IMD processing.

Raúl Ernesto Gutiérrez de Piñerez Reyes, Juan Francisco Díaz Frias

Discriminative Learning of First-Order Weighted Abduction from Partial Discourse Explanations

Abduction is inference to the best explanation. Abduction has long been studied in a wide range of contexts and is widely used for modeling artificial intelligence systems, such as diagnostic systems and plan recognition systems. Recent advances in the techniques of automatic world knowledge acquisition and inference technique warrant applying abduction with large knowledge bases to real-life problems. However, less attention has been paid to how to automatically learn score functions, which rank candidate explanations in order of their plausibility. In this paper, we propose a novel approach for learning the score function of first-order logic-based weighted abduction [1] in a supervised manner. Because the manual annotation of abductive explanations (i.e. a set of literals that explains observations) is a time-consuming task in many cases, we propose a framework to learn the score function from partially annotated abductive explanations (i.e. a subset of those literals). More specifically, we assume that we apply abduction to a specific task, where a subset of the best explanation is associated with output labels, and the rest are regarded as hidden variables. We then formulate the learning problem as a task of discriminative structured learning with hidden variables. Our experiments show that our framework successfully reduces the loss in each iteration on a plan recognition dataset.

Kazeto Yamamoto, Naoya Inoue, Yotaro Watanabe, Naoaki Okazaki, Kentaro Inui

Facilitating the Analysis of Discourse Phenomena in an Interoperable NLP Platform

The analysis of discourse phenomena is essential in many natural language processing (NLP) applications. The growing diversity of available corpora and NLP tools brings a multitude of representation formats. In order to alleviate the problem of incompatible formats when constructing complex text mining pipelines, the Unstructured Information Management Architecture (UIMA) provides a standard means of communication between tools and resources. U-Compare, a text mining workflow construction platform based on UIMA, further enhances interoperability through a shared system of data types, allowing free combination of compliant components into workflows. Although U-Compare and its type system already support syntactic and semantic analyses, support for the analysis of discourse phenomena was previously lacking. In response, we have extended the U-Compare type system with new discourse-level types. We illustrate processing and visualisation of discourse information in U-Compare by providing several new deserialisation components for corpora containing discourse annotations. The new U-Compare is downloadable from

http://nactem.ac.uk/ucompare

Riza Theresa Batista-Navarro, Georgios Kontonatsios, Claudiu Mihăilă, Paul Thompson, Rafal Rak, Raheel Nawaz, Ioannis Korkontzelos, Sophia Ananiadou

Backmatter

Title: Computational Linguistics and Intelligent Text Processing
Editor: Alexander Gelbukh
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-37247-6
Print ISBN: 978-3-642-37246-9
DOI: https://doi.org/10.1007/978-3-642-37247-6

Springer Professional

About this book

Table of Contents

Frontmatter

General Techniques

Unsupervised Feature Adaptation for Cross-Domain NLP with an Application to Compositionality Grading

Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification

Lexical Resources

A Quick Tour of BabelNet 1.1

Automatic Pipeline Construction for Real-Time Annotation

A Multilingual GRUG Treebank for Underresourced Languages

Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

Approaches of Anonymisation of an SMS Corpus

A Corpus Based Approach for the Automatic Creation of Arabic Broken Plural Dictionaries

Temporal Classifiers for Predicting the Expansion of Medical Subject Headings

Knowledge Discovery on Incompatibility of Medical Concepts

Extraction of Part-Whole Relations from Turkish Corpora

Chinese Terminology Extraction Using EM-Based Transfer Learning Method

Orthographic Transcription for Spoken Tunisian Arabic

Morphology and Tokenization

An Improved Stemming Approach Using HMM for a Highly Inflectional Language

Semi-automatic Acquisition of Two-Level Morphological Rules for Iban Language

Finite State Morphology for Amazigh Language

New Perspectives in Sinographic Language Processing through the Use of Character Structure

The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation

Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes

Syntax and Named Entity Recognition

Structural Prediction in Incremental Dependency Parsing

Semi-supervised Constituent Grammar Induction Based on Text Chunking Information

Turkish Constituent Chunking with Morphological and Contextual Features

Enhancing Czech Parsing with Verb Valency Frames

An Automatic Approach to Treebank Error Detection Using a Dependency Parser

Topic-Oriented Words as Features for Named Entity Recognition

Named Entities in Judicial Transcriptions: Extended Conditional Random Fields

Introducing Baselines for Russian Named Entity Recognition

Word Sense Disambiguation and Coreference Resolution

Five Languages Are Better Than One: An Attempt to Bypass the Data Acquisition Bottleneck for WSD

Analyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach

MaxMax: A Graph-Based Soft Clustering Algorithm Applied to Word Sense Induction

A Model of Word Similarity Based on Structural Alignment of Subject-Verb-Object Triples

Coreference Annotation Schema for an Inflectional Language

Exploring Coreference Uncertainty of Generically Extracted Event Mentions

Semantics and Discourse

LIARc: Labeling Implicit ARguments in Spanish Deverbal Nominalizations

Automatic Detection of Idiomatic Clauses

Evaluating the Results of Methods for Computing Semantic Relatedness

Similarity Measures Based on Latent Dirichlet Allocation

Evaluating the Premises and Results of Four Metaphor Identification Systems

Determining the Conceptual Space of Metaphoric Expressions

What is being Measured in an Information Graphic?

Comparing Discourse Tree Structures

Assessment of Different Workflow Strategies for Annotating Discourse Relations: A Case Study with HDRB

Building a Discourse Parser for Informal Mathematical Discourse in the Context of a Controlled Natural Language

Discriminative Learning of First-Order Weighted Abduction from Partial Discourse Explanations

Facilitating the Analysis of Discourse Phenomena in an Interoperable NLP Platform

Backmatter

Premium Partner