Skip to main content

2014 | Buch

Computational Linguistics and Intelligent Text Processing

15th International Conference, CICLing 2014, Kathmandu, Nepal, April 6-12, 2014, Proceedings, Part I

insite
SUCHEN

Über dieses Buch

This two-volume set, consisting of LNCS 8403 and LNCS 8404, constitutes the thoroughly refereed proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2014, held in Kathmandu, Nepal, in April 2014. The 85 revised papers presented together with 4 invited papers were carefully reviewed and selected from 300 submissions. The papers are organized in the following topical sections: lexical resources; document representation; morphology, POS-tagging, and named entity recognition; syntax and parsing; anaphora resolution; recognizing textual entailment; semantics and discourse; natural language generation; sentiment analysis and emotion recognition; opinion mining and social networks; machine translation and multilingualism; information retrieval; text classification and clustering; text summarization; plagiarism detection; style and spelling checking; speech processing; and applications.

Inhaltsverzeichnis

Frontmatter

Lexical Resources

Using Word Association Norms to Measure Corpus Representativeness

An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people’s word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people’s associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so.

Reinhard Rapp
Optimality Theory as a Framework for Lexical Acquisition

This paper re-investigates a lexical acquisition system initially developed for French. We show that, interestingly, the architecture of the system reproduces and implements the main components of Optimality Theory. However, we formulate the hypothesis that some of its limitations are mainly due to a poor representation of the constraints used. Finally, we show how a better representation of the constraints used would yield better results.

Thierry Poibeau
Verb Clustering for Brazilian Portuguese

Levin-style classes which capture the shared syntax and semantics of verbs have proven useful for many Natural Language Processing (NLP) tasks and applications. However, lexical resources which provide information about such classes are only available for a handful of worlds languages. Because manual development of such resources is extremely time consuming and cannot reliably capture domain variation in classification, methods for automatic induction of verb classes from texts have gained popularity. However, to date such methods have been applied to English and a handful of other, mainly resource-rich languages. In this paper, we apply the methods to Brazilian Portuguese - a language for which no VerbNet or automatic class induction work exists yet. Since Levin-style classification is said to have a strong cross-linguistic component, we use unsupervised clustering techniques similar to those developed for English without language-specific feature engineering. This yields interesting results which line up well with those obtained for other languages, demonstrating the cross-linguistic nature of this type of classification. However, we also discover and discuss issues which require specific consideration when aiming to optimise the performance of verb clustering for Brazilian Portuguese and other less-resourced languages.

Carolina Scarton, Lin Sun, Karin Kipper-Schuler, Magali Sanches Duran, Martha Palmer, Anna Korhonen
Spreading Relation Annotations in a Lexical Semantic Network Applied to Radiology

Domain specific ontologies are invaluable but their development faces many challenges. In most cases, domain knowledge bases are built with very limited scope without considering the benefits of including domain knowledge to a general ontology. Furthermore, most existing resources lack meta-information about association strength (weights) and annotations (frequency information like

frequent

,

rare

... or relevance information like

pertinent

or

irrelevant

). In this paper, we are presenting a semantic resource for radiology built over an existing general semantic lexical network (JeuxDeMots). This network combines weight and annotations on typed relations between terms and concepts. Some inference mechanisms are applied to the network to improve its quality and coverage. We extend this mechanism to relation annotation. We describe how annotations are handled and how they improve the network by imposing new constraints especially those founded on medical knowledge.

Lionel Ramadier, Manel Zarrouk, Mathieu Lafourcade, Antoine Micheau
Issues in Encoding the Writing of Nepal’s Languages

The major language of Nepal, known today as Nepali, is spoken as mother tongue by nearly half the population, and as a second language by nearly all of the rest. A considerable volume of computational linguistics work has been done on Nepali, both in research establishments and commercial organizations. However there are another 94 languages indigenous to the country, and the situation for these is not good. In order to apply computational linguistics methods to a language it must first be represented in the computer, but most of the languages of Nepal have no written tradition, let alone any support by computers. It is the written form that is needed for full computational processes, and it is here that we encounter barriers or at best inappropriate compromises. We will look at the situation in Nepal, ignoring the 17 cross-border languages where the major speaker population lies outside Nepal. We are left with only three languages with written traditions: Nepali which is well served, Newari with over 1000 years of written tradition but which so far has been frustrated in attempts to encode its writing, and Limbu which does have its writing encoded though with defects. Many of the remaining languages may be written in Devanagari, but aspire to something different that relates to their languages and has a more visually distinctive writing to mark their identity. We look at what can be done for these remaining languages and speculate whether a common writing system and encoding could cover all the languages of Nepal. Inevitably we must focus on the current standard for the computer encoding of writing, Unicode, but we find that while language activists in Nepal do not adequately understand what is possible with the technology and pursue objectives within Unicode that are not necessary or helpful, external experts only have limited understanding of all the issues involved and the requirements of living languages and their users and instead pursuing scholarly interests which offer limited support for living users.

Pat Hall, Bal Krishna Bal, Sagun Dhakhwa, Bhim Narayan Regmi
Compound Terms and Their Multi-word Variants: Case of German and Russian Languages

The terminology of any language and any domain continuously evolves and leads to a constant term renewal. Terms undergo a wide range of morphological and syntactic variations which have to be handled by any NLP applications. If the syntactic variations of multi-word terms have been described and tools designed to process them, only a few works studied the syntagmatic variants of compound terms. This paper is dedicated to the identification of such variants, and more precisely to the detection of synonymic pairs that consist of “compound term - multi-word term ”. We describe a pipeline for their detection, from compound recognition and splitting to alignment of the variants with original terms, through multi-word term extraction. The experiments are carried out for two compound-producing languages, German and Russian, and two specialised domains: wind energy and breast cancer. We identify variation patterns for these two languages and demonstrate that the transformation of a morphological compound into a syntagmatic compound mainly occurs when the term denomination needs to be enlarged.

Elizaveta Clouet, Béatrice Daille
A Fully Automated Approach for Arabic Slang Lexicon Extraction from Microblogs

With the rapid increase in the volume of Arabic opinionated posts on different social media forums, comes an increased demand for Arabic sentiment analysis tools and resources. Social media posts, especially those made by the younger generation, are usually written using colloquial Arabic and include a lot of slang, many of which evolves over time. While some work has been carried out to build modern standard Arabic sentiment lexicons, these need to be supplemented with dialectical terms and continuously updated with slang. This paper proposes a fully automated approach for building a dialectical/slang subjectivity lexicon for use in Arabic Sentiment analysis using lexico-syntactic patterns. Since existing Arabic part of speech taggers and other morphological resources have been found to handle colloquial Arabic very poorly, the presented approach does not employ any such tools, allowing the presented approach to generalize across dialects with some minor modifications. Results of experiments, that targeted Egyptian Arabic, show the approach’s ability to detect subjective internet slang represented by single words or by multi-word expressions, as well as classifying the polarity of these with a high degree of precision.

Hady ElSahar, Samhaa R. El-Beltagy
Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification

In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as

cooker

, while

stove

is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.

Hiram Calvo
Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing

The inability of reliable text extraction from arbitrary documents is often an obstacle for large scale NLP based on resources crawled from the Web. One of the largest problems in the conversion of PDF documents is the detection of the boundaries of common textual units such as paragraphs, sentences and words. PDF is a file format optimized for printing and encapsulates a complete description of the layout of a document including text, fonts, graphics and so on. This paper describes a tool for extracting texts from arbitrary PDF files for the support of large-scale data-driven natural language processing. Our approach combines the benefits of several existing solutions for the conversion of PDF documents to plain text and adds a language-independent post-processing procedure that cleans the output for further linguistic processing. In particular, we use the PDF-rendering libraries pdfXtk, Apache Tika and Poppler in various configurations. From the output of these tools we recover proper boundaries using on-the-fly language models and language-independent extraction heuristics. In our research, we looked especially at publications from the European Union, which constitute a valuable multilingual resource, for example, for training statistical machine translation models. We use our tool for the conversion of a large multilingual database crawled from the EU bookshop with the aim of building parallel corpora. Our experiments show that our conversion software is capable of fixing various common issues leading to cleaner data sets in the end.

Jörg Tiedemann

Document Representation

Dependency-Based Semantic Parsing for Concept-Level Text Analysis

Concept-level text analysis is superior to word-level analysis as it preserves the semantics associated with multi-word expressions. It offers a better understanding of text and helps to significantly increase the accuracy of many text mining tasks. Concept extraction from text is a key step in concept-level text analysis. In this paper, we propose a ConceptNet-based semantic parser that deconstructs natural language text into concepts based on the dependency relation between clauses. Our approach is domain-independent and is able to extract concepts from heterogeneous text. Through this parsing technique, 92.21% accuracy was obtained on a dataset of 3,204 concepts. We also show experimental results on three different text analysis tasks, on which the proposed framework outperformed state-of-the-art parsing techniques.

Soujanya Poria, Basant Agarwal, Alexander Gelbukh, Amir Hussain, Newton Howard
Obtaining Better Word Representations via Language Transfer

Vector space word representations have gained big success recently at improving performance across various NLP tasks. However, existing word embeddings learning methods only utilize homo-lingual corpus. Inspired by transfer learning, we propose a novel language transfer method to obtain word embeddings via language transfer. Under this method, in order to obtain word embeddings of one language (target language), we train models on corpus of another different language (source language) instead. And then we use the obtained source language word embeddings to represent target language word embeddings. We evaluate the word embeddings obtained by the proposed method on word similarity tasks across several benchmark datasets. And the results show that our method is surprisingly effective, outperforming competitive baselines by a large margin. Another benefit of our method is that the process of collecting new corpus might be skipped.

Changliang Li, Bo Xu, Gaowei Wu, Xiuying Wang, Wendong Ge, Yan Li
Exploring Applications of Representation Learning in Nepali

We explore the applications of representation learning in Nepali, an under-resourced language. Using distributional similarity on a large amount of unlabeled Nepali text, we induce clusters of different sizes. The use of these clusters as features significantly improves the performance compared to the baseline on two standard NLP tasks. In a part-of-speech (PoS) tagging experiment where the train and test domain are the same, the accuracy on the unknown words increased by up to 5% compared to the baseline. In a named-entity recognition (NER) experiment in domain adaptation setting with a small training data size, the F1 score improved by up to 41% compared to the baseline. In a setting where train and test domain are the same, the F1 score improved by 13% compared to the baseline.

Anjan Nepal, Alexander Yates
Topic Models Incorporating Statistical Word Senses

LDA considers a surface word to be identical across all documents and measures the contribution of a surface word to each topic. However, a surface word may present different signatures in different contexts, i.e. polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the classic LDA and a standalone sense-based LDA model significantly in document clustering.

Guoyu Tang, Yunqing Xia, Jun Sun, Min Zhang, Thomas Fang Zheng
How Preprocessing Affects Unsupervised Keyphrase Extraction

Unsupervised keyphrase extraction techniques generally consist of candidate phrase selection and ranking techniques. Previous studies treat the candidate phrase selection and ranking as a whole, while the effectiveness of identifying candidate phrases and the impact on ranking algorithms have remained undiscovered. This paper surveys common candidate selection techniques and analyses the effect on the performance of ranking algorithms from different candidate selection approaches. Our evaluation shows that candidate selection approaches with better coverage and accuracy can boost the performance of the ranking algorithms.

Rui Wang, Wei Liu, Chris McDonald

Morphology, POS-tagging, and Named Entity Recognition

Methods and Algorithms for Unsupervised Learning of Morphology

This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.

Burcu Can, Suresh Manandhar
Morphological Analysis of the Bishnupriya Manipuri Language Using Finite State Transducers

In this work we present a morphological analysis of Bishnupriya Manipuri language, an Indo-Aryan language spoken in the north eastern India. As of now, there is no computational work available for the language. Finite state morphology is one of the successful approaches applied in a wide variety of languages over the year. Therefore we adapted the finite state approach to analyse morphology of the Bishnupriya Manipuri language.

Nayan Jyoti Kalita, Navanath Saharia, Smriti Kumar Sinha
A Hybrid Approach to the Development of Part-of-Speech Tagger for Kafi-noonoo Text

Although natural language processing (NLP) is now a popular area of research and development, less-resourced languages are not receiving much attention from developers. One of such under-resourced languages is Kafi-noonoo which is spoken in the south-western regions of Ethiopia. This paper presents the development of part-of-speech tagger for Kafi-noonoo. In order to develop the tagger, we employed a hybrid of two systems: statistical and rule-based taggers. The lexical and transitional probabilities of word classes are modeled using HMM. However, due to the limitation of corpus for the language, a set of transformation rules are applied to improve the result. The system was tested with test corpus and, with 90% of the corpus used for training, the hybrid tagger yielded an accuracy of 80.47%.

Zelalem Mekuria, Yaregal Assabie
Modified Differential Evolution for Biochemical Name Recognizer

In this paper we propose a modified differential evolution (MDE) based feature selection and ensemble learning algorithms for biochemical entity recognizer. Identification and classification of chemical entities are relatively more complex and challenging compared to the other related tasks. As chemical entities we focus on IUPAC and IUPAC related entities. The algorithm performs feature selection within the framework of a robust machine learning algorithm, namely Conditional Random Field. Features are identified and implemented mostly without using any domain specific knowledge and/or resources. In this paper we modify traditional differential evolution to perform two tasks,

viz.

determining relevant set of features as well as determining proper voting weights for constructing an ensemble. The feature selection technique produces a set of potential solutions on the final population. We develop many models of CRF using these feature combinations. In order to further improve the performance the outputs of these classifiers are combined together using a classifier ensemble technique based on modified DE. Our experiments with the benchmark datasets yield the recall, precision and F-measure values of 82.34%, 88.26% and 85.20%, respectively.

Utpal Kumar Sikdar, Asif Ekbal, Sriparna Saha

Syntax and Parsing

Extended CFG Formalism for Grammar Checker and Parser Development

This paper reports on the implementation of grammar checkers and parsers for highly inflected and under-resourced languages. As classical context free grammar (CFG) formalism performs poorly on languages with a rich morphological feature system, we have extended the CFG formalism by adding syntactic roles, lexical constraints, and constraints on morpho-syntactic feature values. The formalism also allows to assign morpho-syntactic feature values to phrases and to specify optional constituents. The paper also describes how we are implementing the grammar checker by using two sets of rules – rules describing correct sentences and rules describing grammar errors. The same engine with a different rule set can be used for the different purposes – to parse the text or to find the grammar errors. The paper also describes the implementation of Latvian and Lithuanian parsers and grammar checkers and the quality measurement methods used for the quality assessment.

Daiga Deksne, Inguna Skadiņa, Raivis Skadiņš
Dealing with Function Words in Unsupervised Dependency Parsing

In this paper, we show some properties of function words in dependency trees. Function words are grammatical words, such as articles, prepositions, pronouns, conjunctions, or auxiliary verbs. These words are often short and very frequent in texts and therefore many of them can be easily recognized. We formulate a hypothesis that function words tend to have a fixed number of dependents and we prove this hypothesis on treebanks. Using this hypothesis, we are able to improve unsupervised dependency parsing and outperform previously published state-of-the-art results for many languages.

David Mareček, Zdeněk Žabokrtský
When Rules Meet Bigrams

This paper discusses an on-going project aiming at improving the quality and the efficiency of a rule-based parser by the addition of a statistical component. The proposed technique relies on bigrams of pairs (word+category) selected from the homographs contained in our lexical database and computed over a large section of the Hansard corpus, previously tagged. The bigram table is used by the parser to rank and prune the set of alternatives. To evaluate the gains obtained by the hybrid system, we conducted two manual evaluations. One over a small subset of the Hansard corpus, the other one with a corpus of about 50 articles taken from the magazine

The Economist

. In both cases, we compare analyses obtained by the parser with and without the statistical component, focusing only on one important source of mistakes, the confusion between nominal and verbal readings for ambiguous words such as

announce

,

sets

,

costs

,

labour

, etc.

Eric Wehrli, Luka Nerima
Methodology for Connecting Nouns to Their Modifying Adjectives

Adjectives are words that describe or modify other elements in a sentence. As such, they are frequently used to convey facts and opinions about the nouns they modify. Connecting nouns to the corresponding adjectives becomes vital for intelligent tasks such as aspect-level sentiment analysis or interpretation of complex queries (e.g., "

small hotel with large rooms

") for fine-grained information retrieval. To respond to the need, we propose a methodology that identifies dependencies of nouns and adjectives by looking at syntactic clues related to part-of-speech sequences that help recognize such relationships. These sequences are generalized into patterns that are used to train a binary classifier using machine learning methods. The capabilities of the new method are demonstrated in two, syntactically different languages: English, the leading language of international discourse, and Hebrew, whose rich morphology poses additional challenges for parsing. In each language we compare our method with a designated, state-of-the-art parser and show that it performs similarly in terms of accuracy while: (a) our method uses a simple and relatively small training set; (b) it does not require a language specific adaptation, and (c) it is robust across a variety of writing styles.

Nir Ofek, Lior Rokach, Prasenjit Mitra
Constituency Parsing of Complex Noun Sequences in Hindi

A complex noun sequence is one in which a head noun is recursively modified by one or more bare nouns and/or genitives Constituency analysis of complex noun sequence is a prerequisite for finding dependency relation (semantic relation) between components of the sequence. Identification of dependency relation is useful for various applications such as question answering, information extraction, textual entailment, paraphrasing.

In Hindi, syntactic agreement rules can handle to a large extent the parsing of recursive genitives (Sharma, 2012)[12].This paper implements frequency based corpus driven approaches for parsing recursive genitive structures that syntactic rules cannot handle as well as recursive compound nouns and combination of gentive and compound noun sequences. Using syntactic rules and dependency global algorithm, an accuracy of 92.85% is obtained.

Arpita Batra, Soma Paul, Amba Kulkarni
Amharic Sentence Parsing Using Base Phrase Chunking

Parsing plays a significant role in many natural language processing (NLP) applications as their efficiency relies on having an effective parser. This paper presents Amharic sentence parser developed using base phrase chunker that groups syntactically correlated words at different levels. We use HMM to chunk base phrases where incorrectly chunked phrases are pruned with rules. The task of parsing is then performed by taking chunk results as inputs. Bottom-up approach with transformation algorithm is used to transform the chunker to the parser. Corpus from Amharic news outlets and books was collected for training and testing. The training and testing datasets were prepared using the 10-fold cross validation technique. Test results on the test data showed an average parsing accuracy of 93.75%.

Abeba Ibrahim, Yaregal Assabie

Anaphora Resolution

A Machine Learning Approach to Pronominal Anaphora Resolution in Dialogue Based Intelligent Tutoring Systems

Anaphora resolution is a central topic in dialogue and discourse that deals with finding the referent of a pronoun. It plays a critical role in conversational Intelligent Tutoring Systems (ITSs) as it can increase the accuracy of assessing students’ knowledge level, i.e. mental model, based on their natural language inputs. Although anaphora resolution is one of the most studied problems in Natural Language Processing, there are very few studies that focus on anaphora resolution in dialogue based ITSs. To this end, we present Deep Anaphora Resolution Engine++ (DARE++) that adapts and extends existing machine learning solutions to resolve pronouns in ITS dialogues. Experiments showed that DARE++ achieves a F-measure of 88.93%, proving the potential of the proposed method for resolving pronouns in student-tutor dialogues.

Nobal B. Niraula, Vasile Rus
A Maximum Entropy Based Honorificity Identification for Bengali Pronominal Anaphora Resolution

This paper presents a maximum entropy based method for determining honorific identities of personal nouns in Bengali. Later this information is used for pronoun (anaphora) resolution system for Bengali as honorificity plays an important role for pronominal anaphora resolution in Bengali. Experiment has done on a publicly available dataset. Experimental result shows that when the module for honorific identification is added to the existing pronoun resolution system, the accuracy (avg. F1-score) of the system is improved from 0.602 to 0.703 and this improvement is shown to be statistically significant.

Apurbalal Senapati, Utpal Garain

Recognizing Textual Entailment

Statistical Relational Learning to Recognise Textual Entailment

We propose a novel approach to recognise textual entailment (RTE) following a two-stage architecture – alignment and decision – where both stages are based on semantic representations. In the alignment stage the entailment candidate pairs are represented and aligned using predicate-argument structures. In the decision stage, a Markov Logic Network (MLN) is learnt using rich relational information from the alignment stage to predict an entailment decision. We evaluate this approach using the RTE Challenge datasets. It achieves the best results for the RTE-3 dataset and shows comparable performance against the state of the art approaches for other datasets.

Miguel Rios, Lucia Specia, Alexander Gelbukh, Ruslan Mitkov
Annotation Game for Textual Entailment Evaluation

Recognizing textual entailment (RTE) is a well-defined task concerning semantic analysis. It is evaluated against manually annotated collection of pairs hypothesis–text. A pair is annotated true if the text entails the hypothesis and false otherwise. Such collection can be used for training or testing a RTE application only if it is large enough.

We present a game which purpose is to collect h–t pairs. It follows a detective story narrative pattern: a brilliant detective and his slower assistant talk about the riddle to reveal the solution to readers. In the game the detective (human player) provides a short story. The assistant (the application) proposes hypotheses the detective judges true, false or non-sense.

Hypothesis generation is a rule-based process but the most likely hypotheses that are offered for annotation are calculated from a language model. During generation individual sentence constituents are rearranged to produce syntactically correct sentences.

The game is intended to collect data in the Czech language. However, the idea can be applied for other languages. The paper concentrates on description of the most interesting modules from a language-independent point of view as well as the game elements.

Zuzana Nevěřilová

Semantics and Discourse

Axiomatizing Complex Concepts from Fundamentals

We have been engaged in the project of encoding commonsense theories of cognition, or how we think we think, in a logical representation. In this paper we use the concept of a “serious threat” as our prime example, and examine the infrastructure required for capturing the meaning of this complex concept. It is one of many examples we could have used, but it is particularly interesting because building up to this concept from fundamentals, such as causality and scalar notions, highlights a number of representational issues that have to be faced along the way, where the complexity of the target concepts strongly influences how we resolve those issues.

We first describe our approach to definition, defeasibility, and reification, where hard decisions have to bemade to get the enterprise off the ground.We then sketch our approach to causality, scalar notions, goals, and importance. Finally we use all this to characterize what it is to be a serious threat. All of this is necessarily sketchy, but the key ideas essential to the target concept should be clear.

Jerry R. Hobbs, Andrew Gordon
A Semantics Oriented Grammar for Chinese Treebanking

Chinese grammar engineering has been a much debated task. Whilst semantic information has been reconed crucial for Chinese syntactic analysis and downstream applications, existing Chinese treebanks lack a consistent and strict sentential semantic formalism. In this paper, we introduce a semantics oriented grammar for Chinese, designed to provide basic supports for tasks such as automatic semantic parsing and sentence generation. It has a directed acyclic graph structure with a simple yet expressive label set, and leverages elementary predication to support logical form conversion. To our knowledge, it is the first Chinese grammar representation capable of direct transformation into logical forms.

Meishan Zhang, Yue Zhang, Wanxiang Che, Ting Liu
Unsupervised Interpretation of Eventive Propositions

This work addresses the challenge of automatically unfold transfers of meaning in eventive propositions. For example, if we want to interpret “throw pass” in the context of sports, we need to find the object (“ball”) that transferred some semantic properties to “pass” to make it acceptable as argument for “throw”. We propose a probabilistic model for interpreting an eventive proposition by recovering two additional coupled propositions related to the one under interpretation. We gather the statistics after building a Proposition Store from a document collection, and explore different configurations to couple propositions based on WordNet relations. These coupled propositions compose an actual interpretation of the original proposition with a precision of 0.57, but only for an 18% of samples. If we evaluate whether the interpretation is just useful or not for recovering background knowledge required for interpretation, then results rise up to 0.71 of precision and recall.

Anselmo Peñas, Bernardo Cabaleiro, Mirella Lapata
Sense-Specific Implicative Commitments

Natural language processing systems, even when given proper syntactic and semantic interpretations, still lack the common sense inference capabilities required for genuinely understanding a sentence. Recently, there have been several studies developing a semantic classification of verbs and their sentential complements, aiming at determining which inferences people draw from them. Such constructions may give rise to implied commitments that the author normally cannot disavow without being incoherent or without contradicting herself, as described for instance in the work of Kartunnen. In this paper, we model such knowledge at the semantic level by attempting to associate such inferences with specific word senses, drawing on WordNet and VerbNet. This allows us to investigate to what extent the inferences apply to semantically equivalent words within and across languages.

Gerard de Melo, Valeria de Paiva
A Tiered Approach to the Recognition of Metaphor

We present a tiered-approach to the recognition of metaphor. The first tier is made up of highly precise expert-driven lexico-syntactic patterns which are automatically expended on in the second tier using lexical and dependency transformations. The final tier utilizes an SVM classifier using a variety of syntactic, semantic, and psycholinguistic features to determine if an expression is metaphoric. We focus on the recognition of metaphors in which the target is associated with the concept of “Economic Inequality” and examine the effectiveness of our approach for metaphors expressed in English, Farsi, Russian, and Spanish. Through experimental analysis we show that the proposed approach is capable of achieving 67.4% to 77.8% F-Measure depending on the language.

David B. Bracewell, Marc T. Tomlinson, Michael Mohler, Bryan Rink
Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes

Knowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates defining a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning technique (namely Conditional Random Fields, widely used for supervised labeling tasks) in order to calculate, indirectly and without supervision, similarities among text sequences. Our approach consists in generating artificial labeling problems on the data to reveal regularities between entities through their labeling. We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks. The results demonstrate the usefulness of this unsupervised approach, and open many avenues for defining similarities for complex representations of textual data.

Vincent Claveau, Abir Ncibi
Semi-supervised SRL System with Bayesian Inference

We propose a new approach to perform semi-supervised training of Semantic Role Labeling models with very few amount of initial labeled data. The proposed approach combines in a novel way supervised and unsupervised training, by forcing the supervised classifier to overgenerate potential semantic candidates, and then letting unsupervised inference choose the best ones. Hence, the supervised classifier can be trained on a very small corpus and with coarse-grain features, because its precision does not need to be high: its role is mainly to constrain Bayesian inference to explore only a limited part of the full search space. This approach is evaluated on French and English. In both cases, it achieves very good performance and outperforms a strong supervised baseline when only a small number of annotated sentences is available and even without using any previously trained syntactic parser.

Alejandra Lorenzo, Christophe Cerisara
A Sentence Similarity Method Based on Chunking and Information Content

This paper introduces a method for assessing the semantic similarity between sentences, which relies on the assumption that the meaning of a sentence is captured by its syntactic constituents and the dependencies between them. We obtain both the constituents and their dependencies from a syntactic parser. Our algorithm considers that two sentences have the same meaning if it can find a good mapping between their chunks and also if the chunk dependencies in one text are preserved in the other. Moreover, the algorithm takes into account that every chunk has a different importance with respect to the overall meaning of a sentence, which is computed based on the information content of the words in the chunk. The experiments conducted on a well-known paraphrase data set show that the performance of our method is comparable to state of the art.

Dan Ştefănescu, Rajendra Banjade, Vasile Rus
An Investigation on the Influence of Genres and Textual Organisation on the Use of Discourse Relations

In this paper, we investigate some of the problems associated with the automatic extraction of discourse relations. In particular, we study the influence of communicative goals encoded in a given genre against another, and between the various communicative goals encoded between sections of documents of a same genre. Some investigations have been made in the past in order to identify the differences seen across either genres or textual organization, but none have made a thorough statistical analysis of these differences across currently available annotated corpora. In this paper, we show that both the communicative goal of a given genre and, to a lesser extend, that of a particular topic tackled by that genre, do in fact influence in the distribution of discourse relations. Using a statistically grounded approach, we show that certain discourse relations are more likely to appear within given genres and subsequently within sections within a genre. In particular, we observed that

Attributions

are common in the newspaper articles genre while

Joint

relations are comparatively more frequent in online reviews. We also notice that

Temporal

relations are statically more common in the methodology sections of scientific research documents than in the rest of the text. These results are important as they give clues to allow the tailoring of current discourse taggers to specific textual genres.

Félix-Hervé Bachand, Elnaz Davoodi, Leila Kosseim
Discourse Tagging for Indian Languages

Indian Language Discourse Project is to develop large corpus annotated with various types of discourse relations which are explicit and implicit. As an initial step towards it we have annotated corpus in three languages, Hindi, Tamil and Malayalam belonging to the two major language families in India- Indo Aryan and Dravidian. In this paper we describe our initial experiments in annotating all the three language corpus and the domains of the corpus belongs to health. The initial experiment brought out various types of discourse connectives in the three languages and how they vary amongst the languages. The preliminary study itself revealed that there is cross linguistic variation among the three languages. We have shown the inter annotator agreement for all the three languages.

Sobha Lalitha Devi, S. Lakshmi, Sindhuja Gopalan

Natural Language Generation

Classification-Based Referring Expression Generation

This paper presents a study in the field of Natural Language Generation (NLG), focusing on the computational task of referring expression generation (REG). We describe a standard REG implementation based on the well-known Dale & Reiter Incremental algorithm, and a classification-based approach that combines the output of several support vector machines (SVMs) to generate definite descriptions from two publicly available corpora. Preliminary results suggest that the SVM approach generally outperforms incremental generation, which paves the way to further research on machine learning methods applied to the task.

Thiago Castro Ferreira, Ivandré Paraboni
Generating Relational Descriptions Involving Mutual Disambiguation

This paper discusses the generation of relational referring expressions in which target and landmark descriptions are allowed to help disambiguate each other. Using a corpus of referring expressions in a simple visual domain - in which these descriptions are likely to occur - we propose a classification approach to decide when to generate them. The classifier is then embedded in a REG algorithm whose results outperform a number of naive baseline systems, suggesting that mutual disambiguation is fairly common in language use, and that this may not be entirely accounted for by existing REG algorithms.

Caio V. M. Teixeira, Ivandré Paraboni, Adriano S. R. da Silva, Alan K. Yamasaki
Bayesian Inverse Reinforcement Learning for Modeling Conversational Agents in a Virtual Environment

This work proposes a Bayesian approach to learn the behavior of human characters that give advice and help users to complete tasks in a situated environment. We apply Bayesian Inverse Reinforcement Learning (BIRL) to infer this behavior in the context of a serious game, given evidence in the form of stored dialogues provided by experts who play the role of several conversational agents in the game. We show that the proposed approach converges relatively quickly and that it outperforms two baseline systems, including a dialogue manager trained to provide “locally” optimal decisions.

Lina M. Rojas-Barahona, Christophe Cerisara
Learning to Summarize Time Series Data

In this paper we focus on content selection for summarizing time series data using Machine Learning techniques. The goal is to exploit a parallel corpus to predict the appropriate level of abstraction required for a summarization task. This is an important step towards building an automated NLG (Natural Language Generation) system to generate text for unseen data. Machine learning approaches are used to induce the underlying rules for text summarization, which are potentially close to the ones that humans use to generate textual summaries. We present an approach to select important points in a time series that can aid in generating captions or textual summaries. We evaluate our techniques on a parallel corpus of human generated weather forecast text corresponding to numerical weather prediction data.

Pranay Kumar Venkata Sowdaboina, Sutanu Chakraborti, Somayajulu Sripada
Backmatter
Metadaten
Titel
Computational Linguistics and Intelligent Text Processing
herausgegeben von
Alexander Gelbukh
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-54906-9
Print ISBN
978-3-642-54905-2
DOI
https://doi.org/10.1007/978-3-642-54906-9