nach oben

2001 | Buch

Kapitel lesen Erstes Kapitel lesen

Computational Linguistics and Intelligent Text Processing

Second International Conference, CICLing 2001 Mexico City, Mexico, February 18–24, 2001 Proceedings

herausgegeben von: Alexander Gelbukh

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

CICLing 2001 is the second annual Conference on Intelligent text processing and Computational Linguistics (hence the name CICLing), see www.CICLing.org. It is intended to provide a balanced view of the cutting edge developments in both theoretical foundations of computational linguistics and practice of natural language text processing with its numerous applications. A feature of the CICLing conferences is their wide scope that covers nearly all areas of computational linguistics and all aspects of natural language processing applications. The conference is a forum for dialogue between the specialists working in these two areas. This year our invited speakers were Graeme Hirst (U. Toronto, Canada), Sylvain Kahane (U. Paris 7, France), and Ruslan Mitkov (U. Wolverhampton, UK). They delivered excellent extended lectures and organized vivid discussions. A total of 72 submissions were received, all but very few of surprisingly high quality. After careful reviewing, the Program Committee selected for presentation 53 of them, 41 as full papers and 12 as short papers, by 98 authors from 19 countries: Spain (19 authors), Japan (15), USA (12), France, Mexico (9 each), Sweden (6), Canada, China, Germany, Italy, Malaysia, Russia, United Arab Emirates (3 each), Argentina (2), Bulgaria, The Netherlands, Ukraine, UK, and Uruguay (1 each).

Inhaltsverzeichnis

Frontmatter

Computational Linguistics

Computational Linguistic Theories

What Is a Natural Language and How to Describe It? Meaning-Text Approaches in Contrast with Generative Approaches

The paper expounds the general conceptions of the Meaning- Text theory about what a natural language is and how it must be de- scribed. In a second part, a formalization of these conceptions - the transductive grammars - is proposed and compared with generative ap- proaches.

Sylvain Kahane

A Fully Lexicalized Grammar for French Based on Meaning-Text Theory

The paper presents a formal lexicalized dependency grammar based on Meaning-Text theory. This grammar associates semantic graphs with sentences. We propose a fragment of a grammar for French, including the description of ex- tractions. The main particularity of our grammar is it that it builds bubble trees as syntactic representations, that is, trees whose nodes can be filled by bubbles, which can contain others nodes. Our grammar needs more complex operations of combination of elementary structures than other lexicalized grammars, such as TAG or CG, but avoids the multiplication of elementary structures and provides linguistically well-motivated treatments.

Sylvain Kahane

Modeling the Level of Involvement of Verbal Arguments

The purpose of this paper is to discuss the output of a neural network based on a linguistic model for recognizing the levels of involvement of different verbal arguments, assuming the non-discreteness of thematic relations and their non-primitiveness in linguistic theory. The network’s output called for hypothesizing that, contrary to the received view, there is no equal level of involvement in verbal arguments, one having to be always more invovolved than the others, even when having the same number of Proto-Agent and Proto- Patients contributing properties.

Leo Ferres

Magical Number Seven Plus or Minus Two: Syntactic Structure Recognition in Japanese and English Sentences

George A. Miller said that human beings have only seven chunks in short-term memory, plus or minus two. We counted the num- ber of bunsetsus (phrases) whose modifiees are undetermined in each step of an analysis of the dependency structure of Japanese sentences, and which therefore must be stored in short-term memory. The num- ber was roughly less than nine, the upper bound of seven plus or minus two. We also obtained similar results with English sentences under the assumption that human beings recognize a series of words, such as a noun phrase (NP), as a unit. This indicates that if we assume that the human cognitive units in Japanese and English are bunsetsu and NP respectively, analysis will support Miller’s 7 ± 2 theory.

Masaki Murata, Kiyotaka Uchimoto, Qing Ma, Hitoshi Isahara

Semantics

Spatio-temporal Indexing in Database Semantics

In logic, the spatio-temporal location of a proposition is characterized precisely within a Cartesian system of space and time coordinates. This is suitable for characterizing the truth value of propositions relative to possible worlds, but not for modeling the spatio-temporal orientation of natural cognitive agents.1 This paper presents an alternative approach to representing space and time. While on the same level of abstraction as the logical approach, it is designed for an anal- ysis of spatio-temporal inferences in humans. Such an analysis is important for modeling natural language communication because spatio-temporal information is constantly coded into language by the speaker and decoded by the hearer. Starting from the spatio-temporal characterization of direct observation in cogni- tive agents without language, the speaker’s coding of spatio-temporal information into language is analyzed, followed by the hearer’s reconstruction of this location. These procedures of transferring spatio-temporal information from the speaker to the hearer are embedded in the general structure of database semantics.

Roland Hausser

Russellian and Strawsonian Definite Descriptions in Situation Semantics

In this paper I give two alternatives for representing the def- inite descriptions - Russellian and Strawsonian approaches realized in situation semantics. I show how the Russellian treatment of the definites can be made “referential”, while preserving its original generalized quan- tifier mode. The Strawsonian representation is substantially referential with presuppositional effect of the restriction imposed over the paramet- ric representative of the potential referent of the definite description. The situation theoretical restricted parameters, speakers’ reference functions in particular contexts of utterances, and the notion of a resource situa- tion for evaluating NPs are the key formal tools used. This work does not reject any of the two accounts in favor of the other, but shows how both approaches give better results in a theory which models partial- ity of the linguistic information, discourse dependency and in particular, speakers’ references. In both approaches, the prototypical definite NPs get appropriate “parametric” interpretations.

Roussanka Loukanova

Treatment of Personal Pronouns Based on Their Parameterization

Personal pronouns of several European languages are re-demarcated and parametrized by means of their morphological, syntactic, and semantic features. A universal semantic representation of personal pronouns as vectors of few semantic components is proposed. The main semantic components are implied by syntactic features: grammatical person, number, and gender. The re- sults of parameterization are applied to morpho-syntactic and semantic agree- ment, translation of pronouns from one language to another, and disambigua- tion of several verbal constructions.

Igor A. Bolshakov

Modeling Textual Context in Linguistic Pattern Matching

We present a model to describe linguistic patterns regarding their textual contexts. The description of contexts takes into account all the informa- tion about text structure and contents. The model is an algebra that manipulates arbitrary regions of text. We show how to use the model to describe linguistic patterns and contextual rules in order to identify semantic information among parts of text

Slim Ben Hazez

Statistical Methods in Studying the Semantics of Size Adjectives

The present study deals with a statistical analysis of the lexico- semantic group of size adjectives (great, big, large, little, small, and the like). The statistical procedures and methods, as well as the mathematical formulae used, reveal the integrative and differential semes, along with the lexico- semantic links found in the microsystem.

Valentyna Arkhelyuk

Numerical Model of the Strategy for Choosing Polite Expressions

Japanese speakers often use different expressions having the same meaning but with different levels of politeness when speaking to different listeners. Brown and Levinson proposed a theory to explain how the expression to use is selected, but it is only qualitative. We propose a numerical model, constructed using the quantification-I method and multiple regression analysis, for predicting the expressions selected in various relationships such as familiarity and relative social power between speaker and listener. The politeness of each expression is quantified by the method of paired comparison.

Tamotsu Shirado, Hitoshi Isahara

Anaphora and Reference

Outstanding Issues in Anaphora Resolution

This paper argues that even though there has been considerable ad- vance in the research in anaphora resolution over the last 10 years, there are still a number of outstanding issues. The paper discusses several of these issues and outlines some of the work underway to address them with particular reference to the work carried out by the author’s research team.

Ruslan Mitkov

PHORA: A NLP System for Spanish

In this paper we present a whole Natural Language Process- ing (NLP) system for Spanish. The core of this system is the parser, which uses the grammatical formalism Lexical-Functional Grammars (LFG). The system uses the Specification Marks Method in order to resolve the lexical ambiguity. Another important component of this system is the anaphora resolution module. To solve the anaphora, this module con- tains a method based on linguistic information (lexical, morphological, syntactic and semantic), structural information (anaphoric accessibility space in which the anaphor obtains the antecedent) and statistical infor- mation. This method is based on constraints and preferences and solves pronouns and definite descriptions. Moreover, this system fits dialogue and non-dialogue discourse features. The anaphora resolution module uses several resources, such as a lexical database (Spanish WordNet) to provide semantic information and a POS tagger providing the part of speech for each word and its root to make this resolution process easier.

Manuel Palomar, Maximiliano Saiz-Noeda, Rafael Muñoz, Armando Suárez, Patricio Martínez-Barco, Andrés Montoyo

Belief Revision on Anaphora Resolution

The aim of this article is to describe a new approach for solving anaphora problem. We focus on anaphora resolution by means of belief revision process.

Sandra Roger

A Machine-Learning Approach to Estimating the Referential Properties of Japanese Noun Phrases

The referential properties of noun phrases in the Japanese language, which has no articles, are useful for article generation in Japa- nese English machine translation and for anaphora resolution in Japanese noun phrases. They are generally classified as generic noun phrases, def- inite noun phrases, and indefinite noun phrases. In the previous work, referential properties were estimated by developing rules that used clue words. If two or more rules were in conflict with each other, the cate- gory having the maximum total score given by the rules was selected as the desired category. The score given by each rule was established by hand, so the manpower cost was high. In this work, we automatically adjusted these scores by using a machine-learning method and succeeded in reducing the amount of manpower needed to adjust these scores.

Masaki Murata, Kiyotaka Uchimoto, Qing Ma, Hitoshi Isahara

The Referring Expressions in the Other’s Comment

The linguistic facts required for artificial intelligence to be able to distinguish between text and metatext include expressions referring to “text about text” insertions. Such data can be provided by discourse linguistics. The present paper offers a list of expressions used to identify text about text in the current conversation and outlines other semantic and pragmatic phenomena which may constitute the individual profile of the system designed to process the other’s comment containing those expressions.

Tamara Matulevich

Disambiguation

Lexical Semantic Ambiguity Resolution with Bigram-Based Decision Trees

This paper presents corpus-based approach to word sense disambiguation where decision tree ssigns sense to an ambiguous word based on the bigrams that occur nearby. This approach is evaluated using the sense-tagged corpora from the 1998 SENSEVAL word sense disambiguation exercise. It is more ccurate than the verage results reported for 30 of 36 words, and is more accurate than the best results for 19 of 36 words.

Ted Pedersen

Interpretation of Compound Nominals Using WordNet

We describe an approach to interpreting noun-noun compounds within a question answering system. The system’s lexicon, based on WordNet, provides the basis for heuristics that group noun-noun compounds with seman- tically similar words. The semantic relationship between the nouns in a com- pound is determined by the choice of heuristic for the compound. We discuss procedures for selecting one heuristic in cases where several can apply to a compound, the effects of lexical ambiguity, and some initial results of our methods.

Leslie Barrett, Anthony R. Davis, Bonnie J. Dorr

Specification Marks for Word Sense Disambiguation: New Development

This paper presents a new method for the automatic resolution of lexical ambiguity of nouns in English texts. This new method is made up of three new heuristics, which improve the previous Specification Marks Method. These heuristics rely on the use of the gloss in the semantic relations (hypernym and hyponym) and the hierarchic organization of WordNet. An evaluation of the new method was done on both the Semantic Concordance Corpus (Sem- cor) [7], and Microsoft 98 Encarta Encyclopaedia Deluxe. The percentages of correct resolutions were Semcor 66.2% and Encarta 66.3% respectively. These percentages show that successful results can be obtained with different domain corpus, and therefore our proposed method can be applied successfully on any corpus.

Andrés Montoyo, Manuel Palomar

Three Mechanisms of Parser Driving for Structure Disambiguation

Structural ambiguity is one of the most difficult problems in natural language processing. Two disambiguation mechanisms for unrestricted text analysis are commonly used: lexical knowledge and context considerations. Our parsing method includes three different mechanisms to reveal syntactic struc- tures and an additional voting module to obtain the most probable structures for a sentence. The developed tools do not require any tagging or syntactic marking of texts.

Galicia-Haro Sofía N., Gelbukh Alexander, Bolshakov Igor A.

Translation

Recent Research in the Field of Example-Based Machine Translation

Machine Translation (MT) is a multi-disciplinary, fast evolving research domain, which makes use of almos all computa ional methods known in artificial intell- igence. Paradigms as differen as logic programming, case-based reasoning, ge- netic algorithms, artificial neural networks, probabilistic and statistic methods have all been employed for the translation task. However, none of the proposed methods has ye led to overall satisfactory results and given birth to a universal machine ranslation engine. Instead, the search for what translation quality and what coverage of the system one would realistically need, what methods and knowledge resources are required for tha end and how much one is willing to invest, is a research domain in itself. Until the end of the eighties, MT was strongly dominated by rule-based systems which deduce translations of natural language texts based on a bilingual lexicon and a grammar formalism. Source language sentences were analyzed with a monolingual grammar and the source language representation was mapped into the target language by means of a transfer grammar. The target representations were then refined and adapted to the target language requirements. Since the beginning of the nineties, huge corpora of bilingual translated texts have been made available in computer-readable format. This empirical knowledge has given raise to a new paradigm in machine translation, that of corpus-based machine translation. Corpus-based machine translation (CBMT) systems make use of reference translations which are assumed to be ideal with respect to text type and domain, target reader and its intention.

Carl Michael

Intelligent Case Based Machine Translation System

Interactive Hybrid Strategies Machine Translation (IHSMT) system has just been designed to solve the translation problems. It forms a nice interdependent cooperation relation between human and machine by interaction. The system achieves hybrid strategy translation by synthesizing the rule-based reasoning and case-based reasoning, and therefore overcomes the demerits of single strategy. This paper has done some work on learning mechanism of this system and proposes a learning model of human-machine tracking and memorizing (HMTM). This model can store the information of human-machine interaction into memory base as case of machine learning, and then gradually accumulate knowledge to improve the intelligence of MT system.

Wang JianDe, Chen ZhaoXiong, Huang HeYan

A Hierarchical Phrase Alignment from English and Japanese Bilingual Text

We propose a phrase alignment method that aims to acquire translation knowl- edge automatically from bilingual text. Here, phrase alignment refers to the extraction of equivalent partial word sequences between bilingual sentences. We used the term phrase alignment since these word sequences include not only words but also noun phrases, verb phrases, relative clauses, and so on. We con- sider English and Japanese as the target languages.

Kenji Imamura

Text Generation

Title Generation Using a Training Corpus

This paper discusses fundamental issues involved in word selection for title generation. We review several methods for title generation, namely extractive summarization and two versions of a Naïve Bayesian, and compare the performance of those methods using an F1 metric. In addition, we introduce a novel approach to title generation using the k-nearest neighbor (KNN) algorithm. Both the KNN method and a limited-vocabulary Naïve Bayesian method outperform the other evaluated methods with an F1 score of around 20%. Since KNN produces complete and legible titles, we conclude that KNN is a very promising method for title generation, provided good content overlap exists between the training corpus and the test documents.

Rong Jin, Alexander G. Hauptmann

A New Approach in Building a Corpus for Natural Language Generation Systems

One of the main difficulties in building NLG systems is to produce a good requirement specification. A way to face this problem is by using a corpus to show the client the features of the system to be developed. In this paper we describe a method to elaborate that corpus and how can be used for a particular system. This method consists of five steps: text collection, input determination, text and input analysis, corpus construction, and pattern extraction.

M del Socorro Bernardos Galindo, Guadalupe Aguado de Cea

A Study on Text Generation from Non-verbal Information on 2D Charts

This paper describes a text generation method to explain non-verbal information with verbal information, i.e., natural language. We use as non-verbal information, two dimensional numerical informa- tion, is explained its behavior with natural language. To bridge the gap between non-verbal information and verbal information, I apply fuzzy sets theory to translate quantitative information into qualitative infor- mation. The proposed method is verified by showing the generation pro- cess of some example texts which explain the behavior of a line chart.

Ichiro Kobayashi

Interactive Multilingual Generation

The paper presents interactive multilingual generation, which in most cases is a viable alternative to machine translation and automatic generation. The idea is to automatically produce a preliminary version of the text, and allow the user to modify it in his native language. The software then produces the other languages following the user’s modifications. This technique has been used in an international project and validated in operation on a dozen European sites.

José Coch, Karine Chevreau

A Computational Feature Analysis for Multilingual Character-to-Character Dialogue

Natural language generation systems to date have concen- trated on the tasks of explanation generation, tutorial dialogue, auto- mated software documentation, and similar technical tasks. A largely unexplored area is narrative prose generation, or the production of texts used in stories such as novels, mysteries, and fairy tales. We present a feature analysis of one complex area of NLG found in narrative prose but not in technical generation tasks: character-to-character dialogue. This analysis has enabled us to modify a surface realization system to include the necessary features that dialogue requires and thus to write the types of texts found in narratives.

Charles Callaway

Dictionaries and Corpora

Experiments on Extracting Knowledge from a Machine-Readable Dictionary of Synonym Differences

In machine translation and natural language generation, making the wrong word choice from a set of near-synonyms can be imprecise or awkward, or convey unwanted implications. Using Edmonds’s model of lexical knowledge to represent clusters of near-synonyms, our goal is to automatically derive a lexi- cal knowledge-base from the Choose the Right Word dictionary of near-synonym discrimination. We do this by automatically classifying sentences in this dictio- nary according to the classes of distinctions they express. We use a decision-list learning algorithm to learn words and expressions that characterize the classes DENOTATIONAL DISTINCTIONS and ATTITUDE-STYLE DISTINCTIONS. These results are then used by an extraction module to actually extract knowledge from each sentence. We also integrate a module to resolve anaphors and word-to-word comparisons. We evaluate the results of our algorithm for several randomly se- lected clusters against a manually built standard solution, and compare them with the results of a baseline algorithm.

Diana Zaiu Inkpen, Graeme Hirst

Recognition of Author’s Scientific and Technical Terms

The intensive use of terms of a specific terminology is admittedly one of the most distinguishing features of scientific and technical (sci-tech) texts. We propose to categorize the terms as either dictionary units of generally accepted terminology or new terms introduced into a sci-tech text to facilitate the description of new author’s concepts; we call such terms author’s. The pa- per discusses issues concerned with authors. terms, including frequent ways of their definition and using in sci-tech texts. It is argued that recognition of author’s sci-tech terms are important for NLP applications such as computer- aided scientific and literary editing, automatic text abstracting and summariza- tion, or acquiring of expert knowledge. A sketch of the procedure for recogni- tion of author’s terms with certain degree of accuracy is given.

Elena I. Bolshakova

Lexical-Semantic Tagging of an Italian Corpus

Semantically tagged corpora are becoming an urgent need for training and evaluation within many applications. They are also the natural accompaniment of semantic lexicons, for which they constitute both a useful testbed to evaluate their adequacy and a repository of corpus examples for the attested senses. It is essential that sound criteria are defined for their construction and a specific methodology is set up for the treatment of various semantic phenomena. We present some observations and results concerning the lexical-semantic tagging of an Italian corpus within the framework of two projects: the ELSNET feasibility study, part of a preparatory phase started with Senseval/Romanseval, and an Italian National Project (TAL), where one of the components is the lexical-semantic annotation of larger quantities of texts for an Italian syntactic-semantic Treebank. The results of the ELSNET experiment have been of utmost importance for the definition of the technical guidelines for the lexical-semantic level of annotation of the Treebank.

Nicoletta Calzolari, Ornella Corazzari, Antonio Zampolli

Meaning Sort — Three Examples: Dictionary Construction, Tagged Corpus Construction, and Information Presentation System —

It is often useful to sort words into an order that reflects relations among their meanings as obtained by using a thesaurus. In this paper, we introduce a method of arranging words semantically by using several types of ‘is-a’ thesauri and a multi-dimensional thesaurus. We also describe three major applications where a meaning sort is useful and show the effectiveness of a meaning sort. Since there is no doubt that a word list in meaning-order is easier to use than a word list in some random order, a meaning sort, which can easily produce a word list in meaning-order, must be useful and effective.

Masaki Murata, Kyoko Kanzaki, Kiyotaka Uchimoto, Qing Ma, Hitoshi Isahara

Converting Morphological Information Using Lexicalized and General Conversion

Today, many kinds of tagged corpora are available for re- search use. Often a different morphological system is used in each cor- pus. This makes it difficult to merge different types of morphological information, since conversion between different systems is complex and necessitates a understanding of both systems.This paper describes a method of converting morphological information between two different systems by using lexicalized and general conver- sion. The difference between lexicalized and general conversion is the existence or absence of a lexicalized condition. Which conversion is ap- plied depends on the frequency of segments.

Mitsuo Shimohata, Eiichiro Sumita

Zipf and Heaps Laws’ Coefficients Depend on Language

We observed that the coefficients of two important empirical statisti- cal laws of language - Zipf law and Heaps law - are different for different lan- guages, as we illustrate on English and Russian examples. This may have both theoretical and practical implications. On the one hand, the reasons for this may shed light on the nature of language. On the other hand, these two laws are im- portant in, say, full-text database design allowing predicting the index size.

Alexander Gelbukh, Grigori Sidorov

Morphology

Applying Productive Derivational Morphology to Term Indexing of Spanish Texts

This paper deals with the application of natural language processing techniques to the field of information retrieval. To be pre- cise, we propose the application of morphological families for single term confiation in order to reduce the linguistic variety of indexed documents written in Spanish. A system for automatic generation of morphological families by means of Productive Derivational Morphology is discussed. The main characteristics of this system are the use of a minimum of lin- guistic resources, a low computational cost, and the independence with respect to the indexing engine.

Jesús Vilares, David Cabrero, Miguel A. Alonso

Unification-Based Lexicon and Morphology with Speculative Feature Signalling

The implementation of a unification-based lexicon is dis- cussed as well as the morphological rules needed for mapping between the lexicon and grammar. It is shown how different feature usages can be utilized in the implementation to reach the intended surface word-form matches, with the correct feature settings. A novelty is the way features are used as binary signals in the compositional morphology. Finally, the lexicon coverage of the closed-class words and of the different types of open-class words is evaluated.

Björn Gambäck

A Method of Pre-computing Connectivity Relations for Japanese/Korean POS Tagging

This paper presents an efficient dictionary structure of Part- of-Speech(POS) Tagging for Japanese/Korean by extending Aho and Corasick’s pattern matching machine. The proposed method is a simple and fast algorithm to find all possible morphemes in an input sentence and in a single pass, and it stores the relations of grammatical connec- tivity of neighboring morphemes into the output functions. Therefore, the proposed method can reduce both costs of the dictionary lookup and the connection check to find the most suitable word segmentation. From the simulation results, it turns out that the proposed method was 21.8% faster (CPU time) than the general approach using the trie structure. Concerning the number of candidates for checking connections, it was 27.4% less than that of the original morphological analysis.

Kazuaki Ando, Lee Tae-hun, Masami Shishibori, Aoe Jun-ichi

A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP

Natural language processing, such as Checking and Correc- tion of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European lan- guages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have diffiiculty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic struc- tures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.

Fuji Ren

Web-Based Arabic Morphological Analyzer

This paper presents an Arabic morphological analyzer1 that is a component of an architecture which can process unrestricted text from Inter- net. The morphological analyzer uses an object-oriented model to represent the morphological rules for verbs and nouns, a matching algorithm to isolate the affixes and the root of a given word-form, and a linguistic knowledge base consisting in lists of words. The morphological rules fall into two categories: the regular morphological rules of the Arabic grammar and the exception rules that represent the language exceptions. The representation and the implemen- tation of these rules and the matching algorithm are discussed in this paper.

Jawad Berri, Hamza Zidoum, Yacine Atif

Parsing Techniques

Stochastic Parsing and Parallelism

Parsing CYK-like algorithms are inherently parallel: there are a lot of cells in the chart that can be calculated simultaneously. In this work, we present a study on the appropriate techniques of paralle- lism to obtain an optimal performance of the extended CYK algorithm, a stochastic parsing algorithm that preserves the same level of expressive- ness as the one in the original grammar, and improves further tasks of robust parsing. We consider two methods of parallelization: distributed memory and shared memory. The excellent performance obtained with the second one turns this algorithm into an alternative that could com- pete with other parsing techniques more efficient a priori.

Francisco-Mario Barcala, Oscar Sacristán, Jorge Graña

Practical Nondeterministic DR(k) Parsing on Graph-Structured Stack

A new approach to parse context-free grammars is presented. It relies on discriminating-reverse, DR(k), parsers, with a Tomita-like nondeterminism-controlling graph-structured stack (GSS) algorithm.The advantage of this generalized discriminating-reverse (GDR) approach over GLR lies on the possibility of using DR(k) parsers, which combine full LR(k) parsing power with a small number of states even for k > 1.This can greatly reduce nondeterminism originating from limited parsing power, as it is typical of the restricted direct LR parsers (SLR, LALR) commonly used in Tomita’s algorithm.Furthermore, relying on a DR parser allows a GSS that associates nodes to symbols instead of direct-LR states, and makes easier computation of the shared forest.Moreover, DR(k) parsers have been shown to be linear for LR(k) gram- mars, and the DR(k) parser efficiency has been practically found to be very similar to direct LR(k) parsers.The paper first presents the nondeterministic DR(k) generation algo- rithm (for non-LR(k) grammars). Then, it discusses the corresponding adaptation of the GSS algorithm and shows how the shared forest com- putation is naturally handled.

José Fortes Gálvez, Jacques Farré, Miguel Ángel Pérez Aguiar

Intelligent Text Processing

Text Categorization

Text Categorization Using Adaptive Context Trees

A new way of representing texts written in natural language is introduced, as a conditional probability distribution at the letter level learned with a variable length Markov model called adaptive context tree model. Text categorization experiments demonstrates the ability of this representation to catch information about the semantic content of the text.

Jean-Philippe Vert

Text Categorization through Multistrategy Learning and Visualization

This paper introduces a multistrategy learning approach to the cate- gorization of text documents. The approach benefits from two existing, and in our view complimentary, sets of categorization techniques: those based on Roc- chio’s algorithm and those belonging to the rule learning class of machine learning algorithms. Visualization is used for the presentation of the output of learning.

Ali Hadjarian, Jerzy Bala, Peter Pachowicz

Automatic Topic Identification Using Ontology Hierarchy

This paper proposes a method of using ontology hierarchy in automatic topic identification. The fundamental idea behind this work is to exploit an ontology hierarchical structure in order to find a topic of a text. The keywords that are extracted from a given text will be mapped onto their corresponding concepts in the ontology. By optimizing the corresponding concepts, we will pick a single node among the concepts nodes that we believe is the topic of the target text. However, a limited vocabulary problem is encountered while mapping the keywords onto their corresponding concepts. This situation forces us to extend the ontology by enriching each of its concepts with new concepts using the external linguistics knowledge-base (WordNet). Our intuition of a high number keywords mapped onto the ontology concepts is that our topic identification technique can perform at its best.

Sabrina Tiun, Rosni Abdullah, Tang Enya Kong

Software for Creating Domain-Oriented Dictionaries and Document Clustering in Full-Text Databases

The problem of reorganization and classification of full-text data- bases by means of the clustering of composing documents is considered. The clustering is performed in the space of words relating to special domains. Meth- ods of constructing domain-oriented dictionaries are suggested. The software developed can be used in the case when number of clusters is unknown before- hand. Aprioristic uncertainty necessitates several steps of clustering, with large clusters undergoing a further subdivision.

Pavel Makagonov, Konstantin Sboychakov

Chi-Square Classifier for Document Categorization

The problem of document categorization is considered. The set of domains and the keywords specific for these domains is supposed to be selected beforehand as initial data. We apply the well-known statistical hypothesis test that considers images of documents and domains as normalized vectors. In comparison with existing methods, such approach allows to take into account a random character of initial data. The classifier is developed in the framework of Document Investigator software package.

Mikhail Alexandrov, Alexander Gelbukh, George Lozovoi

Information Retrieval

Information Retrieval of Electronic Medical Records

This paper presents common issues associated with information retrieval from electronic medical records and presents linguistic approaches to resolve these issues. Linguistic analyses of three medical topics (heart attacks, smoking, and death reports) are presented to highlight common is- sues and our approaches to resolve them. We demonstrate how the Clinical Practice Analysis (CPA) system developed by Synthesys Technologies, Inc. enables the medical researcher to create powerful queries to retrieve in- formation about individual patients or entire patient populations quickly and easily. The efficiency of this system has been enhanced by the imple- mentation of linguistic generalizations specific to medical records, such as lexical variation, ambiguity, argument alternation, anaphora resolution, be- lief contexts, downward-entailing contexts and presupposition.

Currie Anne-Marie, Jocelyn Cohan, Larisa Zlatic

Automatic Keyword Extraction Using Domain Knowledge

Documents can be assigned keywords by frequency analysis of the terms found in the document text, which arguably is the primary source of knowledge about the document itself. By including a hierarchi- cally organised domain specific thesaurus as a second knowledge source the quality of such keywords was improved considerably, as measured by match to previously manually assigned keywords. In the presented ex- periment, the combination of the evidence from frequency analysis and the hierarchically organised thesaurus was done using inductive logic programming.

Anette Hulth, Jussi Karlgren, Anna Jonsson, Henrik Boström, Lars Asker

Approximate VLDC Pattern Matching in Shared-Forest

We present a matching-based proposal intended to deal with querying for structured text databases. Our approach extends approxi- mate VLDC matching techniques by allowing a query to exploit sharing of common parts between patterns used to index the document.

Manuel Vilares, Francisco J. Ribadas, Victor M. Darriba

Knowledge Engineering for Intelligent Information Retrieval

This paper presents a clustered approach to designing an overall ontological model together with a general rule-based component that serves as a mapping device. By observational criteria, a multi-lingual team of experts excerpts concepts from general communication in the media. The team, then, finds equivalent expressions in English, German, French, and Spanish. On the basis of a set of ontological and lexical re- lations, a conceptual network is built up. Concepts are thought to be universal. Objects unique in time and space are identified by names and will be explained by the universals as their instances. Our approach relies on multi-relational descriptions of concepts. It provides a powerful tool for documentation and conceptual language learning. First and foremost, our multi-lingual, polyhierarchical ontology fills the gap of semantically- based information retrieval by generating enhanced and improved queries for internet search.

Guido Drexel

Is Peritext a Key for Audiovisual Documents? The Use of Texts Describing Television Programs to Assist Indexing

In the INA (Institut national de l’audiovisuel, the French National Broadcasting Institute), indexers work on audiovisual document retrieval. Because of the problem of variability and cost in the indexing process, we are trying to provide new assistance to access the audiovisual document by introducing new textual information on the digital support.

Karine Lespinasse, Bruno Bachimont

An Information Space Using Topic Identification for Retrieved Documents

We present a model of an Information Space (IS) that combines the Vector Space Model (VSM) with conceptual hierarchies for Information Retrieval (IR). This model will allow navigating between both representations helping the user in the retrieval process. The goal is to filter the documents the user is interested based on the theme.

David Escorial

Structure Identification. Text Mining

Contextual Rules for Text Analysis

In this paper we describe a rule-based formalism for the analysis and labelling of texts segments. The rules are contextual rewriting rules with a restricted form of negation. They allow to underspecify text segments not considered relevant to a given task and to base decisions upon context. A parser for these rules is presented and consistence and completeness issues are discussed. Some results of an implementation of this parser with a set of rules oriented to the segmentation of texts in propositions are shown.

Dina Wonsever, Jean-Luc Minel

Finding Correlative Associations among News Topics

A method for finding real-world associations between news topics (as distinguished from apparent associations caused by the constant size of the newspaper) is described. This is important for studying society interests.

Manuel Montes-y-Gómez, Aurelio López-López, Alexander Gelbukh

Backmatter

Titel: Computational Linguistics and Intelligent Text Processing
herausgegeben von: Alexander Gelbukh
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-44686-6
Print ISBN: 978-3-540-41687-6
DOI: https://doi.org/10.1007/3-540-44686-9

Springer Professional