Skip to main content

2002 | Buch

Natural Language Processing and Information Systems

6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 Stockholm, Sweden, June 27–28, 2002 Revised Papers

herausgegeben von: Birger Andersson, Maria Bergholtz, Paul Johannesson

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The workshop on Applications of Natural Language to Information Systems (NLDB)hassince1995providedaforumforacademicandindustrialresearchers and practitioners to discuss the application of natural language to both the development and use of software applications. Theuseofnaturallanguageinrelationtosoftwarehascontributedtoimpr- ing the development of software from the viewpoints of both the developers and the users. Developers bene?t from improvements in conceptual modeling, so- ware validation, natural language program speci?cations, and many other areas. Users bene?t from increased usability of applications through natural language query interfaces, semantic webs, text summarizations, etc. The integration of natural language and information systems has been a - search objective for a long time now. Today, the goal of good integration seems not so far-fetched. This is due mainly to the rapid progress of research in natural language and to the development of new and powerful technologies. The in- gration of natural language and information systems has become a convergent point towards which many researchers from several research areas are focussing.

Inhaltsverzeichnis

Frontmatter

Linguistic Aspects of Modelling

An Ontology-Based Framework for Generating and Improving Database Design
Abstract
It is well-accepted that there is a need to incorporate domain knowledge into system development tools of various kinds. Most of these are case tools that have been quite successful in supporting design on a syntactical basis. However, in general, they are not capable of representing and using information about the semantics of an application domain. This research presents a framework for supporting the generation and analysis of conceptual database designs through the use of ontologies. The framework is implemented in a database design assistant prototype that illustrates the research results.
Vijayan Sugumaran, Veda C. Storey
A Web Information Extraction System to DB Prototyping
Abstract
Database prototyping is a technique widely used both to validate user requirements and to verify certain application functionality. These tasks usually require the population of the underlying data structures with sampling data that, additionally, may need to stick to certain restrictions. Although some existing approaches have already automated this population task by means of random data generation, the lack of semantic meaning of the resulting structures may interfere both in the user validation and in the designer verification task.
In order to solve this problem and improve the intuitiveness of the resulting prototypes, this paper presents a population system that, departing from the information contained in a UML-compliant Domain Conceptual Model, applies Information Extraction techniques to compile meaningful information sets from texts available through Internet. The system is based on the semantic information extracted from the EWN lexical resource and includes, among other features, a named entity recognition system and an ontology that speed up the prototyping process and improve the quality of the sampling data.
P. Moreda, R. Muñoz, P. Martńez-Barco, C. Cachero, Manuel Palomar
Automatic Help for Building and Maintaining Ontologies
Abstract
“Is_A” links are the core component of all ontologies and are organized into “hierarchies of concepts”. In this paper we will first address the problem of an automatic help to build sound hierarchies. Dependencies called “existence constraints” are the foundation for the definition of a “normalized” hierarchy of concepts. In the first part of the paper algorithms are provided to obtain a normalized hierarchy starting either from concepts or from instances using Boolean functions. The second part of the paper is devoted to the hierarchy maintenance: automatically inserting, merging or removing pieces of knowledge. We also provide a way to give synthetic views of the hierarchy.
Nadira Lammari, Elisabeth Métais

Information Retrieval

Semi-automatic Content Extraction from Specifications
Abstract
Specifications are critical to companies involved in complex manufacturing. The constant reading, reviewing, and analysis of materials and process specifications is extremely labor-intensive, qualityimpacting, and time-consuming. A conceptual design for a tool that provides computer-assistance in the interpretation of specification requirements has been created and a strategy for semantic-markup, which is the overlaying of abstract syntax (“the essence”) on the text, has been developed. The solution is based on the techniques for Information Extraction and the XML technology, and it captures the specification content within a semantic ontology. The working prototype of the tool being built will serve as the foundation for potential full-scale commercialization.
Krishnaprasad Thirunarayan, Aaron Berkovich, Dan Sokol
Integrating Retrieval Functionality inWeb Sites Based on Storyboard Design andWord Fields
Abstract
Large information-intensive sites are developed everywhere and need in every case a separate retrieval support. The support comprises functionalities for key word and catalogue search as well as context based retrieval. The search functionality is often added after designing the complete site using common search engines or site map tools. The resulting information of that search facilities is mostly not that specific the user needs. We focus on an integrated development of search facilities within the web site design process. Additionally, we use the linguistic instrument ‘word field’ for defining abstract search keys during the design process.
Antje Düsterhöft, Bernhard Thalheim
Vulcain — An Ontology-Based Information Extraction System
Abstract
This paper describes an information extraction system, Vulcain, dedicated to message filtering for a specific domain. The paper focuses on a method for identifying domain-specific terms and concepts, using syntactic information and an existing domain ontology. We focused on a method for identifying terms by partial syntactic analysis, based on TAG grammars. The domain ontology is represented in description logics, and DL inference mechanisms are used to validate the candidate concepts.
Amalia Todirascu, Laurent Romary, Dalila Bekhouche

Natural Language Text Understanding

Automatic Identification of European Languages
Abstract
We describe our word-based implementation of a language identifying system for the text messages written in European languages. Specifically, we use and compare linguistic (based on functional words) and statistic (based on the word frequency) approaches to construction of the identifying vocabularies. Our version of the statistic approach copes with the differences in degrees of word overlap among languages and the problem of the small-size messages. In addition, it allows an user to choose the accuracy of language identification. At present, our system identifies 8 languages (Bulgarian, English, French, German, Italian, Russian, Spanish and Swedish) in various encodings. With the identifying vocabularies of limited size (less than 1500 keys per language), the accuracy of identification attains 99% even for the messages containing only one sentence.
Anna V. Zhdanova
A Method for Maintaining Document Consistency Based on Similarity Contents
Abstract
The advent of the WWW and distributed information systems have made it possible to share documents between different users and organisations. However, this has created many problems related to the security, accessibility, right and most importantly the consistency of documents. It is important that the people involved have access to the most up-to-date version of the documents, retrieve the correct documents and should be able to update the documents repository in such a way that his or her documents are known to others. In this paper we propose a method for organising, storing and retrieving documents based on similarity contents. The method uses techniques based on information retrieval, document summarisation and term extraction and indexing. This methodology is developed for the E-Cognos project which aims at developing tools for the management and sharing of documents in the construction domain.
Farid Meziane, Yacine Rezgui
Evaluation and Construction of Training Corpuses for Text Classification: A Preliminary Study
Abstract
Text classification is becoming more and more important with the rapid growth of on-line information available. It was observed that the quality of training corpus impacts the performance of the trained classifier. This paper proposes an approach to build high-quality training corpuses for better classification performance by first exploring the properties of training corpuses, and then giving an algorithm for constructing training corpuses semi-automatically. Preliminary experimental results validate our approach: classifiers based on the training corpuses constructed by our approach can achieve good performance while the training corpus’ size is significantly reduced. Our approach can be used for building efficient and lightweight classification systems.
Shuigeng Zhou, Jihong Guan

Knowledge Bases

Replicating Quantified Noun Phrases in Database Semantics
Abstract
Predicate calculus treats determiner-noun sequences like the man, every man, or several men as ‘quantified noun phrases.’ This analysis in terms of quantifiers, variables, and connectives creates a major structural difference compared to the handling of proper names. The modeling of natural language communication in database semantic (DBS), in contrast, treats the functor-argument structure as primary, regardless of whether an argument is of the sign type symbol (determiner-noun sequence), name, or indexical (pronoun). The meanings carried by different determiners are reanalyzed as controlling the matching between nominal symbols and individuals, or sets of individuals, at the level of context
Roland Hausser
Ontological Extraction of Content for Text Querying
Abstract
This paper describes a method and a system ONTOQUERY for content-based querying of texts based on the availability of an ontology for the concepts in the text domain. A key principle in the system is the extraction of conceptual content of noun phrases into descriptors forming an integral part of the ontology.
The retrieval of text passages rests on matching descriptors from the text against descriptors from the noun phrases in the query. The match need not be exact but is mediated by the ontology, invoking in particular taxonomic reasoning with sub- and super concepts. The paper also reports on a prototype implementation of the system.
Troels Andreasen, Per Anker Jensen, Jørgen Fischer Nilsson, Patrizia Paggio, Bolette Sandford Pedersen, Hanne Erdman Thomsen
Ontology-Based Data Cleaning
Abstract
Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.
Zoubida Kedad, Elisabeth Métais

Recognition of Information in Natural Language Descriptions

Retrieving NASA Problem Reports with Natural Language
Abstract
A system that retrieves problem reports from a NASA database is described. The database is queried with natural language questions. Part-of-speech tags are first assigned to each word in the question using a rule-based tagger. A partial parse of the question is then produced with independent sets of deterministic finite state automata. Using partial parse information, a look up strategy searches the database for problem reports relevant to the question. A bigram stemmer and irregular verb conjugates have been incorporated into the system to improve accuracy. The system is evaluated by a set of fifty five questions posed by NASA engineers. A discussion of future research is also presented.
Sebastian van Delden, Fernando Gomez
Access to Multimedia Information through Multisource and Multilanguage Information Extraction
Abstract
We describe our work on information extraction from multiple sources for the Multimedia Indexing and Searching Environment, a project aiming at developing technology to produce formal annotations about essential events in multimedia programme material. The creation of a composite index from multiple and multi-lingual sources is a unique aspect of this project. The domain chosen for tuning the software components and testing is football. Our information extraction system is based on the use of finite state machinery pipelined with full semantic analysis and discourse interpretation.
Horacio Saggion, Hamish Cunningham, Kalina Bontcheva, Diana Maynard, Cris Ursu, Oana Hamza, Yorick Wilks
On Semantic Classification of Modifiers
Abstract
To search a large dictionary for a collocation expressing a desired meaning, the human reader needs some kind of hierarchical structure that would facilitate such search. In this paper, fragments of semantic classification of modifiers are elaborated based on several highly modifier-productive nouns, namely, common nouns person, action, look, corporation, and price, as well the terms coating, medium, and check. By modifiers meant are adjectives, participles, or preposition phrases syntactically dependent on the nouns. The classification rubrics proved to be heavily dependent on the modified headword noun and are representative fragments of a Roget-like thesaurus. It is shown that the modifiers under consideration are rather selective in their use, similarly to standard lexical functions (LFs) by Mel'čuk, while for many nouns LFs can be absent. The obtained classification rubrics can be used for other English nouns and for other languages. Some deficiencies of the proposed rubrics are discussed.
Igor A. Bolshakov, Alexander Gelbukh

Natural Language Conversational Systems

Evaluating a Spelling Support in a Search Engine
Abstract
The information in a database is usually accessed using SQL or some other query language, but if one uses a free text retrieval system the retrieval of text based information becomes much easier and user friendly, since one can use natural languages techniques such as automatic spell checking and stemming. The free text retrieval system needs first to index the database but then it is just to search the database. Normally a search engine does not give any answers to queries when the search words does not exist in the index, therefore we connected a spell checker module into a search engine and evaluated it. The domain used was the web site of the Swedish National Tax Board (Riksskatteverket, RSV), where the search engine was used between April and Sept 2001. One million queries were made by the public. Of these queries 10 percent were “misspelled” or erroneous and our spell checker corrected around 90 percent of these.
Hercules Dalianis
Opening Statistical Translation Engines to Terminological Resources
Abstract
The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT). However, accurate evaluation of its potential in real-life contexts is still a questionable issue. In this study, we investigate the behavior of a SMT engine faced with a corpus far different from the one it has been trained on. We show that terminological databases are obvious resources that should be used to boost the performance of a statistical engine. We propose and evaluate a way of integrating terminology into a SMT engine which yields a significant reduction in word error rate.
Philippe Langlais

Short Papers

User-Centred Ontology Learning for Knowledge Management
Abstract
Automatic ontology building is a vital issue in many fields where they are currently built manually. This paper presents a user-centred methodology for ontology construction based on the use of Machine Learning and Natural Language Processing. In our approach, the user selects a corpus of texts and sketches a preliminary ontology (or selects an existing one) for a domain with a preliminary vocabulary associated to the elements in the ontology (lexicalisations). Examples of sentences involving such lexicalisation (e.g. ISA relation) in the corpus are automatically retrieved by the system. Retrieved examples are validated by the user and used by an adaptive Information Extraction system to generate patterns that discover other lexicalisations of the same objects in the ontology, possibly identifying new concepts or relations. New instances are added to the existing ontology or used to tune it. This process is repeated until a satisfactory ontology is obtained. The methodology largely automates the ontology construction process and the output is an ontology with an associated trained leaner to be used for further ontology modifications.
Christopher Brewster, Fabio Ciravegna, Yorick Wilks
A Multilevel Text Processing Model of Newsgroup Dynamics
Abstract
We present a multilevel model of discussions in USENET newsgroups that includes the use of statistical and linguistic methods to obtain lexical, semantic and discourse characteristics of the text. We expose constraints that make information extraction and summarization more amenable to analysis at different levels. Our model makes use of posting structure, times of posting, time spans, and length and depth of a thread in order to extract higher-level information on subject matter, interest level, topicality, and discussion trends.
G. Sampathsampath@tcnj.edu, Miroslav Martinovic
Best Feature Selection for Maximum Entropy-Based Word Sense Disambiguation_
Abstract
In this paper, a supervised learning method of word sense disambiguation based on maximum entropy conditional probability models is presented. This system acquires the linguistic knowledge from an annotated corpus and this knowledge is represented in the form of features. Several types of features has been analyzed for a few words selected from the DSO corpus. The main contribution of this paper consists of the selection of the best sets of features for each word from the training data in order to build the classifiers. Our experimentation shows that our method reaches a good accuracy when it is compared with, for example, the systems at SENSEVAL-2.
Armando Suárez, Manuel Palomar
Linguistics in Large-Scale Web Search
Abstract
In spite of intensive research on linguistic techniques in information retrieval, there are still few large-scale search engines that have taken full advantage of these techniques. This paper presents the integration of various linguistic techniques in one of the largest search engines on the Internet. The techniques include language identification, offensive content filtering, phrasing and anti-phrasing, normalization, and clustering. We go into some of the challenges of Internet search and discuss our experiences with these techniques.
Jon Atle Gulla, Per Gunnar Auran, Knut Magne Risvik
Similarity Model and Term Association for Document Categorization
Abstract
In the information retrieval and document categorization context, both Euclidean distance- and cosine-based similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. This paper analyzes the properties of term-document space, term-category space and categorydocument space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define a ∞-similarity model of documents. Here we make best use of existing category membership represented by corpus as much as possible, and the objective is to improve categorization performance. The empirical results been obtained by k-NN classifier over Reuters-21578 corpus show that utilization of term association can improve the effectiveness of categorization system and ∞- similarity model outperforms than ones without term association.
Huaizhong Kou, Georges Gardarin
Omnibase: Uniform Access to Heterogeneous Data for Question Answering
Abstract
Although the World Wide Web contains a tremendous amount of information, the lack of uniform structure makes finding the right knowledge difficult. A solution is to turn the Web into a “virtual database” and to access it through natural language.We built Omnibase, a system that integrates heterogeneous data sources using an object- property-value model. With the help of Omnibase, our Start natural language system can now access numerous heterogeneous data sources on the Web in a uniform manner, and answers millions of user questions with high precision.
Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, Baris Temelkuran
Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database
Abstract
The question-answering system developed by this research matches one-sentence-long user questions to a number of question templates that cover the conceptual model of the database and describe the concepts, their attributes, and the relationships in form of natural language questions. A question template resembles a frequently asked question (FAQ). Unlike a static FAQ, however, a question template may contain entity slots that are replaced by data instances from the underlying database. During the question-answering process, the system retrieves relevant data instances and question templates, and offers one or several interpretations of the original question. The user selects an interpretation to be answered.
Eriks Sneiders
Backmatter
Metadaten
Titel
Natural Language Processing and Information Systems
herausgegeben von
Birger Andersson
Maria Bergholtz
Paul Johannesson
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-36271-5
Print ISBN
978-3-540-00307-6
DOI
https://doi.org/10.1007/3-540-36271-1