Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 18th International Conference on Applications of Natural Language to Information Systems, held in Salford, UK, in June 2013. The 21 long papers, 15 short papers and 17 poster papers presented in this volume were carefully reviewed and selected from 80 submissions. The papers cover the following topics: requirements engineering, question answering systems, named entity recognition, sentiment analysis and mining, forensic computing, semantic web, and information search.

Inhaltsverzeichnis

Frontmatter

Full Papers

Extraction of Statements in News for a Media Response Analysis

The extraction of statements is an essential step in a Media Response Analysis (MRA), because statements in news represent the most important information for a customer of a MRA and can be used as the underlying data for Opinion Mining in newspaper articles. We propose a machine learning approach to tackle this problem. For each sentence, our method extracts different features which indicate the importance of a sentence for a MRA. Classified sentences are filtered through a density-based clustering, before selected sentences are combined to statements. In our evaluation, this technique achieved better results than comparison methods from Text Summarization and Opinion Mining on two real world datasets.

Thomas Scholz, Stefan Conrad

Sentiment-Based Ranking of Blog Posts Using Rhetorical Structure Theory

Polarity estimation in large-scale and multi-topic domains is a difficult issue. Most state-of-the-art solutions essentially rely on frequencies of sentiment-carrying words (e.g., taken from a lexicon) when analyzing the sentiment conveyed by natural language text. These approaches ignore the structural aspects of a document, which contain valuable information. Rhetorical Structure Theory (RST) provides important information about the relative importance of the different text spans in a document. This knowledge could be useful for sentiment analysis and polarity classification. However, RST has only been studied for polarity classification problems in constrained and small scale scenarios. The main objective of this paper is to explore the usefulness of RST in large-scale polarity ranking of blog posts. We apply sentence-level methods to select the key sentences that convey the overall on-topic sentiment of a blog post. Then, we apply RST analysis to these core sentences in order to guide the classification of their polarity and thus to generate an overall estimation of the document’s polarity with respect to a specific topic. Our results show that RST provides valuable information about the discourse structure of the texts that can be used to make a more accurate ranking of documents in terms of their estimated sentiment in multi-topic blogs.

Jose M. Chenlo, Alexander Hogenboom, David E. Losada

Automatic Detection of Ambiguous Terminology for Software Requirements

Identifying ambiguous requirements is an important aspect of software development, as it prevents design and implementation errors that are costly to correct. Unfortunately, few efforts have been made to automatically solve the problem. In this paper, we study the problem of lexical ambiguity detection and propose methods that can automatically identify potentially ambiguous concepts in software requirement specifications. Specifically, we focus on two types of lexical ambiguities, i.e.,

Overloaded

and

Synonymous

ambiguity. Experiment results over four real-world software requirement collections show that the proposed methods are effective in detecting ambiguous terminology.

Yue Wang, Irene L. Manotas Gutièrrez, Kristina Winbladh, Hui Fang

An OpenCCG-Based Approach to Question Generation from Concepts

Dialogue systems are often regarded as being tedious and inflexible. We believe that one reason is rigid and inadaptable system utterances. A good dialogue system should automatically choose a formulation that reflects the user’s expectations. However, current dialogue system development environments only allow the definition of questions with unchangeable formulations. In this paper we present a new approach to the generation of system questions by only defining basic concepts. This is the basis for realising adaptive, user-tailored, and human-like system questions in dialogue systems.

Markus M. Berg, Amy Isard, Johanna D. Moore

A Hybrid Approach for Arabic Diacritization

The orthography of Modern standard Arabic (MSA) includes a set of special marks called diacritics that carry the intended pronunciation of words. Arabic text is usually written without diacritics which leads to major linguistic ambiguities in most of the cases since Arabic words have different meaning depending on how they are diactritized. This paper introduces a hybrid diacritization system combining both rule-based and data- driven techniques targeting standard Arabic text. Our system relies on automatic correction, morphological analysis, part of speech tagging and out of vocabulary diacritization components. The system shows improved results over the best reported systems in terms of full-form diacritization, and comparable results on the level of morphological diacritization. We report these results by evaluating our system using the same training and evaluation sets used by the systems we compare against.. Our system shows a word error rate (WER) of 4.4% on the morphological diacritization, ignoring the last letter diacritics, and 11.4% on the full-form diacritization including case ending diacritics. This means an absolute 1.1% reduction on the word error rate (WER) over the best reported system.

Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, Eslam Kamal

EDU-Based Similarity for Paraphrase Identification

We propose a new method to compute the similarity between two sentences based on elementary discourse units, EDU-based similarity. Unlike conventional methods, which directly compute similarities based on sentences, our method divides sentences into discourse units and uses them to compute similarities. We also show the relation between paraphrases and discourse units, which plays an important role in paraphrasing. We apply our method to the paraphrase identification task. By using only a single SVM classifier, we achieve 93.1% accuracy on the PAN corpus, a large corpus for detecting paraphrases.

Ngo Xuan Bach, Nguyen Le Minh, Akira Shimazu

Exploiting Query Logs and Field-Based Models to Address Term Mismatch in an HIV/AIDS FAQ Retrieval System

One of the main challenges in the retrieval of Frequently Asked Questions (FAQ) is that the terms used by information seekers to express their information need are often different from those used in the relevant FAQ documents. This lexical disagreement (aka term mismatch) can result in a less effective ranking of the relevant FAQ documents by retrieval systems that rely on keyword matching in their weighting models. In this paper, we tackle such a lexical gap in an SMS-Based HIV/AIDS FAQ retrieval system by enriching the traditional FAQ document representation using terms from a query log, which are added as a separate field in a field-based model. We evaluate our approach using a collection of FAQ documents produced by a national health service and a corresponding query log collected over a period of 3 months. Our results suggest that by enriching the FAQ documents with additional terms from the SMS queries for which the true relevant FAQ documents are known and combining term frequencies from the different fields, the lexical mismatch problem in our system is markedly alleviated, leading to an overall improvement in the retrieval performance in terms of Mean Reciprocal Rank (MRR) and recall.

Edwin Thuma, Simon Rogers, Iadh Ounis

Exploring Domain-Sensitive Features for Extractive Summarization in the Medical Domain

This paper describes experiments to adapt document summarization to the medical domain. Our summarizer combines linguistic features corresponding to text fragments (typically sentences) and applies a machine learning approach to extract the most important text fragments from a document to form a summary. The generic features comprise features used in previous research on summarization. We propose to adapt the summarizer to the medical domain by adding domain-specific features. We explore two types of additional features: medical domain features and semantic features. The evaluation of the summarizer is based on medical articles and targets different aspects: i) the classification of text fragments into ones which are important and ones which are unimportant for a summary; ii) analyzing the effect of each feature on the performance; and iii) system improvement over our baseline summarizer when adding features for domain adaptation. Evaluation metrics include accuracy for training the sentence extraction and the ROUGE measure computed for reference summaries. We achieve an accuracy of 84.16% on medical balanced training data by using an IB1 classifier. Training on unbalanced data achieves higher accuracy than training on balanced data. Domain adaptation using all domain-specific features outperforms the baseline summarization wrt. ROUGE scores, which shows the successful domain adaptation with simple means.

Dat Tien Nguyen, Johannes Leveling

A Corpus-Based Approach for the Induction of Ontology Lexica

While there are many large knowledge bases (e.g. Freebase, Yago, DBpedia) as well as linked data sets available on the web, they typically lack lexical information stating how the properties and classes are realized lexically. If at all, typically only one label is attached to these properties, thus lacking any deeper syntactic information, e.g. about syntactic arguments and how these map to the semantic arguments of the property as well as about possible lexical variants or paraphrases. While there are lexicon models such as

lemon

allowing to define a lexicon for a given ontology, the cost involved in creating and maintaining such lexica is substantial, requiring a high manual effort. Towards lowering this effort, in this paper we present a semi-automatic approach that exploits a corpus to find occurrences in which a given property is expressed, and generalizing over these occurrences by extracting dependency paths that can be used as a basis to create

lemon

lexicon entries. We evaluate the resulting automatically generated lexica with respect to DBpedia as dataset and Wikipedia as corresponding corpus, both in an automatic mode, by comparing to a manually created lexicon, and in a semi-automatic mode in which a lexicon engineer inspected the results of the corpus-based approach, adding them to the existing lexicon if appropriate.

Sebastian Walter, Christina Unger, Philipp Cimiano

SQUALL: A Controlled Natural Language as Expressive as SPARQL 1.1

The Semantic Web is now made of billions of triples, which are available as Linked Open Data (LOD) or as RDF stores. The most common approach to access RDF datasets is through SPARQL, an expressive query language. However, SPARQL is difficult to learn for most users because it exhibits low-level notions of relational algebra such as union, filters, or grouping. We present SQUALL, a high-level language for querying and updating an RDF dataset. It has a strong compliance with RDF, covers all features of SPARQL 1.1, and has a controlled natural language syntax that completely abstracts from low-level notions. SQUALL is available as two web services: one for translating a SQUALL sentence to a SPARQL query or update, and another for directly querying a SPARQL endpoint such as DBpedia.

Sébastien Ferré

Evaluating Syntactic Sentence Compression for Text Summarisation

This paper presents our work on the evaluation of syntactic based sentence compression for automatic text summarization. Sentence compression techniques can contribute to text summarization by removing redundant and irrelevant information and allowing more space for more relevant content. However, very little work has focused on evaluating the contribution of this idea for summarization. In this paper, we focus on pruning individual sentences in extractive summaries using phrase structure grammar representations. We have implemented several syntax-based pruning techniques and evaluated them in the context of automatic summarization, using standard evaluation metrics. We have performed our evaluation on the TAC and DUC corpora using the BlogSum and MEAD summarizers. The results show that sentence pruning can achieve compression rates as low as 60%, however when using this extra space to fill in more sentences, ROUGE scores do not improve significantly.

Prasad Perera, Leila Kosseim

An Unsupervised Aspect Detection Model for Sentiment Analysis of Reviews

With the rapid growth of user-generated content on the internet, sentiment analysis of online reviews has become a hot research topic recently, but due to variety and wide range of products and services, the supervised and domain-specific models are often not practical. As the number of reviews expands, it is essential to develop an efficient sentiment analysis model that is capable of extracting product aspects and determining the sentiments for aspects. In this paper, we propose an unsupervised model for detecting aspects in reviews. In this model, first a generalized method is proposed to learn multi-word aspects. Second, a set of heuristic rules is employed to take into account the influence of an opinion word on detecting the aspect. Third a new metric based on mutual information and aspect frequency is proposed to score aspects with a new bootstrapping iterative algorithm. The presented bootstrapping algorithm works with an unsupervised seed set. Finally two pruning methods based on the relations between aspects in reviews are presented to remove incorrect aspects. The proposed model does not require labeled training data and can be applicable to other languages or domains. We demonstrate the effectiveness of our model on a collection of product reviews dataset, where it outperforms other techniques.

Ayoub Bagheri, Mohamad Saraee, Franciska de Jong

Cross-Lingual Natural Language Querying over the Web of Data

The rapid growth of the Semantic Web offers a wealth of semantic knowledge in the form of Linked Data and ontologies, which can be considered as large knowledge graphs of marked up Web data. However, much of this knowledge is only available in English, affecting effective information access in the multilingual Web. A particular challenge arises from the vocabulary gap resulting from the difference in the query and the data languages. In this paper, we present an approach to perform cross-lingual natural language queries on Linked Data. Our method includes three components: entity identification, linguistic analysis, and semantic relatedness. We use Cross-Lingual Explicit Semantic Analysis to overcome the language gap between the queries and data. The experimental results are evaluated against 50 German natural language queries. We show that an approach using a cross-lingual similarity and relatedness measure outperforms other systems that use automatic translation. We also discuss the queries that can be handled by our approach.

Nitish Aggarwal, Tamara Polajnar, Paul Buitelaar

Extractive Text Summarization: Can We Use the Same Techniques for Any Text?

In this paper we address two issues. The first one analyzes whether the performance of a text summarization method depends on the topic of a document. The second one is concerned with how certain linguistic properties of a text may affect the performance of a number of automatic text summarization methods. For this we consider semantic analysis methods, such as textual entailment and anaphora resolution, and we study how they are related to proper noun, pronoun and noun ratios calculated over original documents that are grouped into related topics. Given the obtained results, we can conclude that although our first hypothesis is not supported, since it has been found no evident relationship between the topic of a document and the performance of the methods employed, adapting summarization systems to the linguistic properties of input documents benefits the process of summarization.

Tatiana Vodolazova, Elena Lloret, Rafael Muñoz, Manuel Palomar

Unsupervised Medical Subject Heading Assignment Using Output Label Co-occurrence Statistics and Semantic Predications

Librarians at the National Library of Medicine tag each biomedical abstract to be indexed by their Pubmed information system with terms from the Medical Subject Headings (MeSH) terminology. The MeSH terminology has over 26,000 terms and indexers look at each article’s full text to assign a set of most suitable terms for indexing it. Several recent automated attempts focused on using the article title and abstract text to identify MeSH terms for the corresponding article. Most of these approaches used supervised machine learning techniques that use already indexed articles and the corresponding MeSH terms. In this paper, we present a novel unsupervised approach using named entity recognition, relationship extraction, and output label co-occurrence frequencies of MeSH term pairs from the existing set of 22 million articles already indexed with MeSH terms by librarians at NLM. The main goal of our study is to gauge the potential of output label co-occurrence statistics and relationships extracted from free text in unsupervised indexing approaches. Especially, in biomedical domains, output label co-occurrences are generally easier to obtain than training data involving document and label set pairs owing to the sensitive nature of textual documents containing protected health information. Our methods achieve a micro F-score that is comparable to those obtained using supervised machine learning techniques with training data consisting of document label set pairs. Baseline comparisons reveal strong prospects for further research in exploiting label co-occurrences and relationships extracted from free text in recommending terms for indexing biomedical articles.

Ramakanth Kavuluru, Zhenghao He

Bayesian Model Averaging and Model Selection for Polarity Classification

One of the most relevant task in Sentiment Analysis is Polarity Classification. In this paper, we discuss how to explore the potential of ensembles of classifiers and propose a voting mechanism based on Bayesian Model Averaging (BMA). An important issue to be addressed when using ensemble classification is the model selection strategy. In order to help in selecting the best ensemble composition, we propose an heuristic aimed at evaluating the a priori contribution of each model to the classification task. Experimental results on different datasets show that Bayesian Model Averaging, together with the proposed heuristic, outperforms traditional classification methods and the well known Majority Voting mechanism.

Federico Alberto Pozzi, Elisabetta Fersini, Enza Messina

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

Building a system to extract Arabic named entities is a complex task due to the ambiguity and structure of Arabic text. Previous approaches that have tackled the problem of Arabic named entity recognition relied heavily on Arabic parsers and taggers combined with a huge set of gazetteers and sometimes large training sets to solve the ambiguity problem. But while these approaches are applicable to modern standard Arabic (MSA) text, they cannot handle colloquial Arabic. With the rapid increase in online social media usage by Arabic speakers, it is important to build an Arabic named entity recognition system that deals with both colloquial Arabic and MSA text. This paper introduces an approach for extracting Arabic persons’ name without utilizing any Arabic parsers or taggers. Evaluation of the presented approach shows that it achieves high precision and an acceptable level of recall on a benchmark dataset.

Omnia Zayed, Samhaa El-Beltagy, Osama Haggag

ANEAR: Automatic Named Entity Aliasing Resolution

Identifying the different aliases used by or for an entity is emerging as a significant problem in reliable Information Extraction systems, especially with the proliferation of social media and their ever growing impact on different aspects of modern life such as politics, finance, security, etc. In this paper, we address the novel problem of Named Entity Aliasing Resolution (NEAR). We attempt to solve the NEAR problem in a language-independent setting by extracting the different aliases and variants of person named entities. We generate feature vectors for the named entities by building co-occurrence models that use different weighting schemes. The aliasing resolution process applies unsupervised machine learning techniques over the vector space models in order to produce groups of entities along with their aliases. We test our approach on two languages: Arabic and English. We study the impact of varying the level of morphological preprocessing of the words, as well as the part of speech tags surrounding the person named entities, and the named entities’ distribution in the data set. We create novel evaluation data sets for both languages. NEAR yields better overall performance in Arabic than in English for comparable amounts of data, effectively using the POS tag information to improve performance. Our approach achieves an

F

β

 = 1

score of 67.85% and 70.03% for raw English and Arabic data sets, respectively.

Ayah Zirikly, Mona Diab

Improving Candidate Generation for Entity Linking

Entity linking is the task of linking names in free text to the referent entities in a knowledge base. Most recently proposed linking systems can be broken down into two steps: candidate generation and candidate ranking. The first step searches candidates from the knowledge base and the second step disambiguates them. Previous works have been focused on the recall of the generation because if the target entity is absent in the candidate set, no ranking method can return the correct result. Most of the recall-driven generation strategies will increase the number of the candidates. However, with large candidate sets, memory/time consuming systems are impractical for online applications. In this paper, we propose a novel candidate generation approach to generate high recall candidate set with small size. Experimental results on two KBP data sets show that the candidate generation recall achieves more than 93%. By leveraging our approach, the candidate number is reduced from hundreds to dozens, the system runtime is saved by 70.3% and 76.6% over the baseline and the highest micro-averaged accuracy in the evaluation is improved by 2.2% and 3.4%.

Yuhang Guo, Bing Qin, Yuqin Li, Ting Liu, Sheng Li

Person Name Recognition Using the Hybrid Approach

Arabic Person Name Recognition has been tackled mostly using either of two approaches: a rule-based or Machine Learning (ML) based approach, with their strengths and weaknesses. In this paper, the problem of Arabic Person Name Recognition is tackled through integrating the two approaches together in a pipelined process to create a hybrid system with the aim of enhancing the overall performance of Person Name Recognition tasks. Extensive experiments are conducted using three different ML classifiers to evaluate the overall performance of the hybrid system. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches. Moreover, our system outperforms the state-of-the-art of Arabic Person Name Recognition in terms of accuracy when applied to ANERcorp dataset, with precision 0.949, recall 0.942 and f-measure 0.945.

Mai Oudah, Khaled Shaalan

A Broadly Applicable and Flexible Conceptual Metagrammar as a Basic Tool for Developing a Multilingual Semantic Web

The paperformulates the problem of constructing a broadly applicable and flexible Conceptual Metagrammar (CM). It is to be a collection of the rules enabling us to construct step by step a semantic representation (or text meaning representation) of practically arbitrary sentence or discourse pertaining to mass spheres of human’s professional activity. The opinion is grounded that the first version of broadly applicable and flexible CM is already available in the scientific literature. It is conjectured that the definition of the class of SK-languages (standard knowledge languages) provided by the theory of K-representations (knowledge representations) can be interpreted as the first version of broadly applicable and flexible CM. The current version of the latter theory is stated in the author’s monograph published by Springer in 2010. The final part of the paper describes the connections with the related approaches, in particular, with the studies on developing a Multilingual Semantic Web.

Vladimir A. Fomichov

Short Papers

MOSAIC: A Cohesive Method for Orchestrating Discrete Analytics in a Distributed Model

Achieving an HLT analytic architecture that supports easy integration of new and legacy analytics is challenging given the independence of analytic development, the diversity of data modeling, and the need to avoid rework. Our solution is to separate input, artifacts, and results from execution by delineating different subcomponents including an inbound gateway, an executive, an analytic layer, an adapter layer, and a data bus. Using this design philosophy, MOSAIC is an architecture of replaceable subcomponents built to support workflows of loosely-coupled analytics bridged by a common data model.

Ransom Winder, Joseph Jubinski, John Prange, Nathan Giles

Ranking Search Intents Underlying a Query

Observation on query log of search engine indicates that queries are usually ambiguous. Similar to document ranking, search intents should be ranked to facilitate information search. Previous work attempts to rank intents with merely relevance score. We argue that diversity is also important. In this work, unified models are proposed to rank intents underlying a query by combining relevance score and diversity degree, in which the latter is reflected by non-overlapping ratio of every intent and aggregated non-overlapping ratio of a set of intents. Three conclusions are drawn according to the experiment results. Firstly, diversity plays an important role in intent ranking. Secondly, URL is more effective than similarity in detecting unique subtopics. Thirdly, the aggregated non-overlapping ratio makes some contribution in similarity based intent ranking but little in URL based intent ranking.

Yunqing Xia, Xiaoshi Zhong, Guoyu Tang, Junjun Wang, Qiang Zhou, Thomas Fang Zheng, Qinan Hu, Sen Na, Yaohai Huang

Linguistic Sentiment Features for Newspaper Opinion Mining

The sentiment in news articles is not created only through single words, also linguistic factors, which are invoked by different contexts, influence the opinion-bearing words. In this paper, we apply various commonly used approaches for sentiment analysis and expand research by analysing semantic features and their influence to the sentiment. We use a machine learning approach to learn from these features/influences and to classify the resulting sentiment. The evaluation is performed on two datasets containing over 4,000 German news articles and illustrates that this technique can increase the performance.

Thomas Scholz, Stefan Conrad

Text Classification of Technical Papers Based on Text Segmentation

The goal of this research is to design a multi-label classification model which determines the research topics of a given technical paper. Based on the idea that papers are well organized and some parts of papers are more important than others for text classification, segments such as title, abstract, introduction and conclusion are intensively used in text representation. In addition, new features called Title Bi-Gram and Title SigNoun are used to improve the performance. The results of the experiments indicate that feature selection based on text segmentation and these two features are effective. Furthermore, we proposed a new model for text classification based on the structure of papers, called Back-off model, which achieves 60.45% Exact Match Ratio and 68.75% F-measure. It was also shown that Back-off model outperformed two existing methods, ML-kNN and Binary Approach.

Thien Hai Nguyen, Kiyoaki Shirai

Product Features Categorization Using Constrained Spectral Clustering

Opinion mining has increasingly become a valuable practice to grasp public opinions towards various products and related features. However, for the same feature, people may express it using different but related words and phrases. It is helpful to categorize these words and phrases, which are domain synonyms, under the same feature group to produce an effective opinion summary. In this paper, we propose a novel semi-supervised product features categorization strategy using constrained spectral clustering. Different from existing methods that cluster product features using lexical and distributional similarities, we exploit the morphological and contextual characteristics between product features as prior constraints knowledge to enhance the categorizing process. Experimental evaluation on real-life dataset demonstrates that our proposed method achieves better results compared with the baselines.

Sheng Huang, Zhendong Niu, Yulong Shi

A New Approach for Improving Cross-Document Knowledge Discovery Using Wikipedia

In this paper, we present a new model that incorporates the extensive knowledge derived from Wikipedia for cross-document knowledge discovery. The model proposed here is based on our previously introduced Concept Chain Queries (CCQ) which is a special case of text mining focusing on detecting semantic relationships between two concepts across multiple documents. We attempt to overcome the limitations of CCQ by building a semantic kernel for concept closeness computing to complement existing knowledge in text corpus. The experimental evaluation demonstrates that the kernel-based approach outperforms in ranking important chains retrieved in the search results.

Peng Yan, Wei Jin

Using Grammar-Profiles to Intrinsically Expose Plagiarism in Text Documents

Intrinsic plagiarism detection deals with the task of finding plagiarized sections in text documents without using a reference corpus. This paper describes a novel approach in this field by analyzing the grammar of authors and using sliding windows to find significant differences in writing styles. To find suspicious text passages, the algorithm splits a document into single sentences, calculates syntax grammar trees and builds profiles based on frequently used grammar patterns. The text is then traversed, where each window is compared to the document profile using a distance metric. Finally, all sentences that have a significantly higher distance according to a utilized Gaussian normal distribution are marked as suspicious. A preliminary evaluation of the algorithm shows very promising results.

Michael Tschuggnall, Günther Specht

Feature Selection Methods in Persian Sentiment Analysis

With the enormous growth of digital content in internet, various types of online reviews such as product and movie reviews present a wealth of subjective information that can be very helpful for potential users. Sentiment analysis aims to use automated tools to detect subjective information from reviews. Up to now as there are few researches conducted on feature selection in sentiment analysis, there are very rare works for Persian sentiment analysis. This paper considers the problem of sentiment classification using different feature selection methods for online customer reviews in Persian language. Three of the challenges of Persian text are using of a wide variety of declensional suffixes, different word spacing and many informal or colloquial words. In this paper we study these challenges by proposing a model for sentiment classification of Persian review documents. The proposed model is based on stemming and feature selection and is employed Naive Bayes algorithm for classification. We evaluate the performance of the model on a collection of cellphone reviews, where the results show the effectiveness of the proposed approaches.

Mohamad Saraee, Ayoub Bagheri

Towards the Refinement of the Arabic Soundex

In this paper, we present phonetic encoding functions that play the role of hash functions in the indexation of an Arabic dictionary. They allow us to answer approximate queries that, given a query word, ask for all the words that are phonetically similar to it. They consider the phonetic features of the standard Arabic language and involve some possible phonetic alterations induced by specific habits in the pronunciation of Arabic.

We propose two functions, the first one is called the ”Algerian Dialect Refinement” and it takes into account phonetic confusions usually known to the Algerian people while speaking Arabic; and the second one is named the ”Speech Therapy Refinement” and it examines some mispronunciations common to children.

Nedjma Djouhra Ousidhoum, Nacéra Bensaou

An RDF-Based Semantic Index

Managing efficiently and effectively very large amount of digital documents requires the definition of novel indexes able to capture and express documents’ semantics. In this work, we propose a novel semantic indexing technique particularly suitable for knowledge management applications. Algorithms and data structures are presented and preliminary experiments are reported, showing the efficiency and effectiveness of the proposed index for semantic queries.

F. Amato, F. Gargiulo, A. Mazzeo, V. Moscato, A. Picariello

Experiments in Producing Playful “Explanations” for Given Names (Anthroponyms) in Hebrew and English

In this project, we investigate the generation of wordplay that can serve as playful “explanations” for given names. We present a working system (part of work in progress), which segments and/or manipulates input names. The system does so by decomposing them into sequences (or phrases) composed of at least two words and/or transforming them into other words. Experiments reveal that the output stimulates human users into completing explanations creatively, even without sophisticated derivational grammar. This research applies to two languages: Hebrew and English. The applied transformations are: addition of a letter, deletion of a letter and replacement of a similar letter. Experiments performed in these languages show that in Hebrew the input and output are perceived to be reasonably associated; whereas, the English output, if perceived to be acceptable rather than absurd, is accepted as a humorous pun.

Yaakov HaCohen-Kerner, Daniel Nisim Cohen, Ephraim Nissan

Collaborative Enrichment of Electronic Dictionaries Standardized-LMF

The collaborative enrichment is a new tendency in constructing resources, notably electronic dictionaries. This approach is considered very efficient for resources weakly structured. In this paper, we deal with applying the collaborative enrichment for electronic dictionaries standardized according to LMF-ISO 24613. The models of such dictionaries are complex and finely structured. The purpose of the paper is, on the one hand, to expose the challenges related to this framework and, in the second hand, to propose practical solutions based on an appropriate approach. This approach ensures the properties of completeness, consistency and non-redundancy of lexical data. In order to illustrate the proposed approach, we describe the experimentation carried out on a standardized Arabic dictionary.

Aida Khemakhem, Bilel Gargouri, Abdelmajid Ben Hamadou

Enhancing Machine Learning Results for Semantic Relation Extraction

This paper describes a large scale method to extract semantic relations between named entities. It is characterized by a large number of relations and can be applied to various domains and languages. Our approach is based on rule mining from an Arabic corpus using lexical, semantic and numerical features.

Three primordial steps are needed: Firstly, we extract the learning features from annotated examples. Then, a set of rules are generated automatically using three learning algorithms which are Apriori, Tertius and the decision tree algorithm C4.5. Finally, we add a module of significant rules selection in which we use an automatic technique based on many experiments. We achieved satisfactory results when applied to our test corpus.

Ines Boujelben, Salma Jamoussi, Abdelmajid Ben Hamadou

GenDesc: A Partial Generalization of Linguistic Features for Text Classification

This paper presents an application that belongs to automatic classification of textual data by supervised learning algorithms. The aim is to study how a better textual data representation can improve the quality of classification. Considering that a word meaning depends on its context, we propose to use features that give important information about word contexts. We present a method named

GenDesc

, which generalizes (with POS tags) the least relevant words for the classification task.

Guillaume Tisserant, Violaine Prince, Mathieu Roche

Entangled Semantics

In the context of monolingual and bilingual retrieval, Simple Knowledge Organisation System (SKOS) datasets can play a dual role as knowledge bases for semantic annotations and as language-independent resources for translation. With no existing track of formal evaluations of these aspects for datasets in SKOS format, we describe a case study on the usage of the Thesaurus for the Social Sciences in SKOS format for a retrieval setup based on the CLEF 2004-2006 Domain-Specific Track topics, documents and relevance assessments. Results showed a mixed picture with significant system-level improvements in terms of mean average precision in the bilingual runs. Our experiments set a new and improved baseline for using SKOS-based datasets with the GIRT collection and are an example of component-based evaluation.

Diana Tanase, Epaminondas Kapetanios

Poster Papers

Phrase Table Combination Deficiency Analyses in Pivot-Based SMT

As the parallel corpus is not available all the time, pivot language was introduced to solve the parallel corpus sparseness in statistical machine translation. In this paper, we carried out several phrase-based SMT experiments, and analyzed the detailed reasons that caused the decline in translation performance. Experimental results indicated that both covering rate of phrase pairs and translation probability accuracy affect the quality of translation.

Yiming Cui, Conghui Zhu, Xiaoning Zhu, Tiejun Zhao, Dequan Zheng

Analysing Customers Sentiments: An Approach to Opinion Mining and Classification of Online Hotel Reviews

Customer opinion holds a very important place in products and service business, especially for companies and potential customers. In the last years, opinions have become yet more important due to global Internet usage as opinions pool. Unfortunately , looking through customer reviews and extracting information to improve their service is a difficult work due to the large number of existing reviews. In this work we present a system designed to mine client opinions, classify them as positive or negative, and classify them according to the hotel features they belong to. To obtain this classification we use a machine learning classifier, reinforced with lexical resources to extract polarity and a specialized hotel features taxonomy.

Juan Sixto, Aitor Almeida, Diego López-de-Ipiña

An Improved Discriminative Category Matching in Relation Identification

This paper describes an improved method for relation identification, which is the last step of unsupervised relation extraction. Similar entity pairs maybe grouped into the same cluster. It is also important to select a key word to describe the relation accurately. Therefore, an improved DF feature selection method is employed to rearrange low-frequency entity pairs’ features in order to get a feature set for each cluster. Then we used an improved Discriminative Category Matching (DCM) method to select typical and discriminative words for entity pairs’ relation. Our experimental results show that Improved DCM method is better than the original DCM method in relation identification.

Yongliang Sun, Jing Yang, Xin Lin

Extracting Fine-Grained Entities Based on Coordinate Graph

Most previous entity extraction studies focus on a small set of coarse-grained classes, such as person etc. However, the distribution of entities within query logs of search engine indicates that users are more interested in a wider range of fine-grained entities, such as GRAMMY winner and Ivy League member etc. In this paper, we present a semi-supervised method to extract fine-grained entities from an open-domain corpus. We build a graph based on entities in

coordinate list

s, which are html nodes with the same tag path of the DOM trees. Then class labels are propagated over the graph from known entities to unknowns. Experiments on a large corpus from ClueWeb09a dataset show that our proposed approach achieves the promising results.

Qing Yang, Peng Jiang, Chunxia Zhang, Zhendong Niu

NLP-Driven Event Semantic Ontology Modeling for Story

This paper presents a NLP-driven semantic ontology modeling for unstructured data of Chinese children stories. We use a weakly-supervised approach to capture n-ary facts based on the output of dependency parser and regular expressions. After n-ary facts post-processing, we populate the extracted facts of events to SOSDL (Story-Oriented Semantic Language), an event ontology designed for modeling semantic elements and relations of events, to form a machine-readable format. Experiments indicate the reasonability and feasibility of our approach.

Chun-Ming Gao, Qiu-Mei Xie, Xiao-Lan Wang

The Development of an Ontology for Reminiscence

The research presented in this paper investigates the construction and feasibility of use of an ontology of reminiscence in a conversational agent (CA) with suitable reminiscence mechanisms for non-clinical use within a healthy aging population who may have memory loss as part of normal aging and thereby improve subjective well-being (SWB).

Collette Curry, James O’Shea, Keeley Crockett, Laura Brown

Chinese Sentence Analysis Based on Linguistic Entity-Relationship Model

We propose a new model called linguistic entity relationship model (LERM) for the Chinese syntactic parsing. In this model, we implement the analysis algorithm based on the analysis and verification of the linguistic entity relationship modes that are extracted and defined to describe the most basic syntactic and semantic structures. Compared with the corpus-based and rule-based methods, we neither manually write a large number of rules as used in traditional rule-based methods nor use the corpus to train the model. We only use the few meta-rules to describe the grammars. A Chinese syntactic parsing system based on the model is developed, and its performance of syntactic parsing outperforms the corpus-based baseline system.

Dechun Yin

A Dependency Graph Isomorphism for News Sentence Searching

Given that the amount of news being published is only increasing, an effective search tool is invaluable to many Web-based companies. With word-based approaches ignoring much of the information in texts, we propose Destiny, a linguistic approach that leverages the syntactic information in sentences by representing sentences as graphs with disambiguated words as nodes and grammatical relations as edges. Destiny performs approximate sub-graph isomorphism on the query graph and the news sentence graphs, exploiting word synonymy as well as hypernymy. Employing a custom corpus of user-rated queries and sentences, the algorithm is evaluated using the normalized Discounted Cumulative Gain, Spearman’s Rho, and Mean Average Precision and it is shown that Destiny performs significantly better than a TF-IDF baseline on the considered measures and corpus.

Kim Schouten, Flavius Frasincar

Unsupervised Gazette Creation Using Information Distance

Named Entity extraction (NEX)

problem consists of automatically constructing a gazette containing instances for each NE of interest. NEX is important for domains which lack a corpus with tagged NEs. In this paper, we propose a new unsupervised (bootstrapping) NEX technique, based on a new variant of the Multiword Expression Distance (MED)[1] and information distance [2]. Efficacy of our method is shown using comparison with BASILISK and PMI in agriculture domain. Our method discovered 8 new diseases which are not found in Wikipedia.

Sangameshwar Patil, Sachin Pawar, Girish K. Palshikar, Savita Bhat, Rajiv Srivastava

A Multi-purpose Online Toolset for NLP Applications

This paper presents a new implementation of the multi-purpose set of NLP tools for Polish, made available online in a common web service framework. The tool set comprises a morphological analyzer, a tagger, a named entity recognizer, a dependency parser, a constituency parser and a coreference resolver. Additionally, a web application offering chaining capabilities and a common BRAT-based presentation framework is presented.

Maciej Ogrodniczuk, Michał Lenart

A Test-Bed for Text-to-Speech-Based Pedestrian Navigation Systems

This paper presents an Android system to support eyes-free, hands-free navigation through a city. The system operates in two distinct modes:

manual

and

automatic

. In manual, a human operator sends text messages which are realized via TTS into the subject’s earpiece. The operator sees the subject’s GPS position on a map, hears the subject’s speech, and sees a 1 fps movie taken from the subject’s phone, worn as a necklace. In automatic mode, a programmed controller attempts to achieve the same guidance task as the human operator.

We have fully built our manual system and have verified that it can be used to successfully guide pedestrians through a city. All activities are logged in the system into a single, large database state. We are building a series of automatic controllers which require us to confront a set of research challenges, some of which we briefly discuss in this paper. We plan to demonstrate our work live at NLDB.

Michael Minock, Johan Mollevik, Mattias Åsander, Marcus Karlsson

Automatic Detection of Arabic Causal Relations

The work described in this paper is about the automatic detection and extraction of causal relations that are explicitly expressed in Modern Standard Arabic (MSA) texts. In this initial study, a set of linguistic patterns was derived to indicate the presence of cause-effect information in sentences from open domain texts. The patterns were constructed based on a set of syntactic features which was acquired by analyzing a large untagged Arabic corpus so that parts of the sentence representing the

cause

and those representing the

effect

can be distinguished. To the best of researchers knowledge, no previous studies have dealt this type of relation for the Arabic language.

Jawad Sadek

A Framework for Employee Appraisals Based on Inductive Logic Programming and Data Mining Methods

This paper develops a new semantic framework that supports employee performance appraisals, based on inductive logic programming and data mining techniques. The framework is applied to learn a grammar for writing SMART objectives and provide feedback. The paper concludes with an empirical evaluation of the framework which shows promising results.

Darah Aqel, Sunil Vadera

A Method for Improving Business Intelligence Interpretation through the Use of Semantic Technology

Although business intelligence applications are increasingly important for business operations, the interpretation of results from business intelligent tools relies greatly upon the knowledge and experience of the analyst. This research develops a methodology for capturing knowledge of employees carrying out the interpretation of business intelligence output. The methodology requires the development of targeted ontologies that contribute to analyzing the output. An architecture is presented.

Shane Givens, Veda Storey, Vijayan Sugumaran

Code Switch Point Detection in Arabic

This paper introduces a dual-mode stochastic system to automatically identify linguistic code switch points in Arabic. The first of these modes determines the most likely word tag (i.e. dialect or modern standard Arabic) by choosing the sequence of Arabic word tags with maximum marginal probability via lattice search and 5-gram probability estimation. When words are out of vocabulary, the system switches to the second mode which uses a dialectal Arabic (DA) and modern standard Arabic (MSA) morphological analyzer. If the OOV word is analyzable using the DA morphological analyzer only, it is tagged as “DA”, if it is analyzable using the “MSA” morphological analyzer only, it is tagged as MSA, otherwise if analyzable using both of them, then it is tagged as “both”. The system yields an

F

β

 = 1

score of 76.9% on the development dataset and 76.5% on the held-out test dataset, both judged against human-annotated Egyptian forum data.

Heba Elfardy, Mohamed Al-Badrashiny, Mona Diab

SurveyCoder: A System for Classification of Survey Responses

Survey coding is the process of analyzing text responses to open-ended questions in surveys. We present SurveyCoder, a research prototype which helps the survey analysts to achieve significant automation of the survey coding process. SurveyCoder’s multi-label text classification algorithm makes use of a knowledge base that consists of linguistic resources, historical data, domain specific rules and constraints. Our method is applicable to surveys carried out in different domains.

Sangameshwar Patil, Girish K. Palshikar

Rhetorical Representation and Vector Representation in Summarizing Arabic Text

This paper examines the benefits of both the Rhetorical Representation and Vector Representation for Arabic text summarization. The Rhetorical Representation uses the Rhetorical Structure Theory (RST) for building the Rhetorical Structure Tree (RS-Tree) and extracts the most significant paragraphs as a summary. On the other hand, the Vector Representation uses a cosine similarity measure for ranking and extracting the most significant paragraphs as a summary. The framework evaluates both summaries using precision. Statistical results show that Rhetorical Representation is superior to Vector Representation. Moreover, the rhetorical summary keeps the text in context, without leading to lack of cohesion in which the anaphoric reference is not broken i.e. improving the ability of extracting the semantics behind the text.

Ahmed Ibrahim, Tarek Elghazaly

Backmatter

Weitere Informationen