Information Retrieval

A New Method for Sentiment Classification in Text Retrieval

Traditional text categorization is usually a topic-based task, but a subtle demand on information retrieval is to distinguish between positive and negative view on text topic. In this paper, a new method is explored to solve this problem. Firstly, a batch of Concerned Concepts in the researched domain is predefined. Secondly, the special knowledge representing the positive or negative context of these concepts within sentences is built up. At last, an evaluating function based on the knowledge is defined for sentiment classification of free text. We introduce some linguistic knowledge in these procedures to make our method effective. As a result, the new method proves better compared with SVM when experimenting on Chinese texts about a certain topic.

Yi Hu, Jianyong Duan, Xiaoming Chen, Bingzhen Pei, Ruzhan Lu

Topic Tracking Based on Linguistic Features

This paper explores two linguistically motivated restrictions on the set of words used for topic tracking on newspaper articles: named entities and headline words. We assume that named entities is one of the linguistic features for topic tracking, since both topic and event are related to a specific

place

and

time

in a story. The basic idea to use headline words for the tracking task is that headline is a compact representation of the original story, which helps people to quickly understand the most important information contained in a story. Headline words are automatically generated using headline generation technique. The method was tested on the Mainichi Shimbun Newspaper in Japanese, and the results of topic tracking show that the system works well even for a small number of positive training data.

Fumiyo Fukumoto, Yusuke Yamaji

The Use of Monolingual Context Vectors for Missing Translations in Cross-Language Information Retrieval

For cross-language text retrieval systems that rely on bilingual dictionaries for bridging the language gap between the source query language and the target document language, good bilingual dictionary coverage is imperative. For terms with missing translations, most systems employ some approaches for expanding the existing translation dictionaries. In this paper, instead of lexicon expansion, we explore whether using the context of the unknown terms can help mitigate the loss of meaning due to missing translation. Our approaches consist of two steps: (1) to identify terms that are closely associated with the unknown source language terms as

context

vectors and (2) to use the translations of the associated terms in the context vectors as the surrogate translations of the unknown terms. We describe a query-independent version and a query-dependent version using such monolingual context vectors. These methods are evaluated in Japanese-to-English retrieval using the NTCIR-3 topics and data sets. Empirical results show that both methods improved CLIR performance for short and medium-length queries and that the query-dependent context vectors performed better than the query-independent versions.

Yan Qu, Gregory Grefenstette, David A. Evans

Automatic Image Annotation Using Maximum Entropy Model

Automatic image annotation is a newly developed and promising technique to provide semantic image retrieval via text descriptions. It concerns a process of automatically labeling the image contents with a pre-defined set of keywords which are exploited to represent the image semantics. A Maximum Entropy Model-based approach to the task of automatic image annotation is proposed in this paper. In the phase of training, a basic visual vocabulary consisting of blob-tokens to describe the image content is generated at first; then the statistical relationship is modeled between the blob-tokens and keywords by a Maximum Entropy Model constructed from the training set of labeled images. In the phase of annotation, for an unlabeled image, the most likely associated keywords are predicted in terms of the blob-token set extracted from the given image. We carried out experiments on a medium-sized image collection with about 5000 images from Corel Photo CDs. The experimental results demonstrated that the annotation performance of this method outperforms some traditional annotation methods by about 8% in mean precision, showing a potential of the Maximum Entropy Model in the task of automatic image annotation.

Wei Li, Maosong Sun

Corpus-Based Parsing

Corpus-Based Analysis of Japanese Relative Clause Constructions

Japanese relative clause constructions (RCC’s) are defined as being the NP’s of structure ‘S NP’, noting the lack of a relative pronoun or any other explicit form of noun-clause demarcation. Japanese relative clause modification should be classified into at least two major semantic types: case-slot gapping and head restrictive. However, these types for relative clause modification cannot apparently be distinguished. In this paper we propose a method of identifying a RCC’s type with a machine learning technique. The features used in our approach are not only representing RCC’s characteristics, but also automatically obtained from large corpora. The results we obtained from evaluation revealed that our method outperformed the traditional case frame-based method, and the features that we presented were effective in identifying RCC’s types.

Takeshi Abekawa, Manabu Okumura

Parsing Biomedical Literature

We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.

Matthew Lease, Eugene Charniak

Parsing the Penn Chinese Treebank with Semantic Knowledge

We build a class-based selection preference sub-model to incorporate external semantic knowledge from two Chinese electronic semantic dictionaries. This sub-model is combined with modifier-head generation sub-model. After being optimized on the held out data by the EM algorithm, our improved parser achieves 79.4% (F1 measure), as well as a 4.4% relative decrease in error rate on the Penn Chinese Treebank (CTB). Further analysis of performance improvement indicates that semantic knowledge is helpful for nominal compounds, coordination, and N⋄V tagging disambiguation, as well as alleviating the sparseness of information available in treebank.

Deyi Xiong, Shuanglong Li, Qun Liu, Shouxun Lin, Yueliang Qian

Using a Partially Annotated Corpus to Build a Dependency Parser for Japanese

We explore the use of a partially annotated corpus to build a dependency parser for Japanese. We examine two types of partially annotated corpora. It is found that a parser trained with a corpus that does not have any grammatical tags for words can demonstrate an accuracy of 87.38%, which is comparable to the current state-of-the-art accuracy on the Kyoto University Corpus. In contrast, a parser trained with a corpus that has only dependency annotations for each two adjacent

bunsetsus

(chunks) shows moderate performance. Nonetheless, it is notable that features based on character n-grams are found very useful for a dependency parser for Japanese.

Manabu Sassano

Web Mining

Entropy as an Indicator of Context Boundaries: An Experiment Using a Web Search Engine

Previous works have suggested that the uncertainty of tokens coming after a sequence helps determine whether a given position is at a context boundary. This feature of language has been applied to unsupervised text segmentation and term extraction. In this paper, we fundamentally verify this feature. An experiment was performed using a web search engine, in order to clarify the extent to which this assumption holds. The verification was applied to Chinese and Japanese.

Kumiko Tanaka-Ishii

Automatic Discovery of Attribute Words from Web Documents

We propose a method of acquiring attribute words for a wide range of objects from Japanese Web documents. The method is a simple unsupervised method that utilizes the statistics of words, lexico-syntactic patterns, and HTML tags. To evaluate the attribute words, we also establish criteria and a procedure based on question-answerability about the candidate word.

Kosuke Tokunaga, Jun’ichi Kazama, Kentaro Torisawa

Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web

This paper presents a lightweight method for unsupervised extraction of paraphrases from arbitrary textual Web documents. The method differs from previous approaches to paraphrase acquisition in that 1) it removes the assumptions on the quality of the input data, by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and 2) it does not require any explicit clue indicating which documents are likely to encode parallel paraphrases, as they report on the same events or describe the same stories. Large sets of paraphrases are collected through exhaustive pairwise alignment of small needles, i.e., sentence fragments, across a haystack of Web document sentences. The paper describes experiments on a set of about one billion Web documents, and evaluates the extracted paraphrases in a natural-language Web search application.

Marius Paşca, Péter Dienes

Confirmed Knowledge Acquisition Using Mails Posted to a Mailing List

In this paper, we first discuss a problem of developing a knowledge base by using natural language documents: wrong information in natural language documents. It is almost inevitable that natural language documents, especially web documents, contain wrong information. As a result, it is important to investigate a method of detecting and correcting wrong information in natural language documents when we develop a knowledge base by using them. In this paper, we report a method of detecting wrong information in mails posted to a mailing list and developing a knowledge base by using these mails. Then, we describe a QA system which can answer how type questions based on the knowledge base and show that question and answer mails posted to a mailing list can be used as a knowledge base for a QA system.

Yasuhiko Watanabe, Ryo Nishimura, Yoshihiro Okada

Rule-Based Parsing

Automatic Partial Parsing Rule Acquisition Using Decision Tree Induction

Partial parsing techniques try to recover syntactic information efficiently and reliably by sacrificing completeness and depth of analysis. One of the difficulties of partial parsing is finding a means to extract the grammar involved automatically. In this paper, we present a method for automatically extracting partial parsing rules from a tree-annotated corpus using decision tree induction. We define the partial parsing rules as those that can decide the structure of a substring in an input sentence deterministically. This decision can be considered as a classification; as such, for a substring in an input sentence, a proper structure is chosen among the structures occurred in the corpus. For the classification, we use decision tree induction, and induce partial parsing rules from the decision tree. The acquired grammar is similar to a phrase structure grammar, with contextual and lexical information, but it allows building structures of depth one or more. Our experiments showed that the proposed partial parser using the automatically extracted rules is not only accurate and efficient, but also achieves reasonable coverage for Korean.

Myung-Seok Choi, Chul Su Lim, Key-Sun Choi

Chunking Using Conditional Random Fields in Korean Texts

We present a method of chunking in Korean texts using conditional random fields (CRFs), a recently introduced probabilistic model for labeling and segmenting sequence of data. In agglutinative languages such as Korean and Japanese, a rule-based chunking method is predominantly used for its simplicity and efficiency. A hybrid of a rule-based and machine learning method was also proposed to handle exceptional cases of the rules. In this paper, we present how CRFs can be applied to the task of chunking in Korean texts. Experiments using the STEP 2000 dataset show that the proposed method significantly improves the performance as well as outperforms previous systems.

Yong-Hun Lee, Mi-Young Kim, Jong-Hyeok Lee

High Efficiency Realization for a Wide-Coverage Unification Grammar

We give a detailed account of an algorithm for efficient tactical generation from underspecified logical-form semantics, using a wide-coverage grammar and a corpus of real-world target utterances. Some earlier claims about chart realization are critically reviewed and corrected in the light of a series of practical experiments. As well as a set of algorithmic refinements, we present two novel techniques: the integration of subsumption-based local ambiguity factoring, and a procedure to selectively unpack the generation forest according to a probability distribution given by a conditional, discriminative model.

John Carroll, Stephan Oepen

Linguistically-Motivated Grammar Extraction, Generalization and Adaptation

In order to obtain a high precision and high coverage grammar, we proposed a model to measure grammar coverage and designed a PCFG parser to measure efficiency of the grammar. To generalize grammars, a grammar binarization method was proposed to increase the coverage of a probabilistic context-free grammar. In the mean time linguistically-motivated feature constraints were added into grammar rules to maintain precision of the grammar. The generalized grammar increases grammar coverage from 93% to 99% and bracketing F-score from 87% to 91% in parsing Chinese sentences. To cope with error propagations due to word segmentation and part-of-speech tagging errors, we also proposed a grammar blending method to adapt to such errors. The blended grammar can reduce about 20~30% of parsing errors due to error assignment of pos made by a word segmentation system.

Yu-Ming Hsieh, Duen-Chi Yang, Keh-Jiann Chen

Disambiguation

PP-Attachment Disambiguation Boosted by a Gigantic Volume of Unambiguous Examples

We present a PP-attachment disambiguation method based on a gigantic volume of unambiguous examples extracted from raw corpus. The unambiguous examples are utilized to acquire precise lexical preferences for PP-attachment disambiguation. Attachment decisions are made by a machine learning method that optimizes the use of the lexical preferences. Our experiments indicate that the precise lexical preferences work effectively.

Daisuke Kawahara, Sadao Kurohashi

Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain

This paper describes a method of adapting a domain-independent HPSG parser to a biomedical domain. Without modifying the grammar and the probabilistic model of the original HPSG parser, we develop a log-linear model with additional features on a treebank of the biomedical domain. Since the treebank of the target domain is limited, we need to exploit an original disambiguation model that was trained on a larger treebank. Our model incorporates the original model as a reference probabilistic distribution. The experimental results for our model trained with a small amount of a treebank demonstrated an improvement in parsing accuracy.

Tadayoshi Hara, Yusuke Miyao, Jun’ichi Tsujii

A Hybrid Approach to Single and Multiple PP Attachment Using WordNet

The problem of prepositional phrase attachment is crucial to various natural language processing tasks and has received wide attention in the literature. In this paper, we propose an algorithm to disambiguate between PP attachment sites. The algorithm uses a combination of supervised and unsupervised learning along with the WordNet information, which is implemented using a back-off model. Our use of the available sources of lexical knowledge base in combination with large un-annotated corpora generalizes the existing algorithms with improved performance. The algorithm achieved average accuracy of 86.68% over three test data sets with 100% recall. It is further extended to deal with the multiple PP attachment problem using the training based on single PP attachment sites and showed improvement over the earlier works on multiple pp attachment.

Akshar Bharathi, U. Rohini, P. Vishnu, S. M. Bendre, Rajeev Sangal

Period Disambiguation with Maxent Model

This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.

Chunyu Kit, Xiaoyue Liu

Text Mining

Acquiring Synonyms from Monolingual Comparable Texts

This paper presents a method for acquiring synonyms from monolingual comparable text (MCT). MCT denotes a set of monolingual texts whose contents are similar and can be obtained automatically. Our acquisition method takes advantage of a characteristic of MCT that included words and their relations are confined. Our method uses contextual information of surrounding one word on each side of the target words. To improve acquisition precision,

prevention of outside appearance

is used. This method has advantages in that it requires only part-of-speech information and it can acquire infrequent synonyms. We evaluated our method with two kinds of news article data: sentence-aligned parallel texts and document-aligned comparable texts. When applying the former data, our method acquires synonym pairs with 70.0% precision. Re-evaluation of incorrect word pairs with source texts indicates that the method captures the appropriate parts of source texts with 89.5% precision. When applying the latter data, acquisition precision reaches 76.0% in English and 76.3% in Japanese.

Mitsuo Shimohata, Eiichiro Sumita

A Method of Recognizing Entity and Relation

The entity and relation recognition, i.e. (1) assigning semantic classes to entities in a sentence, and (2) determining the relations held between entities, is an important task in areas such as information extraction. Subtasks (1) and (2) are typically carried out sequentially, but this approach is problematic: the errors made in subtask (1) are propagated to subtask (2) with an accumulative effect; and, the information available only in subtask (2) cannot be used in subtask (1). To address this problem, we propose a method that allows subtasks (1) and (2) to be associated more closely with each other. The process is performed in three stages: firstly, employing two classifiers to do subtasks (1) and (2) independently; secondly, recognizing an entity by taking all the entities and relations into account, using a model called the Entity Relation Propagation Diagram; thirdly, recognizing a relation based on the results of the preceding stage. The experiments show that the proposed method can improve the entity and relation recognition in some degree.

Xinghua Fan, Maosong Sun

Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora

We present a new implication of Wu’s (1997) Inversion Transduction Grammar (ITG) Hypothesis, on the problem of retrieving truly parallel sentence translations from large collections of highly

non

-parallel documents. Our approach leverages a strong language universal constraint posited by the ITG Hypothesis, that can serve as a strong inductive bias for various language learning problems, resulting in both efficiency and accuracy gains. The task we attack is highly practical since non-parallel multilingual data exists in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Our aim here is to mine truly parallel sentences, as opposed to comparable sentence pairs or loose translations as in most previous work. The method we introduce exploits Bracketing ITGs to produce the first known results for this problem. Experiments show that it obtains large accuracy gains on this task compared to the expected performance of state-of-the-art models that were developed for the less stringent task of mining comparable sentence pairs.

Dekai Wu, Pascale Fung

Automatic Term Extraction Based on Perplexity of Compound Words

Many methods of term extraction have been discussed in terms of their accuracy on huge corpora. However, when we try to apply various methods that derive from frequency to a small corpus, we may not be able to achieve sufficient accuracy because of the shortage of statistical information on frequency. This paper reports a new way of extracting terms that is tuned for a very small corpus. It focuses on the structure of compound terms and calculates perplexity on the term unit’s left-side and right-side. The results of our experiments revealed that the accuracy with the proposed method was not that advantageous. However, experimentation with the method combining perplexity and frequency information obtained the highest average-precision in comparison with other methods.

Minoru Yoshida, Hiroshi Nakagawa

Document Analysis

Document Clustering with Grouping and Chaining Algorithms

Document clustering has many uses in natural language tools and applications. For instance, summarizing sets of documents that all describe the same event requires first identifying and grouping those documents talking about the same event. Document clustering involves dividing a set of documents into non-overlapping clusters. In this paper, we present two document clustering algorithms:

grouping algorithm

, and

chaining algorithm

. We compared them with k-means and the EM algorithms. The evaluation results showed that our two algorithms perform better than the k-means and EM algorithms in different experiments.

Yllias Chali, Soufiane Noureddine

Using Multiple Discriminant Analysis Approach for Linear Text Segmentation

Research on linear text segmentation has been an on-going focus in NLP for the last decade, and it has great potential for a wide range of applications such as document summarization, information retrieval and text understanding. However, for linear text segmentation, there are two critical problems involving automatic boundary detection and automatic determination of the number of segments in a document. In this paper, we propose a new domain-independent statistical model for linear text segmentation. In our model, Multiple Discriminant Analysis (MDA) criterion function is used to achieve global optimization in finding the best segmentation by means of the largest word similarity within a segment and the smallest word similarity between segments. To alleviate the high computational complexity problem introduced by the model, genetic algorithms (GAs) are used. Comparative experimental results show that our method based on MDA criterion functions has achieved higher P

k

measure (Beeferman) than that of the baseline system using TextTiling algorithm.

Zhu Jingbo, Ye Na, Chang Xinzhi, Chen Wenliang, Benjamin K Tsou

Classifying Chinese Texts in Two Steps

This paper proposes a two-step method for Chinese text categorization (TC). In the first step, a Naïve Bayesian classifier is used to fix the fuzzy area between two categories, and, in the second step, the classifier with more subtle and powerful features is used to deal with documents in the fuzzy area, which are thought of being unreliable in the first step. The preliminary experiment validated the soundness of this method. Then, the method is extended from two-class TC to multi-class TC. In this two-step framework, we try to further improve the classifier by taking the dependences among features into consideration in the second step, resulting in a Causality Naïve Bayesian Classifier.

Xinghua Fan, Maosong Sun, Key-sun Choi, Qin Zhang

Assigning Polarity Scores to Reviews Using Machine Learning Techniques

We propose a novel type of document classification task that quantifies how much a given document (review) appreciates the target object using not binary polarity (

good

or

bad

) but a continuous measure called

sentiment polarity score

(sp-score). An sp-score gives a very concise summary of a review and provides more information than binary classification. The difficulty of this task lies in the quantification of polarity. In this paper we use support vector regression (SVR) to tackle the problem. Experiments on book reviews with five-point scales show that SVR outperforms a multi-class classification method using support vector machines and the results are close to human performance.

Daisuke Okanohara, Jun’ichi Tsujii

Ontology and Thesaurus

Analogy as Functional Recategorization: Abstraction with HowNet Semantics

One generally accepted hallmark of creative thinking is an ability to look beyond conventional labels and recategorize a concept based on its behaviour and functional potential. So while taxonomies are useful in any domain of reasoning, they typically represent the conventional label set that creative thinking attempts to look beyond. So if a linguistic taxonomy like WordNet [1] is to be useful in driving linguistic creativity, it must support some basis for recategorization, to allow an agent to reorganize its category structures in a way that unlocks the functional potential of objects, or that recognizes similarity between literally dissimilar ideas. In this paper we consider how recategorization can be used to generate analogies using the HowNet [2] ontology, a lexical resource like WordNet that in addition to being bilingual (Chinese/English) also provides explicit semantic definitions for each of the terms that it defines.

Tony Veale

PLSI Utilization for Automatic Thesaurus Construction

When acquiring synonyms from large corpora, it is important to deal not only with such surface information as the context of the words but also their latent semantics. This paper describes how to utilize a latent semantic model PLSI to acquire synonyms automatically from large corpora. PLSI has been shown to achieve a better performance than conventional methods such as tf·idf and LSI, making it applicable to automatic thesaurus construction. Also, various PLSI techniques have been shown to be effective including: (1) use of Skew Divergence as a distance/similarity measure; (2) removal of words with low frequencies, and (3) multiple executions of PLSI and integration of the results.

Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama

Analysis of an Iterative Algorithm for Term-Based Ontology Alignment

This paper analyzes the results of automatic concept alignment between two ontologies. We use an iterative algorithm to perform concept alignment. The algorithm uses the similarity of shared terms in order to find the most appropriate target concept for a particular source concept. The results show that the proposed algorithm not only finds the relation between the target concepts and the source concepts, but the algorithm also shows some flaws in the ontologies. These results can be used to improve the correctness of the ontologies.

Shisanu Tongchim, Canasai Kruengkrai, Virach Sornlertlamvanich, Prapass Srichaivattana, Hitoshi Isahara

Finding Taxonomical Relation from an MRD for Thesaurus Extension

Building a thesaurus is very costly and time-consuming task. To alleviate this problem, this paper proposes a new method for extending a thesaurus by adding taxonomic information automatically extracted from an MRD. The proposed method adopts a machine learning algorithm in acquiring rules for identifying a taxonomic relationship to minimize human-intervention. The accuracy of our method in identifying hypernyms of a noun is 89.7%, and it shows that the proposed method can be successfully applied to the problem of extending a thesaurus.

SeonHwa Choi, HyukRo Park

Relation Extraction

Relation Extraction Using Support Vector Machine

This paper presents a supervised approach for relation extraction. We apply Support Vector Machines to detect and classify the relations in Automatic Content Extraction (ACE) corpus. We use a set of features including lexical tokens, syntactic structures, and semantic entity types for relation detection and classification problem. Besides these linguistic features, we successfully utilize the distance between two entities to improve the performance. In relation detection, we filter out the negative relation candidates using entity distance threshold. In relation classification, we use the entity distance as a feature for Support Vector Classifier. The system is evaluated in terms of recall, precision, and F-measure, and errors of the system are analyzed with proposed solution.

Gumwon Hong

Discovering Relations Between Named Entities from a Large Raw Corpus Using Tree Similarity-Based Clustering

We propose a tree-similarity-based unsupervised learning method to extract relations between Named Entities from a large raw corpus. Our method regards relation extraction as a clustering problem on shallow parse trees. First, we modify previous tree kernels on relation extraction to estimate the similarity between parse trees more efficiently. Then, the similarity between parse trees is used in a hierarchical clustering algorithm to group entity pairs into different clusters. Finally, each cluster is labeled by an indicative word and unreliable clusters are pruned out. Evaluation on the New York Times (1995) corpus shows that our method outperforms the only previous work by 5 in F-measure. It also shows that our method performs well on both high-frequent and less-frequent entity pairs. To the best of our knowledge, this is the first work to use a tree similarity metric in relation clustering.

Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, Chew Lim Tan

Automatic Relation Extraction with Model Order Selection and Discriminative Label Identification

In this paper, we study the problem of unsupervised relation extraction based on model order identification and discriminative feature analysis. The model order identification is achieved by stability-based clustering and used to infer the number of the relation types between entity pairs automatically. The discriminative feature analysis is used to find discriminative feature words to name the relation types. Experiments on ACE corpus show that the method is promising.

Chen Jinxiu, Ji Donghong, Tan Chew Lim, Niu Zhengyu

Mining Inter-Entity Semantic Relations Using Improved Transductive Learning

This paper studies the problem of mining relational data hidden in natural language text. In particular, it approaches the relation classification problem with the strategy of transductive learning. Different algorithms are presented and empirically evaluated on the ACE corpus. We show that transductive learners exploiting various lexical and syntactic features can achieve promising classification performance. More importantly, transductive learning performance can be significantly improved by using an induced similarity function.

Zhu Zhang

Text Classification

A Preliminary Work on Classifying Time Granularities of Temporal Questions

Temporal question classification assigns time granularities to temporal questions ac-cording to their anticipated answers. It is very important for answer extraction and verification in the literature of temporal question answering. Other than simply distinguishing between "date" and "period", a more fine-grained classification hierarchy scaling down from "millions of years" to "second" is proposed in this paper. Based on it, a SNoW-based classifier, combining user preference, word N-grams, granularity of time expressions, special patterns as well as event types, is built to choose appropriate time granularities for the ambiguous temporal questions, such as When- and How long-like questions. Evaluation on 194 such questions achieves 83.5% accuracy, almost close to manually tagging accuracy 86.2%. Experiments reveal that user preferences make significant contributions to time granularity classification.

Wei Li, Wenjie Li, Qin Lu, Kam-Fai Wong

Classification of Multiple-Sentence Questions

Conventional QA systems cannot answer to the questions composed of two or more sentences. Therefore, we aim to construct a QA system that can answer such multiple-sentence questions. As the first stage, we propose a method for classifying multiple-sentence questions into question types. Specifically, we first extract the core sentence from a given question text. We use the core sentence and its question focus in question classification. The result of experiments shows that the proposed method improves F-measure by 8.8% and accuracy by 4.4%.

Akihiro Tamura, Hiroya Takamura, Manabu Okumura

Transliteration

A Rule Based Syllabification Algorithm for Sinhala

This paper presents a study of Sinhala syllable structure and an algorithm for identifying syllables in Sinhala words. After a thorough study of the Syllable structure and linguistic rules for syllabification of Sinhala words and a survey of the relevant literature, a set of rules was identified and implemented as a simple, easy-to-implement algorithm. The algorithm was tested using 30,000 distinct words obtained from a corpus and compared with the same words manually syllabified. The algorithm performs with 99.95 % accuracy.

Ruvan Weerasinghe, Asanka Wasala, Kumudu Gamage

An Ensemble of Grapheme and Phoneme for Machine Transliteration

Machine transliteration is an automatic method to generate characters or words in one alphabetical system for the corresponding characters in another alphabetical system. There has been increasing concern on machine transliteration as an assistant of machine translation and information retrieval. Three machine transliteration models, including “grapheme-based model”, “phoneme-based model”, and “hybrid model”, have been proposed. However, there are few works trying to make use of correspondence between source grapheme and phoneme, although the correspondence plays an important role in machine transliteration. Furthermore there are few works, which dynamically handle source grapheme and phoneme. In this paper, we propose a new transliteration model based on an ensemble of grapheme and phoneme. Our model makes use of the correspondence and dynamically uses source grapheme and phoneme. Our method shows better performance than the previous works about 15~23% in English-to-Korean transliteration and about 15~43% in English-to-Japanese transliteration.

Jong-Hoon Oh, Key-Sun Choi

Machine Translation – I

Improving Statistical Word Alignment with Ensemble Methods

This paper proposes an approach to improve statistical word alignment with ensemble methods. Two ensemble methods are investigated: bagging and cross-validation committees. On these two methods, both weighted voting and unweighted voting are compared under the word alignment task. In addition, we analyze the effect of different sizes of training sets on the bagging method. Experimental results indicate that both bagging and cross-validation committees improve the word alignment results regardless of weighted voting or unweighted voting. Weighted voting performs consistently better than unweighted voting on different sizes of training sets.

Hua Wu, Haifeng Wang

Empirical Study of Utilizing Morph-Syntactic Information in SMT

In this paper, we present an empirical study that utilizes morph-syntactical information to improve translation quality. With three kinds of language pairs matched according to morph-syntactical similarity or difference, we investigate the effects of various morpho-syntactical information, such as base form, part-of-speech, and the relative positional information of a word in a statistical machine translation framework. We learn not only translation models but also word-based/class-based language models by manipulating morphological and relative positional information. And we integrate the models into a log-linear model. Experiments on multilingual translations showed that such morphological information as part-of-speech and base form are effective for improving performance in morphologically rich language pairs and that the relative positional features in a word group are useful for reordering the local word orders. Moreover, the use of a class-based n-gram language model improves performance by alleviating the data sparseness problem in a word-based language model.

Young-Sook Hwang, Taro Watanabe, Yutaka Sasaki

Question Answering

Instance-Based Generation for Interactive Restricted Domain Question Answering Systems

One important component of interactive systems is the generation component. While template-based generation is appropriate in many cases (for example, task oriented spoken dialogue systems), interactive question answering systems require a more sophisticated approach. In this paper, we propose and compare two example-based methods for generation of information seeking questions.

Matthias Denecke, Hajime Tsukada

Answering Definition Questions Using Web Knowledge Bases

This paper presents a definition question answering approach, which is capable of mining textual definitions from large collections of documents. In order to automatically identify definition sentences from a large collection of documents, we utilize the existing definitions in the Web knowledge bases instead of hand-crafted rules or annotated corpus. Effective methods are adopted to make full use of Web knowledge bases, and they promise high quality response to definition questions. We applied our system in the TREC 2004 definition question-answering task and achieved an encouraging performance with the F-measure score of 0.404, which was ranked second among all the submitted runs.

Zhushuo Zhang, Yaqian Zhou, Xuanjing Huang, Lide Wu

Exploring Syntactic Relation Patterns for Question Answering

In this paper, we explore the syntactic relation patterns for open-domain factoid question answering. We propose a pattern extraction method to extract the various relations between the proper answers and different types of question words, including target words, head words, subject words and verbs, from syntactic trees. We further propose a QA-specific tree kernel to partially match the syntactic relation patterns. It makes the more tolerant matching between two patterns and helps to solve the data sparseness problem. Lastly, we incorporate the patterns into a Maximum Entropy Model to rank the answer candidates. The experiment on TREC questions shows that the syntactic relation patterns help to improve the performance by 6.91 MRR based on the common features.

Dan Shen, Geert-Jan M. Kruijff, Dietrich Klakow

Web-Based Unsupervised Learning for Query Formulation in Question Answering

Converting questions to effective queries is crucial to open-domain question answering systems. In this paper, we present a web-based unsupervised learning approach for transforming a given natural-language question to an effective query. The method involves querying a search engine for Web passages that contain the answer to the question, extracting patterns that characterize fine-grained classification for answers, and linking these patterns with n-grams in answer passages. Independent evaluation on a set of questions shows that the proposed approach outperforms a naive keyword-based approach in terms of mean reciprocal rank and human effort.

Yi-Chia Wang, Jian-Cheng Wu, Tyne Liang, Jason S. Chang

Morphological Analysis

A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.

Zhou GuoDong

A Lexicon-Constrained Character Model for Chinese Morphological Analysis

This paper proposes a lexicon-constrained character model that combines both word and character features to solve complicated issues in Chinese morphological analysis. A Chinese character-based model constrained by a lexicon is built to acquire word building rules. Each character in a Chinese sentence is assigned a tag by the proposed model. The word segmentation and part-of-speech tagging results are then generated based on the character tags. The proposed method solves such problems as unknown word identification, data sparseness, and estimation bias in an integrated, unified framework. Preliminary experiments indicate that the proposed method outperforms the best SIGHAN word segmentation systems in the open track on 3 out of the 4 test corpora. Additionally, our method can be conveniently integrated with any other Chinese morphological systems as a post-processing module leading to significant improvement in performance.

Yao Meng, Hao Yu, Fumihito Nishino

Relative Compositionality of Multi-word Expressions: A Study of Verb-Noun (V-N) Collocations

Recognition of Multi-word Expressions (MWEs) and their relative compositionality are crucial to Natural Language Processing. Various statistical techniques have been proposed to recognize MWEs. In this paper, we integrate all the existing statistical features and investigate a range of classifiers for their suitability for recognizing the non-compositional Verb-Noun (V-N) collocations. In the task of ranking the V-N collocations based on their relative compositionality, we show that the correlation between the ranks computed by the classifier and human ranking is significantly better than the correlation between ranking of individual features and human ranking. We also show that the properties ‘Distributed frequency of object’ (as defined in [27] ) and ‘Nearest Mutual Information’ (as adapted from [18]) contribute greatly to the recognition of the non-compositional MWEs of the V-N type and to the ranking of the V-N collocations based on their relative compositionality.

Sriram Venkatapathy, Aravind K. Joshi

Automatic Extraction of Fixed Multiword Expressions

Fixed multiword expressions are strings of words which together behave like a single word. This research establishes a method for the automatic extraction of such expressions. Our method involves three stages. In the first, a statistical measure is used to extract candidate bigrams. In the second, we use this list to select occurrences of candidate expressions in a corpus, together with their surrounding contexts. These examples are used as training data for supervised machine learning, resulting in a classifier which can identify target multiword expressions. The final stage is the estimation of the part of speech of each extracted expression based on its context of occurence. Evaluation demonstrated that collocation measures alone are not effective in identifying target expressions. However, when trained on one million examples, the classifier identified target multiword expressions with precision greater than 90%. Part of speech estimation had precision and recall of over 95%.

Campbell Hore, Masayuki Asahara, Yūji Matsumoto

Machine Translation – II

Phrase-Based Statistical Machine Translation: A Level of Detail Approach

The merit of phrase-based statistical machine translation is often reduced by the complexity to construct it. In this paper, we address some issues in phrase-based statistical machine translation, namely: the size of the phrase translation table, the use of underlying translation model probability and the length of the phrase unit. We present Level-Of-Detail (LOD) approach, an agglomerative approach for learning phrase-level alignment. Our experiments show that LOD approach significantly improves the performance of the word-based approach. LOD demonstrates a clear advantage that the phrase translation table grows only sub-linearly over the maximum phrase length, while having a performance comparable to those of other phrase-based approaches.

Hendra Setiawan, Haizhou Li, Min Zhang, Beng Chin Ooi

Why Is Zero Marking Important in Korean?

This paper argues for the necessity of zero pronoun annotations in Korean treebanks and provides an annotation scheme that can be used to develop a gold standard for testing different anaphor resolution algorithms. Relevant issues of pronoun annotation will be discussed by comparing the Penn Korean Treebank with zero pronoun mark-up and the newly developing Sejong Teebank without zero pronoun mark-up. In addition to supportive evidence for zero marking, necessary morphosyntactic and semantic features will be suggested for zero annotation in Korean treebanks.

Sun-Hee Lee, Donna K. Byron, Seok Bae Jang

A Phrase-Based Context-Dependent Joint Probability Model for Named Entity Translation

We propose a phrase-based context-dependent joint probability model for Named Entity (NE) translation. Our proposed model consists of a lexical mapping model and a permutation model. Target phrases are generated by the context-dependent lexical mapping model, and word reordering is performed by the permutation model at the phrase level. We also present a two-step search to decode the best result from the models. Our proposed model is evaluated on the LDC Chinese-English NE translation corpus. The experiment results show that our proposed model is high effective for NE translation.

Min Zhang, Haizhou Li, Jian Su, Hendra Setiawan

Machine Translation Based on Constraint-Based Synchronous Grammar

This paper proposes a variation of synchronous grammar based on the formalism of context-free grammar by generalizing the first component of productions that models the source text, named Constraint-based Synchronous Grammar (CSG). Unlike other synchronous grammars, CSG allows multiple target productions to be associated to a single source production rule, which can be used to guide a parser to infer different possible translational equivalences for a recognized input string according to the feature constraints of symbols in the pattern. Furthermore, CSG is augmented with independent rewriting that allows expressing discontinuous constituents in the inference rules. It turns out that such grammar is more expressive to model the translational equivalences of parallel texts for machine translation, and in this paper, we propose the use of CSG as a basis for building a machine translation (MT) system for Portuguese to Chinese translation.

Fai Wong, Dong-Cheng Hu, Yu-Hang Mao, Ming-Chui Dong, Yi-Ping Li

Text Summarization

A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation

Ordering information is a difficult but a important task for natural language generation applications. A wrong order of information not only makes it difficult to understand, but also conveys an entirely different idea to the reader. This paper proposes an algorithm that learns orderings from a set of human ordered texts. Our model consists of a set of ordering experts. Each expert gives its precedence preference between two sentences. We combine these preferences and order sentences. We also propose two new metrics for the evaluation of sentence orderings. Our experimental results show that the proposed algorithm outperforms the existing methods in all evaluation metrics.

Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka

Significant Sentence Extraction by Euclidean Distance Based on Singular Value Decomposition

This paper describes an automatic summarization approach that constructs a summary by extracting the significant sentences. The approach takes advantage of the cooccurrence relationships between terms only in the document. The techniques used are principal component analysis (PCA) to extract the significant terms and singular value decompostion (SVD) to find out the significant sentences. The PCA can quantify both the term frequency and term-term relationship in the document by the eigenvalue-eigenvector pairs. And the sentence-term matrix can be decomposed into the proper dimensional sentence-concentrated and term-concentrated marices which are used for the Euclidean distances between the sentence and term vectors and also removed the noise of variability in term usage by the SVD. Experimental results on Korean newspaper articles show that the proposed method is to be preferred over random selection of sentences or only PCA when summarization is the goal.

Changbeom Lee, Hyukro Park, Cheolyoung Ock

Named Entity Recognition

Two-Phase Biomedical Named Entity Recognition Using A Hybrid Method

Biomedical named entity recognition (NER) is a difficult problem in biomedical information processing due to the widespread ambiguity of terms out of context and extensive lexical variations. This paper presents a two-phase biomedical NER consisting of term boundary detection and semantic labeling. By dividing the problem, we can adopt an effective model for each process. In our study, we use two exponential models, conditional random fields and maximum entropy, at each phase. Moreover, results by this machine learning based model are refined by rule-based postprocessing implemented using a finite state method. Experiments show it achieves the performance of F-score 71.19% on the JNLPBA 2004 shared task of identifying 5 classes of biomedical NEs.

Seonho Kim, Juntae Yoon, Kyung-Mi Park, Hae-Chang Rim

Heuristic Methods for Reducing Errors of Geographic Named Entities Learned by Bootstrapping

One of issues in the bootstrapping for named entity recognition is how to control annotation errors introduced at every iteration. In this paper, we present several heuristics for reducing such errors using external resources such as WordNet, encyclopedia and Web documents. The bootstrapping is applied for identifying and classifying fine-grained geographic named entities, which are useful for applications such as information extraction and question answering, as well as standard named entities such as PERSON and ORGANIZATION. The experiments show the usefulness of the suggested heuristics and the learning curve evaluated at each bootstrapping loop. When our approach was applied to a newspaper corpus, it could achieve 87 F1 value, which is quite promising for the fine-grained named entity recognition task.

Seungwoo Lee, Gary Geunbae Lee

Linguistic Resources and Tools

Building a Japanese-Chinese Dictionary Using Kanji/Hanzi Conversion

A new bilingual dictionary can be built using two existing bilingual dictionaries, such as Japanese-English and English-Chinese to build Japanese-Chinese dictionary. However, Japanese and Chinese are nearer languages than English, there should be a more direct way of doing this. Since a lot of Japanese words are composed of kanji, which are similar to hanzi in Chinese, we attempt to build a dictionary for kanji words by simple conversion from kanji to hanzi. Our survey shows that around 2/3 of the nouns and verbal nouns in Japanese are kanji words, and more than 1/3 of them can be translated into Chinese directly. The accuracy of conversion is 97%. Besides, we obtain translation candidates for 24% of the Japanese words using English as a pivot language with 77% accuracy. By adding the kanji/hanzi conversion method, we increase the candidates by 9%, to 33%, with better quality candidates.

Chooi-Ling Goh, Masayuki Asahara, Yuji Matsumoto

Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

Katakana, Japanese phonogram mainly used for loan words, is a troublemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to construct precise and concise Katakana word dictionary automatically, given only a medium or large size of Japanese corpus of some domain.

Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi

CTEMP: A Chinese Temporal Parser for Extracting and Normalizing Temporal Information

Temporal information is useful in many NLP applications, such as information extraction, question answering and summarization. In this paper, we present a temporal parser for extracting and normalizing temporal expressions from Chinese texts. An integrated temporal framework is proposed, which includes basic temporal concepts and the classification of temporal expressions. The identification of temporal expressions is fulfilled by powerful chart-parsing based on grammar rules and constraint rules. We evaluated the system on a substantial corpus and obtained promising results.

Wu Mingli, Li Wenjie, Lu Qin, Li Baoli

French-English Terminology Extraction from Comparable Corpora

This article presents a method of extracting bilingual lexica composed of single-word terms (SWTs) and multi-word terms (MWTs) from comparable corpora of a technical domain. First, this method extracts MWTs in each language, and then uses statistical methods to align single words and MWTs by exploiting the term contexts. After explaining the difficulties involved in aligning MWTs and specifying our approach, we show the adopted process for bilingual terminology extraction and the resources used in our experiments. Finally, we evaluate our approach and demonstrate its significance, particularly in relation to non-compositional MWT alignment.

Béatrice Daille, Emmanuel Morin

Discourse Analysis

A Twin-Candidate Model of Coreference Resolution with Non-Anaphor Identification Capability

Although effective for antecedent determination, the traditional twin-candidate model can not prevent the invalid resolution of non-anaphors without additional measures. In this paper we propose a modified learning framework for the twin-candidate model. In the new framework, we make use of non-anaphors to create a special class of training instances, which leads to a classifier capable of identifying the cases of non-anaphors during resolution. In this way, the twin-candidate model itself could avoid the resolution of non-anaphors, and thus could be directly deployed to coreference resolution. The evaluation done on newswire domain shows that the twin-candidate based system with our modified framework achieves better and more reliable performance than those with other solutions.

Xiaofeng Yang, Jian Su, Chew Lim Tan

Improving Korean Speech Acts Analysis by Using Shrinkage and Discourse Stack

A speech act is a linguistic action intended by a speaker. It is important to analyze the speech act for the dialogue understanding system because the speech act of an utterance is closely tied with the user’s intention in the utterance. This paper proposes to use a speech acts hierarchy and a discourse stack for improving the accuracy of classifiers in speech acts analysis. We first adopt a hierarchical statistical technique called shrinkage to solve the data sparseness problem. In addition, we use a discourse stack in order to easily apply discourse structure information to the speech acts analysis. From the results of experiments, we observed that the proposed model made a significant improvement for Korean speech acts analysis. Moreover, we found that it can be more useful when training data is insufficient.

Kyungsun Kim, Youngjoong Ko, Jungyun Seo

Anaphora Resolution for Biomedical Literature by Exploiting Multiple Resources

In this paper, a resolution system is presented to tackle nominal and pronominal anaphora in biomedical literature by using rich set of syntactic and semantic features. Unlike previous researches, the verification of semantic association between anaphors and their antecedents is facilitated by exploiting more outer resources, including UMLS, WordNet, GENIA Corpus 3.02p and PubMed. Moreover, the resolution is implemented with a genetic algorithm on its feature selection. Experimental results on different biomedical corpora showed that such approach could achieve promising results on resolving the two common types of anaphora.

Tyne Liang, Yu-Hsiang Lin

Automatic Slide Generation Based on Discourse Structure Analysis

In this paper, we describe a method of automatically generating summary slides from a text. The slides are generated by itemizing topic/non-topic parts that are extracted from the text based on syntactic/case analysis. The indentations of the items are controlled according to the discourse structure, which is detected by cue phrases, identification of word chain and similarity between two sentences. Our experiments demonstrates generated slides are far easier to read in comparison with original texts.

Tomohide Shibata, Sadao Kurohashi

Semantic Analysis – I

Using the Structure of a Conceptual Network in Computing Semantic Relatedness

We present a new method for computing semantic relatedness of concepts. The method relies solely on the structure of a conceptual network and eliminates the need for performing additional corpus analysis. The network structure is employed to generate artificial conceptual glosses. They replace textual definitions

proper

written by humans and are processed by a dictionary based metric of semantic relatedness [1]. We implemented the metric on the basis of GermaNet, the German counterpart of WordNet, and evaluated the results on a German dataset of 57 word pairs rated by human subjects for their semantic relatedness. Our approach can be easily applied to compute semantic relatedness based on alternative conceptual networks, e.g. in the domain of life sciences.

Iryna Gurevych

Semantic Role Labelling of Prepositional Phrases

We propose a method for labelling prepositional phrases according to two different semantic role classifications, as contained in the Penn treebank and the CoNLL 2004 Semantic Role Labelling data set. Our results illustrate the difficulties in determining preposition semantics, but also demonstrate the potential for PP semantic role labelling to improve the performance of a holistic semantic role labelling system.

Patrick Ye, Timothy Baldwin

Global Path-Based Refinement of Noisy Graphs Applied to Verb Semantics

Recently, researchers have applied text- and web-mining algorithms to mine semantic resources. The result is often a noisy graph of relations between words. We propose a mathematically rigorous refinement framework, which uses path-based analysis, updating the likelihood of a relation between a pair of nodes using evidence provided by multiple indirect paths between the nodes. Evaluation on refining temporal verb relations in a semantic resource called

VerbOcean

showed a 16.1% error reduction after refinement.

Timothy Chklovski, Patrick Pantel

Semantic Role Tagging for Chinese at the Lexical Level

This paper reports on a study of semantic role tagging in Chinese, in the absence of a parser. We investigated the effect of using only lexical information in statistical training; and proposed to identify the relevant headwords in a sentence as a first step to partially locate the corresponding constituents to be labelled. Experiments were done on a textbook corpus and a news corpus, representing simple data and complex data respectively. Results suggested that in Chinese, simple lexical features are useful enough when constituent boundaries are known, while parse information might be more important for complicated sentences than simple ones. Several ways to improve the headword identification results were suggested, and we also plan to explore some class-based techniques for the task, with reference to existing semantic lexicons.

Oi Yee Kwong, Benjamin K. Tsou

NLP Applications

Detecting Article Errors Based on the Mass Count Distinction

This paper proposes a method for detecting errors concerning article usage and singular/plural usage based on the mass count distinction. Although the mass count distinction is particularly important in detecting these errors, it has been pointed out that it is hard to make heuristic rules for distinguishing mass and count nouns. To solve the problem, first, instances of mass and count nouns are automatically collected from a corpus exploiting surface information in the proposed method. Then, words surrounding the mass (count) instances are weighted based on their frequencies. Finally, the weighted words are used for distinguishing mass and count nouns. After distinguishing mass and count nouns, the above errors can be detected by some heuristic rules. Experiments show that the proposed method distinguishes mass and count nouns in the writing of Japanese learners of English with an accuracy of 93% and that 65% of article errors are detected with a precision of 70%.

Ryo Nagata, Takahiro Wakana, Fumito Masui, Atsuo Kawai, Naoki Isu

Principles of Non-stationary Hidden Markov Model and Its Applications to Sequence Labeling Task

Hidden Markov Model (Hmm) is one of the most popular language models. To improve its predictive power, one of Hmm hypotheses, named

limited history hypothesis

, is usually relaxed. Then Higher-order Hmm is built up. But there are several severe problems hampering the applications of high-order Hmm, such as the problem of parameter space explosion, data sparseness problem and system resource exhaustion problem. From another point of view, this paper relaxes the other Hmm hypothesis, named

stationary (time invariant) hypothesis

, makes use of time information and proposes a non-stationary Hmm (NSHmm). This paper describes NSHmm in detail, including its definition, the representation of time information, the algorithms and the parameter space and so on. Moreover, to further reduce the parameter space for mobile applications, this paper proposes a variant form of NSHmm (VNSHmm). Then NSHmm and VNSHmm are applied to two sequence labeling tasks: pos tagging and pinyin-to-character conversion. Experiment results show that compared with Hmm, NSHmm and VNSHmm can greatly reduce the error rate in both of the two tasks, which proves that they have much more predictive power than Hmm does.

Xiao JingHui, Liu BingQuan, Wang XiaoLong

Integrating Punctuation Rules and Naïve Bayesian Model for Chinese Creation Title Recognition

Creation titles, i.e. titles of literary and/or artistic works, comprise over 7% of named entities in Chinese documents. They are the fourth large sort of named entities in Chinese other than personal names, location names, and organization names. However, they are rarely mentioned and studied before. Chinese title recognition is challenging for the following reasons. There are few internal features and nearly no restrictions in the naming style of titles. Their lengths and structures are varied. The worst of all, they are generally composed of common words, so that they look like common fragments of sentences. In this paper, we integrate punctuation rules, lexicon, and naïve Bayesian models to recognize creation titles in Chinese documents. This pioneer study shows a precision of 0.510 and a recall of 0.685 being achieved. The promising results can be integrated into Chinese segmentation, used to retrieve relevant information for specific titles, and so on.

Conrad Chen, Hsin-Hsi Chen

A Connectionist Model of Anticipation in Visual Worlds

Recent “visual worlds” studies, wherein researchers study language in context by monitoring eye-movements in a visual scene during sentence processing, have revealed much about the interaction of diverse information sources and the time course of their influence on comprehension. In this study, five experiments that trade off scene context with a variety of linguistic factors are modelled with a Simple Recurrent Network modified to integrate a scene representation with the standard incremental input of a sentence. The results show that the model captures the qualitative behavior observed during the experiments, while retaining the ability to develop the correct interpretation in the absence of visual input.

Marshall R. Mayberry III, Matthew W. Crocker, Pia Knoeferle

Tagging

Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora

We implement a variant of the algorithm described by Yarowsky and Ngai in [21] to induce an HMM POS tagger for an arbitrary target language using only an existing POS tagger for a source language and an unannotated parallel corpus between the source and target languages. We extend this work by projecting from

multiple

source languages onto a single target language. We hypothesize that systematic transfer errors from differing source languages will cancel out, improving the quality of bootstrapped resources in the target language. Our experiments confirm the hypothesis. Each experiment compares three cases: (a) source data comes from a single language A, (b) source data comes from a single language B, and (c) source data comes from both A and B, but half as much from each. Apart from the source language, other conditions are held constant in all three cases – including the total amount of source data used. The null hypothesis is that performance in the mixed case would be an average of performance in the single-language cases, but in fact, mixed-case performance always exceeds the maximum of the single-language cases. We observed this effect in all six experiments we ran, involving three different source-language pairs and two different target languages.

Victoria Fossum, Steven Abney

The Verbal Entries and Their Description in a Grammatical Information-Dictionary of Contemporary Tibetan

This paper discusses verb information items in the grammatical knowledge dictionary of Tibetan language. The verb information items include morphology, word-formation and syntax, in which the core is the verb classification with syntactic and semantic features. So the influences on clause structures of 12 types of verbs have been discussed. Also the paper designs an information table to compile the verb items and their related knowledge, in which most of the details of verbs have been described. At last some newly-found phenomena have been discussed and the paper therefore proposes that there are still some special tokens or constructs need to be mined out in Tibetan.

Jiang Di, Long Congjun, Zhang Jichuan

Tense Tagging for Verbs in Cross-Lingual Context: A Case Study

The current work applies Conditional Random Fields to the problem of temporal reference mapping from Chinese text to English text. The learning algorithm utilizes a moderate number of linguistic features that are easy and inexpensive to obtain. We train a tense classifier upon a small amount of manually labeled data. The evaluation results are promising according to standard measures as well as in comparison with a pilot tense annotation experiment involving human judges. Our study exhibits potential value for full-scale machine translation systems and other natural language processing tasks in a cross-lingual scenario.

Yang Ye, Zhu Zhang

Regularisation Techniques for Conditional Random Fields: Parameterised Versus Parameter-Free

Recent work on Conditional Random Fields (CRFs) has demonstrated the need for regularisation when applying these models to real-world NLP data sets. Conventional approaches to regularising CRFs has focused on using a Gaussian prior over the model parameters. In this paper we explore other possibilities for CRF regularisation. We examine alternative choices of prior distribution and we relax the usual simplifying assumptions made with the use of a prior, such as constant hyperparameter values across features. In addition, we contrast the effectiveness of priors with an alternative, parameter-free approach. Specifically, we employ

logarithmic opinion pools

(LOPs). Our results show that a LOP of CRFs can outperform a standard unregularised CRF and attain a performance level close to that of a regularised CRF, without the need for intensive hyperparameter search.

Andrew Smith, Miles Osborne

Semantic Analysis – II

Exploiting Lexical Conceptual Structure for Paraphrase Generation

Lexical Conceptual Structure (LCS) represents verbs as semantic structures with a limited number of semantic predicates. This paper attempts to exploit how LCS can be used to explain the regularities underlying lexical and syntactic paraphrases, such as verb alternation, compound word decomposition, and lexical derivation. We propose a paraphrase generation model which transforms LCSs of verbs, and then conduct an empirical experiment taking the paraphrasing of Japanese light-verb constructions as an example. Experimental results justify that syntactic and semantic properties of verbs encoded in LCS are useful to semantically constrain the syntactic transformation in paraphrase generation.

Atsushi Fujita, Kentaro Inui, Yuji Matsumoto

Word Sense Disambiguation by Relative Selection

This paper describes a novel method for a word sense disambiguation that utilizes relatives (i.e. synonyms, hypernyms, meronyms, etc in WordNet) of a target word and raw corpora. The method disambiguates senses of a target word by selecting a relative that most probably occurs in a new sentence including the target word. Only one co-occurrence frequency matrix is utilized to efficiently disambiguate senses of many target words. Experiments on several English datum present that our proposed method achieves a good performance.

Hee-Cheol Seo, Hae-Chang Rim, Myung-Gil Jang

Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features

This paper shows that our WSD system using rich linguistic features achieved high accuracy in the classification of English SENSEVAL2 verbs for both fine-grained (64.6%) and coarse-grained (73.7%) senses. We describe three specific enhancements to our treatment of rich linguistic features and present their separate and combined contributions to our system’s performance. Further experiments showed that our system had robust performance on test data without high quality rich features.

Jinying Chen, Martha Palmer

Automatic Interpretation of Noun Compounds Using WordNet Similarity

The paper introduces a method for interpreting novel noun compounds with semantic relations. The method is built around word similarity with pre-tagged noun compounds, based on

WordNet::Similarity

. Over 1,088 training instances and 1,081 test instances from the Wall Street Journal in the Penn Treebank, the proposed method was able to correctly classify 53.3% of the test noun compounds. We also investigated the relative contribution of the modifier and the head noun in noun compounds of different semantic types.

Su Nam Kim, Timothy Baldwin

Language Models

An Empirical Study on Language Model Adaptation Using a Metric of Domain Similarity

This paper presents an empirical study on four techniques of language model adaptation, including a maximum

a posteriori

(MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion. We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains. In particular, we attempt to interpret the results given in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain measured using the information-theoretic notion of cross entropy. We show that such a metric correlates well with the CER performance of the adaptation methods, and also show that the discriminative methods are not only superior to a MAP-based method in terms of achieving larger CER reduction, but are also more robust against the similarity of background and adaptation domains.

Wei Yuan, Jianfeng Gao, Hisami Suzuki

A Comparative Study of Language Models for Book and Author Recognition

Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.

Özlem Uzuner, Boris Katz

Spoken Language

Lexical Choice via Topic Adaptation for Paraphrasing Written Language to Spoken Language

Our research aims at developing a system that paraphrases written language text to spoken language style. In such a system, it is important to distinguish between appropriate and inappropriate words in an input text for spoken language. We call this task lexical choice for paraphrasing. In this paper, we describe a method of lexical choice that considers the topic. Basically, our method is based on the word probabilities in written and spoken language corpora. The novelty of our method is topic adaptation. In our framework, the corpora are classified into topic categories, and the probability is estimated using such corpora that have the same topic as input text. The result of evaluation showed the effectiveness of topic adaptation.

Nobuhiro Kaji, Sadao Kurohashi

A Case-Based Reasoning Approach for Speech Corpus Generation

Corpus-based stochastic language models have achieved significant success in speech recognition, but construction of a corpus pertaining to a specific application is a difficult task. This paper introduces a Case-Based Reasoning system to generate natural language corpora. In comparison to traditional natural language generation approaches, this system overcomes the inflexibility of template-based methods while avoiding the linguistic sophistication of rule-based packages. The evaluation of the system indicates our approach is effective in generating users’ specifications or queries as 98% of the generated sentences are grammatically correct. The study result also shows that the language model derived from the generated corpus can significantly outperform a general language model or a dictation grammar.

Yandong Fan, Elizabeth Kendall

Terminology Mining

Web-Based Terminology Translation Mining

Mining terminology translation from a large amount of Web data can be applied in many fields such as reading/writing assistant, machine translation and cross-language information retrieval. How to find more comprehensive results from the Web and obtain the boundary of candidate translations, and how to remove irrelevant noises and rank the remained candidates are the challenging issues. In this paper, after reviewing and analyzing all possible methods of acquiring translations, a feasible statistics-based method is proposed to mine terminology translation from the Web. In the proposed method, on the basis of an analysis of different forms of term translation distributions, character-based string frequency estimation is presented to construct term translation candidates for exploring more translations and their boundaries, and then sort-based subset deletion and mutual information methods are respectively proposed to deal with subset redundancy information and prefix/suffix redundancy information formed in the process of estimation. Extensive experiments on two test sets of 401 and 3511 English terms validate that our system has better performance.

Gaolin Fang, Hao Yu, Fumihito Nishino

Extracting Terminologically Relevant Collocations in the Translation of Chinese Monograph

This paper suggests a methodology which is aimed to extract the terminologically relevant collocations for translation purposes. Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules. The extraction system used in our work operated at three steps: (1) Tokenization and POS tagging of the corpus; (2) Extraction of multi-word units using statistical measure; (3) Linguistic filtering to make use of syntactic patterns and stop-word list. As a result, hybrid method using linguistic filters proved to be a suitable method for selecting terminological collocations, it has considerably improved the precision of the extraction which is much higher than that of purely statistical method. In our test, hybrid method combining “Log-likelihood ratio” and “linguistic rules” had the best performance in the extraction. We believe that terminological collocations and phrases extracted in this way, could be used effectively either to supplement existing terminological collections or to be used in addition to traditional reference works.

Byeong-Kwu Kang, Bao-Bao Chang, Yi-Rong Chen, Shi-Wen Yu

Springer Professional

Inhaltsverzeichnis

Frontmatter