Skip to main content

2006 | Buch

Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead

21st International Conference, ICCPOL 2006, Singapore, December 17-19, 2006. Proceedings

herausgegeben von: Yuji Matsumoto, Richard W. Sproat, Kam-Fai Wong, Min Zhang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed proceedings of the 21st International Conference on Computer Processing of Oriental Languages, ICCPOL 2006, held in Singapore in December 2006, colocated with ISCSLP 2006, the 5th International Symposium on Chinese Spoken Language Processing.

The 36 revised full papers and 20 revised short papers presented were carefully reviewed and selected from 169 submissions. The papers are organized in topical sections on information retrieval, document classification, questions and answers, summarization, machine translation, word segmentation, chunking, abbreviation expansion, writing-system issues, parsing, semantics, and lexical resources.

Inhaltsverzeichnis

Frontmatter

Information Retrieval/Document Classification/QA/ Summarization I

Answering Contextual Questions Based on the Cohesion with Knowledge

In this paper, we propose a Japanese question-answering (QA) system to answer contextual questions using a Japanese non-contextual QA system. The contextual questions usually contain reference expressions to refer to previous questions and their answers. We address the reference resolution in contextual questions by finding the interpretation of references so as to maximize the cohesion with knowledge. We utilize the appropriateness of the answer candidate obtained from the non-contextual QA system as the degree of the cohesion. The experimental results show that the proposed method is effective to disambiguate the interpretation of contextual questions.

Tatsunori Mori, Shinpei Kawaguchi, Madoka Ishioroshi
Segmentation of Mixed Chinese/English Document Including Scattered Italic Characters

It is difficult to segment mixed Chinese/English documents when there are many italic characters scattered in documents. Most contributions attach more attention to English documents. However, mixed document is different from English document and some special features should be considered. This paper gives a new way to solve the problem. At first, an appropriate character area is chosen to detect italic. Next, a two-step strategy is adopted. Italic determination is done first and then if the character pattern is identified as italic, the estimation of slant angle will be done. Finally the italic character pattern is corrected by shear transform. A method of adopting two-step weighted projection profile histogram for italic determination is introduced. And a fast algorithm to estimate slant angle is also introduced. Three large sample collections, including character and character-pair and document respectively, are provided to evaluate our method and encouraging results are achieved.

Yong Xia, Chun-Heng Wang, Ru-Wei Dai
Using Pointwise Mutual Information to Identify Implicit Features in Customer Reviews

This paper is concerned with automatic identification of implicit product features expressed in product reviews in the context of opinion question answering. Utilizing a polarity lexicon, we map each adjectives in the lexicon to a set of predefined product features. According to the relationship between those opinion-oriented words and product features, we could identify what feature a review is regarding without the appearance of explicit feature nouns or phrases. The results of our experiments proved the validity of this method.

Qi Su, Kun Xiang, Houfeng Wang, Bin Sun, Shiwen Yu
Using Semi-supervised Learning for Question Classification

This paper tries to use unlabelled in combination with labelled questions for semi-supervised learning to improve the performance of question classification task. We also give two proposals to modify the Tri-training which is a simple but efficient co-training style algorithm to make it more suitable for question data type. In order to avoid bootstrap-sampling the training set to get different sets for training the three classifiers, the first proposal is to use multiple algorithms for classifiers in Tri-training, the second one is to use multiple algorithms for classifiers in combination with multiple views. The modification prevents the error rate at the initial step from being increased and our experiments show promising results.

Nguyen Thanh Tri, Nguyen Minh Le, Akira Shimazu
Query Similarity Computing Based on System Similarity Measurement

Query similarity computation is one of important factors in the process of query clustering. It has been used widely in the field of information processing. In this paper, a unified model for query similarity computation is presented based on system similarity. The novel approach of similarity computation uses the literal, semantic and statistical relative features of query. The method can take advantage of the normal approaches to improve the computation accuracy. Experiments show that the proposed method is an effective solution to the query similarity computation problem, and it can be generalized to measure the similarity of other components of text, such as sentences, paragraphs etc.

Chengzhi Zhang, Xiaoqin Xu, Xinning Su

Machine Translation I

An Improved Method for Finding Bilingual Collocation Correspondences from Monolingual Corpora

Bilingual collocation correspondence is helpful to machine translation and second language learning. Existing techniques for identifying Chinese-English collocation correspondence suffer from two major problems. They are sensitive to the coverage of the bilingual dictionary and the insensitive to semantic and contextual information. This paper presents the

ICT

(Improved Collocation Translation) method to overcome these problems. For a given Chinese collocation, the word translation candidates extracted from a bilingual dictionary are expanded to improve the coverage. A new translation model, which incorporates statistics extracted from monolingual corpora, word semantic similarities from monolingual thesaurus and bilingual context similarities, is employed to estimate and rank the probabilities of the collocation correspondence candidates. Experiments show that

ICT

is robust to the coverage of bilingual dictionary. It achieves 50.1% accuracy for the first candidate and 73.1% accuracy for the top-3 candidates.

Ruifeng Xu, Kam-Fai Wong, Qin Lu, Wenjie Li
A Syntactic Transformation Model for Statistical Machine Translation

We present a phrase-based SMT approach in which the word-order problem is solved using syntactic transformation in the preprocessing phase (There is no reordering in the decoding phase.) We describe a syntactic transformation model based on the probabilistic context-free grammar. This model is trained by using bilingual corpus and a broad coverage parser of the source language. This phrase-based SMT approach is applicable to language pairs in which the target language is poor in resources. We considered translation from English to Vietnamese and from English to French. Our experiments showed significant BLEU-score improvements in comparison with Pharaoh, a state-of-the-art phrase-based SMT system.

Thai Phuong Nguyen, Akira Shimazu
Word Alignment Between Chinese and Japanese Using Maximum Weight Matching on Bipartite Graph

The word-aligned bilingual corpus is an important knowledge source for many tasks in NLP especially in machine translation. Among the existing word alignment methods, the unknown word problem, the synonym problem and the global optimization problem are very important factors impacting the recall and precision of alignment results. In this paper, we proposed a word alignment model between Chinese and Japanese which measures similarity in terms of morphological similarity, semantic distance, part of speech and co-occurrence, and matches words by maximum weight matching on bipartite graph. The model can partly solve the problems mentioned above. The model was proved to be effective by experiments. It achieved 80% as F-Score than 72% of GIZA++.

Honglin Wu, Shaoming Liu
Improving Machine Transliteration Performance by Using Multiple Transliteration Models

Machine transliteration has received significant attention as a supporting tool for machine translation and cross-language information retrieval. During the last decade, four kinds of transliteration model have been studied — grapheme-based model, phoneme-based model, hybrid model, and correspondence-based model. These models are classified in terms of the information sources for transliteration or the units to be transliterated — source graphemes, source phonemes, both source graphemes and source phonemes, and the correspondence between source graphemes and phonemes, respectively. Although each transliteration model has shown relatively good performance, one model alone has limitations on handling complex transliteration behaviors. To address the problem, we combined different transliteration models with a “

generating transliterations followed by their validation

” strategy. The strategy makes it possible to consider complex transliteration behaviors using the strengths of each model and to improve transliteration performance by validating transliterations. Our method makes use of web-based and transliteration model-based validation for transliteration validation. Experiments showed that our method outperforms both the individual transliteration models and previous work.

Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara

Information Retrieval/Document Classification/ QA/Summarization II

Clique Percolation Method for Finding Naturally Cohesive and Overlapping Document Clusters

Techniques for find document clusters mostly depend on models that impose strong explicit and/or implicit priori assumptions. As a consequence, the clustering effects tend to be unnatural and stray away from the intrinsic grouping natures of a document collection. We apply a novel graph-theoretic technique called

Clique Percolation Method

(CPM) for document clustering. In this method, a process of enumerating highly cohesive maximal document cliques is performed in a random graph, where those strongly adjacent cliques are mingled to form naturally overlapping clusters. Our clustering results can unveil the inherent structural connections of the underlying data. Experiments show that CPM can outperform some typical algorithms on benchmark data sets, and shed light on its advantages on natural document clustering.

Wei Gao, Kam-Fai Wong, Yunqing Xia, Ruifeng Xu
Hybrid Approach to Extracting Information from Web-Tables

This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.

Sung-won Jung, Mi-young Kang, Hyuk-chul Kwon
A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

Bottom-up hierarchical document clustering normally merges two most similar clusters in each step iteratively. This paper proposes a novel bottom-up hierarchical document clustering algorithm to merge several pairs of most similar clusters in each step. This is done via a concept of “kNN-connectedness”, which measures the mutual connectedness of clusters in kNNs, and a kNN connection graph, which organizes given clusters into several sets of kNN-connected clusters. In such a graph, a connection between any two clusters only exists in the kNN-connected clusters of the same set. Moreover, a new kNN-based attraction function is proposed to measure the similarity between two clusters and indicates the potential probability of the two clusters being merged. The attraction function only considers the relative distribution of their nearest neighbors between two clusters in a vector space while other criteria, such as the well-known cluster-based cosine similarity function, measures the absolute distance between two clusters. This makes the attraction function effectively apply to the cases where different clusters may have very different distance variation. In each step, a kNN connection graph, consisting of several sets of kNN-connected clusters, is first constructed from the given clusters using a kNN algorithm and the concept of “kNN-connectedness”. For each set of kNN-connected clusters, the attraction degree between any two clusters is calculated and several top connected cluster pairs will be merged. In this way, the iteration number can be largely reduced and the clustering process can be much speeded. Evaluation on a news document corpus shows that the kNN connection graph-based hierarchical document clustering algorithm can achieve better performance than the famous k-means clustering algorithm while reducing the iteration number sharply in comparison with normal hierarchical document clustering.

Qiaoming Zhu, Junhui Li, Guodong Zhou, Peifeng Li, Peide Qian

Poster Session 1

The Great Importance of Cross-Document Relationships for Multi-document Summarization

Graph-based methods have been developed for multi-document summarization in recent years and they make use of the relationships between sentences in a graph-based ranking algorithm to extract salient sentences. This paper proposes to differentiate the cross-document relationships and the within-document relationships between sentences for multi-document summarization. The two kinds of relationships between sentences are deemed to have unequal contributions in the graph-based ranking algorithm. We apply the graph-based ranking algorithm based on each kind of sentence relationships and explore their relative importance for multi-document summarization. Experimental results on DUC 2002 and DUC 2004 data demonstrate the great importance of the cross-document relationships between sentences for multi-document summarization. Even the system based only on the cross-document relation-ships can perform better than or at least as well as the systems based on both kinds of relationships between sentences.

Xiaojun Wan, Jianwu Yang, Jianguo Xiao
The Effects of Computer Assisted Instruction to Train People with Reading Disabilities Recognizing Chinese Characters

Chinese stem-deriving instruction has been proved to effectively help people with reading disabilities recognize Chinese characters. With the applications and development of information technology, cybernetic Chinese stem-deriving instruction can help more people with reading disabilities learn Chinese characters and peruse articles more effectively. In this study, we develop computer-assisted instruction method for Chinese stem-deriving instruction and compare three teaching strategies. In this work, we recruit three elementary students with reading disabilities as participants, and evaluate the effectiveness of instructing with a proposed teaching strategy.

Wan-Chih Sun, Tsung-Ren Yang, Chih-Chin Liang, Ping-Yu Hsu, Yuh-Wei Kung
Discrimination-Based Feature Selection for Multinomial Naïve Bayes Text Classification

In this paper we focus on the problem of class discrimination issues to improve performance of text classification, and study a discrimination-based feature selection technique in which the features are selected based on the criterion of enlarging separation among competing classes, referred to as discrimination capability. The proposed approach discards features with small discrimination capability measured by Gaussian divergence, so as to enhance the robustness and the discrimination power of the text classification system. To evaluation its performance, some comparison experiments of multinomial naïve Bayes classifier model are constructed on Newsgroup and Ruters21578 data collection. Experimental results show that on Newsgroup data set divergence measure outperforms MI measure, and has slight better performance than DF measure, and outperforms both measures on Ruters21578 data set. It shows that discrimination-based feature selection method has good contributions to enhance discrimination power of text classification model.

Jingbo Zhu, Huizhen Wang, Xijuan Zhang
A Comparative Study on Chinese Word Clustering

This paper evaluates four unsupervised Chinese word clustering methods, respectively maximum mutual information (MMI), function word (FW), high frequent word (HFW), and word cluster (WC). Two evaluation measures, part-of-speech (POS) precision and semantic precision, are employed. Testing results show that MMI reaches the best performance: 79.09% on POS precision and 49.75% on semantic precision, while the other three exceed 51.09% and 29.78% respectively. When applying word clusters generated by the methods mentioned above to the alignment-based automatic Chinese syntactic induction, the performance is further improved.

Bo Wang, Houfeng Wang
Populating FrameNet with Chinese Verbs Mapping Bilingual Ontological WordNet with FrameNet

This paper describes the construction of a linguistic knowledge base using Frame Semantics, instantiated with Chinese Verbs imported from the Chinese-English Bilingual Ontological WordNet (BOW). The goal is to use this knowledge base to assist with semantic role labeling. This is accomplished through the mapping of FrameNet and WordNet and a novel verb selection restriction using both the WordNet inter-relations and the concept classification in the Suggested Upper Merged Ontology (SUMO). The FrameNet WordNet mapping provides a channel for Chinese verbs to interface with Frame Semantics. By taking the mappings between verbs and frames as learning data, we attempt to identify subsuming SUMO concepts for each frame and further identify more Chinese verbs on the basis of synset inter-relations in WordNet.

Ian C. Chow, Jonathan J. Webster
Collecting Novel Technical Terms from the Web by Estimating Domain Specificity of a Term

This paper proposes a method of domain specificity estimation of technical terms using the Web. In the proposed method, it is assumed that, for a certain technical domain, a list of known technical terms of the domain is given. Technical documents of the domain are collected through the Web search engine, which are then used for generating a vector space model for the domain. The domain specificity of a target term is estimated according to the distribution of the domain of the sample pages of the target term. We apply this technique of estimating domain specificity of a term to the task of discovering novel technical terms that are not included in any of existing lexicons of technical terms of the domain. Out of randomly selected 1,000 candidates of technical terms per a domain, we discovered about 100 ~ 200 novel technical terms.

Takehito Utsuro, Mitsuhiro Kida, Masatsugu Tonoike, Satoshi Sato
Building Document Graphs for Multiple News Articles Summarization: An Event-Based Approach

Since most of news articles report several events and these events are referred in many related documents, we propose an event-based approach to visualize documents as graph on different conceptual granularities. With graph-based ranking algorithm, we illustrate the application of document graph to multi-document summarization. Experiments on DUC data indicate that our approach is competitive with state-of-the-art summarization techniques. This graphical representation which does not require training corpora can be potentially adapted to other languages.

Wei Xu, Chunfa Yuan, Wenjie Li, Mingli Wu, Kam-Fai Wong
A Probabilistic Feature Based Maximum Entropy Model for Chinese Named Entity Recognition

This paper proposes a probabilistic feature based Maximum Entropy (ME) model for Chinese named entity recognition. Where, probabilistic feature functions are used instead of binary feature functions, it is one of the several differences between this model and the most of the previous ME based model. We also explore several new features in our model, which includes confidence functions, position of features etc. Like those in some previous works, we use sub-models to model Chinese Person Names, Foreign Names, location name and organization name respectively, but we bring some new techniques in these sub-models. Experimental results show our ME model combining above new elements brings significant improvements.

Suxiang Zhang, Xiaojie Wang, Juan Wen, Ying Qin, Yixin Zhong
Correcting Bound Document Images Based on Automatic and Robust Curved Text Lines Estimation

Geometric distortion often occurs when taking images of bound documents. This phenomenon greatly impairs recognition accuracy. In this paper, a new one-image based method is proposed to correct geometric distortion in bound document images. According to this method, the document image is binarized first. Next, curved text-line features are extracted. Thirdly, locally optimized text curves are detected using a graph model. Finally, the technique of texture warping is applied to correct the image. Experimental results show that images restored by our proposed method can achieve good perception and recognition results.

Yichao Ma, Chunheng Wang, Ruwei Dai
Cluster-Based Patent Retrieval Using International Patent Classification System

A patent collection provides a great test-bed for cluster-based information retrieval. International Patent Classification (IPC) system provides a hierarchical taxonomy with 5 levels of specificity. We regard IPC codes of patent applications as cluster information, manually assigned by patent officers according to their subjects. Such manual cluster provides advantages over auto-matically built clusters using document term similarities. There are previous researches that successfully apply cluster-based retrieval models using language modeling. We develop cluster-based language models that employ advantages of having manually clustered documents.

Jungi Kim, In-Su Kang, Jong-Hyeok Lee
Word Error Correction of Continuous Speech Recognition Using WEB Documents for Spoken Document Indexing

This paper describes an error correction method of continuous speech recognition using WEB documents for spoken documents indexing. We performed an experiment of error correction for news speech automatically transcribed, where we focused on especially proper nouns. Two LVCSR systems were used to detect correctly and incorrectly recognized words. Keywords for the Internet search engine were selected among the correctly transcribed words, then correct candidates for the mis-recognized words were obtained in retrieved documents. A Dynamic Programming (DP) technique with a confusion matrix was utilized to compare the candidates with the mis-recognized words. In results of experiment of error correction, recognition rate of proper nouns achieved improvement of about 10% by using WEB documents.

Hiromitsu Nishizaki, Yoshihiro Sekiguchi
Extracting English-Korean Transliteration Pairs from Web Corpora

Transliteration pair acquisition has received significant attention as a technique for constructing up-to-date transliteration lexicons, and for supporting machine translation and cross-language information retrieval. Previous studies on transliteration pair acquisition focused on only the phonetic similarity model but seldom considered the real-usage of transliterations in texts. Moreover, previous web-based validation models considered only one-way validation (validation from the viewpoint of a source term) rather than joint validation between a source term and a target term. To address these problems, we propose a novel transliteration pair acquisition model that extracts transliteration pairs from the Web and validates the pairs by combining the phonetic similarity and joint web-validation models. Experiments demonstrated that our transliteration pair acquisition model was effective.

Jong-Hoon Oh, Hitoshi Isahara

Word Segmentation/Chunking/Abbreviation Expansion/Writing-System Issues

From Phoneme to Morpheme: Another Verification Using a Corpus

We scientifically test Harris’s hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.

Kumiko Tanaka-Ishii, Zhihui Jin
Chinese Abbreviation Identification Using Abbreviation-Template Features and Context Information

Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identifying new abbreviations from existing ones, our solution is to add generalization capability to the abbreviation lexicon by replacing words with word classes and therefore create abbreviation-templates. By utilizing abbreviation-template features as well as context information, a SVM model is employed as the classifier. The evaluation on a raw Chinese corpus obtains an encouraging performance. Our experiments further demonstrate the improvement after integrating with morphological analysis, substring analysis and person name identification.

Xu Sun, Houfeng Wang
Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually Segmented Corpora

Word frequencies play important roles in many NLP-related applications. Word frequency estimation for Chinese remains a big challenge due to the characteristics of Chinese. An underlying fact is that a perfect word-segmented Chinese corpus never exists, and currently we only have raw corpora, which can be of arbitrarily large size, automatically word-segmented corpora derived from raw corpora, and a number of manually word-segmented corpora, with relatively smaller size, which are developed under various word segmentation standards by different researchers. In this paper we propose a new scheme to do word frequency approximation by combining the factors above. Experiments indicate that in most cases this scheme can benefit the word frequency estimation, though in other cases its performance is still not very satisfactory.

Wei Qiao, Maosong Sun
Identification of Maximal-Length Noun Phrases Based on Expanded Chunks and Classified Punctuations in Chinese

In general, there are two types of noun phrases (NP): Base Noun Phrase (BNP), and Maximal-Length Noun Phrase (MNP). MNP identification can largely reduce the complexity of full parsing, help analyze the general structure of complex sentences, and provide important clues for detecting main predicates in Chinese sentences. In this paper, we propose a 2-phase hybrid approach for MNP identification which adopts salient features such as expanded chunks and classified punctuations to improve performance. Experimental result shows a high quality performance of 89.66% in F

1

-measure.

Xue-Mei Bai, Jin-Ji Li, Dong-Il Kim, Jong-Hyeok Lee
A Hybrid Approach to Chinese Abbreviation Expansion

This paper presents a hybrid approach to Chinese abbreviation expansion. In this study, each short-form in Chinese text is assumed to be created by the method of reduction and the method of elimination or generalization, respectively. A mapping table between short words and long words and a dictionary of non-reduced short-form/full-form pairs are thus applied to generate the respective expansion candidates. Then, a hidden Markov model (HMM) based disambiguation is employed to rank these candidates and select a proper expansion for each ambiguous abbreviation. In order to improve expansion accuracy, some linguistic knowledge like discourse information and abbreviation patterns are further employed to double-check the expanded results and revise some error expansions if any. The proposed approach was evaluated on an abbreviation-expanded corpus built from the Peking University Corpus. The results showed that a recall of 83.8% and a precision of 86.3% can be achieved on average for different types of Chinese abbreviations.

Guohong Fu, Kang-Kuong Luke, Min Zhang, GuoDong Zhou
Category-Pattern-Based Korean Word-Spacing

It is difficult to cope with data sparseness, unless augmenting the size of the dictionary in a stochastic-based word-spacing model is an option. To resolve both data sparseness and the dictionary memory size problem, this paper describes the process of dynamically providing candidate words to detect correct words using morpheme unigrams and their categories. Each candidate word’s probability was estimated from the morpheme probability, which was weighted according to its category. The category weights were trained to minimize the mean of the errors between the observed probability of a word and that estimated by the word’s individual morpheme probability weighted by its category power in a category pattern for producing the given word.

Mi-young Kang, Sung-won Jung, Hyuk-chul Kwon
An Integrated Approach to Chinese Word Segmentation and Part-of-Speech Tagging

This paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration’, is presented. The experiments based on a manually word-segmented and part-of-speech tagged corpus with about 5.8 million words show that this true integration achieves 98.61% F-measure in word segmentation, 95.18% F-measure in part-of-speech tagging, and 93.86% F-measure in word segmentation and part-of-speech tagging, outperforming all other kinds of combinations to some extent. The experimental results demonstrate the potential for further improving the performance of Chinese word segmentation and part-of-speech tagging.

Maosong Sun, Dongliang Xu, Benjamin K. Tsou, Huaming Lu
Kansuke: A Kanji Look-Up System Based on a Few Stroke Prototypes

We have developed a method that makes it easier for language beginners to look up Japanese kanji characters. Instead of using the arbitrary conventions of kanjis, this method is based on three simple prototypes: horizontal, vertical, and other strokes. For example, the code for the kanji

$\boxplus$

(

ta

, meaning rice field) is ‘3-3-0’, indicating the kanji consists of three horizontal strokes and three vertical strokes. Such codes allow a beginner to look up kanjis even with no knowledge of the ideographic conventions used by native speakers. To make the search easier, a complex kanji can be looked up via the components making up the kanji. We conducted a user evaluation of this system and found that non-native speakers could look up kanjis more quickly and reliably, and with fewer failures, with our system than with conventional methods.

Kumiko Tanaka-Ishii, Julian Godon
Modelling the Orthographic Neighbourhood for Japanese Kanji

Japanese kanji recognition experiments are typically narrowly focused, and feature only native speakers as participants. It remains unclear how to apply their results to kanji similarity applications, especially when learners are much more likely to make similarity-based confusion errors. We describe an experiment to collect authentic human similarity judgements from participants of all levels of Japanese proficiency, from non-speaker to native. The data was used to construct simple similarity models for kanji based on pixel difference and radical cosine similarity, in order to work towards genuine confusability data. The latter model proved the best predictor of human responses.

Lars Yencken, Timothy Baldwin
Reconstructing the Correct Writing Sequence from a Set of Chinese Character Strokes

A Chinese character is composed of several strokes ordered in a particular sequence. The stroke sequence contains useful online information for handwriting recognition and handwriting education. Although there exist some general heuristic stroke sequence rules, sometimes these rules can be inconsistent making it difficult to apply them to determine the standard writing sequence given a set of strokes. In this paper, we proposed a method to estimate the standard writing sequence given the strokes of a Chinese character. The strokes are modeled as discrete states with the state transition costs determined by the result of the classification into forward/backward order of each stroke pair using the positional features. Candidate sequences are found by shortest path algorithm and the final decision of the stroke sequence is made according to the total handwriting energy in case there is more than one candidate sequence. Experiments show that our results provide better performance than existing approaches.

Kai-Tai Tang, Howard Leung

Machine Translation II

Expansion of Machine Translation Bilingual Dictionaries by Using Existing Dictionaries and Thesauruses

This paper gives a method of expanding bilingual dictionaries by creating a new multi-word entry (MWE) and its possible translation previously unregistered in bilingual dictionaries by replacing one of the components of a registered MWE with its semantically similar words, and then selecting appropriate lexical entries from the pairs of new MWEs and their possible translations according to a prioritizing method. In the proposed method, the pairs of new nominal MWEs and their possible translations are prioritized by referring to more than one thesaurus and considering the number of original MWEs from which a single new MWE is created. As a result, the pairs which are effective for improving translation quality, if registered in bilingual dictionaries, are acquired with an improvement of 55.0% for the top 500 prioritized pairs. This accuracy rate exceeds the one marked with the baseline method.

Takeshi Kutsumi, Takehiko Yoshimi, Katsunori Kotani, Ichiko Sata, Hitoshi Isahara
Feature Rich Translation Model for Example-Based Machine Translation

Most EBMT systems select the best example scored by the similarity between the input sentence and existing examples. However, there is still much matching and mutual-translation information unexplored from examples. This paper introduces log-linear translation model into EBMT in order to adequately incorporate different kinds of features inherited in the translation examples. Instead of designing translation model by human intuition, this paper formally constructs a multi-dimensional feature space to include various features of different aspects. In the experiments, the proposed model shows significantly better result.

Yin Chen, Muyun Yang, Sheng Li, Hongfei Jiang
Dictionaries for English-Vietnamese Machine Translation

Dictionary has an important role in Rule-Based Machine Translation. Many efforts have been concentrated on building machine-readable dictionaries. However, researchers have a long debate about structure and entry of these dictionaries. We develop a syntactic-semantic structure for English-Vietnamese dictionary as first measure to solve lexical gap problem, then use extend feature to improve Vietnamese dictionary to get grammatical target sentences. This work describes dictionaries used in English-Vietnamese Machine Translation (EVMT) at Ho Chi Minh City University of technology. There are three dictionaries: The English dictionary, the bilingual English-Vietnamese and the Vietnamese dictionary.

Le Manh Hai, Nguyen Chanh Thanh, Nguyen Chi Hieu, Phan Thi Tuoi

Poster Session 2

Translation Selection Through Machine Learning with Language Resources

Knowledge acquisition is a critical problem for machine translation and translation selection. In this paper, I propose a tranlsation selection method that combines variable features from multiple language resources using machine learning. I introduce multiple measures for sense disambiguation and word selection that are based on language resources, and apply machine learning to combine those measures for translation selection. In evaluation, precision of translation selection improves even though a small-sized bilingual corpus is used as training data.

Hyun Ah Lee
Acquiring Translational Equivalence from a Japanese-Chinese Parallel Corpus

This paper presents our work on acquiring translational equivalence from a Japanese-Chinese parallel corpus. We follow and extend existing word alignment techniques, including statistical model and heuristic model, in order to achieve a high performance. In addition to the statistics of the parallel corpus, the lexical knowledge of the language pair, such as orthographic cognates and bilingual dictionary are exploited. The implemented aligner is applied to the annotation of word alignment in the parallel corpus and the evaluation is conducted also. The experimental results prove the usability of the aligner in our task.

Yujie Zhang, Qing Ma, Qun Liu, Wenliang Chen, Hitoshi Isahara
Deep Processing of Korean Floating Quantifier Constructions

The so-called floating quantifier constructions in languages like Korean display intriguing properties whose successful processing can prove the robustness of a parsing system. This paper shows that a constraint-based analysis, in particular couched upon the framework of HPSG, can offer us an efficient way of parsing these constructions together with proper semantic representations. It also shows how the analysis has been successfully implemented in the LKB (Linguistic Knowledge Building) system.

Jong-Bok Kim, Jaehyung Yang
Compilation of a Dictionary of Japanese Functional Expressions with Hierarchical Organization

The Japanese language has a lot of functional expressions, which consist of more than one word and behave like a single functional word. A remarkable characteristic of Japanese functional expressions is that each functional expression has many different surface forms. This paper proposes a methodology for compilation of a dictionary of Japanese functional expressions with hierarchical organization. We use a hierarchy with nine abstraction levels: the root node is a dummy node that governs all entries; a node in the first level is a headword in the dictionary; a leaf node corresponds to a surface form of a functional expression. Two or more lists of functional expressions can be integrated into this hierarchy. This hierarchy also provides a way of systematic generation of all different surface forms. We have compiled the dictionary with 292 headwords and 13,958 surface forms, which covers almost all of major functional expressions.

Suguru Matsuyoshi, Satoshi Sato, Takehito Utsuro
A System to Indicate Honorific Misuse in Spoken Japanese

We developed a computational system to indicate the misuse of honorifics in word form and in performance of expressions in Japanese speech sentences. The misuse in word form was checked by constructing a list of expressions whose word form is bad in terms of honorifics. The misuse in performance was checked by constructing a consistency table. The consistency table defined the consistency between the honorific features of sentences and the social relationship among the people involved in the sentence. The social relationship was represented by combinations of [the number of people involved in the sentence] × [relative social position among the people] × [in-group/out-group relationship among the people]. The consistency table was obtained by using a machine learning technique. The proposed system was verified using test data prepared by the authors and also by third-party linguistic researchers. The results showed that the system was able to discriminate between the correct and the incorrect honorific sentences in all but a few cases. Furthermore, differences in the educational importance among the norms used in the system were revealed based on experiments using sentences written by people who are not linguistic experts.

Tamotsu Shirado, Satoko Marumoto, Masaki Murata, Kiyotaka Uchimoto, Hitoshi Isahara
A Chinese Corpus with Word Sense Annotation

This paper presents the construction of a Chinese word sense-tagged corpus. The resulting lexical resource includes mainly three components: 1) a corpus annotated with word senses; 2) a lexicon containing sense distinction and description in the feature-based formalism; 3) the linking between the sense entries in the lexicon and CCD synsets. A dynamic model is put forward to build the three knowledge bases simultaneously and interactively. The strategy to improve consistency is addressed since consistency is a thorny issue for constructing semantic resources. The inter-annotator agreement of the sense-tagged corpus is satisfied. The database will grow up to be a powerful lexical resource both for linguistic researches on Chinese lexical semantics and word sense disambiguation.

Yunfang Wu, Peng Jin, Yangsen Zhang, Shiwen Yu
Multilingual Machine Translation of Closed Captions for Digital Television with Dynamic Dictionary Adaptation

In this paper, we present a multilingual machine translation system for closed captions for digital television. To cope with frequent appearance of unregistered words and the articles of multiple domains as in TV news program, we propose a Dynamic Dictionary Adaptation method. We adopted live resources of multilingual Named Entities and their translingual equivalences from Web sites of daily news, providing multilingual daily news in Chinese, English, Japanese and Korean. We also utilize Dynamic Domain Identification for automatic dictionary stacking. With these integrated approaches, we obtained average translation quality enhancement of 0.5 in Mean Opinion Score (MOS) for Korean-to-Chinese. We also had 0.5 and 0.1 average enhancement for Korean-English and Korean-Japanese, respectively. The average enhancement is 0.37, which means almost a third level up to the next higher MOS scale.

Sanghwa Yuh, Jungyun Seo
Acquiring Concept Hierarchies of Adjectives from Corpora

We describe a method to acquire a distribution of the concepts of adjectives automatically by using a self-organizing map and a directional similarity measure. A means of evaluating concept hierarchies of adjectives extracted automatically from corpora is elucidated. We used Scheffe’s method of paired comparison to test experimentally the validity of hierarchies thus obtained with human intuition and found that our method was effective for 43% of the hierarchies considered.

Kyoko Kanzaki, Qing Ma, Eiko Yamamoto, Tamotsu Shirado, Hitoshi Isahara
Pronunciation Similarity Estimation for Spoken Language Learning

This paper presents an approach for estimating pronunciation similarity between two speakers using the cepstral distance. General speech recognition systems have been used to find the matched words of a speaker, using the acoustical score of a speech signal and the grammatical score of a word sequence. In the case of learning a language, for a speaker with impaired hearing, it is not easy to estimate the pronunciation similarity using automatic speech recognition systems, as this requires more information of pronouncing characteristics, than information on word matching. This is a new challenge for computer aided pronunciation learning. The dynamic time warping algorithm is used for cepstral distance computation between two speech data with codebook distance subtracted to consider the characteristics of each speaker. The experiments evaluated on the Korean fundamental vowel set show that the similarity of two speaker’s pronunciation can be efficiently computed using computers.

Donghyun Kim, Dongsuk Yook
A Novel Method for Rapid Speaker Adaptation Using Reference Support Speaker Selection

In this paper, we propose a novel method for rapid speaker adaptation based on speaker selection, called reference support speaker selection (RSSS). The speakers, who are acoustically close to the test speaker, are selected from reference speakers using our proposed algorithm. Furthermore, a single-pass re-estimation procedure, conditioned on the selected speakers is shown. The proposed method can quickly obtain a more optimal reference speaker subset because the selection is dynamically determined according to reference support vectors. This adaptation strategy was evaluated in a large vocabulary speech recognition task. From the experiments, we confirm the effectiveness of proposed method.

Jian Wang, Zhen Yang, Jianjun Lei, Jun Guo
Using Latent Semantics for NE Translation

This paper describes an algorithm that assists in the discovery of Named Entity (NE) translation pairs from large corpora. It is based on Latent Semantic Analysis (LSA) and Cross-Lingual Latent Semantic Indexing (CL-LSI), and is demonstrated to be able to automatically discover new translation pairs in a bootstrapping framework. Some experiments are performed to quantify the interaction between corpus size, features and algorithm parameters, in order to better understand the workings of the proposed approach.

Boon Pang Lim, Richard W. Sproat
Chinese Chunking with Tri-training Learning

This paper presents a practical tri-training method for Chinese chunking using a small amount of labeled training data and a much larger pool of unlabeled data. We propose a novel selection method for tri-training learning in which newly labeled sentences are selected by comparing the agreements of three classifiers. In detail, in each iteration, a new sample is selected for a classifier if the other two classifiers agree on the labels while itself disagrees. We compare the proposed tri-training learning approach with co-training learning approach on Upenn Chinese Treebank V4.0(CTB4). The experimental results show that the proposed approach can improve the performance significantly.

Wenliang Chen, Yujie Zhang, Hitoshi Isahara
Binarization Approaches to Email Categorization

Email categorization becomes very popular today in personal information management. However, most n-way classification methods suffer from feature unevenness problem, namely, features learned from training samples distribute unevenly in various folders. We argue that the binarization approaches can handle this problem effectively. In this paper, three binarization techniques are implemented, i.e. one-against-rest, one-against-one and some-against-rest, using two assembling techniques, i.e. round robin and elimination. Experiments on email categorization prove that significant improvement has been achieved in these binarization approaches over an n-way baseline classifier.

Yunqing Xia, Kam-Fai Wong
Investigating Problems of Semi-supervised Learning for Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the problem of determining the right sense of a polysemous word in a given context. In this paper, we will investigate the use of unlabeled data for WSD within the framework of semi supervised learning, in which the original labeled dataset is iteratively extended by exploiting unlabeled data. This paper addresses two problems occurring in this approach: determining a subset of new labeled data at each extension and generating the final classifier. By giving solutions for these problems, we generate some variants of bootstrapping algorithms and apply to word sense disambiguation. The experiments were done on the datasets of four words:

interest

,

line

,

hard

, and

serve

; and on English lexical sample of Senseval-3.

Anh-Cuong Le, Akira Shimazu, Le-Minh Nguyen
Developing a Dialog System for New Idea Generation Support

A knowledge-based dialog system gives correct answers; however, it is unsuitable for open-ended input. On the other hand, Eliza makes open-ended conversations, but it gives no new information to its user. We propose a new type of dialog system. Our system lies between the above two dialog systems, and it converses about various topics and gives information related to the user’s utterances. This type of dialog is useful for generating new ideas especially when the user has an obscure desire to get information about his or her interest, but no concrete goal. Our system selects an appropriate sentence from a corpus to respond to a user’s utterance. The most proper response will have surface cohesion and semantic coherence with the user’s utterance. We made a trial system to converse about movies.

Masahiro Shibata, Yoichi Tomiura, Hideki Matsumoto, Tomomi Nishiguchi, Kensei Yukino, Akihiro Hino

Parsing/Semantics/Lexical Resources

The Incremental Use of Morphological Information and Lexicalization in Data-Driven Dependency Parsing

Typological diversity among the natural languages of the world poses interesting challenges for the models and algorithms used in syntactic parsing. In this paper, we apply a data-driven dependency parser to Turkish, a language characterized by rich morphology and flexible constituent order, and study the effect of employing varying amounts of morpholexical information on parsing performance. The investigations show that accuracy can be improved by using representations based on inflectional groups rather than word forms, confirming earlier studies. In addition, lexicalization and the use of rich morphological features are found to have a positive effect. By combining all these techniques, we obtain the highest reported accuracy for parsing the Turkish Treebank.

Gülşen Eryiğit, Joakim Nivre, Kemal Oflazer
Pattern Dictionary Development Based on Non-compositional Language Model for Japanese Compound and Complex Sentences

A large-scale sentence pattern dictionary (SP-dictionary) for Japanese compound and complex sentences has been developed. The dictionary has been compiled based on the

non-compositional language model

. Sentences with 2 or 3 predicates are extracted from a Japanese-to-English parallel corpus of 1 million sentences, and the compositional constituents contained within them are generalized to produce a SP-dictionary containing a total of 215,000 pattern pairs. In evaluation tests, the SP-dictionary achieved a syntactic coverage of 92% and a semantic coverage of 70%.

Satoru Ikehara, Masato Tokuhisa, Jin’ichi Murakami, Masashi Saraki, Masahiro Miyazaki, Naoshi Ikeda
A Case-Based Reasoning Approach to Zero Anaphora Resolution in Chinese Texts

Anaphora is a common phenomenon in discourses as well as an important research issue in the applications of natural language processing. In this paper, both intra-sentential and inter-sentential zero anaphora in Chinese texts are addressed. Unlike general rule-based approaches, our resolution method is embedded with a case-based reasoning mechanism which has the benefit of knowledge acquisition if the case size varies. In addition, the presented approach employs informative features with the help of two outer knowledge resources. Compared to rule-based approaches, our resolution to 1047 zero anaphora instances achieved 82% recall and 77% precision.

Dian-Song Wu, Tyne Liang
Building a Collocation Net

This paper presents an approach to build a novel two-level collocation net, which enables calculation of the collocation relationship between any two words, from a large raw corpus. The first level consists of atomic classes (each atomic class consists of one word and feature bigram), which are clustered into the second level class set. Each class in both levels is represented by its collocation candidate distribution, extracted from the linguistic analysis of the raw training corpus, over possible collocation relation types. In this way, all the information extracted from the linguistic analysis is kept in the collocation net. Our approach applies to both frequently and less-frequently occurring words by providing a clustering mechanism resolve the data sparseness problem through the collocation net. Experimentation shows that the collocation net is efficient and effective in solving the data sparseness problem and determining the collocation relationship between any two words.

GuoDong Zhou, Min Zhang, GuoHong Fu
Backmatter
Metadaten
Titel
Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead
herausgegeben von
Yuji Matsumoto
Richard W. Sproat
Kam-Fai Wong
Min Zhang
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-49668-7
Print ISBN
978-3-540-49667-0
DOI
https://doi.org/10.1007/11940098