Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 13th China National Conference on Computational Linguistics, CCL 2014, and of the First International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2014, held in Wuhan, China, in October 2014. The 27 papers presented were carefully reviewed and selected from 233 submissions. The papers are organized in topical sections on word segmentation; syntactic analysis and parsing the Web; semantics; discourse, coreference and pragmatics; textual entailment; language resources and annotation; sentiment analysis, opinion mining and text classification; large‐scale knowledge acquisition and reasoning; text mining, open IE and machine reading of the Web; machine translation; multilinguality in NLP; underresourced languages processing; NLP applications.



Word Segmentation

Unsupervised Joint Monolingual Character Alignment and Word Segmentation

We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bernstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points.
Zhiyang Teng, Hao Xiong, Qun Liu

Syntactic Analysis and Parsing the Web

Improving Multi-pass Transition-Based Dependency Parsing Using Enhanced Shift Actions

In multi-pass transition-based dependency parsing algorithm, the shift actions are usually inconsistent for the same node pair in different passes. Some node pairs have a indeed dependency relation, but the modifier node has not been a complete subtree yet. The bottom-up parsing strategy requires to perform shift action for these node pairs. In this paper, we propose a method to improve performance of parsing by using enhanced shift actions. These actions can be further used as features for the next parsing decision. Experimental results show that our method is effective to improve the performance of parsing.
Chenxi Zhu, Xipeng Qiu, Xuanjing Huang

Lexical Semantics and Ontologies

Diachronic Deviation Features in Continuous Space Word Representations

In distributed word representation, each word is represented as a unique point in the vector space. This paper extends this to a diachronic setting, where multiple word embeddings are generated with corpora in different time periods. These multiple embeddings can be mapped to a single target space via a linear transformation. In this target space each word is thus represented as a distribution. The deviation features of this distribution can reflect the semantic variation of words through different time periods. Experiments show that word groups with similar deviation features can indicate the hot topics in different ages. And the frequency change of these word groups can be used to detect the age of peak celebrity of the topics in the history.
Ni Sun, Tongfei Chen, Liumingjing Xiao, Junfeng Hu

Ontology Matching with Word Embeddings

Ontology matching is one of the most important work to achieve the goal of the semantic web. To fulfill this task, element-level matching is an indispensable step to obtain the fundamental alignment. In element-level matching process, previous work generally utilizes WordNet to compute the semantic similarities among elements, but WordNet is limited by its coverage. In this paper, we introduce word embeddings to the field of ontology matching. We testified the superiority of word embeddings and presented a hybrid method to incorporate word embeddings into the computation of the semantic similarities among elements. We performed the experiments on the OAEI benchmark, conference track and real-world ontologies. The experimental results show that in element-level matching, word embeddings could achieve better performance than previous methods.
Yuanzhe Zhang, Xuepeng Wang, Siwei Lai, Shizhu He, Kang Liu, Jun Zhao, Xueqiang Lv


Exploiting Multiple Resources for Word-Phrase Semantic Similarity Evaluation

Previous researches on semantic similarity calculating have been mainly focused on documents, sentences or concepts. In this paper, we study the semantic similarity of words and compositional phrases. The task is to judge the semantic similarity of a word and a short sequence of words. Based on structured resource (WordNet), semi-structured resource (Wikipedia) and unstructured resource (Web), this paper extracts rich effective features to represent the word-phrase pair. The task can be treated as a binary classification problem and we employ Support Vector Machine to estimate whether the word and phrase is similar given a word-phrase pair. Experiments are conducted on SemEval 2013 Task5a. Our method achieves 82.9% in accuracy, and outperforms the best system (80.3%) that participates in the task. Experimental results demonstrate the effectiveness of our proposed approach.
Xiaoqiang Jin, Chengjie Sun, Lei Lin, Xiaolong Wang

Dependency Graph Based Chinese Semantic Parsing

Semantic Dependency Parsing (SDP) is a deep semantic analysis task. A well-formed dependency scheme is the foundation of SDP. In this paper, we refine the HIT dependency scheme using stronger linguistic theories, yielding a dependency scheme with more clear hierarchy. To cover Chinese semantics more comprehensively, we make a break away from the constraints of dependency trees, and extend to graphs. Moreover, we utilize SVM to parse semantic dependency graphs on the basis of parsing of dependency trees.
Yu Ding, Yanqiu Shao, Wanxiang Che, Ting Liu

Discourse, Coreference and Pragmatics

A Joint Learning Approach to Explicit Discourse Parsing via Structured Perceptron

Discourse parsing is a challenging task and plays a critical role in discourse analysis. In this paper, we focus on building an end-to-end PDTB-style explicit discourse parser via structured perceptron by decomposing it into two components, i.e., a connective labeler, which identifies connectives from a text and determines their senses in classifying discourse relationship, and an argument labeler, which identifies corresponding arguments for a given connective. Particularly, to reduce error propagation and incorporate the interaction between the two components, a joint learning approach via structured perceptron is proposed. Evaluation on the PDTB corpus shows that our two-components explicit discourse parser can achieve comparable performance with the state-of-the-art one. It also shows that our joint learning approach can significantly outperform the pipeline ones.
Sheng Li, Fang Kong, Guodong Zhou

Textual Entailment

Chinese Textual Entailment Recognition Based on Syntactic Tree Clipping

Textual entailment has been proposed as a unifying generic framework for modeling language variability and semantic inference in different Natural Language Processing (NLP) tasks. This paper presents a novel statistical method for recognizing Chinese textual entailment in which lexical, syntactic with semantic matching features are combined together. In order to solve the problems of syntactic tree matching difficulty and tree structure errors caused by Chinese word segmentation, the method firstly clips the syntactic trees into minimum information trees and then computes syntactic matching similarity on them. All features will be used in a voting style under different machine learning methods to predict whether the text sentence can entail the hypothesis sentence in a text-hypothesis pair. The experimental results show that the feature on changing structure of syntactic tree is effective and efficient in Chinese textual entailment.
Zhichang Zhang, Dongren Yao, Songyi Chen, Huifang Ma

Language Resources and Annotation

Automatic Collection of the Parallel Corpus with Little Prior Knowledge

As an important resource for machine translation and cross-language information retrieval, collecting large-scale parallel corpus has been paid wide attention. With the development of the Internet, researchers begin to mine the parallel corpora from the multilingual websites. They use some prior knowledge like ad hoc heuristics or calculate the similarity of the webpages structure and content to find the bilingual webpages. This paper presents a method that uses the search engine and little prior knowledge about the URL patterns to get the bilingual websites from the Internet. The method is fast for its low time cost and there is no need for large-scale computation on URL pattern matching. We have collected 88 915 candidate parallel Chinese-English webpages, which average accuracy is around 90.8%. During the evaluation, the true bilingual websites that we found have high similar html structure and good quality translations.
Shutian Ma, Chengzhi Zhang

The Chinese-English Contrastive Language Knowledge Base and Its Applications

In this paper, we introduce our ongoing research on a Chinese-English Contrastive Language Knowledge Base, including its architecture, the selection of its entries and the XML-based annotation schemes used. We also report on the progress of annotation. The knowledge base is linguistically motivated, focusing on a wide range of sub-sentential contrasts between Chinese and English. It will offer a new form of bilingual resources for NLP tasks, for use in contrastive linguistic research and translation studies, amongst others. Currently, joint efforts are being made to develop tools for Computer-Assisted Translation and Second Language Acquisition using this knowledge base.
Xiaojing Bai, Christoph Zähner, Hongying Zan, Shiwen Yu

Sentiment Analysis, Opinion Mining and Text Classification

Clustering Product Aspects Using Two Effective Aspect Relations for Opinion Mining

Aspect recognition and clustering is important for many sentiment analysis tasks. To date, many algorithms for recognizing product aspects have been explored, however, limited work have been done for clustering the product aspects. In this paper, we focus on the problem of product aspect clustering. Two effective aspect relations: relevant aspect relation and irrelevant aspect relation are proposed to describe the relationships between two aspects. According to these two relations, we can explore many relevant and irrelevant aspects into two different sets as background knowledge to describe each product aspect. Then, a hierarchical clustering algorithm is designed to cluster these aspects into different groups, in which aspect similarity computation is conducted with the relevant aspect set and irrelevant aspect set of each product aspect. Experimental results on camera domain demonstrate that the proposed method performs better than the baseline without using the two aspect relations, and meanwhile proves that the two aspect relations are effective.
Yanyan Zhao, Bing Qin, Ting Liu

Text Classification with Document Embeddings

Distributed representations have gained a lot of interests in natural language processing community. In this paper, we propose a method to learn document embedding with neural network architecture for text classification task. In our architecture, each document can be represented as a fine-grained representation of different meanings so that the classification can be done more accurately. The results of our experiments show that our method achieve better performances on two popular datasets.
Chaochao Huang, Xipeng Qiu, Xuanjing Huang

Large-Scale Knowledge Acquisition and Reasoning

Reasoning Over Relations Based on Chinese Knowledge Bases

Knowledge bases are useful resource for many applications, but reasoning new relationships between new entities based on them is difficult because they often lack the knowledge of new relations and entities. In this paper, we introduce the novel Neural Tensor Network (NTN)[1] model to reason new facts based on Chinese knowledge bases. We represent entities as an average of their constituting word or character vectors, which share the statistical strength between entities, such as . The NTN model uses a tensor network to replace a standard neural layer, which strengthen the interaction of two entity vectors in a simple and efficient way. In experiments, we compare the NTN and several other models, the results show that all models’ performance can be improved when word vectors are pre-trained from an unsupervised large corpora and character vectors don’t have this advantage. The NTN model outperforms others and reachs high classification accuracy 91.1% and 89.6% when using pre-trained word vectors and random character vectors, respectively. Therefore, when Chinese word segmentation is a difficult task, initialization with random character vectors is a feasible choice.
Guoliang Ji, Yinghua Zhang, Hongwei Hao, Jun Zhao

Text Mining, Open IE and Machine Reading of the Web

Distant Supervision for Relation Extraction via Sparse Representation

In relation extraction, distant supervision is proposed to automatically generate a large amount of labeled data. Distant supervision heuristically aligns the given knowledge base to free text and consider the alignment as labeled data. This procedure is effective to get training data. However, this heuristically label procedure is confronted with wrong labels. Thus, the extracted features are noisy and cause poor extraction performance. In this paper, we exploit the sparse representation to address the noise feature problem. Given a new test feature vector, we first compute its sparse linear combination of all the training features. To reduce the influence of noise features, a noise term is adopted in the procedure of finding the sparse solution. Then, the residuals to each class are computed. Finally, we classify the test sample by assigning it to the object class that has minimal residual. Experimental results demonstrate that the noise term is effective to noise features and our approach significantly outperforms the state-of-the-art methods.
Daojian Zeng, Siwei Lai, Xuepeng Wang, Kang Liu, Jun Zhao, Xueqiang Lv

Learning the Distinctive Pattern Space Features for Relation Extraction

Recently, Distant Supervision (DS) is used to automatically generate training data for relation extraction. As the vast redundancy of information on the web, multiple sentences corresponding to a fact may be achieved. In this paper, we propose pattern space features to leverage data redundancy. Each dimension of pattern space feature vector corresponds to a basis pattern and the vector value is the similarity of entity pairs’ patterns to basis patterns. To achieve distinctive basis patterns, a pattern selection procedure is adopted to filter out noisy patterns. In addition, since too specific patterns will increase the number of basis patterns, we propose a novel pattern extraction method that can avoid extracting too specific patterns while maintaining pattern distinctiveness. To demonstrate the effectiveness of the proposed features, we conduct the experiments on a real world data set with 6 different relation types. Experimental results demonstrate that pattern space features significantly outperform State-of-the-art.
Daojian Zeng, Yubo Chen, Kang Liu, Jun Zhao, Xueqiang Lv

Machine Translation

An Investigation on Statistical Machine Translation with Neural Language Models

Recent work has shown the effectiveness of neural probabilistic language models(NPLMs) in statistical machine translation(SMT) through both reranking the n-best outputs and direct decoding. However there are still some issues remained for application of NPLMs. In this paper we further investigate through detailed experiments and extension of state-of-art NPLMs. Our experiments on large-scale datasets show that our final setting, i.e., decoding with conventional n-gram LMs plus un-normalized feedforward NPLMs extended with word clusters could significantly improve the translation performance by up to averaged 1.1 Bleu on four test datasets, while decoding time is acceptable. And results also show that current NPLMs, including feedforward and RNN still cannot simply replace n-gram LMs for SMT.
Yinggong Zhao, Shujian Huang, Huadong Chen, Jiajun Chen

Using Semantic Structure to Improve Chinese-English Term Translation

This paper introduces a method which aims at translating Chinese terms into English. Our motivation is providing deep semantic-level information for term translation through analyzing the semantic structure of terms. Using the contextual information in the term and the first sememe of each word in HowNet as features, we trained a Support Vector Machine (SVM) model to identify the dependencies among words in a term. Then a Conditional Random Field (CRF) model is trained to mark semantic relations for term dependencies. During translation, the semantic relations within the Chinese terms are identified and three features based on semantic structure are integrated into the phrase-based statistical machine translation system. Experimental results show that the proposed method achieves 1.58 BLEU points improvement in comparison with the baseline system.
Guiping Zhang, Ruiqian Liu, Na Ye, Haihong Huang

Query Expansion for Mining Translation Knowledge from Comparable Data

When mining parallel text from comparable corpora, we confront vast search space since parallel sentence or sub-sentential fragments can be scattered throughout the source and target corpus. To reduce the search space, most previous approaches have tried to use heuristics to mine comparable documents. However, these heuristics are only available in few cases. Instead, we go on a different direction and adopt the cross-language information retrieval (CLIR) framework to find translation candidates directly at sentence level from comparable corpus. What’s more, for the sake of better retrieval result, two simple but effective query expansion methods are proposed. Experimental results show that using our query expansion methods can help to improve the recall significantly and obtain candidates of sentence pairs with high quality. Thus, our methods can help to make good preparation for extracting both parallel sentences and fragments subsequently.
Lu Xiang, Yu Zhou, Jie Hao, Dakun Zhang

A Comparative Study on Simplified-Traditional Chinese Translation

Due to historical reasons, modern Chinese is written in traditional characters and simplified characters, which quite frequently renders text translation between the two scripting systems indispensable. Computer-based simplified-traditional Chinese conversion is available on MS Word, Google Translate and many language tools on the WWW. Their performance has reached very high precision. However, because of the existence of one-to-many relationships between simplified and traditional Chinese characters, there is considerable room for improvement. This paper presents a comparative study of simplified-traditional Chinese translation on MS Word, Google Translate and JFJ, followed by discussion on further development, including improvement of translation accuracy and support to human proofreading.
Xiaoheng Zhang

Multilinguality in NLP

Combining Lexical Context with Pseudo-alignment for Bilingual Lexicon Extraction from Comparable Corpora

Only a few studies have made use of alignment information in bilingual lexicon extraction from comparable corpora, in which comparable corpora are necessarily divided into 1-1 aligned document pairs. They have not been able to show extracted lexicons benefit from the embedding of alignment information. Moreover, strict 1-1 alignments do not exist broadly in comparable corpora. We develop in this paper a language-independent approach to lexicon extraction by combining the classic lexical context with pseudo-alignment information. Experiments on the English-French comparable corpus demonstrate that pseudo-alignment in comparable corpora is an essential feature leading to a significant improvement of standard method of lexicon extraction, a perspective that have never been investigated in a similar way by previous studies.
Bo Li, Qunyan Zhu, Tingting He, Qianjun Chen

Chinese-English OOV Term Translation with Web Mining, Multiple Feature Fusion and Supervised Learning

This paper focuses on the Web-based Chinese-English Out-of-Vocabulary (OOV) term translation pattern, and emphasizes on the translation selection based on multiple feature fusion and the ranking based on Ranking Support Vector Machine (Ranking SVM). By utilizing the SIGHAN2005 corpus for the Chinese Named Entity Recognition (NER) task and selected new terms, the experiments based on different data sources show the consistent results. From the experimental results for combining our model with Chinese-English Cross-Language Information Retrieval (CLIR) on the data sets of TREC, it can be found that the obvious performance improvements for both query translation and CLIR are obtained.
Yun Zhao, Qinen Zhu, Cheng Jin, Yuejie Zhang, Xuanjing Huang, Tao Zhang

A Universal Phrase Tagset for Multilingual Treebanks

Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He, Liang Tian

Underresourced Languages Processing

Co-occurrence Degree Based Word Alignment: A Case Study on Uyghur-Chinese

Most widely used word alignment models are based on word co-occurrence counts in parallel corpus. However, the data sparseness during training of the word alignment model makes word co-occurrence counts of Uyghur-Chinese parallel corpus cannot indicate associations between source and target words effectively. In this paper, we propose a Uyghur-Chinese word alignment method based on word co-occurrence degree to alleviate the data sparseness problem. Our approach combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree, fuzzy co-occurrence weights can be obtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current Uyghur word and other Uyghur words in fuzzy co-occurrence word pairs. Experiment shows that with the co-occurrence degree based word alignment model, the performance of Uyghur-Chinese word alignment result is outperform the baseline word alignment model, the quality of Uyghur-Chinese machine translation also improved.
Chenggang Mi, Yating Yang, Xi Zhou, Xiao Li, Turghun Osman

Calculation Analysis on Consonant and Character for Corpus Study of Gesar Epic "HorLing"

We made an econometric analysis on consonants and characters after the establishment of Gesar epic classic version HorLing corpus. Firstly, we set up a 2-million-consonants corpus for further verification and comparison of the character frequency of HorLing. Secondly, we established the theory of Tibetan consonants combination rules and Tibetan theory consonants and wish the Tibetan theory consonants coverage on HorLing. By the analysis, we not only understood the status of the Gesar epic consonants and characters, but also clarified the application of Tibetan consonants and characters in the real life.
Duo La, Tashi Gyal

NLP Applications

Sentence Level Paraphrase Recognition Based on Different Characteristics Combination

This paper has proposed a novel method based on different characteristics combination to do paraphrase recognition. We employ different measurements to weigh the lexical part and syntactic part due to that the different part of sentence makes distinguishing contribution to the sentence semantic during the task of paraphrase recognition. Our experiment is conducted by parsing the pair sentences of MSRPC first, then followed by adopting differentiated weights to calculate the power of different parts of the sentence.Through this method, we have obtained the outperform precision and average F value result compared with the previous approaches.
Maoyuan Zhang, Hong Zhang, Deyu Wu, Xiaohang Pan

Learning Tag Relevance by Context Analysis for Social Image Retrieval

Tags associated with images significantly promote the development of social image retrieval. However, these user-annotated tags suffer the problems of noise and inconsistency, which limits the role they play in image retrieval. In this paper, we build a novel model to learn the tag relevance based on the context analysis for each tag. In our model, we firstly consider the user tagging habits and use a multi-model association network to capture the tag-tag relationship and tag-image relationship, and then accomplish the random-walk over the tag graph for each image to refine the tag relevance. Different from the earlier research work related to tag ranking, our contributions focuse on the globally-comparable tag relevance measure (i.e., can be compared across different images) and better tag relevance learning model by detailed context analysis for each tag. Our experiments on the public data from Flickr have obtained very positive results.
Yong Cheng, Wenhui Mao, Cheng Jin, Yuejie Zhang, Xuanjing Huang, Tao Zhang

ASR-Based Input Method for Postal Address Recognition in Chinese Mandarin

As the automatic speech recognition technology (ASR) has becoming more and more mature, especially with statistical language modeling built with web scale data, and with the utilization of Hidden Markov Model probabilistic framework, speech recognition has become applicable to many domains and usage scenarios. In particular, speech recognition can be applied to task such as Chinese postal address recognition. This paper presents the first attempt ever, in both academic and commercial settings, to create an ASR-based input method for postal address recognition in Chinese Mandarin. By customizing the statistical language model to such domain, and incorporating the knowledge from the structural information provided by geo-topology, our language model successfully captures the signals from geographical contextual information and self-correct possible mis-recognitions. Experiment results provide evident that our approach based on speech recognition achieves a faster and a more accuracy input method compare to traditional keyboard-based input.
Ling Feng Wei, Sun Maosong


Weitere Informationen

Premium Partner