This book constitutes the thoroughly refereed post-workshop proceedings of the 20th Chinese Lexical Semantics Workshop, CLSW 2019, held in Chiayi, Taiwan, in June 2019.

The 39 full papers and 46 short papers included in this volume were carefully reviewed and selected from 254 submissions. They are organized in the following topical sections: lexical semantics; applications of natural language processing; lexical resources; corpus linguistics.



Lexical Semantics


Spatiality and Its Semantic Consequence of the Quantitative Expression “一CCN” in Mandarin Chinese

In Mandarin Chinese, reduplicative classifiers can be used to express plurality, which is closely related to the spatiality of nouns. This paper takes “一CCN” (short for the construction “One + Classifier + Classifier + Noun”) as an example to analyze the spatial presentation of “一CCN” and discusses its semantic consequence. Different from other quantificational expressions such as “Number + Classifier + Noun”, “一CCN” does not express a small quantity and “一CC” cannot fall in the scope of quantitative adverb and . By analyzing the contexts in which it can or cannot appear, and comparing it with (“many”), we conclude that the “many” meaning of “一CCN” is not its literal but inferential meaning.

Jing Sun, Yulin Yuan

Difference and Analysis Between the Structures of “Shai( ) + NP” and “Xiu( ) + NP”

With the popularity of phrases such as “shai( ) +  (happiness)” and “xiu( ) +  (loves)”, more and more nouns or noun phrases are coming into the structures of “shai( ) + NP” and “xiu( ) + NP”. In such a structure, the intuitive and cognitive perception is that “ shai( ) + NP” and “ xiu( ) + NP” express similar semantic connotations. But in the process of observing the corpus, we find that some nouns or noun phrases are unable to replace each other. With regard to this language phenomenon, we take the BCC Corpus of Beijing Language and Culture University as the research corpus, from which the relevant corpus is extracted, and discuss the similarities and differences between the two language structures from two aspects of word formation ability and collocation words by observing the collocation of index lines.

Cuiting Hu, Yanqiu Shao

Relationship Between Discourse Notions and the Lexicon: From the Perspective of Chinese Information Structure

This paper investigates the properties of information structure (IS), focusing on the relationship between the discourse notions (i.e. topic and focus) and the lexicon. The feature-based approach suggests that topic and focus are formal features which are numerated from the lexicon, active in the computational system and encoded in syntax. However, this approach has received some criticisms regarding the generation of IS, such as the violation of the inclusiveness condition, the employment of the “look-ahead” technique, the neglect of discourse pragmatic properties of topic/focus and the exclusion of the non-configurational IS. The key issue is that the semantic-syntactic properties of the lexical items in IS are relevant to the lexicon, but the pragmatic properties of IS are determined by factors out of the lexicon. IS should not be the result of pure syntactic computation, but a phenomenon of syntactic-pragmatic interface.

Shun-Hua Fu

A Collostructional Analysis of Ditransitive Constructions in Mandarin

By investigating the frequency distribution of 37 verbs in Mandarin ditransitive constructions and adopting a collostructional analysis (cf. Gries & Stefanowitsch [1]), this study aims to clarify the construction meaning of each type of ditransitive construction. The preliminary result shows that two constructions differ in terms of fine-grained aspects, such as the number and completion of transfer events. Based on the corpus findings, this study claims that the transfer meaning expressed by double-object constructions entails only one entire macro-event while the transfer event expressed by prepositional dative constructions highlights and involves more than one event, thus increasing the possibility of the prepositional dative conveying incomplete transfer meaning.

Huichen S. Hsiao, Lestari Mahastuti

The Semantic Analysis and Representation of “Hai-NP-Ne” Construction with NP Quoted from Context

This paper mainly analyzes the semantics of three types of “hai-NP-ne” construction with NP quoted from context, formalizes their semantics with the scale structure, and compares similarities and differences among these three types of construction semantics. It is found that all these three types of “hai-NP-ne” are at the lower point of entailing scales. Meanwhile, three different types of entailing scales, including likelihood scale, felicity scale, and truth value scale, are activated by the various types of verbs that are omitted in the “hai-NP-ne” construction. As a result, various meanings are derived.

Xiaoyu Cao

The Centennial Controversy: How to Classify the Chinese Adverb Dōu?

The centenary research history of the adverb dōu is very controversial. There are severe divergences in academic circles on how to classify the semantic and pragmatic functions of the adverb dōu. In the existing literature, there are mainly three versions: the trichotomy, the dichotomy, and the univocal. Trichotomy and Dichotomy define the semantic and pragmatic functions of each sub-dōu with synonyms in Chinese analytical formulas. This kind of description is simple, convenient, and intuitive but also relatively vague and hard to verify. Moreover, there is a lack of extractions and instructions of the distinctive oppositions for different sub-dōus. Univocal takes totalizing universal or the universal/distributive quantification as the consistent feature of the adverb dōu. The verification cannot cover all corpora of dōu, besides, it cannot evade the distinctive oppositions between dōua and dōub, which are opposite to the consistent feature. After reviewing various views on the classification of dōu in the existing literature, this paper illustrates the distinctive semantic and pragmatic oppositions between dōua and dōub. The first is the quantificational semantic features with objective truth values that dōua has, but dōub does not. The second is the pragmatic features of a subjective evaluation that dōub has, but dōua does not. So this paper approves of the Dichotomy that the adverbs dōu are dōua and dōub.

Hua Zhong

Research on the Hidden ‘De’ in Basic Noun Compounds Based on the Large-Scale Corpus

Basic noun compounds refer to phrases with nominal functions composed of two nouns. The study on the concealment of “de” in basic noun compounds is helpful to discover the implicit knowledge of noun and noun combinations, and the transformational rules between “N-N phrases” and “N-de-N phrases”. Scholars have described the hidden “de” from different perspectives, but no statistical analysis has been carried out from a large-scale corpus. This paper takes newspapers in Dynamic Circulation Corpus (DCC) for ten years as the corpus source and extracts basic noun compounds from it. We find out part without “de” in corpus by searching corpus for verification and after that, summarize types of basic noun compounds from two levels of syntactic structure and semantic relationship. We also analyze the reason for this phenomenon.

Ying Zhang, Pengyuan Liu, Qi Su

Research of Speech Act Verb Interpretations About Dictionaries of Learning Chinese as a Foreign Language from the Perspective of Frame Semantics

Based on the Frame Semantic Theory of Fillmore, C.J., this research summarizes the interpretation modes of 9 speech act verbs with the meta-language interpretations. According to the actual occurrence of the frame elements in the corpora, the original frames are modified, and the standard interpretation models of 9 lexical units are constructed. This research analyzes and compares the interpretations of related meanings in the four dictionaries, and puts forward some suggestions to the verb interpretations of the Dictionaries of Learning Chinese as a Foreign Language from the perspective of frame semantics.

Weili Wang

A Study on the Expressions of Modal Particles of the Suggestion Function in Spoken Chinese

A Case Study on “ba”, “ma”, and “bei”

This paper focuses on the analysis of how modal particles “ba”, “ma”, and “bei” can help express the suggestion function in spoken Chinese. It mainly describes the characteristics of expressions of the suggestion function and the contextual variables which affect the selection of linguistic forms. Through video transcription, this study summarizes three typical means of expressions of the modal particles “ba”, “ma”, and “bei” of the suggestion function from 604,585 words of a spoken Chinese corpus. The study analyzes the contextual variables involved in the expressions of discourse function, making the context tagging available, so as to facilitate future research on the expressions of other functions.

Xie Jingyi

A Study of the Characteristics of ABB-Type Adjectives in Shaoxing Dialect

Shaoxing lies in the north of Zhejiang Province. The dialect of Shaoxing has an ancient pedigree and deep cultural connotation. There are abundant ABB adjectives in Shaoxing dialect, which are different from those in Mandarin. The previous scholars either compared this kind of adjectives with other forms of reduplicated words in the Wu dialect, or made a simple and closed description of its grammatical function and ways of word formation. There was no exploration from semantic aspects. This paper explains ABB adjectives in Shaoxing dialect from the aspects of phonetics, grammar and semantics, and uses ABB adjectives in Mandarin as a reference for comparative analysis to have a more complete description of the characteristics of ABB adjectives in Shaoxing dialect.

Bihua Wang, Yueming Du, Lijiao Yang

The Reclassification of Chinese Nominal Measure Words Based on Definition Mode

The classification of nominal measure words is a long-standing problem in Chinese syntactics circle. This paper tries to solve it based on dictionary definition. Through analyzing the definition mode, we find there are some rules hidden in the meaning of nominal measure words, which can be used to reclassify these words into the following categories. When the definition mode is “yongyu(used before)…xing (shape)…”, they belong to the subcategory of highlighting shape of things, such as “tiao (strip)”; when the definition mode is “yongyu (used before) fen (divide)…”, they fall into the subcategory of highlighting constitutive aspect of things, such as “ban (piece)”; when the definition mode is “yongyu (used before)… cheng (become)…”, they fit into the subcategory of highlighting agentive aspect of things, such as “dui (pair)”, etc.…. Compared with the previous classifications, this paper unifies the classification criteria, and enhances the universality and exclusiveness of classification.

Wang Enxu, Yuan Yulin

A Brief Analysis of Semantic Interactions Between Loanwords and Native Words in the Tang Dynasty

Loanwords are words borrowed from another language. This paper conducts case studies of Chinese loanwords in the Tang Dynasty, including both transliterated and liberally translated words, e.g. “Kurung slave”, “lion”, “camel bird” and “nail aromatic”. With synchronic and diachronic analysis, the study finds that the incorporation of loanwords not only brings in new words, but also triggers semantic interactions between loanwords and native words, resulting in the misconception of both loanwords and native words.

Yuchen Zhu, Renfen Hu

On the Verb Zao in Ha-Fu Northeastern Mandarin

As a multi-ethnic area, Northeast China has a long history and abundant culture. Northeastern Mandarin is a symbol of the unique multi-cultural phenomena in the Northeastern Mandarin area, which plays an irreplaceable role in the communication network in this region. Based on the dialect corpus collected from local citizens, this article conducts a research on the frequently-used universal verb zao (造) in Ha-Fu Northeastern Mandarin from the perspectives of semantics, grammar and pragmatics, with a view to some references for further researches on the lexicology, typology, and computational linguistics concerning Northeastern Mandarin.

Bing Shao, Minglong Wei

Linguistic Synaesthesia of Mandarin Sensory Adjectives: Corpus-Based and Experimental Approaches

This study examines linguistic synaesthesia based on both the corpus distribution and the modality rating of Mandarin synaesthetic adjectives. We find that the tendencies attested through the corpus-based and the experimental approaches are compatible, including: (1) the modality exclusivity is negatively correlated with the usage of Mandarin sensory adjectives in linguistic synaesthesia; and (2) the ratings on sensory modalities of Mandarin synaesthetic adjectives are consistent with the synaesthetic directionality of these adjectives. The paper thus argues for the cognitive reality of linguistic synaesthesia, which can be evidenced by both the language production in the corpus and the language processing in the behavior experiment.

Qingqing Zhao, Yunfei Long, Chu-Ren Huang

The Negation Marker mei in Northeastern Mandarin

Negation in Mandarin is a field that has been substantially studied, but researches on the negation of dialects are relatively lacking. This article conducts an analysis into the negation marker mei in Northeastern Mandarin and finds three different types of mei, including mei1, a verbal negator complementary to bu, mei2, an unaccusative intransitive verb which is characteristic in this dialect but has rarely been mentioned in previous researches, and mei3, a telic and static aspectual negator which involves a tone sandhi pattern. The tone sandhi of mei3 in Northeastern Mandarin has something to do with the 4-tone-transformation in Mandarin and can be extended to other negators and even some numerals.

Minglong Wei

Resource Construction and Distribution Analysis of Internal Structure of Modern Chinese Double-Syllable Verb

In this paper, we analyze the necessity of the construction of internal structure resources of verbs from the perspectives of linguistics and NLP application. We also introduce the method and process of the internal structure annotation of double-syllable verbs in Modern Chinese Dictionary (7th Edition). A total of 9697 double-syllable verbs are annotated. From the results of the annotation, it is found that there are 61.19% of words with a character inside the word as the center of the verb. Among them, the words with verb-object structure is the most (64.5%), followed by the adverbial-head structure (27.77%). This paper can provide basic resources for syntactic and semantic analysis so as to realize the unified analysis of lexicon and syntax.

Guirong Wang, Gaoqi Rao, Endong Xun

Gradability, Subjectivity and the Semantics of the Adjectival zhen ‘real’ and jia ‘fake’ in Mandarin

In this paper, we provide an empirical description and a theoretical analysis of the adjectival zhen ‘real’ and jia ‘fake’ in Mandarin Chinese. The two adjectives manifest resistance to degree modifiers, and thus have been traditionally treated as non-gradable adjectives. Empirical evidence, however, shows that they can actually fuse both degree intensification and expressive meanings together. Based on their semantic behaviors, we follow recent advances in multidimensional semantics to propose that zhen and jia are mixed items with bi-dimensional semantics, i.e., the judge of truth-value as the descriptive meaning, and the degree of similarity/deviation between the facts and the subjective expectations as the expressive meaning.

Fan Liu, Qiongpeng Luo

The Repetition of Chinese Onomatopoeia

The boundary between the reduplication and the repetition of Chinese onomatopoeic forms is relatively vague, which brings some problems to Chinese information processing and Chinese teaching. Based on language practice, we focus on the repetition of modern Chinese onomatopoeia, and consider it a rhetorical device to emphasize repetitive sounds or rhythms. The resulting language forms should be regarded as onomatopoeic phrases, in which pauses can be inserted freely. We summarize some typical features of the repetition of onomatopoeia, and briefly introduce relevant punctuation marks, which can be used to express pauses. In addition, we agree that there is a subtle relationship between the reduplication and the repetition of onomatopoeia in syntax, semanteme and other aspects. In essence, they are both the imitation of repetitive sounds.

Mengzhen Xu

A Degree-Based Analysis of ‘V+A+le2’ Construction

This paper examines both the result-realization and the result-deviation interpretations of ‘V+A+le2’ construction under the framework of degree semantics. It is argued that the ‘V+A+le2’ construction is a comparative construction and that different interpretations arise because different standards of comparation are adopted. Result-realization interpretations are seen in all instances of the construction while result-deviation interpretations can be found only in instances with open-scale adjectives. The syntactic structure determines whether the ‘A’ encodes a measure function or the ‘V+A’ encodes a measure of change function in the ‘V+A+le2’ comparative construction.

Mengjie Zhang, Wenhua Duan, Yunqing Lin

Semantic Distinction and Representation of the Chinese Ingestion Verb Chī

Research on the Chinese high-frequency verb chī ‘eat’ is manifold with quite diverse observations by various analytical proposals. Representative works include the five-element semantic chain [1], the emergent argument structure hypothesis [2], and the MARVS-based semantic accounts [3–6]. However, little consensus has been reached on the polysemy of chī and its semantic-to-syntactic properties. In this paper, a comprehensive study of chī with in-depth lexical semantic analysis is conducted by adopting a corpus-driven, frame-based constructional approach. It proposes that chī can be viewed as having ‘one frame, three profiles and seven constructional meanings’ under the assumption that semantic distinctions can be made only if there are sufficient collo-constructional evidence. This study also demonstrates how the polysemy of chī can be understood by a two-dimensional analytical model to account for its semantic extensions based on the interaction of spatial and eventive readings.

Meichun Liu, Mingyu Wan

From Repetition to Continuation: Construction meaning of Mandarin AXAY Four-Character Idioms

Adopting the theoretical framework of Construction Grammar, the present paper aims to examine the internal structures, constructional meanings, and syntactic categories of Mandarin parallel idiomatic prefabs A-X-A-Y. The analysis once again confirms the importance of semantic integration between lexical and constructional meanings. Although X and Y seem to dictate the syntactic category of a four-character idiomatic expression, the syntactic category of the whole idiom is indeed adjustable in contextualized real-world language use. This prominent feature, i.e., the flexibility in terms of syntactic behavior, results from the markedness of the four-character skeleton, namely, a grammatical construction. This syntactic flexibility is then argued to be the essential property of Chinese four-character idioms.

Chiarung Lu, I-Ni Tsai, I-Wen Su, Te-Hsin Liu

A Study on Classification of Monosyllabic and Disyllabic Onomatopoeias Based on the Relation Between the Form and Meaning

There is a certain connection between the form and meaning of the onomatopoeia, so its classification may include both formal and semantic criterion. This paper mainly studies on monosyllabic and disyllabic onomatopoeias in the Modern Chinese Dictionary (6th edition), summarizes three semantic description perspectives as object of sound production, sound features and action features from the definitions of onomatopoeias in dictionary, and initially classifies them as simple onomatopoeias, compound featured onomatopoeias (with sound feature and action feature) and sound featured onomatopoeias. Combined with the formal criteria, this paper further classifies monosyllabic and disyllabic onomatopoeias as A type simple onomatopoeias, A type sound featured onomatopoeias, A type compound featured onomatopoeias, AA type sound featured onomatopoeias, AA type compound featured onomatopoeias, AB type simple onomatopoeias, AB type sound featured onomatopoeias, AB type compound featured onomatopoeias. Based on the classification, this paper also discusses the semantic, structural characteristics and relations between form and meaning of each type.

Bo Xu, Zezhi Zheng

Semantic Features and Internal Differences of Ergative Verbs

The ergative phenomenon in Chinese involves many hot-debated issues in the study of Chinese syntax and semantics, which was explored from various perspectives in previous researches. In this paper, a total number of 123 ergative verbs are sorted out in terms of their syntactic representation, semantic types and semantic features, and result in three findings. Firstly, the common semantic feature of ergative verbs is the meaning of change. Secondly, there are obvious internal differences in transitivity, causativity and volition of ergative verbs: unary ergative verbs indicate spontaneous and uncontrollable changes in events, with low transitivity and obvious non-volitional tendency; binary ergative verbs have higher transitivity and obvious causative tendency. Thirdly, the two relevant structures, ‘S+V+N’ and ‘N+V’, represent different stages before and after the change. The former structure represents the origin or motive force of the change, while the latter represents the state after the change. A temporal sequence and logical causality exist between them.

Fan Jie

Reduplicated Kind Classifier zhǒngzhǒng in Mandarin Chinese and the Associated Plurality Type

The reduplicated kind classifier, zhǒngzhǒng, is related to a nominal plurality. Zhǒngzhǒng has to co-occur with abstract nouns only. This paper argues that without any kind-referring interpretation, zhǒngzhǒng groups entities into an approximative taxonomic category. This differentiates from another reduplicated kind classifier, yī-zhǒngzhǒng, which individualizes entities at the kind level and makes a taxonomic category. Zhǒngzhǒng-NP denotes a set of entities which can be inferred as sum, group atoms and individual atoms, which are terms to cover the distinction between distributive and collective interpretations of plural NPs. Some types of predicates will be served to attest the denotation of zhǒngzhǒng-NP and describe its representation of plurality.

Hua-Hung Yuan

A Study on the Semantic Construal of ‘NP yào VP’ Structure from the Perspective of Grounding

In the framework of grounding theory in cognitive grammar, this paper takes yào as a grounding element and “yào VP” a grounded construction. The conceptualization process of “NP yào VP” structure can be described as a process in which the conceptualizer inputs volitional force, deontic force or speculative force into the process profiled by VP and makes the semantic profile of VP an intentional process. Grounding by yào has effects on the situation type of VP.

Limei Yang

Research Into the Additional Meanings of the Words of Shaanxi-Gansu-Ningxia Border Region Consultative Council—Taking the Shaanxi-Gansu-Ningxia Border Region Consultative Council Literature as a Corpus

The words used in the Shaanxi-Gansu-Ningxia border region consultative council are different from the words used in the region during different periods, which contain additional meanings. Taking the Shaanxi-Gansu-Ningxia border region’s consultative council literature as a corpus, this research aims to research the emotional meanings, writing style meanings thought processes, and denotations of the words used in general official documents words during this period; and also intends to explain the inspiration behind using the interrogative pronoun in topic words, and consider the notion of humanistic care. Hopefully, this research presents a relevant point of reference for present document writing.

Yao Zhang

From Lexical Semantics to Traditional Ecological Knowledge: On Precipitation, Condensation and Suspension Expressions in Chinese

Precipitation, condensation and suspension are different meteorological events involving water in different forms. They are conceptualised and conventionalised with various verbal constructions in Sinitic languages. In this paper, we analyse data from three Mandarin varieties and 229 Sinitic languages, as well as materials from Old Chinese, to support the claim that there is an underlying shared conceptualisation scheme to account for all the variations, and that traditional ecological knowledge (TEK) can be extracted based on the directionality expressed by these linguistic constructions and PoS of weather words. Specifically, we found that across all Mandarin varieties and Sinitic languages, the weather verbs for precipitation (e.g., rain, snow and hail) typically represent downward movement and the weather phenomena words can typically act as verbs in Old Chinese. On the other hand, although the weather verbs for condensation (e.g., dew and frost) also tend to represent downward movement but the weather nouns typically do not have verbal usage in Old Chinese. Lastly, the weather verbs for suspension (e.g., fog and mist) are directionally uncertain and cannot function as a verb in Old Chinese either. The radical shared by Chinese characters denoting these phenomena provided the conceptual ground for morpho-semantic and grammatical behaviours based on Hantology. Our findings not only have important implications for linguistic ontology and lexical semantics, but also lend support to the emerging area of language-based reconstruction of TEK.

Chu-Ren Huang, Sicong Dong

A Diachronic Study of Structure Patterns of Ditransitive Verbs

This paper mainly studies English dative alternation from a diachronic point of view. As a verb of caused-possession, the typical ditransitive verb give is contrasted with send and tell in respect of their different structural biases diachronically (12th–19th centuries). We built dynamic, syn-diachronic structural models, and made contrastive observations on structural biases of the verbs. The verb give shows a strong tendency to occur in the double object construction, send has a structural bias towards the prepositional object construction and tell shows preference for the clausal complement diachronically, which suggests the determining role of lexical semantics of the verb in its structural preference and the meaning of syntactic structures in which the verb occurs.

Guoyan Lyu, Yanmei Gao

Investigation on the Lexicalization Process and Causes of “Guzhi”

The lexicalization process of “guzhi” underwent a transformation from a verbal cross-layer structure to an adjective. In this process, there are several points worthy of attention: first, the object was omitted after “zhi” and the structure center was inclined to “gu”; second, the high frequency of “guzhi + VP” made it possible for reinterpretation; and third, the VPs in the structure were often disyllabic phrases, which further strengthened the role of foot. All these factors worked together, causing the verb-adverbial structure “gu + zhi” gradually faded out and the adjective “guzhi” appeared in large numbers during the Ming and Qing Dynasties. Basically in the Qing Dynasty, the lexicalization process was finally completed.

Hong Jin, Yingjie Dong

When “Natural Nouns” Surface as Verbs in Old Chinese: A Lexical Semantic Exploration

This paper reports a study of “natural nouns” that are used as verbs in Old Chinese, focusing on the lexical semantic analysis of natural nouns based on Generative Lexicon Theory. The detailed investigation of 39 natural nouns reveals that each of them encodes information of events as various types of Conventionalized Attributes (CAs) in their qualia structures, and is verbalized by activating or exploiting a particular type of CA and realizing this CA as the core meaning of the denominal verb. The discussion shows that the relative salience of a particular CA in the qualia structure largely determines the CA’s probability of triggering N-V conversion. Moreover, there is a clear tendency in CA exploitation: CAs encoding events with human participants are much more likely to be exploited by N-V conversion than CAs encoding events without human participants. This tendency could be accounted for by people’s anthropocentric view of the world.

He Ren

The Classification of Korean Verbs and Its Application in TCFL

Korean Verbs can be classified according to meanings and usages, verbal characteristics covering four dimensions (action involvement in other items, action influence on other agents, action initiativity) and verbal independency. Generally, all of these are not considered in the classification of Chinese verbs. On the basis of the classification of Korean verbs, we studied the verbs listed in the Outline of HSK (level 1-6) Vocabulary, and discussed its application in teaching Chinese as a foreign language (TCFL).

Aiping Tu, Duo Qian

The Independence of Monosyllabic Words

Based on the characteristics of Chinese characters, Chinese characters is the intersection of speech and grammar in Chinese with stable speech performance. One Chinese character is a syllable with a tone as a sign. Moreover, Chinese characters have a tenacious meaning, which makes the meaning of words not easy to lose. In other words, monosyllabic words can always maintain strong independence. But between different word classes, its independence also shows differences. In this paper we take the component of antonymous compounds as the entry point, and find that monosyllabic nouns have the strongest independence, followed by verbs, and adjectives have the weakest independence.

Jinzhu Zhang

Applications of Natural Language Processing


Microblog Sentiment Classification Method Based on Dual Attention Mechanism and Bidirectional LSTM

In the information age, the network technology continues to develop. As an emerging social media, Sina Weibo has a huge user base. Every day, hundreds of millions of users express their opinions on hot events, or share the joys and worries in life on the Weibo platform. Therefore, the analysis of the user’s emotion has broad application prospects, which could also be used in the fields of public opinion monitoring, opinion guidance, and advertisement placement. This paper proposes a microblog sentiment classification method based on dual attention mechanism and bidirectional LSTM. Firstly, the bidirectional LSTM model is used to semantically encode the microblog text, then the self-attention and sentiment word attention are introduced into the bidirectional LSTM model. Finally, the Softmax classifier is used to classify the sentiment of microblog. In order to verify the validity of the model, several groups of comparative experiments are carried out, which use NLPCC2013 and NLPCC2014 evaluation task datasets as experimental data sets. The results show that the proposed microblog sentiment classification model based on dual attention mechanism and bidirectional LSTM is superior.

Wenjie Wei, Yangsen Zhang, Ruixue Duan, Wen Zhang

High Order N-gram Model Construction and Application Based on Natural Annotation

The language model based on the n-gram grammar plays an important role in NLP tasks. In this paper, language models based on language boundary are proposed to conquer the challenge of the very big language data: intra-sentence boundary model and inter-sentence boundary model. We developed a training tool on the Hadoop platform based on MapReduce programming, and conducted the prefix tree to compress and store the model. We implemented our model in identifying the boundary in the syntactic parsing, achieving a good result.

Qibo Wang, Gaoqi Rao, Endong Xun

A Printed Chinese Character Recognition Method Based on Area Brightness Feature

This paper proposes a method for printed Chinese character recognition based on the area brightness feature, which is simple and has a low computational cost. It can achieve over 93% accuracy in recognizing printed Chinese characters equal to or greater than 10.5 pt which can meet the needs of certain situations (such as screen capture). The disadvantage of this method is its poor anti-distortion handling ability, and the recognition accuracy of smaller images still needs to be improved.

Yonghong Ke

“Love Is as Complex as Math”: Metaphor Generation System for Social Chatbot

As the wide adoption of intelligent chatbot in human daily life, user demands for such systems evolve from basic task-solving conversations to more casual and friend-like communication. To meet the user needs and build emotional bond with users, it is essential for social chatbots to incorporate more human-like and advanced linguistic features. In this paper, the usage of a commonly used rhetorical device – metaphor – is investigated for social chatbot. Our work first designs a metaphor generation framework, which generates topic-aware and novel figurative sentences. Human annotators validate the novelty and properness of the generated metaphors. More importantly, we evaluate the effects of employing metaphors in human-chatbot conversations. Experiments indicate that our system effectively arouses user interests in communicating with our chatbot, resulting in significantly longer human-chatbot conversations.

Danning Zheng, Ruihua Song, Tianran Hu, Hao Fu, Jin Zhou

Research on Extraction of Simple Modifier-Head Chunks Based on Corpus

The purpose of this study is to automatically extract a set of simple modifier-head chunks from a large-scale corpus. By analyzing the distribution of simple modifier-head chunks in usage, a set of formal rules of chunks extraction are formulated and a rule-based automatic extraction algorithm is designed. In the experiment of random sampling, the precision of extraction result with this method reaches 82.63%, which casts light on knowledge extraction based on large-scale corpus.

Wang Chengwen, Zhang Zheng, Rao Gaoqi, Xun Endong, Miao Jingjing

Incorporating HowNet-Based Semantic Relatedness Into Chinese Word Sense Disambiguation

This paper presents a semi-supervised learning method that incorporates sense knowledge into a Chinese word sense disambiguation (WSD) model. This research also effectively exploits HowNet-based semantic relatedness in order to leverage system performance. The proposed method includes Sense Colony task for improving context expansion and semantic relatedness calculating for sense feature representation. To incorporate sense knowledge into WSD, this paper employs the Semantic relatedness in a semi-supervised label propagation classifier. This research demonstrates state-of-the-art results on word sense disambiguation tasks.

Qiaoli Zhou, Gu Yue, Yuguang Meng

Protein/Gene Entity Recognition and Normalization with Domain Knowledge and Local Context

Biomedical named entity recognition and normalization aim at recognizing biomedical entity mentions from text and mapping them to their unique database entity identifiers (IDs), which are the primary task of biomedical text mining. However, name variation and entity ambiguity problems make this task challenging. In this paper, we leverage domain knowledge by a novel knowledge feature representation method to recognize more entity variants, and model important local context through a dual attention mechanism and a gating mechanism to perform entity normalization. Experimental results on the BioCreative VI Bio-ID corpus show that our proposed system achieves the new state-of-the-art performance (0.844 F1-score for protein/gene entity recognition and 0.408 F1-score for normalization).

Weihong Yao, Xuefei Li, Zongze Li, Zhe Liu, Shixian Ning

Sentence-Level Readability Assessment for L2 Chinese Learning

Automatic assessment of sentence readability level can support educators in selecting sentence examples suitable for different learning levels to complement teaching materials. Although there exists extensive research on document-level and passage-level Chinese readability assessment, the sentence-level evaluation remains little explored. We bridge the gap by providing a research framework and a large corpus of nearly 40,000 sentences with ten-level readability annotation. We design experiments to analyze the influence of 88 linguistic features on sentence complexity and results suggest that the linguistic features can significantly improve the predictive performance with the highest of 70.78% distance-1 adjacent accuracy. Model comparison also confirms that our proposed set of features can reduce the bias in prediction without adding variances. We hope that our corpus, feature sets, and experimental validation can provide educators and linguists with more language resources, enlightenment, and automatic tools for future related research.

Dawei Lu, Xinying Qiu, Yi Cai

Text Readability Assessment for Chinese Second Language Teaching

This paper proposes a multi-type and multi-granularity text readability feature set for Chinese second language teaching, which takes into account the dynamic and static features of the texts and integrates three linguistic units: character, word and sentence. On this basis, this paper analyses and compares various text readability assessment methods, and discusses how to effectively use various features for text readability assessment.

Shuqin Zhu, Jihua Song, Weiming Peng, Dongdong Guo, Gu Wu

Statistical Analysis and Automatic Recognition of Grammatical Errors in Teaching Chinese as a Second Language

Foreigners make various grammatical errors when learning Chinese due to the negative transfer of their mother tongue, learning strategies, etc. At present, the research on grammatical errors mainly focuses on a certain word or a certain kind of errors, resulting in a lack of comprehensive understanding. In this paper, a statistical analysis on large-scale data sets of grammatical errors made by second language learners is conducted, including words with grammatical errors and their quantities. The statistical analysis gives people a more comprehensive understanding of grammatical errors and have certain guiding significance for teaching Chinese as a second language (TCSL). Because of the large proportion of grammatical errors of “的[de](of)”, the usages of “的[de](of)” are integrated into automatic recognition of Chinese grammatical errors. Experimental results show that the performance is overall improved.

Yingjie Han, Mengjie Zhong, Lijuan Zhou, Hongying Zan

Tibetan Case Grammar Error Correction Method Based on Neural Networks

Grammar Error Correction (GEC) is an important researching subject among Nature Language Processing tasks. In this work, aiming at tackling with genitive and ergative grammatical errors in Tibetan formal text, we collect 1793563 consecutive sentence pairs as training set and 5000 sentence pairs with the same distribution as well as 1159 sentence pairs in different distributions as testing sets. In our approach, we firstly preprocess Tibetan text data with compositional rules and then build a neural network architecture which is a combination of BERT and Bi-LSTM, to estimate the probability of given token being genitive or ergative. In experiments, 98.38% and 86.16% in terms of accuracy are observed respectively in testing the proposed model on two different testing sets.

Cizhen Jiacuo, Secha Jia, Sangjie Duanzhu, Cairang Jia

Chinese Text Error Correction Suggestion Generation Based on SoundShape Code

Text error correction is an essential part of text proofreading. This paper presents a method for generating text error correction suggestion based on SoundShape Code. By converting the target words into SoundShape Code and using an improved editing distance algorithm to make an ambiguous match with the words in the vocabulary, a set of candidate words whose similarity exceeds a certain threshold are obtained. Based on the contextual relevance model, each words in the candidate words set is scored, and then reasonable error correction suggestions are given according the score. In this paper, four types of errors are marked: substitution error in words with two-character, missing error in words with more than three-character, inserting error in words with more than three-character, substitution error in words with three-character. In total, 617 errors are tested and analyzed. Experiments show that the similarity calculation based on SoundShape Code can provide reasonable error correction suggestions.

Hanru Wang, Yangsen Zhang, Lipeng Yang, Congcong Wang

Extracting Hierarchical Relations Between the Back-of-the-Book Index Terms

Aiming at solving the problem that the single level back-of-the-book index system is not enough to fully explore the semantics relations between the index terms, a method to extract the hierarchical relations between the index terms based on combination of lexical-syntactic analysis and text structure features is proposed in this paper. It first organizes index terms according to the text structure features, and constructs the indexed term pairs with hierarchical relations step by step. Then based on word vectors, the semantic similarity of paired index terms is calculated to eliminate the misidentified pairs. Finally, the index term pairs with hierarchical relations are optimized in the direct graph to remove redundant and conflict relations, and the hierarchical index system is built at last. Compared with the other results, our method improves precision rate and F value by 11.44% and 5.65% respectively.

Ning Li, Meng Tian, Shuqi Lv

A Situation Evaluation System for Specific Events in Social Media

With the widespread use of social media, social networks have become an important information carrier and platform for users to explore the world. Social networks not only reflect the hot events in society but also influence the trends and evaluation of events through user network behaviors. In this paper, a situation evaluation system is established for social event trends. We use web crawlers to collect multi-source data for a series of events of interest to form a basic knowledge base, and based on this, we extract statistical data and language features. Then, we use the Analytic Hierarchy Process (AHP) algorithm to calculate the weight of important features that can capture the development of the event situation and establish a hot social event situation evaluation system. Finally, we apply this system and stream computing technology to achieve situational awareness of events in real-time.

Bojia Li, Ruoyu Chen, Yangsen Zhang

Automatic Recognition of Chinese Separable Words Based on CRFs

Currently, most of the automatic recognition tasks of separable words adopt a rule-based method, which relies on automatic word segmentation results and lexical patterns generated from common inserted constituents. However, they suffer from incorrect word segmentation results and inaccurate and limited rules. Moreover, they ignore the rich information contained in the context. To address these issues, this paper proposes a CRFs-based method which employs nine features, such as character, POS tag, punctuation, word boundary, keyword and POS sequential rule. Experimental results on real-world datasets show that our approach can make full use of rich information and achieve significant improvements on recognition efficiency compared to all the baselines.

Ning Dong, Weiming Peng

Tibetan Sentence Similarity Evaluation Based on Vectorized Representation Techniques

Sentence similarity evaluation is an essential subject among the researching fields of Natural Language Processing (NLP), however, Tibetan related research on this subject is fairly inactive and has rarely drawn attention. In this work, we proposed an approach by leveraging vectorized representation techniques to tackle this problem by implementing two methods, namely, Euclidean distance evaluation and Jaccard similarity evaluation. Experiments indicated the performance of presented methods is satisfactory.

Zhou Maoxian, Cizhen Jiacuo, Cai Rangjia

An Easier and Efficient Framework to Annotate Semantic Roles: Evidence from the Chinese AMR Corpus

Semantic role labeling (SRL) is a fundamental task in Chinese language processing, but there are three major problems about the construction of SRL corpora. First, disagreements occurred in previous studies over the definition and number of semantic roles. Second, it is hard for static predicate frames to cover dynamic predicate usages. Third, it is unable to annotate the dropped semantic roles. Abstract Meaning Representation (AMR) is a new method which provides a better solution to the above problems. The researchers use 5,000 sentences in the Chinese AMR corpus to make a comparison between AMR and other SRL resources. Data analysis shows that within the framework of AMR, it is easier to annotate semantic roles based on simplified distinction between core and non-core roles. In addition, 1,045 tokens of dropped roles are annotated under this new framework. This study indicates that AMR offers a better solution for Chinese SRL and sentence meaning processing.

Li Song, Yuan Wen, Sijia Ge, Bin Li, Weiguang Qu

Linguistic Knowledge Based on Attention Neural Network for Targeted Sentiment Classification

Deep learning approaches for targeted sentiment classification do not fully exploit linguistic knowledge. In this paper, we propose a Linguistic Knowledge based on Attention Neural Network (LKAN) to employ linguistic knowledge (e.g. sentiment lexicon, negation words, intensity words) to benefit targeted sentiment classification. Firstly, we extract linguistic knowledge words (e.g. sentiment lexicon, negation words, intensity words) in sentences by HowNet vocabulary. Then, we design an attention mechanism which drives the model to concentrate on such words and get a weighted combination of word embeddings as the final representation for the sentences. We evaluate our proposed approach on SemEval 2014 Task 4, whose performance as shown reaches the most advanced level.

Chengyu Du, Pengyuan Liu

A Method of Automatic Memorabilia Generation Based on News Reports

This paper proposes a method of automatic memorabilia generation based on news reports, aiming to generate the memorabilia in a certain time period for specific enterprises or departments via machine learning technologies. Firstly, the nonparametric clustering algorithm DBSCAN is used to cluster news reports based on text similarity. Then, we propose a salience ranking model to calculate the salience score of each cluster from different aspects, such as news coverage, report forwarding and source website importance etc. Finally, time normalization and description generation are performed on the TOP-K clusters so as to generate the final memorabilia. Several experiments are carried out based on news reports crawled from the related website. Experimental results show that the proposed method can effectively discover important events from the corpus. This paper explores memorabilia generation and provides a baseline system for this task.

Sun Rui, Zhang Hongyi, Zhang Benkang, Zhao Hanyan, Tang Renbei

Lexical Resources


A Case Study of Schema-Based Categorized Definition Modes in Chinese Dictionaries

Traditional category theories including classical category and prototype category haven’t provided enough theoretical and practical support to category-based definition in dictionaries. Based on the schema category theory, this paper attempts to demonstrate how the schema-based categorized definition models work in Chinese Dictionaries.

Hongyan Zhang, Lin Wang, Wuying Liu

The Construction and Analysis of Annotated Imagery Corpus of Three Hundred Tang Poems

Imagery is one of the core elements in understanding and appreciating ancient poetry. Lack of imagery data leads to subjective researches in traditional imagery theory. Some quantitative studies are recently proposed but such studies are in lack of annotated corpora. This paper reports the construction of a richly annotated imagery corpus compiled from Three Hundred Tang Poems, a classic poetry anthology. The analysis of 4,496 imageries is made, showing that the use of imagery is a long tail distribution, which conforms to Zipf’s law, and that poets prefer to use natural imageries with metaphorical meanings. The use of imagery reflects a poet’s writing style to some extent, but it cannot be the golden standard for evaluating the quality of poetry.

Xingyue Hao, Sijia Ge, Yang Zhang, Yuling Dai, Peiyi Yan, Bin Li

Building Semantic Dependency Knowledge Graph Based on HowNet

This paper introduces a method of constructing a semantic dependency knowledge graph (SDKG) by using the rich semantic knowledge in HowNet. The establishment of SDKG depends on correspondence between the lexical dependency labels in semantic dependence bank of BLCU-HIT and the event roles in HowNet. For words with few event roles or those which are not included in the knowledge graph, sememes are recommended to them based on SPWE and SPASE algorithms to extend the SDKG. The paper demonstrates that the experiments achieve an accuracy of 86% when the sememe recommendation is conducted. Considering the establishment of the dependency relationship, a correspondence table in this paper including 87 pieces of data of event role labels mapping to dependency labels is designed. The constructed SDKG has nearly 500000 nodes that contains rich dependency information, which can be used to assist the analysis of the Semantic Dependency Parser. Besides, the results of Semantic Dependency Analysis can be drawn on to supplement the SDKG.

Siqi Zhu, Yi Li, Yanqiu Shao, Lihui Wang

Study on the Order of Vocabulary Output of International Students

In the process of teaching Chinese as a foreign language, vocabulary output is an important standard to measure effects of learners’ language acquisition. This paper collects 100 questionnaires from junior and senior international students, which require respondents to list the 300 most important daily words they think should be mastered when learning Chinese. Through the induction and statistical analysis of the questionnaire results, it is found that the vocabulary output of international students follows scene clues, word category clues and part of speech clues. At the same time, the vocabulary output of international students also has certain rules and characteristics. The output vocabulary has imageability and two-syllable words are dominant. Vocabulary output has a preliminary sense of morpheme, which embodies scene concept. In addition, it also has gender difference, and follows the acquisition order.

Xi Wang, Zhimin Wang

Construction of the Contemporary Chinese Common Verbs’ Semantic Framework Dictionary

Semantic lexicon and semantic framework are the primary support of natural language processing tasks such as information extraction, sentiment analysis, and machine translation. Therefore, it is essential to construct the contemporary Chinese common verbs’ semantic framework dictionary that covers rich semantic knowledge. Based on an analysis of current research results, this paper defines the lexical framework of common Chinese verbs. According to the predicate thematic roles, the semantic framework is divided into the basic semantic framework and extended semantic framework. Frameworks which are automatically extracted, taking semantics as the processing unit, and summarized based on large-scale lexical and thematic roles labeling corpus. The complete and simplified versions of the verb framework is constructed with the help of manual proofreading. The final verb framework contains a detailed description and corresponding example sentences of 2,782 common verbs with 4,516 meanings.

Tongfeng Guan, Kunli Zhang, Xuemin Duan, Hongying Zan, Zhifang Sui

Knowledge Graph Representation of Syntactic and Semantic Information

Representation of linguistic knowledge is one of the keys to helping machines understand natural languages. This paper follows the idea from linguistic data to linguistic knowledge and to knowledge representation. At the syntactic level, the syntactic structure and its variants in the corpus are summarized, and the syntactic functions undertaken by the arguments are analyzed. At the semantic level, the semantic roles and semantic types of arguments are analyzed. The purpose is to reveal the interaction between syntax and semantics. Finally, this paper explores a fusion representation method of linguistic data and linguistic knowledge, and carries out a case study.

Danhui Yan, Yude Bi, Xian Huang

Annotation Scheme and Specification for Named Entities and Relations on Chinese Medical Knowledge Graph

The medical knowledge graph describes medical entities and relations in a structured form, which is one of the most important representations for integrating massive medical resources. It is widely used in intelligent question-answering, clinical decision support, and other medical services. The key to building a high-quality medical knowledge graph is the standardization of named entities and relations. However, the research in annotation and specification of named entities and relations is limited. Based on the current research on the medical annotated corpus, this paper establishes an annotation scheme and specification for named entities and relations centered on diseases under the guidance of physicians. The specification contains 11 medical concepts and 12 medical relations. Medical concepts include the diagnosis, treatment, and prognosis of diseases. Medical relations focus on relation types between diseases and medical concepts. In accordance with the specification, a new Chinese medical annotated corpus of high consistency is constructed.

Donghui Yue, Kunli Zhang, Lei Zhuang, Xu Zhao, Odmaa Byambasuren, Hongying Zan

Directionality and Momentum of Water in Weather: A Morphosemantic Study of Conceptualisation Based on Hantology

We present in this paper a study of the conceptualisation of meteorological events involving water in Chinese based on Hantology, a SUMO-based ontology of Chinese orthography. Our comprehensive investigation of the morphosemantic behaviours of these weather words in both Mandarin and Sinitic languages reveals that they are predicted by the directionality and momentum of their formation and movement. We studied events involving water in both liquid and solid forms: such as rain, snow, hail, fog, dew and frost. They share the radical 雨, which can be linked to two SUMO nodes according to Hantology. This ontological bifurcation can be shown to bring about not only the diversity of direction expressions referring to these words for water, but also the differences of semantic features and PoS between them in Archaic Chinese. Moreover, the momentum of different water forms is proposed to be the physical basis for the differences of PoS, semantic features and node linking.

Sicong Dong, Yike Yang, Chu-Ren Huang, He Ren

Construction of Adverbial-Verb Collocation Database Based on Large-Scale Corpus

This paper constructs a high-quality adverbial-verb collocation database based on a large-scale corpus. First, we established a knowledge system of adverbial-verb collocations based on previous studies and linguistic rules. Then, we designed and implemented a knowledge acquisition model of adverbial-verb collocation based on a large-scale corpus. Finally, we evaluated and analyzed the extracted results. The main purposes are to obtain high-quality adverbial-verb collocations by formal means and to provide data support for natural language processing and theoretical and applied linguistic research.

Dan Xing, Endong Xun, Chengwen Wang, Gaoqi Rao, Luyao Ma

On the Definition of Chinese Quadrasyllabic Idiomatic Expressions in Chinese-French Dictionaries: Problems and Corpus-Based Solution

To define Chinese quadrasyllabic idiomatic expressions in Chinese-French dictionaries is a demanding work for dictionary compilers. By comparing and analyzing the definition of quadrasyllabic idiomatic expressions in Chinese-French dictionaries, this study discussed 4 kinds of problems (wrong definition, inappropriate definition, omission of senses and absence of contextual information), and then gave some suggestions to improve the definition of these Chinese idioms by observing and analyzing their real use in large-scale corpus.

Fang Huang

Corpus Linguistics


A Metaphorical Analysis of Five Senses and Emotions in Mandarin Chinese

Emotions can be expressed by the five major external senses of human beings (i.e. vision, hearing, touch, smell and taste) via metaphors. Previous studies have mainly explored the relation between the five senses and emotions from the perspectives of physiology and cognition, and research on the five senses focuses on their semantic meanings. This paper attempts to investigate their relation based on corpus linguistics, centering on sensory verbs and emotional words. It is found that in Mandarin Chinese, five basic emotions (i.e., happiness, sadness, fear, anger, and surprise) can be expressed via olfactory, tactile, visual, and auditory modalities while among these five basic emotions, surprise cannot be expressed through taste.

Jie Zhou, Qi Su, Pengyuan Liu

Research on Chinese Animal Words Extraction Based on Children’s Literature Corpus

Categorized and graded vocabularies are an important aspect of children’s graded reading. Taking animal words from the Thesaurus of Modern Chinese as the seed words, this paper studies a method of extracting animal words from the children’s literature corpus and attempts to construct a word sequencing model. The method used is to match the results of automatic word segmentation with the seed words. There are 786 animal nouns extracted from the corpus, with an increasing rate of 39.36% compared to the 564 seed words, and there are 780 derivative animal words. The animal word sequencing model is based on word-work-popularity and word-writer-popularity, which resolves the problem of having an unbalanced number of characters and writer’s works.

Huizhou Zhao, Zhimin Wang, Shuning Wang, Lifan Zhang

The Concatenation of Body Part Words and Emotions from the Perspective of Chinese Radicals

Emotional stimuli can cause physical reactions in the body, and physiological responses further lead to emotional experiences. In the past, the study of emotional body response in linguistics mostly examined the emotions of the language structure of body parts, and it was mostly limited to the study of dictionary meanings, rarely conducting on the basis of corpus. This paper attempts to examine the concatenation of Chinese body part words and emotions in the microblog corpus from the perspective of Chinese radicals. The study found that each type of body radical can be used with any emotion, but the strength of the concatenation with emotion is not the same, such as “鼻(nose)” or “舌(tongue)” are most closely connected with the emotions of disgust; “舌(tongue)” and “牙(齿, tooth)” can best express the feelings of disgust and surprise. This provides a new perspective for the study of body parts and emotions.

Yue Pan, Pengyuan Liu, Qi Su

A Corpus-Based Study of Keywords in Legislative Chinese and General Chinese

The study of keywords is a hot topic in corpus linguistics and plays an important role in investigating lexical features in specific contexts. For legislative Chinese, as a kind of Chinese for specific purposes, the comparative study of its keywords with those of general Chinese is of great significance for understanding its linguistic features. This study uses the legislative Chinese corpus and Chinese Web 2011 corpus mutually as the observation corpus and the reference corpus, and extracts 50 keywords with the highest keyness score respectively from each corpus. Through a comparative analysis of the semantic classification of these keywords, this study finds that legislative Chinese has the characteristics of focusing on political and economic meanings, showing strong professionalism, having more numerals and monosemous words. This study is of great significance for exploring the characteristics of legislative Chinese vocabulary and exploring the textual features of legislative Chinese.

Shan Wang, Jiuhan Yin

The Semantic Prosody of “Youyu”: Evidence from Corpora

Semantic prosody provides a new perspective to identify the affective meaning of a word. Based on two comparable corpora and an online parallel corpus, this paper explores the semantic prosody of a Chinese functional word ‘youyu’ (‘because’). The statistics in the Chinese corpus, TorCH2014, indicate that “youyu” has two colligations that exhibit obvious negative semantic prosodies. The evidence of its English equivalents in the English corpus and the parallel corpus also proves that the semantic prosody of “youyu” is negative. This study shows that the combination of comparable corpora and parallel corpus provides a powerful tool for language research.

Zhong Wu, Xi-Jun Lan

Corpus-Based Statistical Analysis of Polysemous Words in Legislative Chinese and General Chinese

Legislative language is an effective carrier of legal and judicial justice. It has many characteristics that are different from general language. However, currently the study of the language of legislation, especially legislative Chinese, is still relatively weak. This paper extracts high-frequency words from a legislative Chinese corpus and annotates their word meaning in this corpus. By taking them as target words, this paper then randomly extracts sentences from a large-scale general Chinese corpus (the CCL corpus or the corpus of National Language Committee) for word sense annotation. By comparing word meanings in legislative Chinese and general Chinese, this study finds that there are significant differences between them in terms of the total number of meanings and the frequency of meanings. The reasons of the differences are closely related to the accuracy, written style and contextual features of legislative Chinese in comparison with general Chinese. The comparative study between the two types of languages is helpful for exploring the characteristics of polysemous words in legislative Chinese, deepening the teaching and research of legislative Chinese, and providing references for lexical research in legislative Chinese.

Shan Wang, Jiuhan Yin

Corpus-Based Textual Research on the Meanings of the Chinese Word “Xífu(r)”

There are only two entries under the Chinese character “ ” in the 5th edition of Modern Chinese Dictionary, namely xífù and xífur. They are annotated as two words with completely different meanings, each with two meanings. By searching the Ancient Chinese Corpus and examining the use cases one by one, it is proved that xífur appeared later than xífù, but they are identical in lexical semantics. Xífù and xífur are actually a word, which should be classified as an entry. There are only three meanings in the word: son’s wife, wife and young married woman in general.

Jingmin Wang

A Research into Third-Person Pronouns in Lun Heng(论衡)

Lun Heng was written in the East Han Dynasty. Because the writing style is of both Ancient Chinese and Middle Chinese, Lun Heng is worth researching. The third-person pronouns in Lun Heng are mainly “之(zhi)”, “其(qi)”, “彼(bi)”, “厥(jue)”, “若(ruo)”, “夫(fu)”, “此(ci)” and “是(shi)”. Generally speaking, these third-person pronouns in Lun Heng inherit the existing usage in Ancient Chinese. However, they also have developments and changes, which involve the alteration of syntax functions and the appearance of new third-person pronouns.

Huiping Wang, Zhiying Liu

Analysis of the Collocation of “AA-Type Adjectives” Based on MLC Corpus

Based on the data of “Modern Chinese Dictionary” and Media Language Corpus of Communication University of China, and according to the position of “AA-type adjectives” in the whole collocation, the 104 “AA-type adjectives” studied in this paper can be divided into three categories. They are post-positioning, pre-positioning, and the unlocated respectively. This paper focuses on the syntactic features of “AA-type adjectives and their collocations”, the five types of their expressions, and the comparative analysis of the internal structural relationships of the five collocation types, with a view to provide ideas and clues to the semantic study of “AA-type adjectives and their collocations” and second language teaching.

Junping Zhang, Rui Song, Ting Zhu, Caihong Cao, Mao Yuan

TG Network: A Model that More Effectively Identifies the Use of the Auxiliary Word “DE”

In the knowledge base of function word usage of “trinity”, the auxiliary word “DE” has the characteristics of high frequency and flexible usage. In this paper, a neural network model (TG network) is proposed to automatically recognize the usage of “DE”. In this network, the self-attention mechanism is firstly adopted as the first-layer feature encoder and GRU (gated recurrent unit) as the second-layer semantic extractor, and the recognition accuracy rate reaches 82.8%. Experiments show that the recognition effect of TG network is better than that of previous methods. In further experiments, the larger the window, the better the effect of the model is proved by setting different windows. At the same time, the fine-grained analysis of each usage category is carried out. In the future, it is expected that this model will automatically recognize more function words and the recognition results can be applied to other natural language processing tasks.

Chuang Liu, Hongying Zan, Xuemin Duan, Kunli Zhang, Yingjie Han

A Comparative Study of the Collocations in Legislative Chinese and General Chinese

Remarkable achievements have been made in the study of lexis in general Chinese, such as lexical meanings, word structures, lexical systems, and semantic evolution. However, these studies can hardly reflect the unique characteristics of Chinese for special purposes, such as legal Chinese, travel Chinese, and business Chinese. Taking the commonly used word 管理 guǎnlǐ ‘manage; management’ as an example, this paper explores the characteristics of legislative Chinese in terms of semantic categories and saliency of collocations by comparing them in a legislative Chinese corpus and the BCC corpus. This study finds that 管理 guǎnlǐ ‘manage; management’ is mainly used as a modifier and a modified term in legislative Chinese. The collocated words cover less semantic categories compared to general Chinese. Most of the collocated words are nouns, whose semantic categories mainly come from the political, social and economic fields. The study of the usage of commonly used words in legislative Chinese can not only help to explore the characteristics of legislative Chinese itself and its differences compared with general Chinese, but also provide references for law revision, legal lexicography, and legal Chinese teaching.

Shan Wang, Jiuhan Yin

From Modern to Ancient Chinese: A Corpus Approach to Beneficiary Structure

This study reports the results of two sets of corpus studies on the use of beneficiary structures (wèi-dòng shì), one in modern and the other in ancient Chinese. First, we analyzed the semantic associations of the word wèile ‘do something for something/someone’ in modern Chinese, using two corpora and the word-embedding model. The results were in line with semantic analyses proposed in the Semantic-Map Model. Second, based on an examination of all the sentences expressing beneficiary meanings in Zuo’s Commentary and Mencius, we established that the beneficiary structure in those works involves a light-verb structure that should be syntactically distinguished from other such structures that introduce causative and intentional events. As well as providing some new evidence regarding the semantic content of the wèi-dòng shì in modern Chinese, we present structural evidence of its source, which can be dated to the pre-Qin period, as shown by the examples in the two target ancient-Chinese texts.

Yu-Yin Hsu, Tao Wang

Research on Gender Tendency of Foreign Student’s Basic Chinese Vocabulary

The paper designs a basic vocabulary sequencing model and explores the differences in basic vocabulary output between male and female. It is found that male and female have obvious preferences in choosing needed vocabulary. Males are sensitive to numbers and concerned about the world issues, while females show dependence on living and learning environment and evaluation of their mood. Moreover, men tend to output abstract vocabulary, thinking about issues relatively macro. On the contrary, women tend to output specific vocabulary, consider things relatively micro. At the same time, there are also obvious differences between male and female in kinship terms and personal pronouns. Male and female give priority to the output of kinship terms with the same sex, but male give priority to the output of personal pronouns before female. The study reveals the differences in Chinese vocabulary output among learners of different genders.

Zhimin Wang, Huizhou Zhao

The Restrictions on the Genitive Relative Clauses Triggered by Relational Nouns

Genitives can be relativized in Mandarin Chinese. This article focuses on the genitive relative clauses triggered by relational nouns. The constructions are restricted to some conditions grammatically and semantically. First, the relational nouns in the relative clauses must serve as subjects. And the predicates of the relative clauses are usually intransitive verbs. In addition, the kinship terms cannot construct the qualified genitive relative clauses. Finally, the research explains these restrictions in terms of the theory of prominence condition.

Xin Kou

The Construction of Interactive Environment for Sentence Pattern Structure Based Treebank Annotation

In the construction of treebanks, manual annotation is inefficient, and unable to ensure the consistency of results. Based on the existing graphical syntax annotation platform for the sentence pattern structure, this paper adds the automatic syntax analysis function and constructs a human-computer interactive annotation environment. It shows that, compared with manual mode, human-computer interactive mode can greatly improve the efficiency.

Shiyu Guan, Weiming Peng, Jihua Song, Zhiping Xu

Semantic Representations of Terms in Traditional Chinese Medicine

Word embeddings have been widely used in lexical semantics and neural networks in Natural Language Processing. This article investigates the semantic representations using word embedding technologies by verifying them on a human constructed domain ontology. The domain of Traditional Chinese Medicine (TCM) is used as a workbench in this study, because this domain is knowledge-rich and has a large-scale domain ontology with well-defined entity types and relation types. This article releases a dataset, named “TCMSem”, to capture TCM domain experts’ intuitions of semantic relatedness. This data set is designed to cover the medical entities and relations with as many semantic types as possible so as to initiate a diverse and comprehensive evaluation on word embeddings. Experimental results show that word embeddings have demonstrated higher proficiencies in the detection of synonyms and collocations than other types of semantic relations. Furthermore, the semantic relatedness of thousands of terms of major categories in TCM is visualized using the taxonomy defined in the ontology.

Qinan Hu, Ling Zhu, Feng Yang, Jinghua Li, Qi Yu, Ye Tian, Tong Yu, Yueguo Gu

The Hé-Structure in the Subject Position Revisited

The syntactic status of hé ‘和’ in Chinese [DP1 hé DP2] structure is ambiguous when occurring in the subject position. It could be a preposition or a conjunctor. Three views have been proposed to account for this ambivalence, namely, the context-deterministic account, the multi-categorizer account, and the preposition-taking-all account. This paper argues against the preposition-taking-all account from both theoretical and empirical perspectives, and proposes that hé in the subject position can be and should be coordinative. On the basis of this, this paper presents several advantages of reinstalling the preposition-conjunction dichotomy analysis of hé.

Fanjun Meng

On the Semantics of Suffix -men and NP-men in Mandarin Chinese

The paper deals with the semantics of the nominal suffix -men and NP-men in Mandarin Chinese. In sinologist linguistics, -men is often considered as a “marker of collectivity”. However, the term collectivity (jíhé in Chinese) is confusing. In semantics, the concept of collectivity refers to two things, collective interpretation of a NP and collective nouns. This distinction can help to clarify the usage of -men. The author argues that the suffix -men is a marker of plurality, with [plural] feature due to its denotation. It indicates the number of a NP. In this paper, -men will be analyzed with respect to the collective and distributive interpretations of NP-men, especially when it co-occurs with different types of predicates.

Yan Li

Analysis of the Foot Types and Structures of Chinese Four-Syllable Abbreviations

This paper investigates the foot types and structural features of Chinese four-syllable abbreviations. Results show that: first, whether for the need of complexity and richness of ideogram or for avoiding ambiguity, these abbreviated forms are an essential and meaningful part of modern Chinese abbreviations; second, most of the existing forms tend to present 2 + 2 balanced prosodic structure, nominal attribute in part of speech. In addition, there also exists the possibility of abbreviation further to less than four syllables under some specific conditions.

Wei Ying, Jianfei Luo

Similarities and Differences Between Chinese and English in Sluicing and Their Theoretical Explanation

Sluicing refers to a certain type of compound sentence in which one clause is a wh question where all sentential elements, but the wh-phrase itself alone, are omitted. In semantic interpretation, a sluicing sentence is comparable to a full wh interrogative. The study of sluicing sentence involves several important aspects of syntactic theory. Zhang and Xu in their article provide a unified account of sluicing in Chinese and English from the perspective of predicative Empty Category [1]. It is demonstrated in this article that one important issue still remains to be resolved regarding the similarities and differences between Chinese and English in sluicing: What remains after deletion in English is the wh-phrase alone, but there must be a copular verb going with the retained wh-phrase in Chinese. As the major new viewpoint articulated in this article, the above cross-linguistic contrast is illustrated to be more principally explainable by appealing to the theory of focus rather than by using ad hoc stipulations.

Yewei Qin, Jie Xu

An Investigation of Heterogeneity and Overlap in Semantic Roles

An inventory of semantic roles is needed in semantic role labelling. However, no matter what are semantic roles defined as, there is always heterogeneity and overlap in semantic roles. The semantic properties of members in the same semantic role can be different, while those of members in different semantic roles can be similar. It is a widespread phenomenon in semantic role types. The paper analyzes the cause of heterogeneity and overlap, and points out the close connection between them. The severity of heterogeneity and overlap described in the article is assessed using data from the survey of Chinese PropBank, a publicly available semantic resource consisting of two parts: the Frameset and the corpus.

Long Chen, Weidong Zhan

Cognitive Semantics and Its Application on Lexicography: A Case Study of Idioms with xīn in Modern Chinese Dictionary

This paper attempts to explore the role of cognitive semantics in Chinese lexicography. Specifically, the paper focuses on the multiple senses of the headword 心 xīn and the related idioms in Modern Chinese Dictionary (现代汉语词典 xiàn dài hàn yǔ cí diǎn). It first provides a cognitive analysis of the related idioms of 心 xīn based on its different underlying conceptual metaphors and metonymies. From the cognitive perspective, 心 xīn and ‘heart’ in both Chinese and English can be further classified into different categories. As a result, different idioms with 心 xīn may be interpreted in various ways to match their conceptual categories. In present Modern Chinese Dictionary, all the idioms are arranged according to their pinyin alphabetical sequence. This arrangement mixes up with the headword’s conceptual mechanisms, and may bother some dictionary users, especially the Chinese language learners. This paper suggests that in Chinese dictionary compilation, the arrangement of those idioms may take the underlying conceptual mechanisms into account.

Qian Li

Research on Quantifier Phrases Based on the Corpus of International Chinese Textbooks

Chinese is rich in quantifiers, and studying various types and structures of quantifier phrases is a necessary way to master quantifiers. Using Chinese information processing technology to study quantifier phrases in international Chinese teaching is conducive to promoting deep integration in the two fields. This paper first constructs a knowledge base of quantifier phrase structural modes by tagging the quantifier phrases in the corpus of a certain scale of international Chinese textbooks. Then the characteristics of quantifier phrases in the field of international Chinese teaching are analyzed through the constructed structural mode knowledge base. Finally, on the basis of the structural mode knowledge base, automatic recognition of quantifier phrases in the corpus of international Chinese textbooks is studied.

Dongdong Guo, Jihua Song, Weiming Peng, Yinbing Zhang

Two-Fold Linguistic Evidences on the Identification of Chinese Translation of Buddhist Sutras: Taking Buddhacarita as a Case

As a product of language contact, the Chinese translation of Buddhist sutras has significant linguistic values. However, it is hard to accurately identify the translators and years of translation for some early translated sutras. Considering that the Chinese translation was mingled with elements loaned from foreign languages, mostly Sanskrit, in order to conduct identification from the linguistic perspective, it is necessary to adopt both external and internal linguistic evidences. The external approach is to analyze the characteristics of translation embodied in the corpus using the proofreading method with parallel texts of the Sanskrit original and the Chinese translation, and the internal approach is to examine the language style and language use habits of the translator according to the identification criteria of the Chinese corpora. Taking Buddhacarita as a case, this paper attempts to identify its translator with both the two-fold evidences, which provides a new way for the identification of such corpora.

Bing Qiu


