Elsevier

Neurocomputing

Volume 147, 5 January 2015, Pages 136-146
Neurocomputing

Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with IGNGF neural clustering

https://doi.org/10.1016/j.neucom.2014.02.060Get rights and content

Abstract

Classifications which group together verbs and a set of shared syntactic and semantic properties have proven to be useful in both linguistics and Natural Language Processing tasks. However, most existing approaches for automatically acquiring verb classes fail to associate the verb classes produced with an explicit characterisation of the syntactic and semantic properties shared by the class elements. We propose a novel approach to verb clustering which addresses this shortcoming and permits building verb classifications whose classes group together verbs, subcategorisation frames and thematic grids. Our approach involves the use of a recent neural clustering method called IGNGF (Incremental Growing Neural Gas with Feature maximization). The use of a standard distance measure for determining a winner is replaced in IGNGF by feature maximisation measure relying on the features of the data that are associated with clusters during learning. A main advantage of the method is that maximised features used by IGNGF during learning can also be exploited in a final step for accurately labelling the resulting clusters. In this paper, we exploit IGNGF for the unsupervised classification of French verbs and evaluate the obtained clusters (i.e., verb classes) in two different ways. The first way is a quantitative analysis of the clustering process relying on a usual gold standard and on complementary unbiased clustering quality indexes. The second way is a qualitative analysis of the cluster labelling process. Relying on an adapted gold standard, we evaluate the capacity of the IGNGF clusters labels (i.e., subcategorisation frames and thematic grids) to be exploited for bootstraping a VerbNet-like classification for French. Both analyses clearly highlight the advantages of the approach.

Introduction

Classifications which group together verbs and a set of shared syntactic and semantic properties have proved to be useful in both linguistics and Natural Language Processing (NLP) tasks.

Linguistically, such classifications have been shown [1] to capture generalisations about a range of (cross-)linguistic properties. For example, verbs which share the meaning component ‘communication’ (such as say, disclose, declare, and write) behave similarly also in terms of subcategorisation (I said/disclosed/declare a few words, I said/disclosed/declare a few words to Helen, I said/disclosed/declared that he should go). These classifications also define the mapping between syntactic arguments and thematic roles thereby providing a general framework for studying the complex relationship between lexical predicate-argument structures and grammatical functions. And they provide higher level abstractions in terms of syntactic or semantic features which can be used as a principled means for abstracting away from individual words.

In NLP, the predictive power and the syntax/semantic interface provided by these classifications have been shown to benefit such tasks as computational lexicography [2], machine translation [3], word sense disambiguation [4] and subcategorisation acquisition [5].

Developing such classifications however remains a difficult issue. For English, VerbNet [6] was manually developed and provides detailed, large-scale and online syntactic-semantic descriptions of Levin classes organised into a refined taxonomy. But manual development is costly. To remedy this shortcoming, several methods have been proposed to automatically acquire verb classifications [7], [8], [9], [10], [11], [12], [13]. However, these approaches mostly concentrate on acquiring verb classes, that is, sets of verbs which are semantically and/or syntactically coherent. The specific syntactic and semantic features characterising each verb class are usually left implicit: they determine the clustering of similar verbs into verb classes but they do not explicitly label these classes. In other words, none of the existing approaches build classifications which, like VerbNet, associate a set of verbs with a set of syntactic frames and a set of semantic roles characterising each class.

In this paper, we present a novel approach to the automatic acquisition of verb classes which addresses this shortcoming. It produces classifications which not only group together verbs that share a number of features but also explicitly associate each verb class with a set of subcategorisation frames and thematic grids characteristic of that class.

Our approach involves the use of a recent neural clustering method called IGNGF (Incremental Growing Neural Gas with Feature maximisation, [14]). Interestingly, in this clustering method, the features used for producing the classes are also used for labelling the output clusters. That is, each cluster in the output clustering is labelled with a ranked list of features that best characterises that cluster. To acquire a verb classification, we extract the syntactic and semantic features required for learning verb classes from the existing lexical resources for French verbs. These include, in particular, subcategorisation frames, thematic grids and (English) VerbNet class names. Since the features used for clustering are also those used to label the output classes, the classification produced by applying the IGNGF clustering method to the French data associates each verb class with a “cluster labelling profile” that includes one or more VerbNet classes, syntactic frames and/or thematic grids.

We evaluate the acquired classification both on the clusters (verb sets) it produces and on its cluster labelling i.e., the syntactic and semantic features associated by the IGNGF clustering with the clusters. On the one side, we perform an evaluation of the verb clusters by a comparison against an established test set [7] and by exploiting our own unbiased clustering quality indexes [15]. This evaluation includes the indirect comparison with an alternative the state-of-the-art approach exploiting spectral K-means clustering [7]. On the other side, we carry out an exhaustive qualitative analysis of the clusters examining both the semantic and the syntactic coherence of each cluster regarding to its associated features. In this part, relying on an adapted gold standard, we more specifically evaluate the capacity of the IGNGF clusters labels (i.e., subcategorisation frames and thematic grids) to be used for bootstraping a VerbNet-like classification for French.

The paper is structured as follows. Section 2 describes the features used for clustering and introduces the IGNGF clustering algorithm. Section 3 reports the results of the quantitative analysis. Section 4 reports on the results of the qualitative one. Finally, in Section 5 the conclusions are drawn.

Section snippets

Features and clustering algorithm

Since our aim is to acquire a classification which covers the core verbs of French, we choose to extract the features used for clustering, not from a large corpus parsed automatically, but from manually validated resources. Of course, the same approach could be applied to corpus based data (as done e.g., in [7]) thus making the approach fully unsupervised and directly applicable to any language for which a parser is available.

Gold standard

To evaluate the association among verbs, frames and grids provided by the IGNGF clustering method, we used a reference corpus called V-gold proposed in [7].

V-gold consists of 16 fine grained Levin classes with 12 verbs each (translated to French) whose predominant sense in English belongs to that class. Because we aim to use the classification for semantic role labelling and therefore wish to associate each verb with a thematic grid, we use a slightly modified version of this gold standard

Qualitative evaluation

The evaluation results presented in the former section clearly highlight the superiority of the IGNGF method, as compared to a reference method like the K-means method, for the task of French verbs clustering. However, the former evaluation was only restricted to the verb classes and it thus fails to consider one of the important added values of the IGNGF method which is related to cluster labelling.

In our specific context, the IGNGF clustering method does not only provide verb clusters. As a

Conclusions

We presented a novel approach to verb classification which permits altogether to obtain relevant verb classes and to accurately associate the said classes with a semantic role set and a set of subcategorisation frames. Our approach is based on the exploitation of the recent IGNGF incremental neural clustering method which, additionally to its capabilities to deal with sparse and high dimensional data representation, supports the labelling of clusters with “cluster profiles” i.e., sets of

Jean-Charles Lamirel is a lecturer since 1997. He is currently teaching Information Science and Computer Science at the University of Strasbourg and achieving his research at the INRIA laboratory of Nancy. He was a research member of the INRIA-CORTEX project whose scope is Neural Networks and Biological Systems. He has recently integrated the INRIA TALARIS project whose main concern is automatic language and text processing. His main domain of research is Textual Data Mining based on Neural

References (40)

  • B. Levin

    English Verb Classes and Alternations: A Preliminary investigation

    (1993)
  • K. Kipper, H.T. Dang, M. Palmer, Class-based construction of a verb lexicon, in: AAAI/IAAI, 2000, pp....
  • B.J. Dorr

    Large-scale dictionary construction for foreign language tutoring and interlingual machine translation

    Mach. Transl.

    (1997)
  • D. Prescher, S. Riezler, M. Rooth, Using a probabilistic class-based lexicon for lexical ambiguity resolution, in: 18th...
  • A. Korhonen, Semantically motivated subcategorization acquisition, in: ACL Workshop on Unsupervised Lexical...
  • K. Kipper Schuler, Verbnet: A broad-coverage, comprehensive verb lexicon (Ph.D. thesis), University of Pennsylvania,...
  • L. Sun, A. Korhonen, T. Poibeau, C. Messiant, Investigating the cross-linguistic potential of VerbNet-style...
  • C. Brew, S. Schulte im Walde, Spectral Clustering for German Verbs, in: Proceedings of the Conference on Empirical...
  • S. Schulte im Walde, Experiments on the automatic induction of German semantic verb classes (Ph.D. thesis), Institut fr...
  • S. Schulte im Walde

    Experiments on the automatic induction of german semantic verb classes

    Comput. Linguist.

    (2006)
  • A. Oishi et al.

    Detecting the organization of semantic subclasses of Japanese verbs

    Int. J. Corpus Linguist.

    (1997)
  • H.T. Dang, K. Kipper, M. Palmer, J. Rosenzweig, Investigating regular sense extensions based on interselective Levin...
  • P. Merlo, S. Stevenson, V. Tsang, G. Allaria, A multilingual paradigm for automatic verb classification, in: ACL, 2002,...
  • J.-C. Lamirel, R. Mall, P. Cuxac, G. Safi, Variations to incremental growing neural gas algorithm based on label...
  • J.-C. Lamirel, I. Falk, C. Gardent, Enhancing NLP tasks by the use of a recent neural incremental clustering approach...
  • K. van den Eynde et al.

    La valencel׳approche pronominale et son application au lexique verbal

    J. French Lang. Stud.

    (2003)
  • A. Kupść et al.

    Growing treelex

  • M. Gross

    Méthodes en syntaxe

    (1975)
  • J.H. Randall

    Linking: Studies in Natural Language and Linguistic Theory

    (2010)
  • C. Chang et al.

    LIBSVM: A library for support vector machines

    ACM Trans. Intell. Syst. Technol.

    (2011)
  • Cited by (7)

    View all citing articles on Scopus

    Jean-Charles Lamirel is a lecturer since 1997. He is currently teaching Information Science and Computer Science at the University of Strasbourg and achieving his research at the INRIA laboratory of Nancy. He was a research member of the INRIA-CORTEX project whose scope is Neural Networks and Biological Systems. He has recently integrated the INRIA TALARIS project whose main concern is automatic language and text processing. His main domain of research is Textual Data Mining based on Neural Networks. He has interests in both theoretical models for Data Mining and Data Mining applications. He is more specifically specialized in unsupervised learning methods. He is the creator of the concept of Data Analysis based on Multiple Viewpoints paradigm (MVDA) which has been fruitfully implemented in the MultiSOM and MultiGAS models. His other main topics of research concern Knowledge Extraction through numerical approaches, Visualization methods for Data Analysis, Data Analysis methods Evaluation and Novelty Detection models. He is a member of the COLLNET Informetrics group. He and his tools have been currently involved in European projects on Webometrics and Data Analysis, like the recent EISCTES project. He is also involved in European project whose focus is on Intelligent Recommendation System modeling, like the Satand-Surf ESA project. He takes part in several international collaborations in the domain of Intelligent Patent Mining, Bioinformatics and Webometrics. He was the organizer of one of the last Informetrics/Scientometrics/Webometrics conferences in Nancy (INIST, 2006), and he is a board member of the international Webometrics journal: “Collnet Journal of Scientometrics and Information Management”. The research work achieved by him led to the successful presentation of four different Ph.D. thesis. It generated an important scientific production: 8 contributions in international journals, 10 invited conferences, the organization of an international conference, 3 special sessions and 9 session chairs in international conferences, 87 publications in international conferences, 3 book chapters, 3 publications in national conferences, 5 European project reports. It has also led us to supervise 12 master of research internships for engineers, DEA, or Masters, with a systematic policy of publication in collaboration with the concerned students. This work also was worthy for him as a whole the recognition of many prestigious foreign institutional partners like NIEHS (USA), NSC (Taiwan), KU Leuwen (Belgium), NISTAD (India), and WISELAB (China).

    Ingrid Falk works as a research engineer in Computational Linguistics at the “Linguistique, Langue, Parole (LiLPa)” Department of the University of Strasbourg. She holds a diploma in Mathematics of the University of Bonn, Germany and a Master's Degree and Ph.D. in Computational Linguistics from the University of Lorraine, France. Her research is currently focused on modeling, development and (semi-)automatic processing of lexical, textual and terminological resources based on symbolic and machine learning methods. More generally her research interests are concerned with applying such techniques to problems in the field of Natural Language Processing and Computational Linguistics.

    Claire Gardent is a tenured senior researcher (Directrice deRecherche) at the French National Center for Scientific Research (CNRS). Her research focuses on the computational treatment of natural language meaning. She has worked on the automatic acquisition of lexical resources for French, on syntactic parsing and semantic role labelling; on text generation; and on the interaction between virtual worlds and natural language processing. She has published a book on analysis and generation (with Karine Baschung) and more than 100 articles in journals and conference proceedings many of them in the conferences of the ACL (Association for Computational Linguistics). She has been nominated Chair of the European Chapter for the Association of Computational Linguistics (EACL), editor in chief of the journals "Traitement Automatique des Langues" and "Language and Linguistic Compass (Computational and Mathematical Section)" and member of the editorial board of the journals "Computational Linguistics", "Journal of Semantics", "Journal of Linguistic Modelling". Each year she is on the programme committee of half adozen international conferences or workshops, she also acted as scientific chair for various international conferences, workshops and summer schools (ESSLLI, SIGDIAL, EACL, ENLG, SemDIAL).

    View full text