Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with IGNGF neural clustering
Introduction
Classifications which group together verbs and a set of shared syntactic and semantic properties have proved to be useful in both linguistics and Natural Language Processing (NLP) tasks.
Linguistically, such classifications have been shown [1] to capture generalisations about a range of (cross-)linguistic properties. For example, verbs which share the meaning component ‘communication’ (such as say, disclose, declare, and write) behave similarly also in terms of subcategorisation (I said/disclosed/declare a few words, I said/disclosed/declare a few words to Helen, I said/disclosed/declared that he should go). These classifications also define the mapping between syntactic arguments and thematic roles thereby providing a general framework for studying the complex relationship between lexical predicate-argument structures and grammatical functions. And they provide higher level abstractions in terms of syntactic or semantic features which can be used as a principled means for abstracting away from individual words.
In NLP, the predictive power and the syntax/semantic interface provided by these classifications have been shown to benefit such tasks as computational lexicography [2], machine translation [3], word sense disambiguation [4] and subcategorisation acquisition [5].
Developing such classifications however remains a difficult issue. For English, VerbNet [6] was manually developed and provides detailed, large-scale and online syntactic-semantic descriptions of Levin classes organised into a refined taxonomy. But manual development is costly. To remedy this shortcoming, several methods have been proposed to automatically acquire verb classifications [7], [8], [9], [10], [11], [12], [13]. However, these approaches mostly concentrate on acquiring verb classes, that is, sets of verbs which are semantically and/or syntactically coherent. The specific syntactic and semantic features characterising each verb class are usually left implicit: they determine the clustering of similar verbs into verb classes but they do not explicitly label these classes. In other words, none of the existing approaches build classifications which, like VerbNet, associate a set of verbs with a set of syntactic frames and a set of semantic roles characterising each class.
In this paper, we present a novel approach to the automatic acquisition of verb classes which addresses this shortcoming. It produces classifications which not only group together verbs that share a number of features but also explicitly associate each verb class with a set of subcategorisation frames and thematic grids characteristic of that class.
Our approach involves the use of a recent neural clustering method called IGNGF (Incremental Growing Neural Gas with Feature maximisation, [14]). Interestingly, in this clustering method, the features used for producing the classes are also used for labelling the output clusters. That is, each cluster in the output clustering is labelled with a ranked list of features that best characterises that cluster. To acquire a verb classification, we extract the syntactic and semantic features required for learning verb classes from the existing lexical resources for French verbs. These include, in particular, subcategorisation frames, thematic grids and (English) VerbNet class names. Since the features used for clustering are also those used to label the output classes, the classification produced by applying the IGNGF clustering method to the French data associates each verb class with a “cluster labelling profile” that includes one or more VerbNet classes, syntactic frames and/or thematic grids.
We evaluate the acquired classification both on the clusters (verb sets) it produces and on its cluster labelling i.e., the syntactic and semantic features associated by the IGNGF clustering with the clusters. On the one side, we perform an evaluation of the verb clusters by a comparison against an established test set [7] and by exploiting our own unbiased clustering quality indexes [15]. This evaluation includes the indirect comparison with an alternative the state-of-the-art approach exploiting spectral K-means clustering [7]. On the other side, we carry out an exhaustive qualitative analysis of the clusters examining both the semantic and the syntactic coherence of each cluster regarding to its associated features. In this part, relying on an adapted gold standard, we more specifically evaluate the capacity of the IGNGF clusters labels (i.e., subcategorisation frames and thematic grids) to be used for bootstraping a VerbNet-like classification for French.
The paper is structured as follows. Section 2 describes the features used for clustering and introduces the IGNGF clustering algorithm. Section 3 reports the results of the quantitative analysis. Section 4 reports on the results of the qualitative one. Finally, in Section 5 the conclusions are drawn.
Section snippets
Features and clustering algorithm
Since our aim is to acquire a classification which covers the core verbs of French, we choose to extract the features used for clustering, not from a large corpus parsed automatically, but from manually validated resources. Of course, the same approach could be applied to corpus based data (as done e.g., in [7]) thus making the approach fully unsupervised and directly applicable to any language for which a parser is available.
Gold standard
To evaluate the association among verbs, frames and grids provided by the IGNGF clustering method, we used a reference corpus called V-gold proposed in [7].
V-gold consists of 16 fine grained Levin classes with 12 verbs each (translated to French) whose predominant sense in English belongs to that class. Because we aim to use the classification for semantic role labelling and therefore wish to associate each verb with a thematic grid, we use a slightly modified version of this gold standard
Qualitative evaluation
The evaluation results presented in the former section clearly highlight the superiority of the IGNGF method, as compared to a reference method like the K-means method, for the task of French verbs clustering. However, the former evaluation was only restricted to the verb classes and it thus fails to consider one of the important added values of the IGNGF method which is related to cluster labelling.
In our specific context, the IGNGF clustering method does not only provide verb clusters. As a
Conclusions
We presented a novel approach to verb classification which permits altogether to obtain relevant verb classes and to accurately associate the said classes with a semantic role set and a set of subcategorisation frames. Our approach is based on the exploitation of the recent IGNGF incremental neural clustering method which, additionally to its capabilities to deal with sparse and high dimensional data representation, supports the labelling of clusters with “cluster profiles” i.e., sets of
Jean-Charles Lamirel is a lecturer since 1997. He is currently teaching Information Science and Computer Science at the University of Strasbourg and achieving his research at the INRIA laboratory of Nancy. He was a research member of the INRIA-CORTEX project whose scope is Neural Networks and Biological Systems. He has recently integrated the INRIA TALARIS project whose main concern is automatic language and text processing. His main domain of research is Textual Data Mining based on Neural
References (40)
English Verb Classes and Alternations: A Preliminary investigation
(1993)- K. Kipper, H.T. Dang, M. Palmer, Class-based construction of a verb lexicon, in: AAAI/IAAI, 2000, pp....
Large-scale dictionary construction for foreign language tutoring and interlingual machine translation
Mach. Transl.
(1997)- D. Prescher, S. Riezler, M. Rooth, Using a probabilistic class-based lexicon for lexical ambiguity resolution, in: 18th...
- A. Korhonen, Semantically motivated subcategorization acquisition, in: ACL Workshop on Unsupervised Lexical...
- K. Kipper Schuler, Verbnet: A broad-coverage, comprehensive verb lexicon (Ph.D. thesis), University of Pennsylvania,...
- L. Sun, A. Korhonen, T. Poibeau, C. Messiant, Investigating the cross-linguistic potential of VerbNet-style...
- C. Brew, S. Schulte im Walde, Spectral Clustering for German Verbs, in: Proceedings of the Conference on Empirical...
- S. Schulte im Walde, Experiments on the automatic induction of German semantic verb classes (Ph.D. thesis), Institut fr...
Experiments on the automatic induction of german semantic verb classes
Comput. Linguist.
(2006)
Detecting the organization of semantic subclasses of Japanese verbs
Int. J. Corpus Linguist.
La valencel׳approche pronominale et son application au lexique verbal
J. French Lang. Stud.
Growing treelex
Méthodes en syntaxe
Linking: Studies in Natural Language and Linguistic Theory
LIBSVM: A library for support vector machines
ACM Trans. Intell. Syst. Technol.
Cited by (7)
Feature weighting as a tool for unsupervised feature selection
2018, Information Processing LettersBringing a feature selection metric from machine learning to complex networks
2019, Studies in Computational IntelligenceAutomatic summarization of scientific publications using a feature selection approach
2018, International Journal on Digital LibrariesCluster-based sparse topical coding for topic mining and document clustering
2018, Advances in Data Analysis and ClassificationA variable selection metric applied to centrality and community roles detection
2017, Extraction et Gestion des Connaissances, EGC 2017Feature maximization based clustering quality evaluation: A promising approach
2015, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Jean-Charles Lamirel is a lecturer since 1997. He is currently teaching Information Science and Computer Science at the University of Strasbourg and achieving his research at the INRIA laboratory of Nancy. He was a research member of the INRIA-CORTEX project whose scope is Neural Networks and Biological Systems. He has recently integrated the INRIA TALARIS project whose main concern is automatic language and text processing. His main domain of research is Textual Data Mining based on Neural Networks. He has interests in both theoretical models for Data Mining and Data Mining applications. He is more specifically specialized in unsupervised learning methods. He is the creator of the concept of Data Analysis based on Multiple Viewpoints paradigm (MVDA) which has been fruitfully implemented in the MultiSOM and MultiGAS models. His other main topics of research concern Knowledge Extraction through numerical approaches, Visualization methods for Data Analysis, Data Analysis methods Evaluation and Novelty Detection models. He is a member of the COLLNET Informetrics group. He and his tools have been currently involved in European projects on Webometrics and Data Analysis, like the recent EISCTES project. He is also involved in European project whose focus is on Intelligent Recommendation System modeling, like the Satand-Surf ESA project. He takes part in several international collaborations in the domain of Intelligent Patent Mining, Bioinformatics and Webometrics. He was the organizer of one of the last Informetrics/Scientometrics/Webometrics conferences in Nancy (INIST, 2006), and he is a board member of the international Webometrics journal: “Collnet Journal of Scientometrics and Information Management”. The research work achieved by him led to the successful presentation of four different Ph.D. thesis. It generated an important scientific production: 8 contributions in international journals, 10 invited conferences, the organization of an international conference, 3 special sessions and 9 session chairs in international conferences, 87 publications in international conferences, 3 book chapters, 3 publications in national conferences, 5 European project reports. It has also led us to supervise 12 master of research internships for engineers, DEA, or Masters, with a systematic policy of publication in collaboration with the concerned students. This work also was worthy for him as a whole the recognition of many prestigious foreign institutional partners like NIEHS (USA), NSC (Taiwan), KU Leuwen (Belgium), NISTAD (India), and WISELAB (China).
Ingrid Falk works as a research engineer in Computational Linguistics at the “Linguistique, Langue, Parole (LiLPa)” Department of the University of Strasbourg. She holds a diploma in Mathematics of the University of Bonn, Germany and a Master's Degree and Ph.D. in Computational Linguistics from the University of Lorraine, France. Her research is currently focused on modeling, development and (semi-)automatic processing of lexical, textual and terminological resources based on symbolic and machine learning methods. More generally her research interests are concerned with applying such techniques to problems in the field of Natural Language Processing and Computational Linguistics.
Claire Gardent is a tenured senior researcher (Directrice deRecherche) at the French National Center for Scientific Research (CNRS). Her research focuses on the computational treatment of natural language meaning. She has worked on the automatic acquisition of lexical resources for French, on syntactic parsing and semantic role labelling; on text generation; and on the interaction between virtual worlds and natural language processing. She has published a book on analysis and generation (with Karine Baschung) and more than 100 articles in journals and conference proceedings many of them in the conferences of the ACL (Association for Computational Linguistics). She has been nominated Chair of the European Chapter for the Association of Computational Linguistics (EACL), editor in chief of the journals "Traitement Automatique des Langues" and "Language and Linguistic Compass (Computational and Mathematical Section)" and member of the editorial board of the journals "Computational Linguistics", "Journal of Semantics", "Journal of Linguistic Modelling". Each year she is on the programme committee of half adozen international conferences or workshops, she also acted as scientific chair for various international conferences, workshops and summer schools (ESSLLI, SIGDIAL, EACL, ENLG, SemDIAL).