Skip to main content
Top

2019 | Book

Formalizing Natural Languages with NooJ 2018 and Its Natural Language Processing Applications

12th International Conference, NooJ 2018, Palermo, Italy, June 20–22, 2018, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 12th International Conference, NooJ 2018, held in Palermo, Italy, in June 2018.

The 17 revised full papers and 3 short papers presented in this volume were carefully reviewed and selected from 48 submissions. NooJ is a linguistic development environment that provides tools for linguists to construct linguistic resources that formalize a large gamut of linguistic phenomena: typography, orthography, lexicons for simple words, multiword units and discontinuous expressions, inflectional and derivational morphology, local, structural and transformational syntax, and semantics. The papers in this volume are organized in topical sections on vocabulary and morphology; syntax and semantics; and natural language processing applications.

Table of Contents

Frontmatter
Correction to: Formalizing Natural Languages with NooJ 2018 and Its Natural Language Processing Applications
Ignazio Mauro Mirto, Mario Monteleone, Max Silberztein

Vocabulary and Morphology

Frontmatter
An Automated French-Quechua Conjugator
Abstract
This paper presents the first version of an automated French-Quechua conjugation system of verbs. Using the key structure of Quechua Undefined Tense conjugation and the transformations induced by Interposed suffix IPS sets, I built a complete system of paradigms. I used NooJ linguistic platform to formalize the morpho-syntactic grammar and this serves to automatically obtain the entire set of conjugated forms of transitive, intransitive and impersonal verbs in all tenses, moods, aspects and voices.
Maximiliano Duran
Implementation of Arabic Phonological Rules in NooJ
Abstract
In this paper, we are going to implement Arabic phonological rules. We will present the speech organs apparatus with a description of Arabic sounds. Then, we will describe the phonological changes and provide a brief linguistic study of such changes. Finally, we will propose two solutions to implement Arabic phonological rules in NooJ. The first solution is a newly developed java module in NooJ platform which deals with phonological rules using an independent formatted file. The second solution uses local grammars within the NooJ platform, to locate anomalies in words and then give the appropriate transformations.
Rafik Kassmi, Mohammed Mourchid, Abdelaziz Mouloudi, Samir Mbarki
Arabic Broken Plural Generation Using the Extracted Linguistic Conditions Based on Root and Pattern Approach in the NooJ Platform
Abstract
This paper presents a linguistic study of number’s morphological feature that affects nouns, verbs, adjectives and gerunds (verbal nouns) by giving a special attention to the Arabic Broken Plurals (BPs). The difficulty lies on specifying the candidate Broken plural Pattern/Patterns (BPP/BPPs) to find the BP Form/Forms (BPF/BPFs) of a given Singular Form (SF). The Arabic BP is not automatically generated as other grammatical categories, e.g. verbs. The BP generation depends on SF’s morphological, phonological and semantic features. We have extracted from a deep linguistic study 108 sets of morphological, phonological and semantic conditions, which serve to restrict the generation process of the BPF from its SF. The extracted conditions may give one or more BPP for a given SF. We take into account the exceptions that permeate the extracted conditions. We have implemented inflectional and derivational grammars that generate the BPF using the Root-Pattern approach. This requires building a dictionary that considers not only the root and pattern of the SF as its entries, but also the SF’s morphological, phonological and semantic features using NooJ platform.
Ilham Blanchete, Mohammed Mourchid, Samir Mbarki, Abdelaziz Mouloudi
Detecting Latin-Based Medical Terminology in Croatian Texts
Abstract
No matter what the main language of texts in the medical domain is, there is always an evidence of the usage of Latin-derived words and formative elements in terminology development. Generally speaking, this usage presents language-specific morpho-semantic behaviors in forming both technical-scientific and common-usage words. Nevertheless, this usage of Latin in Croatian medical texts does not seem consistent due to the fact that different mechanisms of word formation may be applied to the same term. In our pursuit to map all the different occurrences of the same concept to only one, we propose a model designed within NooJ and based on dictionaries and morphological grammars. Starting from the manual detection of nouns and their variations, we recognize some word formation mechanisms and develop grammars suitable to recognize Latinisms and Croatinized Latin medical terminology.
Kristina Kocijan, Maria Pia di Buono, Linda Mijić
Processing Croatian Aspectual Derivatives
Abstract
The main objective of this paper is to detect and describe major derivational processes and affixes used in the derivation of aspectually connected Croatian verbs. This kind of analysis is enabled by previous detection of verbal derivational families, i.e. families of verbs with the same root as well as the derivational affixes they contain. Using NooJ, we automatically detect such derivational processes and assign the aspectual tag to derivatives. The procedure is based on the list of selected base forms and derivatives, on the list of derivational affixes and their allomorphs, and on the set of derivational rules. For this objective we selected 15 verbal derivational families comprising app. 250 derivatives in total. The output is being used for the development of a large on-line database of Croatian aspectual pairs, triples and quadruplets. Such a resource will be valuable for various research works in lexicology and lexicography.
Krešimir Šojat, Kristina Kocijan, Matea Filko
Construction of Morphological Grammars for the Tunisian Dialect
Abstract
The use of Tunisian dialect is growing rapidly in social networks. Also, the direct application of standard Arabic tools on Tunisian dialect corpora provides poor results. Thus, the construction of resources has become mandatory for this dialect. With the intention of developing inflected and derivational morphological grammars, we study many Tunisian corpora to elaborate different forms for grammatical categories. Our proposed method is based on four steps which start with the extraction of Tunisian dialect words and end with their morphological, lexical and syntactic enrichment. This method is established thanks to a set of morphological local grammars implemented in NooJ linguistic platform. In fact, the local morphological grammars are transformed into transducers using NooJ’s new technologies. For the evaluation of our method, we apply our lexical resources to a Tunisian corpus with more than 18,000 words. The obtained results look promising.
Roua Torjmen, Kais Haddar
A Chinese Electronic Dictionary and Its Application in NooJ
Abstract
After four years of research, our project of an electronic dictionary containing about 63,000 entries, is near completion. All the entries consist of atomic linguistic units in simplified Chinese characters, the official language of the People’s Republic of China. The totality of these entries would meet the vocabulary needs of daily life. Certain scientific areas, such as mathematics and physics, have also been chosen for inclusion in the dictionary. On this basis, the grammatical categorization of the inputs has been established to clearly identify the different meanings of the entries. Furthermore, the dictionary has the potential to distinguish 15 grammatical categorizations of the Chinese language. In this way any Chinese text may be successfully analyzed using the NooJ software. The grammatical rules to ensure a more comprehensive analysis will be completed at a later stage. The grammatical structure of any sentence in Chinese can be automatically analyzed, so that the aim of assisting the learner to achieve a more complete understanding of Chinese will be realized. As a final result of the research, an effective module for segmenting the Chinese language word by word on the model of Indo-European languages should be possible, as in a Chinese sentence there is no space between two words as there is in Indo-European languages. This tool will also enable the learner to have an easier and more direct access to the Chinese language and its systems. Certain difficult words and other key words can be identified through the use of NooJ.
Zhen Cai

Syntax and Semantics

Frontmatter
Automatic Extraction of Verbal Phrasemes in the Culinary Field with NooJ
Abstract
Phraseology is becoming the Achilles’ heel of foreign learners. Before teaching verbal phrasemes in the culinary field, their extractions are put in the foreground. According to the needs (modeling and disambiguation) of our extraction, NooJ becomes the most appropriate software. After having implemented the lexical data in NooJ, we managed to extract the verbal phrasemes from our corpus for the teaching.
Tong Yang
Some Considerations Regarding the Adverb in Spanish and Its Automatic Treatment: A Pedagogical Application of the NooJ Platform
Abstract
This paper is part of the research project entitled “The pedagogical application of NooJ to the teaching of Spanish”. Our particular aim is to deal with some elements of analysis concerning adverbs in Spanish, and to focus on the intersection of adverbs and adjectives, completing what we presented in previous papers -always within the range of possibilities offered by the NooJ platform-, and having the teaching of Spanish as our implementation horizon. The corpus consists of two types of texts written in Spanish, which are to be contrasted: journalistic texts and youth texts, that is, texts produced by young people. Given the low frequency of adverbs and adverbial structures in both corpora, the need to reinforce this category, the adverb, is considered.
Andrea Rodrigo, Silvia Reyes, Paula Alonso
Expansive Simple Arabic Sentence Parsing Using NooJ Platform
Abstract
All Arabic sentences, both verbal and nominal, share the same main structure, which consists of two required components: the predicate and the subject, and two optional components: the head and the complement. Simple sentences are based on most basic noun phrases (simple nouns), and can be expanded in the predicate, the subject, or the complement. The expansion leads to compound parts rather than simple ones. The aim of this work is to merge our two previous parsers [2, 3], and to extend the merged parser, at the noun phrase level, to be able to parse the expansive simple sentences. Hence, we have implemented a set of syntactic grammars modeling Arabic noun phrase structures. These grammars are enriched by the agreement constraints of the noun phrase components. Using our enhanced and extended grammar, we have parsed syntactically several sentences, we have recognized both nominal and verbal expansive sentences, and we have generated their possible syntactic trees regardless of the sentence’ components order. The results were satisfactory.
Said Bourahma, Mohammed Mourchid, Samir Mbarki, Abdelaziz Mouloudi
A Construction Grammar Approach in the NooJ Framework: Semantic Analysis of Lexemes Describing Emotions in Croatian
Abstract
The paper deals with semantic analysis of several lexemes encoding emotions in Croatian. The paper embraces the Construction grammar approach and shows how some of its basic theoretical tenets perfectly comply with the computational capabilities of NooJ. Using examples of the noun strah ‘fear’, the aim of the research is to point out the possibilities in annotating specific constructional meanings in NooJ, like different connotations of chosen lexemes, generalized uses of that constructions (their distributions in more abstract constructions like noun phrases), their relations with other constructions (other intensifiers of emotions, causative sentences etc.) and various distinctive features of their specific meanings (pragmatic features, as well as semantic and morphosyntactic), which all reflect different linguistic and cognitive phenomena in the language use.
Dario Karl, Božo Bekavac, Ida Raffaelli
The Lexicon-Grammar of Predicate Nouns with ser de in Port4NooJ
Abstract
This paper provides continuity for previous efforts on the integration of complementary lexicon-grammars to expand the paraphrastic capabilities of Port4NooJ, the Portuguese module of NooJ (Silberztein 2016). We describe the integration of the lexicon-grammar of 2,085 predicate nouns, which co-occur in constructions with the support verb ser de ‘be of’ in European Portuguese, such as in O Pedro é de uma coragem extraordinária ‘Peter is of an extraordinary courage’, studied, classified and formalized by Baptista  (2005b). This led to a 20% increase in the number of predicate nouns. We also extended previously created paraphrasing grammars, such as the grammars that paraphrase symmetric predicates, as well as the grammars that handle the substitution of the support verb by another support verb. Furthermore, we created new grammars to paraphrase negative constructions, appropriate noun constructions, adjectival constructions, and manner sub-clauses. The paraphrastic capabilities acquired have been integrated in the eSPERTo system.
Cristina Mota, Jorge Baptista, Anabela Barreiro
Unary Transformations for French Transitive Sentences
Abstract
Unary transformations are transformations that link one sentence to another, keeping the same semantic material. This paper presents a system that formalises a subset of Harris’ transformations for French, in particular the transformations described in the lexicon-grammar table #1, which describes a hundred auxiliary, modal and aspectual verbs. Other, more general transformations will be described as well.
Max Silberztein

Natural Language Processing Applications

Frontmatter
A Set of NooJ Grammars to Verify Laboratory Data Correctness
Abstract
Semantic interoperability in clinical processes is necessary to exchange meaningful information among healthcare facilities. Standardized classification and coding systems allow for meaningful information exchange. This paper aims to support the accuracy validation of mappings between local and standardized clinical content, through the construction of NooJ syntactical grammars for recognition of local linguistic forms and detection of data correctness level. In particular, this work deals with laboratory observations, which are identified by idiosyncratic codes and names by different facilities, thus creating issues in data exchange. The Logical Observation Identifiers Names and Codes (LOINC) is an international standard for uniquely identifying laboratory and clinical observations. Mapping local concepts to LOINC allows to create links among health data systems, even though it is a cost and time-consuming process. Beyond this, in Italy LOINC experts use to manually double check all the performed mappings to validate them. This has over time become a non-trivial task because of the dimension of laboratory catalogues and the growing adoption of LOINC. The aim of this work is realizing a NooJ grammar system to support LOINC experts in validating mappings between local tests and LOINC codes. We constructed syntactical grammars to recognize local linguistic forms and determine data accuracy, and the NooJ contextual constraints to identify the threshold of correctness of each mapping. The grammars created help LOINC experts in reducing the time required for mappings validation.
Francesca Parisi, Maria Teresa Chiaravalloti
A Semantico-Syntactic Disambiguation System of Arabic Movement and Speech Verbs and Their Automatic Translation to French Using NooJ
Abstract
In this paper, we propose a rule-based method whose purpose is to remove the semantico-syntactic ambiguity of Arabic movement and speech verbs and to translate different disambiguated verbs into the French language. This method consists of two main phases: a phase of semantico-syntactic disambiguation and a phase of automatic translation from Arabic to French.
The semantico-syntactic disambiguation phase can be subdivided into three main steps. The first step is to analyze and adapt the syntactic patterns used for nominal sentences. The second step is to identify a dictionary of Arabic movement and speech verbs and a dictionary of Arabic names based on syntactic patterns. As for the third step, it consists in identifying the disambiguation transducers based on the identified dictionaries, a morphological grammar and syntactic patterns. These transducers make it possible to remove the ambiguity of the verb by assigning it an Arabic meaning.
The purpose of the automatic translation phase is to improve the phase of semantico-syntactic disambiguation since the assignment of an Arabic meaning to a verb is not always sufficient to disambiguate it. This phase is based on the identifying of translation transducers based on the treatment of some problems related to the translation from the Arabic language to the French language, the identified dictionary of Arabic movement and speech verbs and a dictionary of French verbs.
The ideas of the proposed approach were validated by a prototype implemented in JAVA and using the NooJ platform.
The experimentation of the two phases enabled us, on the one hand, to test the feasibility of the realized system and secondly, to discern the limits encountered. Metrics of evaluation: F-measure, Precision and Recall allowed us to evaluate the phase of disambiguation. The comparative study with other famous translators supporting the Arabic language such as Google, Reverso and Babylon showed us that the prototype gave satisfactory results.
Mariem Essid, Hela Fehri
NooJ Grammars and Ethical Algorithms: Tackling On-Line Hate Speech
Abstract
The definition of “on-line hate speech” covers all forms of expression that propagate, incite, promote or justify hatred based on intolerance, including that expressed in the form of discrimination and hostility against minorities. Moreover, the concept of hatred includes other sub-concepts such as Homophobia, Racism, Chauvinism, Terrorism, Nationalism, Tolerance/Intolerance, and so on. Specifically, on-line hate speech is used in cases of cyber-harassment, to harm others deliberately, repeatedly and aggressively, in a way so to weaken victims psychologically. To contrast this phenomenon, EC has allocated a relevant amount of H2020 funds for the completion of specific research projects, the goal of which is the construction of computer tools to locate, evaluate and eventually block on-line hate speech. Today, the automatic tackling of online hatred is a daily-performed operation on Social Forums like Facebook, Twitter and Instagram. However, the algorithms these Social Forums use are stochastic/statistical, therefore not suitable to contextualize syntactically and semantically the words used inside posts. Therefore, with on-line hate speech tackling, statistical algorithms may produce inaccurate or even false results, with rather serious consequences.
Mario Monteleone
Pastries or Soaps?
A Stylometric Analysis of Leonarda Cianciulli’s Manuscript and Other Procedural Documents, with NooJ
Abstract
This paper focuses on the analysis of Leonarda Cianciulli’s manuscript, an Italian serial killer better known as “The Soap-maker of Correggio”, who murdered three women (Faustina Setti, Francesca Soavi, Virginia Cacioppo) between 1939 and 1940 and, according to her confessions, turned their bodies into soap and teacakes. The authenticity of her biographical memoirs have been disputed for a long time: many scholars claim they were written by people that aimed at persuading the Court of Assize to limit the effects of the charge of murder, because she studied up to the elementary school and probably wasn’t able to write such a document of more than 700 pages. Data extracted from the manuscript were compared to other procedural documents, the interrogatories and the letters she wrote to her son while she was in the criminal asylum. This analysis is based on the methodology and the tools of Computational Linguistics, and has three main objectives:
  • to evaluate the assumptions of non-reliability of the manuscript, supported by different scholars and experts who examined the document;
  • to define a stylometric profile of the woman;
  • to detect the most significant traits of her “magical thinking”.
Specifically, the analysis was carried out with the NooJ software, and has led to the construction of local grammars able to mine different information about her rituals and obsessions, her murder weapons and methods, her writing style and most used expressions. The identification of stylistic markers plays an important role in forensic linguistics: preferences for certain grammatical constructions, contractions, spelling and punctuation, together with the presence of linguistic mistakes, can provide relevant indicators to make inferences about the author’s motivations and characterization. The analysis gives a detailed portrait of a tormented woman, whose real motives generate arguments and debates among scholars even today.
Sonia Lay
Improvement of Arabic NooJ Parser with Disambiguation Rules
Abstract
Annotating sentences is important to exploit the different features of Arabic corpora. This annotation can be successful thanks to a robust analyzer. That is why in this paper we propose to mention the improvement of our previous analyzer. To do this, we propose a description of our previous analyzer, which presents advantages and gaps. Then, we choose a method of improvement, which is inspired by the former one. Finally, we put forward an idea about the implementation and experimentation of our new cascade of transducers in NooJ platform. The obtained results appear satisfactory.
Nadia Ghezaiel Hammouda, Kais Haddar
NooJ App Optimization
Abstract
Most present systems are never finished or completed. They often need to undergo some changes, concerning for instance user’s requirements or data formats, or in order to fix bugs and problems, improve the system efficiency or change the operating environment. This set of procedures is called “software maintenance”, which is an important phase in the life cycle of any system. However, it is more difficult to maintain a system than to develop it, but it is not difficult to maintain a maintainable system that is extensible and adaptable for any future changes. A model driven system can be considered as maintainable; it is a result of platform independent models transformation. In this work, we have focused on an approach to automate the process of software maintenance. It is a model-driven software evolution concept based on Architecture-Driven Modernization (ADM) approach in which models replace the source code as the key artifact. The objective of the entire process is building a NooJ web application without errors, easy to modify and apt to receive new features. This process contains three phases: the model-driven reengineering, the refinement and the model-driven migration.
Zineb Gotti, Samir Mbarki, Naziha Laaz, Sara Gotti
Backmatter
Metadata
Title
Formalizing Natural Languages with NooJ 2018 and Its Natural Language Processing Applications
Editors
Ignazio Mauro Mirto
Mario Monteleone
Max Silberztein
Copyright Year
2019
Electronic ISBN
978-3-030-10868-7
Print ISBN
978-3-030-10867-0
DOI
https://doi.org/10.1007/978-3-030-10868-7

Premium Partner