Skip to main content

2020 | Book

Formalizing Natural Languages with NooJ 2019 and Its Natural Language Processing Applications

13th International Conference, NooJ 2019, Hammamet, Tunisia, June 7–9, 2019, Revised Selected Papers


About this book

This book constitutes the refereed proceedings of the 13th International Conference, NooJ 2019, held in Hammamet, Tunisia, in June 2019. NooJ is a linguistic development environment that allows linguists to formalize several levels of linguistic phenomena. NooJ provides linguists with tools to develop dictionaries, regular grammars, context-free grammars, context-sensitive grammars and unrestricted grammars as well as their graphical equivalent to formalize each linguistic phenomenon. The 18 full papers presented were carefully reviewed and selected from 54 submissions. The papers are organized in the following tracks: Development of Linguistic Resources, Natural Language Processing Applications, NooJ for the Digital Humanities.

Table of Contents


Development of Linguistic Resources

Recognition of Arabic Phonological Changes by Local Grammars in NooJ
In this paper, we present how to use NooJ in order to recognize all transformations occurring on words following Arabic phonological changes. Our goal is to give the concerned phonological rule, its category, the cause and finally the origin of the word before any transformation. We describe the phonological changes by presenting the three main categories; assimilation, substitution, and weakening. Then, we recall our previous work in this field. We detail all the steps to adopt in order to achieve our goal. We present our classification of word forms, the dictionary, the inflectional grammar, the morphological grammar, and the local grammar. Finally, we present some examples and results.
Rafik Kassmi, Mohammed Mourchid, Abdelaziz Mouloudi, Samir Mbarki
Lexicon-Grammar Tables Development for Arabic Psychological Verbs
The identification of psychological verbs is very important in corpora analyses in order to give the polarity of a given text and define the emotional component. The classification of those verbs represents a challenge for linguists since they classify them according to their needs and their understanding. The aim of this paper is the identification and classification of Arabic psychological verbs through lexicon-grammar tables that are well structured, easy to use for linguists and allow them to describe all the grammatical, syntactic and semantic characteristics of the lexicon. In this work we create lexicon-grammar tables of Arabic psychological verbs with about 400 verbs entries in three main classes and subclasses to use them in lexical, syntactic and semantic analyzers. Using NooJ as an automatic natural language processing platform, we can automatically recognize Arabic psychological verbs by transforming our lexicon-grammar tables into NooJ dictionaries and syntactic grammars enabling the detection of those verbs in texts and corpora.
Asmaa Amzali, Asmaa Kourtin, Mohammed Mourchid, Abdelaziz Mouloudi, Samir Mbarki
The Identification of English Non-finite Structures Using NooJ Platform
Non-finite clauses, clauses that have a non-finite verb phrase, have proven to be frequent and complex features of the structure of academic discourse in English. They are also indicators of writers’ good mastery of academic discourse structures [5, 20, 22]. Therefore teaching non-finite clauses using specifically compiled and selected corpora is believed to have the potential of enhancing the teaching of these problematic structures. The purpose of this paper is to present the application of NooJ software in the automatic detection and extraction of English non-finite clauses occurring in a business English corpus compiled from a student’s master dissertation, business editorials, and business academic research articles. We develop for the purpose of the pedagogical application an annotation framework based on syntactic, semantic and discoursal patterns using NooJ platform. Having implemented the annotation in NooJ platform and obtained satisfactory results, the syntactico-semantic and discoursal analysis of the target structure would be expected to facilitate the autonomous discovery of the rules of use governing English non-finite clauses in the English classroom with the ultimate aim of enhancing students’ mastery of the elaborate and variant use of non-finite clauses in academic writing.
Olfa Ben Amor, Faiza Derbel
Automatic Recognition and Translation of Polysemous Verbs Using the Platform NooJ
In this work we study the phenomenon of verbal polysemy which poses a big challenge in the domain of automatic treatment of languages. This verbal polysemy constitutes not only a constraint to the automatic linguistic recognition, but also a hurdle in the French-Arabic automatic translation, especially in detecting the exact and precise equivalent in the target language. We will try to find solutions in the process of automatic translation to solve the problem of verbal polysemy using the NooJ platform.
Hajer Cheikhrouhou
Negation of Croatian Nouns
The purpose of this paper is to describe a morphological grammar for recognizing negation of a noun and annotating its polarity accordingly. Not all nouns can be negated on the morphological level. For example, nouns like ‘activity’ and ‘knowledge’ (aktivnost, znanje) can have negatives (neaktivnost, neznanje respectfully), but the same is not the case for nouns such as ‘battle’ or ‘table’ (bitka or stol). The most common and frequent Croatian prefix for negation of nouns is ‘ne-’ although several more are used either of Slavic (‘be-’, ‘bez-) or Latin origin (‘anti-’, ‘dis-). In some cases, negated nouns actually denote positive concepts, whereas their non-negated counterparts are used for expressing concepts with negative connotations. For this purpose, all the nouns in NooJ dictionary, that may have nouns in both polarities, are provided with [Polarity = pos] or [Polarity = neg] marker. This information is used in the grammar to switch the polarity of the opposite noun after the insertion of a negative prefix. The grammar is tested on different types of corpora and results are discussed.
Natalija Žanpera, Kristina Kocijan, Krešimir Šojat
The Automatic Generation of NooJ Dictionaries from Lexicon-Grammar Tables
The syntactic and semantic analyses constitute an important part of the automatic natural language processing field. Indeed, the complexity and the richness of the language make these tasks more difficult since they require the description of all the grammatical, syntactic and semantic features of the language lexicon. It is in this context that the lexicon-grammar approach has been introduced: It consists of describing the lexicon of the language through readable and intuitive tables for manual human editing. On the other hand, NooJ is an automatic natural language processing platform that includes different levels of analysis: lexical, morphological, syntactic and semantic. In order to integrate the lexicon-grammar approach into this platform, the tables must be transformed into dictionaries and syntactic grammars. However, setting up the dictionaries in NooJ through these tables is done manually right now, making it very time-consuming and error-prone. Hence, this work aimed at developing dictionaries automatic generation tool in NooJ from lexicon-grammar tables, ensures time saving and fully exploiting the potentialities of the lexicon-grammar approach.
Asmaa Kourtin, Asmaa Amzali, Mohammed Mourchid, Abdelaziz Mouloudi, Samir Mbarki

Natural Language Processing Applications

The Data Scientist on LinkedIn: Job Advertisement Corpus Processing with NooJ
For organizations using big data, one of the most important element to reach tangible results is exploiting human resources: it is not possible to manage data without using them intelligently. Considering the human intervention in relation to big data, means calling into question the so-called “data scientist”. Moving from the above, the main aim of this study is using the linguistic software environment NooJ to process a large corpus of job advertisements for data scientist in Italy collected on the business-networking site LinkedIn. Creating specific linguistic resources with NooJ, we are able to identify the most required skills by companies and organizations.
Searching the ideal candidate to hire, companies pay attention equally to technical skills and soft skills, in particular, as the capacity to work in team and communicate concerns. Finally, our research confirmed that studying the context in which the single words are inserted represents a key step in the process of information extraction by texts.
Maddalena della Volpe, Francesca Esposito
Mining Entrepreneurial Commitment in University Communication: Evidence from Italy
In recent years, the study of language has assumed a central role within the representation of complex systems, triggering a greater interdisciplinarity among separate fields. Thus, new perspectives of analysis are stimulated: using computational linguistic tools to evaluate the impact of language in specific contexts and to understand how social and economic phenomena are developed. Adopting the Lexicon-Grammar theoretical framework, we used NooJ Application to process a corpus gathered with free texts from official universities’ websites in order to explore the hidden intentions in Italian universities’ web communication. Moreover, we created local grammars with single and compound words, associated to different missions of the university: teaching, research and third mission. The outputs demonstrate that teaching topic is the most common emerging from universities’ web communication. As well as organizational aspects, renewing universities purpose implies the ability to communicate effectively the strategic goals, defining the university’s role in the society and, meanwhile, aiming to engage several players.
Maddalena della Volpe, Francesca Esposito
Disambiguation for Arabic Question-Answering System
Because of the increasing amounts of Arabic content on the Internet and the increasing demand for information, Arabic question answering (QA) systems are gaining great importance. Nevertheless, automatic answering of questions in natural language is one of natural language processing’s most challenging tasks. In this paper, we address the issue of processing Arabic Question Answering in the medical domain where there are several specific challenges. The main challenge in dealing with medical field in Arabic language is the need to resolve ambiguity. This issue, though, was not thoroughly studied in related works. Therefore, our QA system requires disambiguation solution to select the correct meaning in order to return the correct answer. The goal of this work is to resolve Arabic-related ambiguities as well as medical-related ambiguities. To achieve this goal, we use dictionaries and transducers using NooJ platform to answer any factoid or complex medical question. Experimentations of the disambiguation task of our Arabic medical question answering system show interesting results.
Sondes Dardour, Héla Fehri, Kais Haddar
Recognition and Analysis of Opinion Questions in Standard Arabic
Nowadays, most question-answering systems have been designed to answer factoid or binary questions (looking for short and precise answers such as dates, locations), however little research has been carried out to study complex questions.
In this paper, we present a method for analyzing medical opinion questions. The analysis of the question asked by the user by means of a pattern based analysis covered the syntactic as well as the morphological levels. These linguistic patterns allow us to annotate the question and the semantic features of the question by means of extracting the focus and topic of the question.
We start with the implementation of the identifying rules and the annotation of the various medical named entities. Our named entity recognizer tool (NER) is able to find references to people, places and organizations, diseases, viruses, as targets to extract the correct answer from the user. The NER is embedded in our question answering system. The task of QA is divided into four phases: question analysis, segmentation, and passage retrieval & answer extraction. Each phase plays a crucial role in the overall performance.
We use the NooJ platform which represents a valuable linguistic development environment. The first evaluations show that the actual results are encouraging and could be deployed for further question types.
Essia Bessaies, Slim Mesfar, Henda Ben Ghezala
A NooJ Tunisian Dialect Translator
The elaboration of a translator system from Arabic dialect to modern standard Arabic becomes an important task in Natural Language Processing applications in the last years. In this context, we are interested in building a translator from Tunisian dialect to modern standard Arabic. In fact, Tunisian dialect is a variant of Arabic as much as it differs from modern standard Arabic. Besides, it is difficult to understand for non-Tunisian people. Intending to elaborate our translator, we study many Tunisian dialect corpora to identify and investigate different phenomena such as Tunisian dialect word morphology and also Tunisian Dialect sentences. The proposed translation method is based on a bilingual dictionary extracted from the study corpus and an elaborated set of local grammars. In addition, local grammars are transformed into finite state transducers while using new technologies of NooJ linguistic platform. To test and evaluate the designed translator, we apply it on a Tunisian dialect test corpus containing more than 18,000 words. The obtained results are ambitious.
Roua Torjmen, Nadia Ghezaiel Hammouda, Kais Haddar
Automatic Text Generation: How to Write the Plot of a Novel with NooJ
Automatic Text Generation (ATG) is a Natural Language Processing (NLP) task that aims at writing acceptable and grammatical written text exploiting machine-representation systems, such as for instance knowledge bases, taxonomies and ontologies. In this sense, it is possible to state that an ATG system works like a translator that converts data into a natural-language written representation. The methods to produce the final texts may differ from those used by compilers, due to the inherent expressivity of natural languages.
ATG is not a recent discipline, even if commercial ATG technology has only recently become widely available. Today, many software environments cope with ATG, as Text Spinner, DKB Lettere, or textOmatic*Composer, just to mention a few.
Mario Monteleone

NooJ for the Digital Humanities

Arabic Learning Application to Enhance the Educational Process in Moroccan Mid-High Stage Using NooJ Platform
The article presents a learning web application, which contributes to enhancing the educational process of the Arabic language, especially in Moroccan Mid-High (schools). We use NooJ linguistic platform [1] to analyze the given syllabus. NooJ’s linguistic engine with its Text Annotation Structure (TAS) returns an annotation file after doing the linguistic analysis. The application stands on this returned annotation file to recognize and represent both nouns and adjectives that occur in the lesson.
We assume that Learners in preparatory stage must be able to distinguish between nouns, verbs and other grammatical categories, but they are not able to extract more sophisticated linguistic features, e.g. the pattern of a noun that has a duplicated root. The difficulty lies in the changes that occur in this kind of words, also in the rule/rules that were used to generate a certain Broken Plural Forms (henceforth BPF) from a singular noun/adjective. The application is meant to recognize and represent these sophisticated linguistic features and provide them to the learner.
The representation process provides the learner with the linguistic features of any noun or adjective that occurred in the lesson, e.g. the application would be able to return root, singular pattern, gender, grammatical category, Broken Plural pattern/patterns and other morphological, phonological and semantic features of a certain noun or adjective. The application can also examine rule/rules that were used to generate BPFs from its singular ones.
The lesson is divided into two main sections; the theoretical part where the learner is able to extract the linguistic features of any noun or adjective. The practical section aims to make the learner capable of recognizing these linguistic features and representing them in any text. In this first version we provide only one lesson, which is the Arabic Broken Plural lesson.
Ilham Blanchete, Mohammed Mourchid, Samir Mbarki, Abdelaziz Mouloudi
Causal Discourse Connectors in the Teaching of Spanish as a Foreign Language (SLF) for Portuguese Learners Using NooJ
Our paper focuses on the teaching of causal discourse connectors to learners of Spanish as a foreign language (SFL) whose mother tongue is Portuguese. It relies on the project about the pedagogical application of NooJ carried out by the IES_UNR research group since 2015, which mainly follows [11] and [12], and which makes use of [13]. The contrastive analysis in Portuguese is based on [14]. To develop discourse strategies for text comprehension and production, we implemented tags related to discursiveness and causality. Discourse connectors or markers may be understood as “constituents that exceed the limit of units such as the word, the phrase or the sentence” [7]. As cause and consequence concur, they involve the use of causal discourse connectors such as porque, ya que, gracias a, in Spanish, and porque, já que, graças a, in Portuguese (because, since, thanks to). We created dictionaries and grammars including two new features: Connector [C] (to name discourse connectors), and causal [+caus] (to identify causal discourse connectors). These features can be more effective for learners of Spanish, especially the one related to causality, since they refer to more general semantic knowledge.
Andrea Rodrigo, Silvia Reyes, Cristina Mota, Anabela Barreiro
Construction of Educational Games with NooJ
Combining learning and entertainment is an interesting concept involving the game. The latter does not require any specific effort; it allows learning while having fun. Thanks to the game, the user is not under any pressure and progresses at his own pace. Using the game in teaching has several advantages. In fact, it represents a source of motivation and pleasure for the players. In addition, it is an opportunity to practise certain skills (language, reflection, actions). Moreover, it develops the consideration of rules and mutual respect. In this paper, we propose two educational games named: ProMoNooJ and AlphaNooJ.
Héla Fehri, Ines Ben Messaoud
Detecting Hate Speech Online: A Case of Croatian
This project proposes a NooJ algorithm with the task to find and categorize various slurs, insults and ultimately, hate speech in Croatian. The results also provide a more detailed insight into inappropriate language in Croatian. We strongly emphasize the ethical considerations of (mis) identifying hate speech and as a result, an unethical and undeserved censorship of inappropriate, but free speech. Thus, we tried to make a clear distinction between insults and hate speech.
The test corpus consists of written online comments and remarks posted on five Croatian Facebook news pages during one week period. Given the differences between the standard Croatian grammar and syntax, and what is actually being used in informal on-line communication, the false negatives present the biggest difficulty since some variations (substandard usages of cases, spelling errors, colloquialisms) are impossible to predict, and therefore, extremely hard to implement into the algorithm.
Kristina Kocijan, Lucija Košković, Petra Bajac
Dealing with Producing and Consuming Expressions in Italian Sentiment Analysis
To perform aspect-level sentiment analysis, in the context of customer reviews, it is necessary to deal not only with subjective opinions, that are well covered by literature, but also with fact-implied opinions, that are less examined since they are less common and harder to handle. This work focuses on a specific type of fact-based opinion called producing and consuming expression. For example, expressions like “this printer consumes a lot of ink” and “this washing-machine makes a lot of noise” are instances of this type of opinion. This research aims to build a tool that can identify these expressions, classify their sentiment polarities and determine their opinion targets. To achieve this task, we started analyzing these expressions from a linguistic point of view. Then we developed a set of linguistic resources using the Nooj software. This set consists of one dictionary and one grammar. The dictionary contains verbs that express usage and production; adjectives and determiners that modify quantity, size, etc. and some generic nouns of resources and wastes. The grammar can recognize and tags simple and complex sentences that are producing and consuming expressions. This grammar produces a tag that allows to determine the sentiment of the expression simply by applying the rules of sentiment composition.
Nicola Cirillo
Rule Based Method for Terrorism, Violence and Threat Classification: Application to Arabic Tweets
In this paper, we present a rule based method to classify Tweets under three main categories; terrorism, violence and threat classes. Given that Arabic is a morphologically complex language, we build a linguistic module to identify a set of patterns for each class. Our proposed method requires three fundamental steps: First, we create our reference corpus collected from Arabic tweets. From the study of this corpus, we identify a set of linguistic rules. Finally, these patterns will be rewritten into local grammar within the linguistic platform NooJ. The evaluation of our system achieved encouraging results to obtaining 84%, 86.8% and 84.7% in terms of recall, precision and f-score respectively, when applied to test corpus.
Wissam Elahsoumi, Ines Boujelben, Iskander Keskes
Formalizing Natural Languages with NooJ 2019 and Its Natural Language Processing Applications
Héla Fehri
Slim Mesfar
Max Silberztein
Copyright Year
Electronic ISBN
Print ISBN

Premium Partner