Skip to main content

Über dieses Buch

The International Conference on Computational Processing of Portuguese—PROPOR —is the main event in the area of natural language processing that is focused on Portuguese and the theoretical and technological issues related to this language. It w- comes contributions for both written and spoken language processing. The event is hosted in Brazil and in Portugal. The meetings have been held in Lisbon/Portugal (1993), Curitiba/Brazil (1996), Porto Alegre/Brazil (1998), Évora/ Portugal (1999), Atibaia/Brazil (2000), Faro/Portugal (2003), Itatiaia/Brazil (2006) and Aveiro/Portugal (2008). This meeting has been a highly productive forum for the progress of this area and to foster the cooperation among the researchers working on the automated processing of the Portuguese language. PROPOR brings together research groups, promoting the development of methodologies, resources and projects that can be shared among all researchers and practitioners in the field. The ninth edition of this event was held in Porto Alegre, Brazil, at Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS). It had two main tracks: one for language processing and another one for speech processing. This event hosted a special Demonstration Session and the first edition of the PhD and MSc Dissertation Contest, which aimed at recognizing the best academic work on processing of the Portuguese language in the last few years. This edition of the event featured tutorials on statistical machine translation and on speech recognition, as well as invited talks by renowned researchers of natural language processing.



Applications: Information Handling

Improving IdSay: A Characterization of Strengths and Weaknesses in Question Answering Systems for Portuguese

IdSay is a Question Answering system for Portuguese that participated at QA@CLEF 2008 with a baseline version (IdSayBL). Despite the encouraging results, there was still much room for improvement. The participation of six systems in the Portuguese task, with very good results either individually or in an hypothetical combination run, provided a valuable source of information. We made an analysis of all the answers submitted by all systems to identify their strengths and weaknesses. We used the conclusions of that analysis to guide our improvements, keeping in mind the two key characteristics we want for the system: efficiency in terms of response time and robustness to treat different types of data. As a result, an improved version of IdSay was developed, including as the most important enhancement the introduction of semantic information. We obtained significantly better results, from an accuracy in the first answer of 32.5% in IdSayBL to 50.5% in IdSay, without degradation of response time.

Gracinda Carvalho, David Martins de Matos, Vitor Rocio

Assessing the Impact of Stemming Accuracy on Information Retrieval

The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Our results show that some kind of correlation does exist, but it is not as strong as one might have expected.

Felipe N. Flores, Viviane P. Moreira, Carlos A. Heuser

Exploiting Multilingual Grammars and Machine Learning Techniques to Build an Event Extraction System for Portuguese

We describe a methodology for building event extraction systems. The approach is based on multilingual domain-specific grammars and exploits weakly supervised machine learning algorithms for lexical acquisition. We report on the process of adapting an already existing event extraction system for the domain of conflicts and crises to the Portuguese language.

Vanni Zavarella, Hristo Tanev, Jens Linge, Jakub Piskorski, Martin Atkinson, Ralf Steinberger

Formalizing CST-Based Content Selection Operations

This paper presents the definition and formalization of content selection operations based on CST (Cross-document Structure Theory) for multidocument summarization purposes.

Maria Lucía Castro Jorge, Thiago Alexandre Salgueiro Pardo

Applications: Text Processing

Translating from Complex to Simplified Sentences

We address the problem of simplifying Portuguese texts at the sentence level by treating it as a “translation task”. We use the Statistical Machine Translation (SMT) framework to learn how to translate from complex to simplified sentences. Given a parallel corpus of original and simplified texts, aligned at the sentence level, we train a standard SMT system and evaluate the “translations” produced using both standard SMT metrics like BLEU and manual inspection. Results are promising according to both evaluations, showing that while the model is usually overcautious in producing simplifications, the overall quality of the sentences is not degraded and certain types of simplification operations, mainly lexical, are appropriately captured.

Lucia Specia

Challenging Choices for Text Simplification

In this paper we discuss particular choices we made during the development of a rule-based syntactic text simplification system. Such choices concern 1) how to deal with adverbial phrases in order to simplify sentences, and 2) the order in which to apply our set of simplification rules. Adverbial phrases have not been considered by previous work on text simplification, but have a considerable impact on the complexity of a sentence. Considering our whole set of simplification rules, we discuss and compare two different orders in which to apply them: empirical and hierarchical.

Caroline Gasperin, Erick Maziero, Sandra M. Aluísio

Comparing Sentence-Level Features for Authorship Analysis in Portuguese

In this paper we compare the robustness of several types of stylistic markers to help discriminate authorship at sentence level. We train a SVM-based classifier using each set of features separately and perform sentence-level authorship analysis over corpus of editorials published in a Portuguese quality newspaper. Results show that features based on POS information, punctuation and word / sentence length contribute to a more robust sentence-level authorship analysis.

Rui Sousa-Silva, Luís Sarmento, Tim Grant, Eugénio Oliveira, Belinda Maia

Language Processing

A Machine Learning Approach to Portuguese Clause Identification

In this work, we apply and evaluate a machine-learning-based system to Portuguese clause identification. To the best of our knowledge, this is the first machine-learning-based approach to this task. The proposed system is based on

Entropy Guided Transformation Learning

. In order to train and evaluate the proposed system, we derive a clause annotated corpus from the


corpus of the

Floresta Sintá(c)tica Project

– an European and Brazilian Portuguese treebank. We include part-of-speech (POS) tags to the derived corpus by using an automatic state-of-the-art tagger. Additionally, we use a simple heuristic to derive a phrase-chunk-like (PCL) feature from phrases in the Bosque corpus. We train an extractor to this sub-task and use it to automatically include the PCL feature in the derived clause corpus. We use POS and PCL tags as input features in the proposed clause identifier. This system achieves a



= 1

of 73.90, when using the golden values of the PCL feature. When the automatic values are used, the system obtains



= 1

= 69.31. These are promising results for a first machine learning approach to Portuguese clause identification. Moreover, these results are achieved using a very simple PCL feature, which is generated by a PCL extractor developed with very little modeling effort.

Eraldo R. Fernandes, Cícero N. dos Santos, Ruy L. Milidiú

A Hybrid Approach for Multiword Expression Identification

Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed for MWE identification, as an inexpensive and language independent way of finding co-occurrence patterns. On the other hand, more linguistically motivated methods for identification, which employ information such as POS filters and lexical alignment between languages, can produce more targeted candidate lists. In this paper we propose a hybrid approach that combines the strenghts of different sources of information using a machine learning algorithm to produce more robust and precise results. Automatic evaluation on gold standards shows that the performance of our hybrid method is superior to the individual results of statistical and alignment-based MWE extraction approaches for Portuguese and for English. This method can be used to aid lexicographic work by providing a more targeted MWE candidate list.

Carlos Ramisch, Helena de Medeiros Caseli, Aline Villavicencio, André Machado, Maria José Finatto

Out-of-the-Box Robust Parsing of Portuguese

In this paper we assess to what extent the available Portuguese treebanks and available probabilistic parsers are suitable for out-of-the-box robust parsing of Portuguese. We also announce the release of the best parser coming out of this exercise, which is, to the best of our knowledge, the first robust parser widely available for Portuguese.

João Silva, António Branco, Sérgio Castro, Ruben Reis

LXGram: A Deep Linguistic Processing Grammar for Portuguese

In this paper we present LXGram, a general purpose grammar for the deep linguistic processing of Portuguese that delivers high precision grammatical analysis and detailed meaning representations. We present the main design features and evaluation results on the grammar’s coverage as well as its ability to produce correct grammatical analyses.

Francisco Costa, António Branco

Language Resources

InferenceNet.Br: Expression of Inferentialist Semantic Content of the Portuguese Language

Often, the information necessary for a complete understanding of texts is implicit, which requires drawing inferences from the use of concepts in the linguistic praxis. We consider that the usual semantic reasoners of natural language systems face difficulties in capturing this knowledge, due mainly to the lack of linguistic-semantic resources that support reasoning of this nature. This paper presents a new linguistic resource that expresses semantic-inferentialist knowledge for the Portuguese language – InferenceNet.Br – containing a base of concepts and a base of sentence patterns. These bases provide content for a top layer of semantic reasoning in natural language systems, where semantic relations are considered according to their roles in inferences, as premises or conclusions. This linguistic resource was used in a system for extracting information about crime, and the results of this proof of concept are discussed.

Vladia Pinheiro, Tarcisio Pequeno, Vasco Furtado, Wellington Franco

Comparing Verb Synonym Resources for Portuguese

In this paper we compare verb synonym information contained in four public-available lexical-semantic resources for Portuguese: TeP, PAPEL, Wiktionary and OpenThesaurusPT. We quantify the extent to which verb synonymy information in four resources overlaps, and we quantify how much novelty each resource in comparison to the others. We demonstrate that the four resources vary


in respect to verb synonymy information. Also, we show that by merging the four resources we can obtain a more comprehensive verb thesaurus. Finally, we suggest that resource merging may actually be required in order to avoid



evaluation bias

that arise from coverage problems when using only one of these resources.

Jorge Teixeira, Luís Sarmento, Eugénio Oliveira

Auxiliary Verbs and Verbal Chains in European Portuguese

This paper describes auxiliary verb constructions in European Portuguese in view of their correct parsing in a fully integrated NLP chain. The paper provides data on these constructions over a large-sized corpus and evaluates the parsing system performance.

Jorge Baptista, Nuno Mamede, Fernando Gomes

P-AWL: Academic Word List for Portuguese

This paper presents and discusses the methodology for the construction of an Academic Word List for Portuguese: PAWL, inspired in its English equivalent. The aim of this linguistic resource is to provide a solid base for future studies and applications on Computer Assisted Language Learning, while maintaining comparability with other comparable resources.

Jorge Baptista, Neuza Costa, Joaquim Guerra, Marcos Zampieri, Maria Cabral, Nuno Mamede

Speech Recognition

Automatic Phone Clustering Based on Confusion Matrices

Phone recognition experiments give information about the confusions between phones. Grouping the most confusable phones and making a multilevel hierarchical classification should improve phone recognition. In this paper a clustering method is investigated, based on phone confusion matrix, for the data-driven generation of phonetic broad classes (PBC) of the Portuguese language. The method is based on a statistical similarity measurement rather than acoustical/phonetic knowledge. Results are presented for two phone recognisers (TIMIT corpus and Portuguese TECNOVOZ database).

Carla Lopes, Arlindo Veiga, Fernando Perdigão

An Open-Source Speech Recognizer for Brazilian Portuguese with a Windows Programming Interface

This work is part of the effort to develop a speech recognition system for Brazilian Portuguese. The resources for the training and test stages of this system, such as corpora, pronunciation dictionary, language and acoustic models, are publicly available. Here, an application programming interface is proposed in order to facilitate using the open-source Julius speech decoder. Performance tests are presented, comparing the developed systems with a commercial software.

Patrick Silva, Pedro Batista, Nelson Neto, Aldebaro Klautau

A Baseline System for Continuous Speech Recognition of Brazilian Portuguese Using the West Point Brazilian Portuguese Speech Corpus

Despite the availability of several speech corpora that can be used to build automatic speech recognition systems, there are only a few corpora for the Brazilian Portuguese (BP) language. This lack of corpora does not allow an extensive and deep research on continuous speech recognition systems for BP. In this work, we present a baseline system for continuous speech recognition for BP and its results using the West Point Brazilian Portuguese Corpus. In addition to the results, the resources developed to build the system are made available for continuing the research on such systems for BP.

Fabiano Weimar dos Santos, Dante Augusto Couto Barone, André Gustavo Adami

Speech Synthesis

Voice Quality of European Portuguese Emotional Speech

In this paper we investigate parameters related to voice quality in European Portuguese (EP) emotional speech. Our main objectives were to obtain, to our knowledge for the first time, values for the parameters commonly contemplated in acoustic analyses of emotional speech and investigate if there is any difference for EP relative to the results obtained for other languages. A small corpus contemplating five emotions (joy, sadness, despair, fear, cold anger) and neutral speech produced by a professional actor was used. Parameters investigated include fundamental frequency, jitter, shimmer and Harmonic Noise Ratio. In general, results were in accordance with the consulted literature regarding F0 and HNR. For jitter and shimmer our results were, in certain aspects, similar to the ones reported in a study of emotional speech for Spanish, another Latin language. From our analyses, and taking into consideration the reduced size of our corpus and the use of an actor as informant, no clear EP characteristic emerged, except for a possible, needing confirmation, difference regarding joy, with values similar to neutral speech.

Ana Nunes, Rosa Lídia Coimbra, António Teixeira

Prosodic Prediction in Brazilian Portuguese: A Contribution to Speech Synthesis

The prosodic prediction phenomenon concerns the possibility of a listener to predict what will be said, following the melodic modulation used for the initial part of a full sentence. This prediction concerns the semantic value and/or the syntactic relations in the sentence. The existence of that phenomenon was confirmed for Brazilian Portuguese, considering the relations between main clauses and subordinate adverbial clauses. Using the speech resynthesis process, it was possible to identify the specific fundamental frequency and durational behavior responsible for the predictive value in the main clauses. The implementation of an algorithm which considers that predictive phenomenon in a speech synthesis system will produce more natural final auditory results.

Cirineu Cecote Stein

The Role of Morphology in Generating High-Quality Pronunciation Lexica for Regional Variants of Portuguese

Grapheme to phoneme (GTP) systems for languages such as English, German, and Korean have been shown to achieve better performance rates with the inclusion of a morpho-phonological preprocessing component. While semi-automatic and automatic GTP approaches for Portuguese continue to achieve steady gains, such algorithms do not take morphology into account, despite a growing need to do so, based in part on the recent spelling reform. This paper presents a pilot study in the development of the Portuguese Unisyn Lexicon (LUPo) for assessing the role of morphological information in the generation of high-quality pronunciation lexica for regional variants of Portuguese. Some problematic orthographic contexts are identified, along with the associated difficulties that arise when morphology is left out of the equation. Expanding from known issues that affect Portuguese GTP systems, new orthographic contexts stemming from the recent spelling reform are addressed.

Simone Ashby, José Pedro Ferreira


Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.




Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!