Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 8th Language and Technology Conference: Challenges for Computer Science and Linguistics, LTC 2017, held in Poznan, Poland, in November 2017.

The 31 revised papers presented in this volume were carefully reviewed and selected from 108 submissions. The papers selected to this volume belong to various fields of: Speech Processing; Multiword Expressions; Parsing; Language Resources and Tools; Ontologies and Wordnets; Machine Translation; Information and Data Extraction; Text Engineering and Processing; Applications in Language Learning; Emotions, Decisions and Opinions; Less-Resourced Languages.

Inhaltsverzeichnis

Frontmatter

Speech Processing

Frontmatter

Intelligent Speech Features Mining for Robust Synthesis System Evaluation

Speech synthesis evaluation involves the analytical description of useful features, sufficient to assess the performance of a speech synthesis system. Its primary focus is to determine the degree of semblance of synthetic voice to a natural or human voice. The task of evaluation is usually driven by two methods: the subjective and objective methods, which have indeed become a regular standard for evaluating voice quality, but are mostly challenged by high speech variability as well as human discernment errors. Machine learning (ML) techniques have proven to be successful in the determination and enhancement of speech quality. Hence, this contribution utilizes both supervised and unsupervised ML tools to recognize and classify speech quality classes. Data were collected from a listening test (experiment) and the speech quality assessed by domain experts for naturalness, intelligibility, comprehensibility, as well as, tone, vowel and consonant correctness. During the pre-processing stage, a Principal Component Analysis (PCA) identified 4 principal components (intelligibility, naturalness, comprehensibility and tone) – accounting for 76.79% variability in the dataset. An unsupervised visualization using self organizing map (SOM), then discovered five distinct target clusters with high densities of instances, and showed modest correlation between significant input factors. A Pattern recognition using deep neural network (DNN), produced a confusion matrix with an overall performance accuracy of 93.1%, thus signifying an excellent classification system.

Moses E. Ekpenyong, Udoinyang G. Inyang, Victor E. Ekong

Neural Networks Revisited for Proper Name Retrieval from Diachronic Documents

Developing high-quality transcription systems for very large vocabulary corpora is a challenging task. Proper names are usually key to understanding the information contained in a document. To increase the vocabulary coverage, a huge amount of text data should be used. In this paper, we extend the previously proposed neural networks for word embedding models: word vector representation proposed by Mikolov is enriched by an additional non-linear transformation. This model allows to better take into account lexical and semantic word relationships. In the context of broadcast news transcription and in terms of recall, experimental results show a good ability of the proposed model to select new relevant proper names.

Irina Illina, Dominique Fohr

Cross-Lingual Adaptation of Broadcast Transcription System to Polish Language Using Public Data Sources

We present methods and procedures designed for cost-efficient adaptation of an existing speech recognition system to Polish. The system (originally built for Czech language) is adapted using common texts and speech recordings accessible from Polish web-pages. The most critical part, an acoustic model (AM) for Polish, is built in several steps, which include: (a) an initial bootstrapping phase that utilizes existing Czech AM, (b) a lightly-supervised iterative scheme for automatic collection and annotation of Polish speech data, and finally (c) acquisition of a large amount of broadcast data in an unsupervised way. The developed system has been evaluated in the task of automatic content monitoring of major Polish TV and Radio stations. Its transcription accuracy (measured on a set of 4 complete TV news shows with total duration of 105 min) is 79,2%. For clean studio speech, its accuracy gets over 92%.

Jan Nouza, Petr Cerva, Radek Safarik

Automatic Transcription and Subtitling of Slovak Multi-genre Audiovisual Recordings

This paper summarizes a recent progress in the development of the automatic transcription system for subtitling of the Slovak multi-genre audiovisual recordings, such as lectures, talks, discussions, broadcast news or TV/radio shows. The main concept is based on application of current and innovative principles and methods oriented towards speech and language processing, automatic speech segmentation, speech recognition, statistical modeling and adaptation of acoustic and language models to a specific topic, gender and speaking style of the speaker. We have developed a working prototype of automatic transcription system for the Slovak language, mainly designed for subtitling of various types of single- or multi-channel audiovisual recordings. Preliminary results show a significant decrease in word error rate relatively from 2.40% to 47.10% for an individual speaker in fully automatic transcription and subtitling of Slovak parliament speech, broadcast news or TEDx talks.

Ján Staš, Peter Viszlay, Martin Lojka, Tomáš Koctúr, Daniel Hládek, Jozef Juhár

Multiword Expressions

Frontmatter

SEJF - A Grammatical Lexicon of Polish Multiword Expressions

We present SEJF, a lexical resource of Polish nominal, adjectival and adverbial multiword expressions. It consists of an intensional module with about 4,700 multiword lemmas assigned to 160 inflection graphs, and an extensional module with 88,000 automatically generated inflected forms annotated with grammatical tags. We show the results of its coverage evaluation against an annotated corpus. The resource is freely available under the Creative Commons BY-SA license.

Monika Czerepowicka, Agata Savary

Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis

The article concerns automatic lemmatization of Multi-Word Units for highly inflective languages. We present an approach, where the lemmatization is conducted using rules generated solely based on a corpus analysis. Conducted experiments revealed, that the accuracy of automatic lemmatization of MWUs for the Polish language according to the developed approach may reach up to 82%.

Jacek Małyszko, Witold Abramowicz, Agata Filipowska, Tomasz Wagner

Parsing

Frontmatter

Parsing of Polish in Graph Database Environment

This paper describes the basic concepts and features of the Langusta system. Langusta is a natural language processing environment embedded in a graph database. The paper presents a rule-based syntactic parsing system for the Polish language using various linguistic resources, including those containing semantic information. The advantages of this approach are directly related to the deployment of the graph paradigm, in particular to the assumption, that rules describing the syntax of the Polish language are valid queries in a graph database query language (Cypher).

Jan Posiadała, Hubert Czaja, Eliza Szczechla, Paweł Susicki

Language Resources and Tools

Frontmatter

RetroC – A Corpus for Evaluating Temporal Classifiers

We present a corpus for training and evaluating systems for the dating of Polish texts. A number of baselines (using year references, knowledge of spelling reforms and birth years) are given for the temporal classification task. We also show that the problem can be viewed as a regression problem and a standard supervised learning tool (Vowpal Wabbit) can be applied. So far, the best result has been achieved with supervised learning with word tokens and character 5-g as features. In addition, error analysis of the results obtained with the best solution are presented in this paper.

Filip Graliński, Piotr Wierzchoń

Reinvestigating the Classification Approach to the Article and Preposition Error Correction

In this work, we reinvestigate the classifier-based approach to article and preposition error correction going beyond linguistically motivated factors. We show that state-of-the-art results can be achieved without relying on a plethora of heuristic rules, complex feature engineering and advanced NLP tools. A proposed method for detecting spaces for article insertion is even more efficient than methods that use a parser. We examine automatically trained word classes acquired by unsupervised learning as a substitution for commonly used part-of-speech tags. Our best models significantly outperform the top systems from CoNLL-2014 Shared Task in terms of article and preposition error correction.

Roman Grundkiewicz, Marcin Junczys-Dowmunt

Binary Classification Algorithms for the Detection of Sparse Word Forms in New Indo-Aryan Languages

This paper describes experiments in applying statistical classification algorithms for the detection of converbs – rare word forms found in historical texts in New Indo-Aryan languages. The digitized texts were first manually tagged with the help of a custom made tool called IA Tagger enabling semi-automatic tagging of the texts. One of the features of the system is the generation of statistical data on occurrences of words and phrases in various contexts, which helps perform historical linguistic analysis at the levels of morphosyntax, semantics and pragmatics. The experiments carried out on data annotated with the use of IA Tagger involved the training of multi-class and binary POS-classifiers.

Rafał Jaworski, Krzysztof Jassem, Krzysztof Stroński

Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms

This work focuses on morphological analysis of raw text and provides a recipe for tokenization, sentence splitting and part-of-speech tagging for all languages included in the Universal Dependencies Corpus. Scalability is an important issue when dealing with large-sized multilingual corpora. The experiments include both lightweight classifiers (linear and decision trees) and heavyweight LSTM-based architectures which are able to attain state-of-the-art results. All the experiments are carried out using the provided data “as-is”. We apply lightweight and heavyweight classifiers on 5 distinct tasks, on multiple languages; we present some lessons learned during the training process; we look at per-language results as well as task averages, we present model footprints, and finally draw a few conclusions regarding trade-offs between the classifiers’ characteristics.

Tiberiu Boros, Stefan Daniel Dumitrescu

Ontologies and Wordnets

Frontmatter

A Semantic Similarity Measurement Tool for WordNet-Like Databases

The paper describes a new framework for computing the semantic similarity of words and concepts using WordNet-like databases. The main advantage of the presented approach is the ability to implement similarity measures as concise expressions in the embedded query language. The preliminary results of the use of the framework to model the semantic similarity of Polish nouns are reported.

Marek Kubis

Similarity Measure for Polish Short Texts Based on Wordnet-Enhanced Bag-of-words Representation

We present a method for computing semantic similarity of Polish texts with main focus given to short texts. We have taken into account the limited set of language tools for Polish, and especially that syntactic and semantic parsers do not express accuracy and robustness high enough and to become a stable basis for similarity computation. A very large wordnet of Polish, namely plWordNet is used to construct meaning representations for words in such a way that different words of the similar meaning receive similar representations. The use of a Word Sense Disambiguation (WSD) tool for Polish brought positive results in one of the method variants, regardless of the limited accuracy of the WSD tool. The proposed measures have been compared with the manual evaluation of sentence pairs. The measures were also applied as a part of the Question Answering system. Improved performance of answer finding was achieved in several types of tests.

Maciej Piasecki, Anna Gut

Methods of Linking Linguistic Resources for Semantic Role Labeling

This paper presents the process of enriching the verb frame database of a Hungarian natural language parser to enable the assignment of semantic roles. We accomplished this by linking the parser’s verb frame database to existing linguistic resources such as VerbNet and WordNet, and automatically transferring back semantic knowledge. We developed OWL ontologies that map the various constraint description formalisms of the linked resources and employed a logical reasoning device to facilitate the linking procedure. We present results and discuss the challenges and pitfalls that arose from this undertaking. We also compare our rule-based approach with that of using a state-of-the-art English semantic role labeler pipeline for the thematic role transferring task.

Balázs Indig, Márton Miháltz, András Simonyi

Machine Translation

Frontmatter

A Quality Estimation System for Hungarian

Quality estimation is an important field of machine translation evaluation. There are automatic evaluation methods for machine translation that use reference translations created by human translators. The creation of these reference translations is very expensive and time-consuming. Furthermore, these automatic evaluation methods are not real-time and the correlation between the results of these methods and that of human evaluation is very low in the case of translations from English to Hungarian. The other kind of evaluation approach is quality estimation. These methods address the task by estimating the quality of translations as a prediction task for which features are extracted from only the source and translated sentences. In this study, we describe an English-Hungarian quality estimation system that can predict quality of translated sentences. Furthermore, using the predicted the quality scores, we combined different kinds of machine translated outputs to improve the translation accuracy. For this task, we created a training corpus. Last, but not least, using the quality estimation method we created a monolingual quality estimation system for a psycholinguistically motivated parser. In this paper we summarize our results and show some partial results of ongoing projects.

Zijian Győző Yang, Andrea Dömötör, László János Laki

Leveraging the Advantages of Associative Alignment Methods for PB-SMT Systems

Training statistical machine translation systems used to require heavy computation times. It has been shown that approximations in the probabilistic approach could lead to impressing improvements (Fast align). We show that, by leveraging the advantages of the associative approach, we achieve similar, even faster, training times, while keeping comparable BLEU scores. Our contributions are of two types: of the engineering type, by introducing multi-processing both in sampling-based alignment and hierarchical sub-sentential alignment; of modeling type, by introducting approximations in hierarchical sub-sentential alignment that lead to important reductions in time without affecting the alignments produced. We test and compare our improvements on six typical language pairs of the Europarl corpus.

Baosong Yang, Yves Lepage

Information and Data Extraction

Frontmatter

Events Extractor for Polish Based on Semantics-Driven Extraction Templates

The paper presents a certain paradigm of extracting events from Polish free texts. We call it semantics-driven because the extraction templates are generated from the specification of a domain knowledge that is expressed in the form of a well-founded ontology. The considered method is equipped with the supporting tool that has two components: the first one is domain-dependent and serves to generate extraction templates on the basis of an ontology. The second part is linguistic and domain-independent and may be used whenever templates are supplied, not necessarily via the generator. We checked the quality performance of our generator on a basis of a case study.

Jolanta Cybulka, Jakub Dutkiewicz

Understanding Questions and Extracting Answers: Interactive Quiz Game Application Design

The paper discusses two key tasks performed by a Question Answering Dialogue System (QADS): user question interpretation and answer extraction. The system represents an interactive quiz game application. The information that forms the content of the game is concerned with biographical facts of famous people’s life. The process of a question classification and answer extraction is performed based on a domain-specific taxonomy of semantic roles and relations computing the Expected Answer Type (EAT). Question interpretation is achieved performing a sequence of classification, information extraction, query formalization and query expansion tasks. The expanded query facilitates the search and retrieval of the information. The facts are extracted from Wikipedia pages by means of the same set of semantic relations, whose fillers are identified by trained sequence classifiers and pattern matching tools, and edited to be returned to the player as full-fledged system answers. The results (precision of 85% for the EAT classification of both in questions and answers) show that the presented approach fits the data well and can be considered as a promising method for other QA domains, in particular when dealing with unstructured information.

Volha Petukhova, Desmond Darma Putra, Alexandr Chernov, Dietrich Klakow

Exploiting Wikipedia-Based Information-Rich Taxonomy for Extracting Location, Creator and Membership Related Information for ConceptNet Expansion

In this paper we present a method for extracting IsA assertions (hyponymy relations), AtLocation assertions (informing of the location of an object or place), LocatedNear assertions (informing of neighboring locations), CreatedBy assertions (informing of the creator of an object) and MemberOf assertions (informing of group membership) automatically from Japanese Wikipedia XML dump files. We use the Hyponymy extraction tool v1.0, which analyses definition, category and hierarchy structures of Wikipedia articles to extract IsA assertions and produce information-rich taxonomy. From this taxonomy we extract additional information, in this case AtLocation, LocatedNear, CreatedBy and MemberOf types of assertions, using our original method. The presented experiments prove that both methods produce satisfactory results: we were able to acquire 5,866,680 IsA assertions with 96.0% reliability, 131,760 AtLocation assertion pairs with 93.5% reliability, 6,217 LocatedNear assertion pairs with 98.5% reliability, 270,230 CreatedBy assertion pairs with 78.5% reliability and 21,053 MemberOf assertions with 87.0% reliability. Our method surpassed the baseline system in terms of both precision and the number of acquired assertions.

Marek Krawczyk, Rafal Rzepka, Kenji Araki

Text Engineering and Processing

Frontmatter

Lexical Analysis of Serbian with Conditional Random Fields and Large-Coverage Finite-State Resources

This article describes a joint approach to lexical tagging in Serbian, combining three fundamental natural language processing tasks: part-of-speech tagging, compound and named entity recognition. The proposed system relies on conditional random fields that are trained from a newly released annotated corpus and finite-state lexical resources used in an existing symbolic Serbian tagging system. Experimental results show that a joint strategy is more robust than pipeline ones and that the use of lexical resources has a significant positive impact on tagging, in particular on out-of-domain texts.

Mathieu Constant, Cvetana Krstev, Duško Vitas

Improving Chunker Performance Using a Web-Based Semi-automatic Training Data Analysis Tool

Fine tuning features for NP chunking is a difficult task. The effects of a modification are sometimes unpredictable. The tuning process with a (un)supervised learning algorithm does not produce necessarily better results. An online toolkit was developed for this scenario highlighting critical areas in training data, which may pose a challenge for the learning algorithm: irregular data, exceptions in trends, unusual property values. This overview of problematic data might inspire the linguist to enhance the data (for example by dividing a class into more detailed classes). The kit was tested on English and Hungarian corpora. Results show that the preparation of datasets for NP chunking is accelerated effectively, which result in better F-scores. The toolkit runs on a simple browser and its usage poses no difficulties for non-technical users. The tool combines the abstraction ability of a linguist and the power of a statistical engine.

István Endrédy

A Connectionist Model of Reading with Error Correction Properties

Recent models of associative long term memory (LTM) have emerged in the field of neuro-inspired computing. These models have interesting properties of error correction, robustness, storage capacity and retrieval performance. In this context, we propose a connectionist model of written word recognition with correction properties, using associative memories based on neural cliques. Similarly to what occurs in human language, the model takes advantage of the combination of phonological and orthographic information to increase the retrieval performance in error cases. Therefore, the proposed architecture and principles of this work could be applied to other neuro-inspired problems that involve multimodal processing, in particular for language applications.

Max Raphael Sobroza Marques, Xiaoran Jiang, Olivier Dufor, Claude Berrou, Deok-Hee Kim-Dufor

Applications in Language Learning

Frontmatter

The Automatic Generation of Nonwords for Lexical Recognition Tests

Lexical recognition tests are frequently used to assess vocabulary knowledge. In such tests, learners need to differentiate between words and artificial nonwords that look much like real words. Our ultimate goal is to create high quality lexical recognition tests automatically which enables repetitive automated testing for different languages. This task involves both simple (words selection) and complex (nonwords generation) subtasks. Our main goal here is to automatically generate word-like nonwords. We compare different ranking strategy and find that our best strategy (a specialized higher-order character-based language model) creates word-like nonwords. We evaluate our nonwords in a user study and find that our automatically generated test yields scores that are highly correlated with a well-established lexical recognition test which was manually created.

Osama Hamed, Torsten Zesch

Teaching Words in Context: Code-Switching Method for English and Japanese Vocabulary Acquisition Systems

One of the essential parts of second language curriculum is teaching vocabulary. Until now many existing techniques tried to facilitate word acquisition, but one method which has been paid less attention to is code-switching. In this paper, we present an experimental system for computer assisted vocabulary learning in context using a code-switching based method, focusing on teaching Japanese vocabulary to foreign language learners. First, we briefly introduce our Co-MIX method for vocabulary teaching systems using code-switching phenomenon to support vocabulary acquisition. Next, we show how we utilize incidental learning technique with graded readers to facilitate vocabulary learning. We present the systems architecture, underlying technologies and the initial evaluation of the system’s performance by using semantic differential scale. Finally, we discuss the evaluation results and compare them with our English vocabulary teaching system.

Michal Mazur, Rafal Rzepka, Kenji Araki

Emotions, Decisions and Opinions

Frontmatter

Automatic Extraction of Harmful Sentence Patterns with Application in Cyberbullying Detection

The problem of humiliating and slandering people through Internet, generally defined as cyberbullying (later: CB), has been recently noticed as a serious social problem disturbing mental health of Internet users. In Japan, to deal with the problem, members of Parent-Teacher Association (PTA) perform Internet Patrol – a voluntary work by reading through the whole Web contents to spot cyberbullying entries. To help PTA members we propose a novel method for automatic detection of malicious contents on the Internet. The method is based on a brute force search algorithm-inspired combinatorial approach to language modeling. The method automatically extracts sophisticated sentence patterns and uses them in classification. We tested the method on actual data containing cyberbullying provided by Human Rights Center. The results show our method outperformed previous methods. It is also more efficient as it requires minimal human effort.

Michal Ptaszynski, Fumito Masui, Yasutomo Kimura, Rafal Rzepka, Kenji Araki

Sentiment Analysis in Polish Web-Political Discussions

The article presents analysis of Polish Internet political discussion forums, characterized by significant polarization and high levels of emotion. The study compares samples of discussions gathered from the Internet comments concerning the last Polish election candidates. The authors compare three dictionary based sentiment analysis methods (built using different sentiment lexicons) with two machine learning ones, and explore methods using word embeddings to enhance sentiment analysis using dictionary based algorithms. The best performing algorithm is giving results closely corresponding to human evaluations.

Antoni Sobkowicz, Marek Kozłowski

Saturation Tests in Application to Validation of Opinion Corpora: A Tool for Corpora Processing

Opinion processing has recently gained much interest among computational linguists, public relation experts, marketing companies, and politicians. Studies of the natural language expression of opinions, desires, emotions, and related phenomena require appropriate tools and methodologies. We propose tools for collection of empirical data in the form of a corpus, limiting our research field to customers’ written opinions about widely used on-line booking services in the area of hotel reservations (via Booking.com). In this paper, we present the corpus acquisition procedure and our data acquisition tool, as well as discuss our decisions about the selection of the source data. We also present some limitations of our proposal and propose a validation methodology for the resulting corpora.

Zygmunt Vetulani, Marta Witkowska, Suleyman Menken, Umut Canbolat

Less-Resourced Languages

Frontmatter

Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose of NLP application for these languages prove to be quite challenging. For the development of any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121 k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Project. Both the taggers are trained and tested with approximately 80 k and 13 k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.

Pitambar Behera, Atul Kr. Ojha, Girish Nath Jha

Cross-Linguistic Projection for French-Vietnamese Named Entity Translation

High-quality translation is time-consuming and an expensive process. Named Entity (NE) Translation, including proper names, remains a very important task for multilingual natural language processing. Most of the gold standard corpora are available for English but not for under-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on a named entity translation approach by cross-linguistic projection for French-Vietnamese, a poor-resourced pair of languages. We incrementally apply a cross-projection method using a small parallel annotated corpora, such as the surface string matching measures according to probabilistic string edit distance similarity and an additional score of syllable consistence feature between the source term and the target term by a syllabification process. Evaluations on French-Vietnamese pair show a good accuracy with BLEU gain more than 4 points when translating bilingual named entities pairs.

Ngoc Tan Le, Fatiha Sadat

National Language Technologies Portals for LRLs: A Case Study

The new Welsh National Language Technologies Portal is an extensive resource for researchers, developers in the ICT and digital media spheres, open source enthusiasts and code clubs who may have limited understanding of language technologies but which nevertheless have a need for incorporating linguistic data and capabilities into their own projects, products, processes and services in order to better serve their wider LRL community. It includes a repository of free, simple and accessible resources with documentation, tutorials, example code and projects. This paper describes the rationale and process of building the Portal, new novel resource dissemination mechanisms employed, such as online APIs and Docker, as well as the lessons learnt and applicability to other similar linguistic situations and communities.

Delyth Prys, Dewi Bryn Jones

Challenges for and Perspectives on the Malagasy Language in the Digital Age

This paper is a revised and extended version of my LTC 2015 contribution [26]. It shows the state of Malagasy language at the present time and deals with challenges and perspectives. To enter in the digital age, languages must provide resources and tools. The creation of useful tools such as spell checkers or machine translation systems would introduce Malagasy into the era of new technology. It encourages users to use the language more. However, it is usually the work of specialists of Natural Language Processing (NLP). For Malagasy, an agglutinative language, the collaboration between specialists of NLP and linguists is required. This paper surveys tools and resources that have been constructed for Malagasy, and, among others, a project [24, 27] based on the DELA framework [14] to construct NLP dictionaries of Malagasy by using conventional dictionaries and converting them into a structured, but readable and manually updatable, resource usable by Unitex [18]. We report on the ongoing construction of NLP dictionaries of verbs, nouns from verbs, grammatical words with the same DELA methodology and we discuss the dictionaries of simple words and multi-word units.

Joro Ny Aina Ranaivoarison

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise