nach oben

2017 | Buch

Kapitel lesen Erstes Kapitel lesen

Text, Speech, and Dialogue

20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings

herausgegeben von: Kamil Ekštein, Václav Matoušek

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the proceedings of the 20th International Conference on Text, Speech, and Dialogue, TSD 2017, held in Prague, CzechRepublic, in August 2017.
The 56 regular papers presented together with 3 abstracts of keynote talks were carefully reviewed and selected from 117 submissions. They focus on topics such as corpora and language resources; speech recognition; tagging, classification and parsing of text and speech; speech and spoken language generation; semantic processing of text and speech; integrating applications of text and speech processing; automatic dialogue systems; as well as multimodal techniques and modelling.

Inhaltsverzeichnis

Frontmatter

Keynote Talk

Frontmatter

A Glimpse Under the Surface: Language Understanding May Need Deep Syntactic Structure

Language understanding is one of the crucial issues both for the theoretical study of language as well as for applications developed in the domain of natural language processing. As Katz (1969, p. 100) puts it “to understand the ability of natural languages to serve as instrument to the communication of thoughts and ideas we must understand what it is that permits those who speak them consistently to connect the right sounds with the right meanings.” The proper task of linguistics consists then in the description (and) explanation of the relation between the set of the semantic representations and that of the phonetic forms of utterances; at the same time, among the principal difficulties there belongs “a specification of the set of semantic representations” (Sgall and Hajičová 1970, p. 5). In our contribution, we present arguments for the approach that follows the tradition of European structuralism which attempted at an account of linguistic meaning the elements of which are understood as “points of intersection” of conceptual contents (as a reflection of reality) and the organizing principle of the grammar of the individual language (Dokulil and Daneš 1958). In other words, we examine how “deep” the sematic representations have to be in order (i) to give an appropriate account of synonymy, and (ii) to help to distinguish semantic differences in cases of ambiguity (homonymy).

Eva Hajičová

Conference Papers

Frontmatter

Robust Automatic Evaluation of Intelligibility in Voice Rehabilitation Using Prosodic Analysis

Speech intelligibility for voice rehabilitation has been successfully evaluated by automatic prosodic analysis. In this paper, the influence of reading errors and the selection of certain words for the computation of prosodic features (nouns only, nouns and verbs, beginning of each sentence, beginnings of sentences and subclauses) are examined. 73 hoarse patients (48.3 ± 16.8 years) read the German version of the text “The North Wind and the Sun”. Their intelligibility was evaluated perceptually by 5 trained experts according to a 5-point scale. Eight prosodic features showed human-machine correlations of r$$\ge $$ 0.4. The normalized energy in a word-pause-word interval, computed from all words (r = 0.69 for the full speaker set), the mean of jitter in nouns and verbs (r = 0.67), and the pause duration before a word (r = 0.66) were the most robust features. However, reading errors can significantly influence these results.

Tino Haderlein, Anne Schützenberger, Michael Döllinger, Elmar Nöth

Personality-Dependent Referring Expression Generation

This paper addresses the issue of how Big Five personality traits may influence the content selection task in Referring Expression generation (REG.) To this end, we build a corpus of referring expressions annotated with personality information, and then use it as the input to a machine learning approach to REG that takes the personality of the target speakers into account. Results show that personality-dependent REG outperforms standard REG algorithms, and that it may be a viable alternative to speaker-dependent approaches that require examples of descriptions produced by every individual under consideration.

Ivandré Paraboni, Danielle Sampaio Monteiro, Alex Gwo Jen Lan

Big Five Personality Recognition from Multiple Text Genres

This paper investigates which Big Five personality traits are best predicted by different text genres, and how much text is actually needed for the task. To this end, we compare the use of ‘free’ Facebook text with controlled text elicited from visual stimuli in descriptive and referential tasks. Preliminary results suggest that certain text genres may be more revealing of personality traits than others, and that some traits are recognisable even from short pieces of text. These insights may aid the future design of more accurate models of personality based on highly focused tasks for both language production and interpretation.

Vitor Garcia dos Santos, Ivandré Paraboni, Barbara Barbosa Claudino Silva

Automatic Classification of Types of Artefacts Arising During the Unit Selection Speech Synthesis

The paper describes an experiment with automatic classification of the basic types of artefacts in the synthetic speech produced by the Czech text-to-speech system using the unit selection synthesis method. The developed classifier based on the Gaussian mixture models (GMM) is solved finally as the open-set classification task due to a limited database of speech artefacts resulting from incorrectly chosen or exchanged speech units during the synthesis process. The realized experiments prove principal impact of the accuracy of determination of the speech artefact section on the final precision of the artefact type classification. From the auxiliary investigations follows a relatively great influence of the number of mixtures and the type of a covariance matrix on the output artefact classification error rate as well as on the computational complexity.

Jiří Přibil, Anna Přibilová, Jindřich Matoušek

A Comparison of Lithuanian Morphological Analyzers

In this paper we present the comparative research work disclosing strengths and weaknesses of two the most popular and publicly available Lithuanian morphological analyzers, in particular, Lemuoklis and Semantika.lt. Their lemmatization, part-of-speech tagging, and fined-grained annotation of the morphological categories (as case, gender, tense, etc.) performance was evaluated on the morphologically annotated gold standard corpus composed of four domains, in particular, administrative, fiction, scientific and periodical texts. Semantika.lt significantly outperformed Lemuoklis by $$\sim $$1.7%, $$\sim $$2.5%, and $$\sim $$8.1% on the lemmatization, part-of-speech tagging, and fine-grained annotation tasks achieving $$\sim $$98.0%, $$\sim $$95.3% and, $$\sim $$86.8% of the accuracy, respectively.Semantika.lt was also superior on the administrative, fiction, and periodical texts; however, Lemuoklis yielded similar performance on the scientific texts and even bypassed Semantika.lt in the fine-grained annotation task.

Jurgita Kapočiūtė-Dzikienė, Erika Rimkutė, Loic Boizou

Constrained Deep Answer Sentence Selection

In this paper, we propose Constrained Deep Neural Network (CDNN) a simple deep neural model for answer sentence selection. CDNN makes its predictions based on neural reasoning compound with some symbolic constraints. It integrates pattern matching technique into sentence vector learning. When trained using enough samples, CDNN outperforms regular models. We show how using other sources of training data as a mean of transfer learning can enhance the performance of the network. In a well-studied dataset for answer sentence selection, our network improves the state of the art in answer sentence selection significantly.

Ahmad Aghaebrahimian

Quora Question Answer Dataset

We report on a progressing work for compiling Quora Question Answer dataset. Quora dataset is composed of questions which are posed in Quora Question Answering site. It is the only dataset which provides sentence-level and word-level answers at the same time. Moreover, the questions in the dataset are authentic which is much more realistic for Question Answering systems. We test the performance of a state-of-the-art Question Answering system on the dataset and compare it with human performance to establish an upper bound.

Ahmad Aghaebrahimian

Sentiment Analysis with Tree-Structured Gated Recurrent Units

Advances in neural network models and deep learning mark great impact on sentiment analysis, where models based on recursive or convolutional neural networks show state-of-the-art results leaving behind non-neural models like SVM or traditional lexicon-based approaches. We present Tree-Structured Gated Recurrent Unit network, which exhibits greater simplicity in comparison to the current state of the art in sentiment analysis, Tree-Structured LSTM model.

Marcin Kuta, Mikołaj Morawiec, Jacek Kitowski

Synthetic Speech in Therapy of Auditory Hallucinations

In this article we propose using speech synthesis in the therapy of auditory verbal hallucinations, which are sometimes called “voices”. During a therapeutic session a patient converses with an avatar, which is controlled by a therapist. The avatar, based on the XFace model and commercial text-to-speech systems, uses a high quality synthetic voice synchronized with lip movements. A proof-of-concept is demonstrated, as well as the results of preliminary experiments with six patients. The initial results are highly encouraging – all the patients claimed that the therapy helped them, and they also highly assessed the quality of the avatar’s speech and its synchronization with the animations.

Kamil Sorokosz, Izabela Stefaniak, Artur Janicki

Statistical Pronunciation Adaptation for Spontaneous Speech Synthesis

To bring more expressiveness into text-to-speech systems, this paper presents a new pronunciation variant generation method which works by adapting standard, i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and prosodic features, and in using a probabilistic machine learning framework, namely conditional random fields and phoneme-based n-gram models. Extensive experiments on the Buckeye corpus of English conversational speech demonstrate the effectiveness of the approach through objective and perceptual evaluations.

Raheel Qader, Gwénolé Lecorvé, Damien Lolive, Marie Tahon, Pascale Sébillot

Machine Learning Approach to the Process of Question Generation

In this paper, we introduce an interactive approach to generation of factual questions from unstructured text. Our proposed framework transforms input text into structured set of features and uses them for question generation. Its learning process is based on combination of machine learning techniques known as reinforcement learning and supervised learning. Learning process starts with initial set of pairs formed by declarative sentences and assigned questions and it continuously learns how to transform sentences into questions. Process is also improved by feedback from users regarding already generated questions. We evaluated our approach and the comparison with state-of-the-art systems shows that it is a perspective way for research.

Miroslav Blšták, Viera Rozinajová

Automatic Extraction of Typological Linguistic Features from Descriptive Grammars

The present paper describes experiments on automatically extracting typological linguistic features of natural languages from traditional written descriptive grammars. The feature-extraction task has high potential value in typological, genealogical, historical, and other related areas of linguistics that make use of databases of structural features of languages. Until now, extraction of such features from grammars has been done manually, which is highly time and labor consuming and becomes prohibitive when extended to the thousands of languages for which linguistic descriptions are available. The system we describe here starts from semantically parsed text over which a set of rules are applied in order to extract feature values. We evaluate the system’s performance on the manually curated Grambank database as the gold standard and report the first measures of precision and recall for this problem.

Shafqat Mumtaz Virk, Lars Borin, Anju Saxena, Harald Hammarström

Text Punctuation: An Inter-annotator Agreement Study

Spoken language is a phenomenon which is hard to be annotated accurately. One of the most ambiguous tasks is to fill in the punctuation marks into the spoken language transcription. Used punctuation marks are often dependent on how annotators understand the transcription content. This may differ as the spoken language often lacks clear structure (inherent to written language) due to the utterance spontaneity or due to skipping between ideas.Therefore we suspect that filling commas into the spoken language transcription is a very ambiguous task with low inter-annotator agreement (IAA). Low IAA also means that application of Gold Truth (GT) annotations for automatic algorithm evaluation is questionable as already discussed in [7, 8].In this paper we analyze the IAA within group of annotators and we propose methods to increase it. We also propose and evaluate a reformulation of classical GT annotations for cases with multiple annotations available.

Marek Boháč, Michal Rott, Vojtěch Kovář

PDTSC 2.0 - Spoken Corpus with Rich Multi-layer Structural Annotation

We present a richly annotated spoken language resource, the Prague Dependency Treebank of Spoken Czech 2.0, the primary purpose of which is to serve for speech-related NLP tasks. The treebank features several novel annotation schemas close to the audio and transcript, and the morphological, syntactic and semantic annotation corresponds to the family of Prague Dependency Treebanks; it could thus be used also for linguistic studies, including comparative studies regarding text and speech. The most unique and novel feature is our approach to syntactic annotation, which differs from other similar corpora such as Treebank-3 [8] in that it does not attempt to impose syntactic structure over input, but it includes one more layer which edits the literal transcript to fluent Czech while keeping the original transcript explicitly aligned with the edited version. This allows the morphological, syntactic and semantic annotation to be deterministically and fully mapped back to the transcript and audio. It brings new possibilities for modeling morphology, syntax and semantics in spoken language – either at the original transcript with mapped annotation, or at the new layer after (automatic) editing. The corpus is publicly and freely available.

Marie Mikulová, Jiří Mírovský, Anja Nedoluzhko, Petr Pajas, Jan Štěpánek, Jan Hajič

Automatic Phonetic Segmentation Using the Kaldi Toolkit

In this paper we explore the possibilities of hidden Markov model based automatic phonetic segmentation with the Kaldi toolkit. We compare the Kaldi toolkit and the Hidden Markov Model Toolkit (HTK) in terms of segmentation accuracy. The well-tuned HTK-based phonetic segmentation framework was taken as the baseline and compared to a newly proposed segmentation framework built from the default examples and recipes available in the Kaldi repository. Since the segmentation accuracy of the HTK-based system was significantly higher than that of the Kaldi-based system, the default Kaldi setting was modified with respect to pause model topology, the way of generating phonetic questions for clustering, and the number of Gaussian mixtures used during modeling. The modified Kaldi-based system achieved results comparable to those obtained by HTK—slightly worse for small segmentation errors but better for gross segmentation errors. We also confirmed that, for both toolkits, the standard three-state left-to-right model topology was significantly outperformed by a modified five-state left-to-right topology, especially with respect to small segmentation errors.

Jindřich Matoušek, Michal Klíma

Language Independent Assessment of Motor Impairments of Patients with Parkinson’s Disease Using i-Vectors

Speech disorders are among the most common symptoms in patients with Parkinson’s disease. In recent years, several studies have aimed to analyze speech signals to detect and to monitor the progression of the disease. Most studies have analyzed speakers of a single language, even in that scenario the problem remains open. In this study, a cross-language experiment is performed to evaluate the motor impairments of the patients in three different languages: Czech, German and Spanish. The i-vector approach is used for the evaluation due to its capability to model speaker traits. The cosine distance between the i-vector of a test speaker and a reference i-vector that represents either healthy controls or patients is computed. This distance is used to perform two analyses: classification between patients and healthy speakers, and the prediction of the neurological state of the patients according to the MDS-UPDRS score. Classification accuracies of up to $$72\%$$ and Spearman’s correlations of up to 0.41 are obtained between the cosine distance and the MDS-UPDRS score. This study is a step towards a language independent assessment of patients with neuro-degenerative disorders.

N. Garcia, J. C. Vásquez-Correa, J. R. Orozco-Arroyave, N. Dehak, E. Nöth

ParCoLab: A Parallel Corpus for Serbian, French and English

ParCoLab is a trilingual parallel corpus containing texts in Serbian, French and English. It is developed at the CLLE-ERSS research unit (UMR 5263 CNRS) at the University of Toulouse, France, in collaboration with the Department of Romance Studies at the University of Belgrade, Serbia. Serbian being one of the less-resourced European languages, this is an important step towards the creation of freely accessible corpora and NLP tools for this language. Our main goal is to provide the scientific community with a high-quality resource that can be used in a wide range of applications, such as contrastive linguistic studies, NLP research, machine and computer assisted translation, translation studies, second language learning and teaching, and applied lexicography. The corpus currently contains 7.1M tokens mainly from literary works, but corpus extension and diversification efforts are ongoing. ParCoLab can be queried online and a part of it is available for download.

Aleksandra Miletic, Dejan Stosic, Saša Marjanović

Prosodic Phrase Boundary Classification Based on Czech Speech Corpora

The correct usage of phrase boundaries is an important issue for ensuring a natural sounding and easily intelligible speech. Therefore, it is not surprising that the boundary detection is also a part of text-to-speech systems. In the presented paper, large speech corpora are used for a classification based approach in order to improve the phrasing of synthesized sentences. The paper compares results of different classifiers to the deterministic approaches based on punctuation and conjunctions and shows that they are able to outperform the simple algorithms.

Markéta Jůzová

Parliament Archives Used for Automatic Training of Multi-lingual Automatic Speech Recognition Systems

In the paper we present a fully automated process capable of creating speech databases needed for training acoustic models for speech recognition systems. We show that archives of national parliaments are perfect sources of speech and text data suited for a lightly supervised training scheme, which does not require human intervention. We describe the process and its procedures in details and demonstrate its usage on three Slavic languages (Polish, Russian and Bulgarian). Practical evaluation is done on a broadcast news task and yields better results than those obtained on some established speech databases.

Jan Nouza, Radek Safarik

Recent Results in Speech Recognition for the Tatar Language

This paper presents a comparative study of several different systems for speech recognition for the Tatar language, including systems for very large and unlimited vocabularies. All the compared systems use a corpus based approach, so recent results in speech and text corpora creation are also shown. The recognition systems differ in acoustic modelling algorithms, basic acoustic units, and language modelling techniques. The DNN based system with the sub-word based language model shows the best recognition result obtained on the test part of speech corpus.

Aidar Khusainov

Data-Driven Identification of German Phrasal Compounds

We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.

Adrien Barbaresi, Katrin Hein

Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units

Phonetically rich and balanced speech corpora are essential components in state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) systems. The written form of speech corpora must be prepared carefully to represent the richness and balance in the linguistic content. There is a lack of this type of spoken and written corpora for Standard Arabic (SA), and the only one available was prepared manually by expert linguists and phoneticians. In this work, we address the task of automatic preparation of written corpora with rich linguistic units. Our work depends on a comprehensive statistical linguistic study of SA based on automatic phonetic transcription of texts with more than 5 million words. We prepared two written corpora: the first corpus contains all allophones in SA with at least 3 occurrences of each allophone and 17 occurences of each phoneme. The second corpus contains, in addition to all allophones, 90.72% of diphones in SA.

Fadi Sindran, Firas Mualla, Tino Haderlein, Khaled Daqrouq, Elmar Nöth

Adding Thesaurus Information into Probabilistic Topic Models

In this paper we present an approach of introducing thesaurus information into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which met in the same texts, should be enhanced and this action leads to their larger contribution into topics found in these texts. The experiments demonstrate that the direct implementation of this idea using WordNet synonyms or direct relations leads to great degradation of the initial model. But the correction of the initial assumption improves the model and makes it better than the initial model in several measures. Adding ngrams in similar manner further improves the model.

Natalia Loukachevitch, Michael Nokel

Dialogue Modelling in Multi-party Social Media Conversation

Social Media is a rich source of human-human interactions on exhausting number of topics. Although dialogue modeling from human-human interactions is not new, but there is no previous work as far as our knowledge attempting to model dialogues from social media data. This paper implements and compares multiple supervised and unsupervised approaches for dialogue modelling from social media conversation; each approach exploiting and unfolding special properties of informal conversations in social media. A new frequency measure is proposed especially for text classification problem in these type of data.

Subhabrata Dutta, Dipankar Das

Markov Text Generator for Basque Poetry

Poetry generation is a challenging field in the area of natural language processing. A poem is a text structured according to predefined formal rules and whose parts are semantically related. In this work we present a novel automated system to generate poetry in Basque language conditioned by non-local constraints. From a given corpus two Markov chains representing forward and backward 2-grams are built. From these Markov chains and a semantic model, a system able to generate poems conforming a given metric and following semantic cues has been designed. The user is prompted to input a theme for the poem and also a seed word to start the generating process. The system produces several poems in less than a minute, enough for using it in live events.

Aitzol Astigarraga, José María Martínez-Otzeta, Igor Rodriguez, Basilio Sierra, Elena Lazkano

Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data

This paper analyses issues of rare and unknown word splitting with byte pair encoding for neural machine translation and proposes two methods that allow improving the quality of word splitting. The first method linguistically guides byte pair encoding and the second method limits splitting of unknown words. We also evaluate corpus re-translation for a new language pair – English-Latvian. We show a significant improvement in translation quality over baseline systems in all reported experiments. We envision that the proposed methods will allow improving the translation of named entities and technical texts in production systems that often receive data not represented in the training corpus.

Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne, Toms Miks

Evaluation of Dictionary Creating Methods for Under-Resourced Languages

In this paper, we present several bilingual dictionary building methods applied for Northern Saami–{English, Finnish, Hungarian, Russian} language pairs. Since Northern Saami is an under-resourced language and standard dictionary building methods require a large amount of pre-processed data, we had to find alternative methods. In a thorough evaluation, we compared the results for each method, which proved our expectations that the precision of standard lexicon building methods is quite low. The most precise method is utilizing Wikipedia title pairs extracted via inter-language links, but Wiktionary-based methods also provided useful result.

Eszter Simon, Iván Mittelholcz

Comparative Evaluation and Integration of Collocation Extraction Metrics

The paper deals with collocation extraction from corpus data. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. The work is focused on bigram collocations. The obtained data on the measure precision allow to establish to some degree that some measures are more precise than others. No measure is ideal, which is why various options of their integration are desirable and useful. We propose a number of parameters that allow to rank collocates in an combined list, namely, an average rank, a normalized rank and an optimized rank.

Victor Zakharov

Errors in Inflection in Czech as a Second Language and Their Automatic Classification

When analyzing language acquisition of inflective languages like Czech, it is necessary to distinguish between errors in word stems and errors in inflection. We use the data of the learner corpus CzeSL, but we propose a simpler error classification based on levels of language description (orthography, morphonology, morphology, syntax, lexicon), which takes into account the uncertainty about the causes of the error. We present a rule-based automatic annotation tool, which can assist both the task of manual error classification and stochastic automatic error annotation with preliminary results of types of errors related to the language proficiency of the text authors.

Tomáš Jelínek

Speaker Model to Monitor the Neurological State and the Dysarthria Level of Patients with Parkinson’s Disease

The progression of the disease in Parkinson’s patients is commonly evaluated with the unified Parkinson’s disease rating scale (UPDRS), which contains several items to assess motor and non–motor impairments. The patients develop speech impairments that can be assessed with a scale to evaluate dysarthria. Continuous monitoring of the patients is suitable to update the medication or the therapy. In this study, a robust speaker model based on the GMM–UBM approach is proposed for the continuous monitoring of the state of Parkinson’s patients. The model is trained with phonation, articulation, and prosody features with the aim of evaluating deficits on each speech dimension. The performance of the model is evaluated in two scenarios: the monitoring of the UPDRS score and the prediction of the dysarthria level of the speakers. The results indicate that the speaker models are suitable to track the disease progression, specially in terms of the evaluation of the dysarthia level of the speakers.

J. C. Vásquez-Correa, R. Castrillón, T. Arias-Vergara, J. R. Orozco-Arroyave, E. Nöth

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.

Leandro Borges dos Santos, Magali Sanches Duran, Nathan Siegle Hartmann, Arnaldo Candido Jr., Gustavo Henrique Paetzold, Sandra Maria Aluisio

Open-Domain Non-factoid Question Answering

We present an end-to-end system for open-domain non-factoid question answering. We leverage the information on the ever-growing World Wide Web, and the capabilities of modern search engines to find the relevant information. Our QA system is composed of three components: (i) query formulation module (QFM) (ii) candidate answer generation module (CAGM) and (iii) answer selection module (ASM). A thorough empirical evaluation using two datasets demonstrates that the proposed approach is highly competitive.

Maria Khvalchik, Anagha Kulkarni

Linguistic Features as Evidence for Historical Context Interpretation

Inspired by the great potential of linguistic features in preserving and revealing writers’ state of mind and conception in certain space and time, we use linguistic features as a vehicle to extract pieces of significant information from a large set of text of known origin so as to construct a context for personal inspection on the writer(s). In this research, we choose a set of linguistic features, each of a grammatical function or a grammatical association pattern, and each represents a different perspective of contextual annotation. In particular, the selected grammatical items include personal pronoun, negation, noun chunk, and are used as text slicing tubes for extracting a certain aspect of information. The initial results show that some selected grammatical constructions are effective in extracting descriptive evidence for construing historical context. Our study has contributed to exploring an effective avenue for innovative history studies by means of examining linguistic evidence.

Jyi-Shane Liu, Ching-Ying Lee, Hua-Yuan Hsueh

Morphosyntactic Annotation of Historical Texts. The Making of the Baroque Corpus of Polish

In the paper, we present some technical issues concerning processing 17th & 18th century texts for the purpose of building a corpus of that period. We describe a chain of procedures leading from transliterated source texts to morphological annotation of text samples that was implemented for building the Baroque Corpus of Polish, a relatively large historical corpus of Polish texts from 17th & 18th c. The described procedure consists of: automatic transliteration from original spelling to modern one, morphological analysis (including the construction of an inflectional dataset for Baroque Polish) and a tool for manual morphosyntactic annotation. The toolchain is being used to create a small manually validated subcorpus, which will serve as training data for a stochastic tagger. Then a larger corpus will be annotated automatically and made available via the Poliqarp corpus search tool.

Witold Kieraś, Dorota Komosińska, Emanuel Modrzejewski, Marcin Woliński

Last Syllable Unit Penalization in Unit Selection TTS

While unit selection speech synthesis tries to avoid speech modifications, it strongly depends on the placement of units into the correct position. Usually, the position is tightly coupled with a distance from the beginning/end of some prosodic or rhythmic units like phrases or words. The present paper shows, however, that it is not necessary to follow position requirements, when the phonetic knowledge of the perception of prosodic patterns (mostly durational in our case) is considered. In particular, we focus on the effects of using word-final units in word-internal positions in synthesized speech, which are often perceived negatively by listeners, due to disruptions in local timing.

Markéta Jůzová, Daniel Tihelka, Radek Skarnitzl

On Multilingual Training of Neural Dependency Parsers

We show that a recently proposed neural dependency parser can be improved by joint training on multiple languages from the same family. The parser is implemented as a deep neural network whose only input is orthographic representations of words. In order to successfully parse, the network has to discover how linguistically relevant concepts can be inferred from word spellings. We analyze the representations of characters and words that are learned by the network to establish which properties of languages were accounted for. In particular we show that the parser has approximately learned to associate Latin characters with their Cyrillic counterparts and that it can group Polish and Russian words that have a similar grammatical function. Finally, we evaluate the parser on selected languages from the Universal Dependencies dataset and show that it is competitive with other recently proposed state-of-the art methods, while having a simple structure.

Michał Zapotoczny, Paweł Rychlikowski, Jan Chorowski

Meaning Extensions, Word Component Structures and Their Distribution: Linguistic Usages Containing Body-Part Terms Liǎn/Miàn, Yǎn/Mù and Zuǐ/Kǒu in Taiwan Mandarin

This study analyzes and compares the linguistic expressions of three sets of body-part terms extracted from the largest, balanced and widely-used Mandarin Chinese corpus, and aims to find their actual usage patterns in the real world context of Mandarin Chinese. It is found that PERSON and EMOTION are the most prevalent metonymic meaning in the six body part terms. As for the metonymic and metaphorical meanings in the six body-part terms and their corresponding word component structures, it is found that when the body-part terms denote PERSON, the most dominant word component structure is [NN]$$_{\mathrm N}$$; when they denote EMOTION, [NN]$$_{\mathrm N}$$ and [VN]$$_{\mathrm V}$$ are the most dominant structures. In addition, the [NN]$$_{\mathrm N}$$ structure shows the highest frequency of occurrences in all the six body part terms when they are used metaphorically.

Hsiao-Ling Hsu, Huei-ling Lai, Jyi-Shane Liu

An Unification-Based Model for Attitude Prediction

Attitude prediction strives to determine whether an opinion holder is positive or negative towards a given target. We cast this problem as a lexicon engineering task in the context of deep linguistic grammar formalisms such as LFG or HPSG. Moreover, we demonstrate that attitude prediction can be accomplished solely through unification of lexical feature structures. It is thus possible to use our model without altering existing grammars, only the lexicon needs to be adapted. In this paper, we also show how our model can be combined with dependency parsers. This makes our model independent of the availability of deep grammars, only unification as a processing mean is needed.

Manfred Klenner

Optimal Number of States in HMM-Based Speech Synthesis

This paper deals with using models with a variable number of states in the HMM-based speech synthesis system. The paper also includes some implementation details on how to use these models in systems based on the HTS toolkit, which cannot handle the models with an unequal number of states directly. A bypass to enable this functionality is proposed here. A data-based method for the determination of the optimal number of states for particular models is proposed here and experimentally tested on 4 large speech corpora. The preference listening test, focused on local differences, proved the preference of the proposed system to the traditional system with 5-state models, while the size of the proposed system (the total number of states) is lower.

Zdeněk Hanzlíček

Temporal Feature Space for Text Classification

In supervised learning algorithms for text classification the text content is usually represented using the frequencies of the words it contains, ignoring their semantic and their relationships. Words within temporal expressions such as “ today ” or “ last February ” are particularly affected by this simplification: the same expression can have a different semantic in documents with different timestamps, while different expressions could refer to the same time. After extracting temporal expressions in documents, we model a set of temporal features derived from the time mentioned in the document, showing the relation between these features and the belonging category. We test our temporal approach on a subset of the New York Times corpus showing a significant improvement over the text-only baseline.

Stefano Giovanni Rizzo, Danilo Montesi

Parkinson’s Disease Progression Assessment from Speech Using a Mobile Device-Based Application

This paper presents preliminary results of individual speaker models for monitoring Parkinson’s disease from speech using a smart-phone. The aim of this study is to evaluate the suitability of mobile devices to perform robust speech analysis. Speech recordings from 68 PD patients were captured from 2012 to 2016 in four recording sessions. The performance of the speaker models is evaluated according to two clinical rating scales: the Unified Parkinson’s Diseae Rating Scale (UPDRS) and a modified version of the Frenchay Dysarthria Assessment (m-FDA) scale. According to the results, it is possible to assess the disease progression from speech with Pearson’s correlations of up to $$r=0.51$$. This study suggests that it is worth to continue working on the development of mobile-based tools for the continuous and unobtrusive monitoring of Parkinson’s patients.

T. Arias-Vergara, P. Klumpp, J. C. Vásquez-Correa, J. R. Orozco-Arroyave, E. Nöth

A New Corpus of Collaborative Dialogue Produced Under Cognitive Load Using a Driving Simulator

We present an experiment designed to collect both monologue and dialogue speech, produced under conditions that will tax the attentional resources of the speaker. Pairs of participants (native speakers of French) were asked to perform tasks involving use of the auditory memory, complex memorisation and recall, and collaborative exchange of information. Following the dual-task paradigm, we induced continuous attentional load to one of participants, using the Continuous Tracking and Reaction (ConTRe) task, in a driving simulator. In this article, we present the corpus and an initial analysis of the prosodic characteristics (silent pauses, filled pauses, disfluencies) in speech produced under cognitive load.

George Christodoulides

Phonetic Segmentation Using Knowledge from Visual and Perceptual Domain

Accurate and automatic phonetic segmentation is crucial for several speech based applications such as phone level articulation analysis and error detection, speech synthesis, annotation, speech recognition and emotion recognition. In this paper we examine the effectiveness of using visual features obtained by processing the image spectrogram of a speech utterance, as applied to phonetic segmentation. Further, we propose a mechanism to combine the knowledge from visual and perceptual domains for automatic phonetic segmentation. This process can be considered analogous to manual phonetic segmentation. The technique was evaluated on TIMIT American English Corpus. Experimental results show significant improvements in phonetic segmentation, especially for lower tolerances of 5, 10 and 15 ms, with an absolute improvement of 8.29% for TIMIT database for a 10 ms tolerance is observed.

Bhavik Vachhani, Chitralekha Bhat, Sunil Kopparapu

The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

This paper focuses on impact of phonetic inaccuracies of acoustic training data on performance of automatic speech recognition system. This is especially important if the training data is created in automated way. In this case, the data often contains errors in a form of wrong phonetic transcriptions. A series of experiments simulating various common errors in phonetic transcriptions based on parts of GlobalPhone data set (for Croatian, Czech and Russian) is conducted. These experiments show the influence of various errors on different languages and acoustic models (Gaussian mixture models, deep neural networks). The impact of errors is also shown for real data obtained by our automated ASR creation process for Belarusian. The results show that the best performance is achieved by using the most accurate data; however, certain amount of errors (up to 5%) does have relatively small impact on speech recognition accuracy.

Radek Safarik, Lukas Mateju

Automatic Detection of Parkinson’s Disease: An Experimental Analysis of Common Speech Production Tasks Used for Diagnosis

Parkinson’s disease (PD) is the second most common neurodegenerative disorder of mid-to-late life after Alzheimer’s disease. During the progression of the disease, most individuals with PD report impairments in speech due to deficits in phonation, articulation, prosody, and fluency. In the literature, several studies perform the automatic classification of speech of people with PD considering various types of acoustic information extracted from different speech tasks. Nevertheless, it is unclear which tasks are more important for an automatic classification of the disease. In this work, we compare the discriminant capabilities of eight verbal tasks designed to capture the major symptoms affecting speech. To this end, we introduce a new database of Portuguese speakers consisting of 65 healthy control and 75 PD subjects. For each task, an automatic classifier is built using feature sets and modeling approaches in compliance with the current state of the art. Experimental results permit to identify reading aloud prosodic sentences and story-telling tasks as the most useful for the automatic detection of PD.

Anna Pompili, Alberto Abad, Paolo Romano, Isabel P. Martins, Rita Cardoso, Helena Santos, Joana Carvalho, Isabel Guimarães, Joaquim J. Ferreira

Unified Simplified Grapheme Acoustic Modeling for Medieval Latin LVCSR

A large vocabulary continuous speech recognition (LVCSR) system designed for dictation of medieval Latin language documents is introduced. Such language technology tool can be of great help for preserving Latin language charters from this era, as optical character recognition systems are often challenged by these historical materials. As corresponding historical research focuses on the Visegrad region, our primary aim is to make medieval Latin dictation available for texts and speakers of this region, concentrating on Czech, Hungarian and Polish. The baseline acoustic models we start with are monolingual grapheme-based ones. On one hand, the application of medieval Latin knowledge-based grapheme-to-phoneme (G2P) mapping from the source language to the target language resulted in significant improvement, reducing the Word Error Rate (WER) by $$13.3\%$$. On the other hand, applying a Unified Simplified Grapheme (USG) inventory set for the three-language acoustic data set complemented with Romanian speech data, resulted in a further $$0.7\%$$ WER reduction - without using any target or source language G2P rules.

Lili Szabó, Péter Mihajlik, András Balog, Tibor Fegyó

Experiments with Segmentation in an Online Speaker Diarization System

In offline speaker diarization systems, particularly those aimed at telephone speech, the accuracy of the initial segmentation of a conversation is often a secondary concern. Imprecise segment boundaries are typically corrected during resegmentation, which is performed as the final step of the diarization process. However, such resegmentation is generally not possible in online systems, where past decisions are usually unchangeable. In such situations, correct segmentation becomes critical. In this paper, we evaluate several different segmentation approaches in the context of online diarization by comparing the overall performance of an i-vector-based diarization system set to operate in a sequential manner.

Marie Kunešová, Zbyněk Zajíc, Vlasta Radová

Spatiotemporal Convolutional Features for Lipreading

We propose a visual parametrization method for the task of lipreading and audiovisual speech recognition from frontal face videos. The presented features utilize learned spatiotemporal convolutions in a deep neural network that is trained to predict phonemes on a frame level. The network is trained on a manually transcribed moderate size dataset of Czech television broadcast, but we show that the resulting features generalize well to other languages as well. On a publicly available OuluVS dataset, a result of 91% word accuracy was achieved using vanilla convolutional features, and 97.2% after fine tuning – substantial state of the art improvements in this popular benchmark. Contrary to most of the work on lipreading, we also demonstrate usefulness of the proposed parametrization in the task of continuous audiovisual speech recognition.

Karel Paleček

Could Emotions Be Beneficial for Interaction Quality Modelling in Human-Human Conversations?

There are different metrics which are used in call centres or Spoken Dialogue Systems (SDSs) as an indicator for problem detection during the dialogue. One of such metrics is emotional state. The measurements of emotions can be a powerful indicator in different task-oriented services. Besides emotional state, there is another widely used metric: customer satisfaction (CS), which has a modification called Interaction Quality (IQ). The both models of CS and IQ may include emotional state as a feature. However, is it an actually necessary feature? Some users/customers can be very emotional, while other can be insufficiently emotional in different satisfaction categories. That is why emotional state may be not an informative feature for IQ/CS modelling. Our research is dedicated to the definition of the emotions measurements role in IQ modelling task.

Anastasiia Spirina, Wolfgang Minker, Maxim Sidorov

Multipoint Neighbor Embedding

Dimensionality reduction methods for visualization attempt to preserve in the embedding as much of the original information as possible. However, projection to 2-D or 3-D heavily distorts the data. Instead, we propose a multipoint extension to neighbor embedding methods, which allows to express datapoints from a high-dimensional space as sets of datapoints in a low-dimensional space. Cardinality of those sets is not assumed a priori. Using gradient of the cost function, we derive an expression, which for every datapoint indicates its remote area of attraction. We use it as a heuristic that guides selection and placement of additional datapoints. We demonstrate the approach with multipoint t-SNE, and adapt the $$\mathcal {O}(N\log N)$$ approximation for computing the gradient of t-SNE to our setting. Experiments show that the approach brings qualitative and quantitative gains, i.e., it expresses more pairwise similarities and multi-group memberships of individual datapoints, better preserving the local structure of the data.

Adrian Lancucki, Jan Chorowski

Significance of Interaction Parameter Levels in Interaction Quality Modelling for Human-Human Conversation

The Interaction Quality (IQ) metric, which originally was designed for spoken dialogue systems (SDSs) to assess human-computer spoken interaction (HCSI) and then adapted to human-human conversation (HHC), is based on features from three interaction parameter levels: an exchange, a window, and a dialogue level. To determine the significance of the window and dialogue interaction parameter levels, as well as their combination, computations, based on different data sets, have been performed using several classification algorithms. The obtained results may be used for further improvement of the IQ model for HHC in terms of the computational complexity.

Anastasiia Spirina, Alina Skorokhod, Tatiana Karaseva, Iana Polonskaia, Maxim Sidorov

Ship-LemmaTagger: Building an NLP Toolkit for a Peruvian Native Language

Natural Language Processing deals with the understanding and generation of texts through computer programs. There are many different functionalities used in this area, but among them there are some functions that are the support of the remaining ones. These methods are related to the core processing of the morphology of the language (such as lemmatization) and automatic identification of the part-of-speech tag. Thereby, this paper describes the implementation of a basic NLP toolkit for a new language, focusing in the features mentioned before, and testing them in an own corpus built for the occasion. The obtained results exceeded the expected results and could be used for more complex tasks such as machine translation.

José Pereira-Noriega, Rodolfo Mercado-Gonzales, Andrés Melgar, Marco Sobrevilla-Cabezudo, Arturo Oncevay-Marcos

A Study of Abstractive Summarization Using Semantic Representations and Discourse Level Information

The present work proposes an exploratory study of abstractive summarization integrating semantic analysis and discursive information. Firstly, we built a conceptual graph using some lexical resources and Abstract Meaning Representation (AMR). Secondly, we applied PageRank algorithm to get the most relevant concepts. Also, we incorporated discursive information of Rethorical Structure Theory (RST) into the PageRank to improve the relevant concepts identification. Finally, we made some rules over the relevant concepts and applied SimpleNLG to make the summaries. This study was performed on the corpus of DUC 2002 and the results showed a F1-measure of 24% in Rouge-1 when AMR and RST were used, proving their usefulness in this task.

Gregory César Valderrama Vilca, Marco Antonio Sobrevilla Cabezudo

Development and Integration of Natural Brazilian Portuguese Synthetic Voices to Framework FIVE

The Framework FIVE is a multiplatform tool that assists the development of voice user interfaces applied in different technological environments. Several works have been carried out in order to provide increasingly in natural synthetic voices to FIVE, however, experiments realized with users has been reported the need for more friendly voices integrated to the framework. This paper describes the development and integration of natural synthetic voices in Brazilian Portuguese to the Framework FIVE. For this, a private audio and phonetics database were used on development of two voices (male and female) using the Unit Selection technique available on MaryTTS platform. For the integration process it was developed a specific web service. For comparison purposes, it was realized experiments to evaluate the naturalness and intelligibility of the voices, and the results obtained show that the constructed voices are more friendly, however, there is not a great difference when compared with HMM-based technique.

Danilo S. Barbosa, Byron L. D. Bezerra, Alexandre M. A. Maciel

Fine-Tuning Word Embeddings for Aspect-Based Sentiment Analysis

Nowadays word embeddings, also known as word vectors, play an important role for many natural language processing (NLP) tasks. In general, these word embeddings are learned from unsupervised learning models (e.g. Word2Vec, GloVe) with a large unannotated corpus and they are independent with the task of their application. In this paper we aim to enrich word embeddings by adding more information from a specific task that is the aspect based sentiment analysis. We propose a model using a convolutional neural network that takes a labeled data set, the learned word embeddings from an unsupervised learning model (e.g. Word2Vec) as input and fine-tunes word embeddings to capture aspect category and sentiment information. We conduct experiments on restaurant review data (http://spidr-ursa.rutgers.edu/datasets/). Experimental results show that fine-tuned word embeddings outperform unsupervisedly learned word embeddings.

Duc-Hong Pham, Thi-Thanh-Tan Nguyen, Anh-Cuong Le

Recognition of the Electrolaryngeal Speech: Comparison Between Human and Machine

Automatic recognition of an electrolaryngeal speech is usually a hard task due to the fact that all phonemes tend to be voiced. However, using a strong language model (LM) for continuous speech recognition task, we can achieve satisfactory recognition accuracy. On the other hand, the recognition of isolated words or phrase sentences containing only several words poses a problem, as in this case, the LM does not have a chance to properly support the recognition. At the same time, the recognition of short phrases has a great practical potential. In this paper, we would like to discuss poor performance of the electrolaryngeal speech automatic speech recognition (ASR), especially for isolated words. By comparing the results achieved by humans and the ASR system, we will attempt to show that even humans are unable to distinguish the identity of the word, differing only in voicing, always correctly. We describe three experiments: the one represents blind recognition, i.e., the ability to correctly recognize an isolated word selected from a vocabulary of more than a million words. The second experiment shows results achieved when there is some additional knowledge about the task, specifically, when the recognition vocabulary is reduced only to words that actually are included in the test. And the third test evaluates the ability to distinguish two similar words (differing only in voicing) for both the human and the ASR system.

Petr Stanislav, Josef V. Psutka, Josef Psutka

Backmatter

Titel: Text, Speech, and Dialogue
herausgegeben von: Kamil Ekštein
Václav Matoušek
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-64206-2
Print ISBN: 978-3-319-64205-5
DOI: https://doi.org/10.1007/978-3-319-64206-2