Skip to main content
Top

2021 | Book

Text, Speech, and Dialogue

24th International Conference, TSD 2021, Olomouc, Czech Republic, September 6–9, 2021, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 24th International Conference on Text, Speech, and Dialogue, TSD 2021, held in Olomouc, Czech Republic, in September 2021.*
The 2 keynote speeches and 46 papers presented in this volume were carefully reviewed and selected from 101 submissions. The topical sections "Text", "Speech", and "Dialogue" deal with the following issues: speech recognition; corpora and language resources; speech and spoken language generation; tagging, classification and parsing of text and speech; semantic processing of text and speech; integrating applications of text and speech processing; automatic dialogue systems; multimodal techniques and modelling, and others.
* Due to the COVID-19 pandemic the conference was held in a "hybrid" mode.

Table of Contents

Frontmatter

Keynote Talks

Frontmatter
Towards User-Centric Text-to-Text Generation: A Survey

Natural Language Generation (NLG) has received much attention with rapidly developing models and ever-more available data. As a result, a growing amount of work attempts to personalize these systems for better human interaction experience. Still, diverse sets of research across multiple dimensions and numerous levels of depth exist and are scattered across various communities. In this work, we survey the ongoing research efforts and introduce a categorization of these under the umbrella user-centric natural language generation. We further discuss some of the challenges and opportunities in NLG personalization.

Diyi Yang, Lucie Flek
Wasserstein Autoencoders with Mixture of Gaussian Priors for Stylized Text Generation

Probabilistic autoencoders are effective for text generation. However, they are unable to control the style of generated text, despite the training samples explicitly labeled with different styles. We present a Wasserstein autoencoder with a Gaussian mixture prior for style-aware sentence generation. Our model is trained on a multi-class dataset and generates sentences in the style of the desired class. It is also capable of interpolating multiple classes. Moreover, we can train our model on relatively small datasets. While a regular WAE or VAE cannot generate diverse sentences with few training samples, our approach generates diverse sentences and preserves the style of the desired classes.

Amirpasha Ghabussi, Lili Mou, Olga Vechtomova

Text

Frontmatter
Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data

Predictability corpora built via Cloze task generally accompany eye-tracking data for the study of processing costs of linguistic structures in tasks of reading for comprehension. Two semantic measures are commonly calculated to evaluate expectations about forthcoming words: (i) the semantic fit of the target word with the previous context of a sentence, and (ii) semantic similarity scores that represent the semantic similarity between the target word and Cloze task responses for it. For Brazilian Portuguese (BP), there was no large eye-tracking corpora with predictability norms. The goal of this paper is to present a method to calculate the two semantic measures used in the first BP corpus of eye movements during silent reading of short paragraphs by undergraduate students. The method was informed by a large evaluation of both static and contextualized word embeddings, trained on large corpora of texts. Here, we make publicly available: (i) a BP corpus for a sentence-completion task to evaluate semantic similarity, (ii) a new methodology to build this corpus based on the scores of Cloze data taken from our project, and (iii) a hybrid method to compute the two semantic measures in order to build predictability corpora in BP.

Sidney Leal, Edresson Casanova, Gustavo Paetzold, Sandra Aluísio
SYN2020: A New Corpus of Czech with an Innovated Annotation

The paper introduces the SYN2020 corpus, a newly released representative corpus of written Czech following the tradition of the Czech National Corpus SYN series. The design of SYN2020 incorporates several substantial new features in the area of segmentation, lemmatization and morphological tagging, such as a new treatment of lemma variants, a new system for identifying morphological categories of verbs or a new treatment of multiword tokens. The annotation process, including data and tools used, is described, and the tools and accuracy of the annotation are discussed as well.

Tomáš Jelínek, Jan Křivan, Vladimír Petkevič, Hana Skoumalová, Jana Šindlerová
Deep Bag-of-Sub-Emotions for Depression Detection in Social Media

This paper presents DeepBoSE, a novel deep learning model for depression detection in social media. The model is formulated such that it internally computes a differentiable Bag-of-Features (BoF) representation that incorporates emotional information. This is achieved by a reinterpretation of classical weighting schemes like tf-idf into probabilistic deep learning operations. An advantage of the proposed method is that it can be trained under the transfer learning paradigm, which is useful to enhance conventional BoF models that cannot be directly integrated into deep learning architectures. Experiments on the eRisk17 and eRisk18 datasets for the depression detection task show that DeepBoSE outperforms conventional BoF representations and is competitive with the state of the art methods.

Juan S. Lara, Mario Ezra Aragón, Fabio A. González, Manuel Montes-y-Gómez
Rewriting Fictional Texts Using Pivot Paraphrase Generation and Character Modification

Gender bias in natural language is pervasive, but easily overlooked. Current research mostly focuses on using statistical methods to uncover patterns of gender bias in textual corpora. In order to study gender bias in a more controlled manner, we propose to build a parallel corpus in which gender and other characteristics of the characters in the same story switch between their opposite alternatives. In this paper, we present a two-step fiction rewriting model to automatically construct such a parallel corpus at scale. In the first step, we paraphrase the original text, i.e., the same storyline is expressed differently, in order to ensure linguistic diversity in the corpus. In the second step, we replace the gender of the characters with their opposites and modify their characteristics by either using synonyms or antonyms. We evaluate our fiction rewriting model by checking the readability of the rewritten texts and measuring readers’ acceptance in a user study. Results show that rewriting with antonyms and synonyms barely changes the original readability level; and human readers perceive synonymously rewritten texts mostly reasonable. Antonymously rewritten texts were perceived less reasonable in the user study and a post-hoc evaluation indicates that this might be mostly due to grammar and spelling issues introduced by the rewriting. Hence, our proposed approach allows the automated generation of a synonymous parallel corpus to study bias in a controlled way, but needs improvement for antonymous rewritten texts.

Dou Liu, Tingting Zhu, Jörg Schlötterer, Christin Seifert, Shenghui Wang
Transformer-Based Automatic Punctuation Prediction and Word Casing Reconstruction of the ASR Output

The paper proposes a module for automatic punctuation prediction and casing reconstruction based on transformers architectures (BERT/T5) that constitutes the current state-of-the-art in many similar NLP tasks. The main motivation for our work was to increase the readability of the ASR output. The ASR output is usually in the form of a continuous stream of text, without punctuation marks and with all words in lowercase. The resulting punctuation and casing reconstruction module is evaluated on both the written text and the actual ASR output in three languages (English, Czech and Slovak).

Jan Švec, Jan Lehečka, Luboš Šmídl, Pavel Ircing
A Database and Visualization of the Similarity of Contemporary Lexicons

Lexical similarity data, quantifying the “proximity” of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative linguistics, computed from very small expert-curated vocabularies that are not supposed to be representative of modern lexicons. We explore a different, fully automated approach to lexical similarity computation, based on an existing 8-million-entry cognate database created from online lexicons orders of magnitude larger than the word lists typically used in linguistics. We compare our results to earlier efforts, and automatically produce intuitive visualizations that have traditionally been hand-crafted. With a new, freely available database of over 27 thousand language pairs over 331 languages, we hope to provide more relevant data to cross-lingual NLP applications, as well as material for the synchronic study of contemporary lexicons.

Gábor Bella, Khuyagbaatar Batsuren, Fausto Giunchiglia
The Detection of Actors for German

In this short paper, we discuss a straight-forward approach for the identification of noun phrases denoting actors (agents). We use a multilayer perceptron applied to the word embeddings of the head nouns in order to learn a model. A list of 9,000 actors together with 11,000 non-actors generated from a newspaper corpus are used as a silver standard. An evaluation of the results seems to indicate that the model generalises well on unseen data.

Manfred Klenner, Anne Göhring
Verbal Autopsy: First Steps Towards Questionnaire Reduction

Verbal Autopsy (VA) is the instrument used to collect Causes of Death (CoD) in places in which the access to health services is out of reach. It consists of a questionnaire addressed to the caregiver of the deceased and involves closed questions (CQ) about signs and symptoms prior to the decease. There is a global effort to reduce the number of questions in the questionnaire to the minimum essential information to ascertain a CoD. To this end we took two courses of action. On the one hand, the relation of the responses with respect to the CoD was considered by means of the entropy in a supervised feature subset selection (FSS) approach. On the other hand, we inspected the questions themselves by means of semantic similarity leading to an unsupervised approach based on semantic similarity (SFSS). In an attempt to assess, quantitatively, the impact of reducing the questionnaire, we assessed the use of these FSS approaches on the CoD predictive capability of a classifier. Experimental results showed that unsupervised semantic similarity feature subset selection (SFSS) approach was competitive to identify similar questions. Nevertheless, naturally, supervised FSS based on the entropy of the responses performed better for CoD prediction. To sum up, the necessity of reviewing the VA questionnaire was accompanied with quantitative evidence.

Ander Cejudo, Owen Trigueros, Alicia Pérez, Arantza Casillas, Daniel Cobos
Effective FAQ Retrieval and Question Matching Tasks with Unsupervised Knowledge Injection

Frequently asked question (FAQ) retrieval, with the purpose of providing information on frequent questions or concerns, has far-reaching applications in many areas like e-commerce services, online forums and many others, where a collection of question-answer (Q-A) pairs compiled a priori can be employed to retrieve an appropriate answer in response to a user’s query that is likely to reoccur frequently. To this end, predominant approaches to FAQ retrieval typically rank question-answer pairs by considering either the similarity between the query and a question (q-Q), the relevance between the query and the associated answer of a question (q-A), or combining the clues gathered from the q-Q similarity measure and the q-A relevance measure. In this paper, we extend this line of research by combining the clues gathered from the q-Q similarity measure and the q-A relevance measure, and meanwhile injecting extra word interaction information, distilled from a generic (open-domain) knowledge base, into a contextual language model for inferring the q-A relevance. Furthermore, we also explore to capitalize on domain-specific topically-relevant relations between words in an unsupervised manner, acting as a surrogate to the supervised domain-specific knowledge base information. As such, it enables the model to equip sentence representations with the knowledge about domain-specific and topically-relevant relations among words, thereby providing a better q-A relevance measure. We evaluate variants of our approach on a publicly-available Chinese FAQ dataset (viz. TaipeiQA), and further apply and contextualize it to a large-scale question-matching task (viz. LCQMC), which aims to search questions from a QA dataset that have a similar intent as an input query. Extensive experimental results on these two datasets confirm the promising performance of the proposed approach in relation to some state-of-the-art ones.

Wen-Ting Tseng, Yung-Chang Hsu, Berlin Chen
Exploring Conditional Language Model Based Data Augmentation Approaches for Hate Speech Classification

Deep Neural Network (DNN) based classifiers have gained increased attention in hate speech classification. However, the performance of DNN classifiers increases with quantity of available training data and in reality, hate speech datasets consist of only a small amount of labeled data. To counter this, Data Augmentation (DA) techniques are often used to increase the number of labeled samples and therefore, improve the classifier’s performance. In this article, we explore augmentation of training samples using a conditional language model. Our approach uses a single class conditioned Generative Pre-Trained Transformer-2 (GPT-2) language model for DA, avoiding the need for multiple class specific GPT-2 models. We study the effect of increasing the quantity of the augmented data and show that adding a few hundred samples significantly improves the classifier’s performance. Furthermore, we evaluate the effect of filtering the generated data used for DA. Our approach demonstrates up to 7.3% and up to 25.0% of relative improvements in macro-averaged F1 on two widely used hate speech corpora.

Ashwin Geet D’Sa, Irina Illina, Dominique Fohr, Dietrich Klakow, Dana Ruiter
Generating Empathetic Responses with a Pre-trained Conversational Model

Conversational agents can be perceived to be more human-like if they possess empathy or the ability to understand and share feelings with their users. Studies have also showed improvement in user engagement on systems that can exhibit emotional skills. Empathy can be expressed through language. In this paper, a pre-trained neural conversational language model named DialoGPT and a new collection of empathetic dialogues tagged with emotions are used in order to investigate the ability of the model in learning and generating more empathetic responses. Using DialoGPT’s small model size, the model was fine-tuned on the EmpatheticDialogues dataset which was intentionally collected from emotional situations. Automatic evaluation using the perplexity metric, and manual evaluation based on performance and user preference, were conducted. The fine-tuned model achieved good performance in generating empathetic responses, with perplexity value of 12.59, which correlated with the ratings from human evaluators.

Jackylyn Beredo, Carlo Migel Bautista, Macario Cordel, Ethel Ong
Adaptation of Classic Readability Metrics to Czech

We have fitted four classic readability metrics to Czech, using InterCorp (a parallel corpus with manual sentence alignment), CzEng 2.0 (a large parallel corpus of crawled web texts), and the optimize.curve fit algorithm from the SciPy library. The adapted metrics are: Flesch Reading Ease, Flesch-Kincaid Grade Level, Coleman-Liau Index, and Automated Readability Index. We describe the details of the procedure and present satisfactory results. Besides, we discuss the sensitivity of these metrics to text paraphrases and correlation of readability scores with empirically observed reading comprehension, as well as the adaptation of Flesch Reading Ease to Czech from Russian.

Klára Bendová, Silvie Cinková
New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction

In this article, we describe the design principles of the ten newly published CLARIN-PL corpora of Slavic and Baltic languages. In relation to other non-commercial online corpora, we highlight the distinctive features of these CLARIN-PL corpora: resource selection, preprocessing, manual segmentation at the sentence level, lemmatisation, annotation and metadata. We also present current and planned work on the development of the CLARIN-PL Balto–Slavic corpora.

Maksim Duszkin, Danuta Roszko, Roman Roszko
Use of Augmentation and Distant Supervision for Sentiment Analysis in Russian

In this study, we test several augmentation and distant supervision techniques to increase sentiment datasets in Russian. We use transfer learning approach pre-trained on created additional data to improve the performance. We compare our proposed approach based on distant supervision with existing augmentation methods. The best results were achieved using three-step approach of sequential training on general, thematic and original train samples. The results were improved by more than 3% to the current state-of-the-art methods for most of the benchmarks using data automatically annotated with distant supervision technique.

Anton Golubev, Natalia Loukachevitch
RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-the-art results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base .

Milan Straka, Jakub Náplava, Jana Straková, David Samuel
Labelled EPIE: A Dataset for Idiom Sense Disambiguation

Natural Language Understanding has made recent advancements where context-aware token representation and word disambiguation have become possible to a large extent. In this scenario, comprehension of phrasal semantics particularly in the context of multi word expressions (MWE) and idioms, is the subsequent task to be addressed. Word level metaphor detection is unable to handle phrases or MWE(s) which occur in both literal and idiomatic context. State of the art transformer architectures can be useful in this context, but the absence of a large comprehensive dataset is a bottleneck. In this paper, we present a labelled EPIE dataset containing 3136 occurrences for 358 formal idioms. To prove the efficacy of our dataset, we also train a sequence classification model effectively and perform cross-dataset evaluation on three independent datasets. Our method achieves good results on all datasets with F1 score of 96% on our test data, and 82%, 74% and 76% F1 score on SemEval All Words, SemEval Lex Sample, and PIE Corpus datasets respectively.

Prateek Saxena, Soma Paul
Introducing NYTK-NerKor, A Gold Standard Hungarian Named Entity Annotated Corpus

Here we present NYTK-NerKor, a gold standard Hungarian named entity annotated corpus containing 1 million tokens. This is the largest corpus ever in its kind. It contains balanced text selection from five genres: fiction, legal, news, web, and Wikipedia. A ca. 200,000 tokens subcorpus contains gold standard morphological annotation besides NE labels. We provide official train, development and test datasets in a proportion of 80%-10%-10%. All sets provide a balanced selection from all genres and sources, while the morphologically annotated subcorpus is also represented in all sets in a balanced way. The format of data files are CoNLL-U Plus, in which the NE annotation follows the CoNLL2002 labelling standard, while morphological information is encoded using the well-known Universal Dependencies POS tags and morphosyntactic features. The novelty of NYTK-NerKor as opposed to similar existing corpora is that it is: by an order of magnitude larger, freely available for any purposes, containing text material from different genres and sources, and following international standards in its format and tagset. The corpus is available under the license CC-BY-SA 4.0 from its GitHub repository: https://github.com/nytud/NYTK-NerKor .

Eszter Simon, Noémi Vadász
Semantic Templates for Generating Long-Form Technical Questions

Question generation (QG) from technical text has multiple important applications such as creation of question-banks for examinations, interviews as well as in intelligent tutoring systems. However, much of the existing work for QG has focused on open-domain and not specifically on technical domain. We propose to generate technical questions using semantic templates. We also focus on ensuring that a large fraction of the generated questions are long-form, i.e., they require longer answers spanning multiple sentences. This is in contrast with existing work which has predominantly focused on generating factoid questions which have a few words or phrases as answers. Using the technical topics selected from undergraduate and graduate-level courses in Computer Science, we show that the proposed approach is able to generate questions with high acceptance rate. Further, we also show that the proposed template-based approach can be effectively leveraged using the distant supervision paradigm to finetune and significantly improve the existing sequence-to-sequence deep learning models for generating long-form, technical questions.

Samiran Pal, Avinash Singh, Soham Datta, Sangameshwar Patil, Indrajit Bhattacharya, Girish Palshikar
Rethinking Adversarial Training for Language Adaptation

Recent advances in pre-trained language models revolutionized the field of natural language processing. However, these approaches require large-scale annotated resources, that are only available for some languages. Collecting data in every language is unrealistic, hence the growing interest in cross-lingual methods that can leverage the knowledge acquired in one language to different target languages. To address these challenges, Adversarial Training has been successfully employed in a variety of tasks and languages. Empirical analysis for the task of natural language inference suggests that, with the advent of neural language models, more challenging auxiliary tasks should be formulated to further improve the transfer of knowledge via Adversarial Training. We propose alternative formulations for the adversarial component, which we believe to be promising in different cross-lingual scenarios.

Gil Rocha, Henrique Lopes Cardoso
A Corpus with Wavesurfer and TEI: Speech and Video in TEITOK

In this paper, we demonstrate how TEITOK provides a full online interface for speech and even video corpora, that are fully searchable using the CQL query language, can contain all speech-related annotation such as repetitions, gaps, and mispronunciations, and provides a full interface for time-aligned annotations scrolling below the waveform and showing the video if there is any. Corpora are stored in the TEI/XML standard, with import and output functions for other established standards like ELAN, Praat, or Transcriber. It is even possible to directly annotate corpora in TEITOK.

Maarten Janssen
Exploiting Subjectivity Knowledge Transfer for End-to-End Aspect-Based Sentiment Analysis

While classic aspect-based sentiment analysis typically includes three sub-tasks (aspect extraction, opinion extraction, and aspect-level sentiment classification), recent studies focus on exploring possibilities of knowledge sharing from different tasks, such as document-level sentiment analysis or document-level domain classification that are less demanding on dataset resources. Several recent studies managed to propose different frameworks for solving nearly complete end-to-end aspect-based sentiment analysis in a unified manner. However, none of them studied the possibility of transferring knowledge about their subjectivity or opinion typology between sub-tasks. In this work, we propose subjectivity-aware learning as a novel auxiliary task for aspect-based sentiment analysis. Besides, we also propose another novel task defined as opinion type detection. We performed extensive experiments on the state-of-the-art dataset that show improvement of model performance while employing subjectivity learning. All models report improvement in overall F1 score for aspect-based sentiment analysis. In addition, we also set new benchmark results for the separate task of subjectivity detection and opinion type detection for the restaurant domain of SemEval 2015 dataset.

Samuel Pecar, Marian Simko
Towards Personal Data Anonymization for Social Messaging

We present a method for building text corpora for the supervised learning of text-to-text anonymization while maintaining a strict privacy policy. In our solution, personal data entities are detected, classified, and anonymized. We use available machine-learning methods, like named-entity recognition, and improve their performance by grouping multiple entities into larger units based on the theory of tabular data anonymization. Experimental results on annotated Czech Facebook Messenger conversations reveal that our solution has recall comparable to human annotators. On the other hand, precision is much lower because of the low efficiency of the named entity recognition in the domain of social messaging conversations. The resulting anonymized text is of high utility because of the replacement methods that produce natural text.

Ondřej Sotolář, Jaromír Plhák, David Šmahel
ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata

We present ParCzech 3.0, a speech corpus of the Czech parliamentary speeches from The Czech Chamber of Deputies which took place from 25th November 2013 to 1st April 2021.Different from previous speech corpora of Czech, we preserve not just orthography but also all the available metadata (speaker identities, gender, web pages links, affiliations committees, political groups, etc.) and complement this with automatic morphological and syntactic annotation, and named entities recognition. The corpus is encoded in the TEI format which allows for a straightforward and versatile exploitation.The rather rich metadata and annotation make the corpus relevant for a wide audience of researchers ranging from engineers in the speech community to theoretical linguists studying rhetorical patterns at scale.

Matyáš Kopp, Vladislav Stankov, Jan Oldřich Krůza, Pavel Straňák, Ondřej Bojar
Using Zero-Shot Transfer to Initialize azWikiNER, a Gold Standard Named Entity Corpus for the Azerbaijani Language

Named Entity Recognition (NER) is one of the primary fields of Natural Language Processing, focused on analyzing and determining the entities in a given text. In this paper, we present a gold standard named entity dataset for Azerbaijani created from the Azerbaijani portion of WikiAnn, a ‘silver standard’ NER dataset generated from Wikipedia. In a zero-shot cross-lingual transfer scenario, we used an M-BERT-based NER model trained on the English Ontonotes corpus to add new entity types to the corpus. The output of the model was then hand-corrected. We evaluate the accuracy of the original WikiAnn corpus, the zero-shot performance of two models trained on the Ontonotes corpus, and two transformer-based NER models trained on the training part of the final corpus: one based on M-BERT and another based XLM-RoBERTa. We release the corpus and the trained models to the public.

Kamran Ibiyev, Attila Novak
Using BERT Encoding and Sentence-Level Language Model for Sentence Ordering

Discovering the logical sequence of events is one of the cornerstones in Natural Language Understanding. One approach to learn the sequence of events is to study the order of sentences in a coherent text. Sentence ordering can be applied in various tasks such as retrieval-based Question Answering, document summarization, storytelling, text generation, and dialogue systems. Furthermore, we can learn to model text coherence by learning how to order a set of shuffled sentences. Previous research has relied on RNN, LSTM, and BiLSTM architecture for learning text language models. However, these networks have performed poorly due to the lack of attention mechanisms. We propose an algorithm for sentence ordering in a corpus of short stories. Our proposed method uses a language model based on Universal Transformers (UT) that captures sentences’ dependencies by employing an attention mechanism. Our method improves the previous state-of-the-art in terms of Perfect Match Ratio (PMR) score in the ROCStories dataset, a corpus of nearly 100K short human-made stories. The proposed model includes three components: Sentence Encoder, Language Model, and Sentence Arrangement with Brute Force Search. The first component generates sentence embeddings using SBERT-WK pre-trained model fine-tuned on the ROCStories data. Then a Universal Transformer network generates a sentence-level language model. For decoding, the network generates a candidate sentence as the following sentence of the current sentence. We use cosine similarity as a scoring function to assign scores to the candidate embedding and the embeddings of other sentences in the shuffled set. Then a Brute Force Search is employed to maximize the sum of similarities between pairs of consecutive sentences.

Melika Golestani, Seyedeh Zahra Razavi, Zeinab Borhanifard, Farnaz Tahmasebian, Hesham Faili
Using Presentation Slides and Adjacent Utterances for Post-editing of Speech Recognition Results for Meeting Recordings

In recent years, the use of automatic speech recognition (ASR) systems in meetings has been increasing, such as for minutes generation and speaker diarization. The problem is that ASR systems often misrecognize words because there is domain-specific content in meetings. In this paper, we propose a novel method for automatically post-editing ASR results by using presentation slides that meeting participants use and utterances adjacent to a target utterance. We focus on automatic post-editing rather than domain adaptation because of the ease of incorporating external information, and the method can be used for arbitrary speech recognition engines. In experiments, we found that our method can significantly improve the recognition accuracy of domain-specific words (proper nouns). We also found an improvement in the word error rate (WER).

Kentaro Kamiya, Takuya Kawase, Ryuichiro Higashinaka, Katashi Nagao
Leveraging Inter-step Dependencies for Information Extraction from Procedural Task Instructions

Written instructions are among the most prevalent means of transferring procedural knowledge. Hence, enabling computers to obtain information from textual instructions is crucial for future AI agents. Extracting information from a step of a multi-part instruction is usually performed by solely considering the semantic and syntactic information of the step itself. In procedural task instructions, however, there is a sequential dependency across entities throughout the entire task, which would be of value for optimal information extraction. However, conventional language models such as transformers have difficulties processing long text, i.e., the entire instruction text from the first step to the last one, since their scope of attention is limited to a relatively short chunk of text. As a result, the dependencies among the steps of a longer procedure are often overlooked. This paper suggests a BERT-GRU model for leveraging sequential dependencies among all steps in a procedure. We present experiments on annotated datasets of text instructions in two different domains, i.e., repairing electronics and cooking, showing our model’s advantage compared to standard transformer models. Moreover, we employ a sequence prediction model to show the correlation between the predictability of tags and the performance benefit achieved by leveraging inter-step dependencies.

Nima Nabizadeh, Heiko Wersing, Dorothea Kolossa

Speech

Frontmatter
DNN-Based Semantic Rescoring Models for Speech Recognition

In this work, we address the problem of improving an automatic speech recognition (ASR) system. We want to efficiently model long-term semantic relations between words and introduce this information through a semantic model. We propose neural network (NN) semantic models for rescoring the N-best hypothesis list. These models use two types of representations as part of DNN input features: static word embeddings (from word2vec) and dynamic contextual embeddings (from BERT). Semantic information is computed thanks to these representations and used in the hypothesis pair comparison mode. We perform experiments on the publicly available dataset TED-LIUM. Clean speech and speech mixed with real noise are experimented, according to our industrial project context. The proposed BERT-based rescoring approach gives a significant improvement of the word error rate (WER) over the ASR system without rescoring semantic models under all experimented conditions and with n-gram and recurrent NN language model (Long Short-Term model, LSTM).

Irina Illina, Dominique Fohr
Identification of Scandinavian Languages from Speech Using Bottleneck Features and X-Vectors

This work deals with identification of the three main Scandinavian languages (Swedish, Danish and Norwegian) from spoken data. For this purpose, various state-of-the-art approaches are adopted, compared and combined, including i-vectors, deep neural networks (DNNs), bottleneck features (BTNs) as well as x-vectors. The best resulting approaches take advantage of multilingual BTNs and allow us to identify the target languages in speech segments lasting 5 s with a very low error rate around 1%. Therefore, they have many practical applications, such as in systems for transcription of Scandinavian TV and radio programs, where different persons speaking any of the target languages may occur. Within identification of Norwegian, we also focus on an unexplored sub-task of distinguishing between Bokmål and Nynorsk. Our results show that this problem is much harder to solve since these two language variants are acoustically very similar to each other: the best error rate achieved in this case is around 20%.

Petr Cerva, Lukas Mateju, Frantisek Kynych, Jindrich Zdansky, Jan Nouza
LSTM-XL: Attention Enhanced Long-Term Memory for LSTM Cells

Long Short-Term Memory (LSTM) cells, frequently used in state-of-the-art language models, struggle with long sequences of inputs. One major problem in their design is that they try to summarize long-term information into a single vector, which is difficult. The attention mechanism aims to alleviate this problem by accumulating the relevant outputs more efficiently. One very successful attention-based model is the Transformer; but it also has issues with long sentences. As a solution, the latest version of Transformers incorporates recurrence into the model. The success of these recurrent attention-based models inspired us to revise the LSTM cells by incorporating the attention mechanism. Our goal is to improve their long-term memory by attending to past outputs. The main advantage of our proposed approach is that it directly accesses the stored preceding vectors, making it more effective for long sentences. Using this method, we can also avoid the undesired resetting of the long-term vector by the forget gate. We evaluated our new cells on two speech recognition tasks and found that it is more beneficial to use attention inside the cells than after them.

Tamás Grósz, Mikko Kurimo
Improving RNN-T ASR Performance with Date-Time and Location Awareness

In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline and when the contexts are combined, the model learns complementary features and the recognition improves by 4.62%. On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others. We ran experiments with models trained on data of sizes 30K hours and 10K hours. We show that the scale of improvement with the 10K hours dataset is much higher than the one obtained with 30K hours dataset. Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.

Swayambhu Nath Ray, Soumyajit Mitra, Raghavendra Bilgi, Sri Garimella
BrAgriSpeech: A Corpus of Brazilian-Portuguese Agricultural Reported Speech

Agriculture is one of Brazil’s largest industries. In Brazil, the price of crops such as sugarcane is driven not only by the production levels but also by speculation and rumour. Also, some crop derivatives such as ethanol have their prices regulated by the government. Reported comments from influential speakers such as government ministers and agricultural-business leaders can impact the prices and in some cases the level of production of food products. Currently, there are no corpora in Brazilian-Portuguese that contains agricultural-related speech, the speakers and their employer. BrAgriSpeech is a corpus that uses linguistic rules and pre-trained models to extract reported speech, the speaker and where available the speaker’s employer as well as a discourse connector that connects the speaker with the quote. The resource has 6982 quotes which are in JSONL format. A sample of 50 quotes was manually evaluated and had an accuracy of 0.77 for quote identification, 0.82 for the identification of the speaker and 0.87 for the identification of the discourse connector. The resource is publicly available to encourage further research in the area.

Brett Drury, Samuel Morais Drury
Exploiting Large-Scale Teacher-Student Training for On-Device Acoustic Models

We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 h of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.

Jing Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi, Chunchuan Lyu, Athanasios Mouchtaris, Siegfried Kunzmann
An AI-Based Detection System for Mudrabharati: A Novel Unified Fingerspelling System for Indic Scripts

Sign Language (SL) is a potential tool for communication in the hearing and speech-impaired community. As individual words cannot be communicated accurately using the SL gestures, fingerspelling is adopted to spell out names of people and places. Due to rich vocabulary and diversity in Indic scripts, and the abugida nature of Indic scripts that distinguish them from a prominent world script like the Roman script, it is cumbersome to use American Sign Language (ASL) convention for fingerspelling in Indian languages. Moreover, due to the existence of 10 major scripts in India, it is a futile task to develop a separate fingerspelling convention for each individual Indic script based on the geometry of the characters. In this paper, we propose a novel and unified fingerspelling system known as Mudrabharati for Indic scripts. The gestures of Mudrabharati are constructed based on the phonetics of Indian scripts and not the geometry of the glyphs that compose the individual characters. Unlike ASL that utilizes just one hand, Mudrabharati uses both the hands - one for consonants and the other for vowels; swarayukta aksharas (Consonant-Vowel combinations) are gestured by using both the hands. An Artificial Intelligence (AI) based recognition system for Mudrabharati that returns the character in Devanagari and Tamil scripts is developed.

F. Amal Jude Ashwin, V. Srinivasa Chakravarthy, Sunil Kumar Kopparapu
Is There Any Additional Information in a Neural Network Trained for Pathological Speech Classification?

Speech is a biomarker extensively explored by the scientific community for different health-care applications because its reduced cost and non-intrusiveness. Specifically, in Parkinson’s disease, speech signals and deep learning methods have been explored for the automatic assessment and monitoring of patients. Related studies have shown to be very accurate to discriminate pathological vs. healthy speech. In spite of the high accuracies observed to detect the presence of diseases from speech, it is not clear which additional information about the speakers or the environment is implicitly learned by the deep learning systems. This study proposes a methodology to evaluate intermediate representations of a neural network in order to find out which other speaker traits and aspects are learned by the system during the training process. We trained models to detect the presence of Parkinson’s disease from speech. Then, we used intermediate representations of the network to classify additional speaker traits such as gender, age, and the native language. It is important to detect which information is available inside the neural network that can lead to open the black-box and to detect possible algorithmic biases. The results indicate that the network, in addition to adjusting its parameters for disease classification, also acquires knowledge about gender of the speakers in the first layers, and about speech tasks and the native language in the last layers of the network.

C. D. Rios-Urrego, J. C. Vásquez-Correa, J. R. Orozco-Arroyave, E. Nöth
On Comparison of XGBoost and Convolutional Neural Networks for Glottal Closure Instant Detection

In this paper, we progress further in the development of an automatic GCI detection model. In previous papers, we compared XGBoost with other supervised learning models just as with a deep one-dimensional convolutional neural network. Here we aimed to compare a deep one-dimensional convolutional neural network, more precisely the InceptionV3 model, with XGBoost and context-aware XGBoost models trained on the same size datasets. Afterward, we wanted to reveal the influence of dataset consistency and size on the XGBoost performance. All newly created models are compared while tested on our custom test dataset. On the publicly available databases, the XGBoost and context-aware XGBoost with the context of length 7 shows similar and better performance than the InceptionV3 model. Also, the consistency of the training dataset shows significant performance improvement in comparison to the older models.

Michal Vraštil, Jindřich Matoušek
Emotional State Modeling for the Assessment of Depression in Parkinson’s Disease

Parkinson’s disease (PD) results from the degeneration of dopamine in the substantia nigra, which plays a role in motor control, mood, and cognitive functions. Some processes in the brain of a PD patient can be overlapped with non-motor functions, where some of the same brain circuitry that is related to mood regulation is also affected. Commonly, most patients experience motor symptoms such as speech impairments, bradykinesia, or resting tremor; while non-motor symptoms such as sleep disorders or depression may also appear in PD. Depression is one of the most common non-motor symptoms developed by patients and is also associated with the rapid progression of motor impairments. This study proposes the use of the “Pleasure, Arousal, and Dominance Emotional State Model” (PAD) to capture similar aspects related to mood and affective states in PD patients. The PAD representation is commonly used to quantify and represent emotions in a multidimensional space. Acoustic information is used as input to feed a deep learning model based on convolutional and recurrent neural networks, which are trained to model the PAD representation. The proposed approach consists of performing transfer knowledge from the PAD model for the classification and the assessment of depression in PD. F1-scores of up to 0.69 are obtained for the classification of PD patients vs. healthy controls and of up to 0.85 for the discrimination between depressive PD vs. non-depressive PD patients, which confirms that there is information embedded in the PAD model that can be used to detect depression in PD.

P. A. Pérez-Toro, J. C. Vasquez-Correa, T. Arias-Vergara, P. Klumpp, M. Schuster, E. Nöth, J. R. Orozco-Arroyave
Attention-Based End-to-End Named Entity Recognition from Speech

Named entities are heavily used in the field of spoken language understanding, which uses speech as an input. The standard way of doing named entity recognition from speech involves a pipeline of two systems, where first the automatic speech recognition system generates the transcripts, and then the named entity recognition system produces the named entity tags from the transcripts. In such cases, automatic speech recognition and named entity recognition systems are trained independently, resulting in the automatic speech recognition branch not being optimized for named entity recognition and vice versa. In this paper, we propose two attention-based approaches for extracting named entities from speech in an end-to-end manner, that show promising results. We compare both attention-based approaches on Finnish, Swedish, and English data sets, underlining their strengths and weaknesses.

Dejan Porjazovski, Juho Leinonen, Mikko Kurimo
Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language

In this paper, we study the iterative self-supervised pretraining procedure for the Tatar language speech recognition system. The complete recipe includes the use of base pre-trained model (the multilingual XLSR model or the Librispeech (English) Wav2Vec 2.0 Base model), the next step was a “source” self-supervised pre-training on collected Tatar unlabeled data (mostly broadcast audio), then the resulting model was used for additional “target” self-supervised pretraining on the annotated corpus (target domain, without using labels), and the final step was to fine-tune the model on the annotated corpus with labels. To conduct the experiments we prepared a 328-h unlabeled and a 129-h annotated audio corpora. Experiments on three datasets (two proprietary and publicly available Common Voice as the third one) showed that the first “source” pretraining step allows ASR models to show on average 24.3% lower WER, and both source and target pretraining - 33.3% lower WER than a simple finetunes base model. The resulting accuracy for the Common Voice (read speech) test dataset is WER 5.37%, on the private TatarCorpus (read clean speech) is 4.65%, and for the spontaneous speech dataset collected from the TV shows is 22.6%, all of the results are the best-published results on these datasets. Additionally, we show that using a multilingual base model can be beneficial for the case of fine-tuning (10.5% less WER for this case), but applying self-supervised pretraining steps eliminates this difference.

Aidar Khusainov, Dzhavdet Suleymanov, Ilnur Muhametzyanov
Speakers Talking Foreign Languages in a Multi-lingual TTS System

This paper presents experiments with a multi-lingual multi-speaker TTS synthesis system jointly trained on English, German, Russian, and Czech speech data. The experimental LSTM-based TTS system with a trainable neural vocoder utilizes the International Phonetic Alphabet (IPA) which allows a straight combination of different languages. We analyzed whether the joint model is capable to generalize and mix the information contained in the training data and whether particular voices can be used for the synthesis of different languages, including the language-specific phonemes. The intelligibility of generated speech was assessed by an SUS (Semantically Unpredictable Sentences) listening tests containing Czech sentences spoken by non-Czech speakers. The performance of the joint multi-lingual model was also compared with independent single-voice models where the missing non-native phonemes were mapped to the most similar native phonemes. Besides the Czech sentences, the preference test also contained the English sentences spoken by Czech voices. The multi-lingual model was preferred for all evaluated voices. Although the generated speech did not sound like a native speaker, the phonetic and prosodic features were definitely better.

Zdeněk Hanzlíček, Jakub Vít, Markéta Řezáčková
Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement of the articulatory organs during speech, and we aim to reconstruct the speech signal from this recording. Our SSI system synthesizes speech from ultrasonic videos of the tongue movement, and the quality of the resulting speech signals are evaluated by metrics such as the mean squared error loss function of the underlying neural network and the Mel-Cepstral Distortion (MCD) of the reconstructed speech compared to the original. Here, we first demonstrate that the amount of silence in the training data can have an influence both on the MCD evaluation metric and on the performance of the neural network model. Then, we train a convolutional neural network classifier to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal. In the experiments our ultrasound-based speech/silence separator achieved a classification accuracy of about 85% and an AUC score around 86%.

Amin Honarmandi Shandiz, László Tóth
How Much End-to-End is Tacotron 2 End-to-End TTS System

In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attention of researchers. The motivation is simple – replacing the individual modules that TTS traditionally built on with a powerful deep neural network simplifies the architecture of the entire system. However, how capable are such end-to-end systems of dealing with classic tasks such as G2P, text normalisation, homograph disambiguation and other issues inseparably linked to text-to-speech systems?In the present paper, we explore three free implementations of the Tacotron 2-based speech synthesizers, focusing on their abilities to transform the input text into correct pronunciation, not only in terms of G2P conversion but also in handling issues related to text analysis and the prosody patterns used.

Daniel Tihelka, Jindřich Matoušek, Alice Tihelková
CNN-TDNN-Based Architecture for Speech Recognition Using Grapheme Models in Bilingual Czech-Slovak Task

Czech and Slovak languages are very similar, not only in writing but also in phonetic form. This work aims to find a suitable combination of these two languages concerning better recognition results. We would like to show such a contribution on the Malach project. The Malach speech of Holocaust survivors is highly emotional, filled with many disfluencies, heavy accents, age-related coarticulation, and many non-speech events. Due to the nature of the corpus, it is very difficult to find other appropriate data for acoustic modeling, so such a combination can significantly improve the amount of training data. We will discuss the differences between the phoneme and grapheme way of combining Czech with Slovak. We will also compare different architectures of deep neural networks (TDNN, TDNNF, CNN-TDNNF) and tune the optimal topology. The proposed bilingual ASR approach provides a slight improvement over monolingual ASR systems, not only at the phoneme level but also at the grapheme.

Josef V. Psutka, Jan Švec, Aleš Pražák

Dialogue

Frontmatter
A Multimodal Model for Predicting Conversational Feedbacks

We propose in this paper a statistical model in the perspective of predicting listener’s feedbacks in a conversation. The first contribution of the paper is a study of the prediction of all feedbacks, including those in overlap with the speaker with a good accuracy. Existing model are good at predicting feedbacks during a pause, but reach a very low success level for all feedbacks. We give in this paper a first step towards this complex problem. The second contribution is a model predicting precisely the type of the feedback (generic vs. specific) as well as other specific features (valence expectation) useful in particular for generating feedbacks in dialogue systems. This work relies on an original corpus.

Auriane Boudin, Roxane Bertrand, Stéphane Rauzy, Magalie Ochs, Philippe Blache
Estimating Social Distance Between Interlocutors with MFCC-Based Acoustic Models for Vowels

The present study is devoted to measuring speech entrainment between interlocutors of varying social distance based on their vowels’ characteristics. 5 degrees of social distance were taken into consideration: siblings, friends, strangers of same and opposite gender, and strangers of significantly different age and social status. Speaker-dependent acoustic models of cardinal Russian vowels /i/, /a/, and /u/ were constructed and compared. We hypothesized that entrainment would be the strongest between siblings and decrease with increasing social distance. However, it was found that while entrainment is indeed strong for siblings, friends actually show less entrainment than strangers. Same-gender pairs showed stronger entrainment than opposite-gender pairs. Entrainment was also found to be vowel-dependent, with /a/ exhibiting the most variation.

Pavel Kholiavin, Alla Menshikova, Tatiana Kachkovskaia, Daniil Kocharov
Remote Learning of Speaking in Syntactic Forms with Robot-Avatar-Assisted Language Learning System

To help second language (L2) learners acquire oral communication skills, dialogue-based computer-assisted language learning (DB-CALL) systems are attracting more interest than ever. When robot-assisted language learning (RALL) is used for realizing such systems, L2 learners are provided with a sense of reality and tension similar to that in a real L2 conversation. At the same time, there are increasing demands for remote learning, accelerated in part by the spread of the novel coronavirus. We have therefore developed a robot-avatar-assisted language learning system that simulates a trialogue in English with two robot avatars and a learner for remote learning. The conversation scenarios deal with various daily topics to keep the learner’s interest and the system prompts the learner to acquire oral skills by using specific syntactic forms in conversation. We conducted a six-day remote learning experiment with ten Japanese university students to evaluate the learning effect, using eye gaze as an index of the learners’ degree of concentration. Our findings demonstrated the effectiveness of our system for remote learning and showed that the learners’ eye gaze activities changed between question answering and repeating tasks.

Taisei Najima, Tsuneo Kato, Akihiro Tamura, Seiichi Yamamoto
Backmatter
Metadata
Title
Text, Speech, and Dialogue
Editors
Kamil Ekštein
František Pártl
Miloslav Konopík
Copyright Year
2021
Electronic ISBN
978-3-030-83527-9
Print ISBN
978-3-030-83526-2
DOI
https://doi.org/10.1007/978-3-030-83527-9

Premium Partner