Skip to main content

2007 | Buch

Text, Speech and Dialogue

10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007. Proceedings

herausgegeben von: Václav Matoušek, Pavel Mautner

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Talks

Language Modeling with Linguistic Cluster Constraints

In the past, Maximum Entropy based language models were constrained by training data n-gram counts, topic estimates, and triggers. We will investigate the obtainable gains from imposing additional constraints related to linguistic clusters, such as parts of speech, semantic/syntactic word clusters, and semantic labels. It will be shown that there substantial profit is available provided the estimates use Gaussian a priori statistics.

Frederick Jelinek, Jia Cui
Some of Our Best Friends Are Statisticians

In his LREC 2004 invited talk when awarded by the first ever Antonio Zampolli prize for his essential contributions to the use of spoken and written language resources, Frederick Jelinek has used the title “Some of My Best Friends Are Linguists”. He did so for many reasons, one of them being that he wanted to remove the perception that he dislikes linguists and linguistics after so many people used to cite his famous line from an old presentation at a Natural Language Processing Evaluation workshop in 1988, in which he said “Whenever I fire a linguist our system performance improves.”

Jan Hajič, Eva Hajičová
Some Special Problems of Speech Communication

We start with a brief overview of our work in speech recognition and understanding which led from monomodal (speech only) human-machine dialog to multimodal human-machine interaction and assistance. Our work in speech communication initially had the goal to develop a complete system for question answering by spoken dialog [7,15]. This goal was achieved in various projects funded by the German Research Foundation [14] and the German Federal Ministry of Education and Research [16]. Problems of multilingual communication were considered in projects supported by the European Union [2,4,10]. In the Verbmobil project the speech-to-speech translation problem was investigated and it turned out that

prosody

and the recognition of

emotion

was important and extremely useful – if not indispensible – to disambiguate utterances and to influence the dialog strategy [3,17]. Multimodal and multimedia aspects of human-machine communication became a topic in the follow-up projects Embassi [11], SmartKom [1], FORSIP [12], and SmartWeb [9].

The SmartWeb project [19], which involves 17 partners from companies, research institutes, and universities, has the general goal to provide the foundations for multimodal human-machine communication with distributed semantic web services using different mobile devices, hand-held, mounted in a car or to a motor cycle. It uses speech and video signals as well as signals from other sensors, e.g. ECG or skin resistance. A special problem in human-machine interaction and assistance is the question whether the user speaks to the machine or not, that is, the distinction of on- and off-talk. It is shown how on-/off-talk can be classified by the combination of prosodic and image features. Using additional sensors the user state in general is estimated to give further cues to the dialog control. This may be used, for example, to avoid input from the dialog system in a situation where a driver is under stress.

In other projects the special problem of children’s speech processing was considered [20]. Among others it was investigated whether a manual correction of automatically computed fundamental frequency

F

0

and word boundaries might have a positive effect on the automatic classification of the 4 classes anger, motherese, emphatic, and neutral; this was not the case, leading to the conclusion that presently there is no need for improved

F

0

algorithms in emotion recognition. The word accuracy (WA) of native and non-native English speaking children was investigated; it was shown that non-native speakers (age 10 – 15) achieve about the same WA as children aged 6 – 7 using a speech recognizer trained with native children speech. The recognizer also was used to develop an automatic scoring of the pronunciation quality of children learning English.

A special problem are impairments of speech which may be congenital (e.g. the cleft lip and palate) or acquired by disease (e.g. cancer of the larynx). Impairments are, among others, treated with speech training by speech therapists. They score the speech quality subjectively according to various criteria. The idea is that the WA of an automatic speech recognizer should be highly correlated with the human rating. Using speech samples from laryngectomees it is shown that the machine rating is about as good as the rating of five human experts and can also be done via telephone. This opens the possibility of an objective and standardized rating of speech quality.

Heinrich Niemann
Recent Advances in Spoken Language Understanding

This presentation will review the state of the art in spoken language understanding. After a brief introduction on conceptual structures, early approaches to spoken language understanding (SLU) followed in the seventies are described. They are based on augmented grammars and non stochastic parsers for interpretation.

Renato De Mori

Text

Transformation-Based Tectogrammatical Dependency Analysis of English

We present experiments with automatic annotation of English texts, taken from the Penn Treebank, at the dependency-based tectogrammatical layer, as it is defined in the Prague Dependency Treebank. The proposed analyzer, which is based on machine-learning techniques, outperforms a tool based on hand-written rules, which is used for partial tectogrammatical annotation of English now, in the most important characteristics of tectogrammatical annotation. Moreover, both tools were combined and their combination gives the best results.

Václav Klimeš
Multilingual Name Disambiguation with Semantic Information

This paper studies the problem of name ambiguity which concerns the discovery of the different underlying meanings behind a name. We have developed a semantic approach on the basis of which a graph-based clustering algorithm determines the sets of the semantically related sentences that talk about the same name. Our approach is evaluated with the Bulgarian, Romanian, Spanish and English languages for various couples of city, country, person and organization names. The yielded results significantly outperform a majority based classifier and are compared to a bigram co-occurrence approach.

Zornitsa Kozareva, Sonia Vàzquez, Andrés Montoyo
Inducing Classes of Terms from Text

This paper describes a clustering method for organizing in semantic classes a list of terms. The experiments were made using a

POS

annotated corpus, the ACL Anthology, which consists of technical articles in the field of Computational Linguistics. The method, mainly based on some assumptions of

Formal Concept Analysis

, consists in building bi-dimensional clusters of both terms and their lexico-syntactic contexts. Each generated cluster is defined as a semantic class with a set of terms describing the extension of the class and a set of contexts perceived as the intensional attributes (or properties) valid for all the terms in the extension. The clustering process relies on two restrictive operations:

abstraction

and

specification

. The result is a concept lattice that describes a domain-specific ontology of terms.

Pablo Gamallo, Gabriel P. Lopes, Alexandre Agustini
Accurate Unlexicalized Parsing for Modern Hebrew

Many state-of-the-art statistical parsers for English can be viewed as Probabilistic Context-Free Grammars (PCFGs) acquired from treebanks consisting of phrase-structure trees enriched with a variety of contextual, derivational (e.g., markovization) and lexical information. In this paper we empirically investigate the applicability and adequacy of the unlexicalized variety of such parsing models to Modern Hebrew, a Semitic language that differs in structure and characteristics from English. We show that contrary to experience with parsing the WSJ, the markovized, head-driven unlexicalized variety does not necessarily outperform plain PCFGs for Semitic languages. We demonstrate that enriching unlexicalized PCFGs with morphologically marked agreement features percolated up the parse tree (e.g., definiteness) outperforms plain PCFGs as well as a simple head-driven variation on the MH treebank. We further show that an (unlexicalized) head-driven variety enriched with the same features achieves even better performance. We conclude that morphologically rich languages introduce an additional dimension of parametrization that is orthogonal to the horizontal/vertical dimensions proposed before [1] and its contribution is essential and complementary.

Reut Tsarfaty, Khalil Sima’an
Disambiguation of the Neuter Pronoun and Its Effect on Pronominal Coreference Resolution

Coreference resolution, determining the appropriate discourse referent for an anaphoric expression, is an essential but difficult task in natural language processing. It has been observed that an important source of errors in machine-learning based approaches to this task, is the wrong disambiguation of the third person singular neuter pronoun as either referential or non-referential. In this paper, we investigate whether a machine learning based approach can be successfully applied to the disambiguation of the neuter pronoun in Dutch and show a modest potential effect of this disambiguation on the results of a machine learning based coreference resolution system for Dutch.

Véronique Hoste, Iris Hendrickx, Walter Daelemans
Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness

The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.

Peifeng Li, Qiaoming Zhu, Peide Qian, Geoffrey C. Fox
Disambiguating Hypernym Relations for Roget’s Thesaurus

Roget’s

Thesaurus is a lexical resource which groups terms by semantic relatedness. It is

Roget’s

shortcoming that the relations are ambiguous, in that it does not

name

them; it only shows that there

is

a relation between terms. Our work focuses on disambiguating hypernym relations within

Roget’s

Thesaurus. Several techniques of identifying hypernym relations are compared and contrasted in this paper, and a total of over 50,000 hypernym relations have been disambiguated within

Roget’s

. Human judges have evaluated the quality of our disambiguation techniques, and we have demonstrated on several applications the usefulness of the disambiguated relations.

Alistair Kennedy, Stanistaw Szpakowicz
Dependency and Phrasal Parsers of the Czech Language: A Comparison

In the paper, we present the results of an experiment with comparing the effectiveness of real text parsers of Czech language based on completely different approaches – stochastic parsers that provide dependency trees as their outputs and a meta-grammar parser that generates a resulting chart structure representing a packed forest of phrasal derivation trees.

We describe and formulate main questions and problems accompanying such experiment, try to offer answers to these questions and finally display also factual results of the tests measured on 10 thousand Czech sentences.

Aleš Horák, Tomáš Holan, Vladimír Kadlec, Vojtěch Kovář
Automatic Word Clustering in Russian Texts

The paper deals with development and application of automatic word clustering (AWC) tool aimed at processing Russian texts of various types, which should satisfy the requirements of flexibility and compatibility with other linguistic resources. The construction of AWC tool requires computer implementation of latent semantic analysis (LSA) combined with clustering algorithms. To meet the need, Python-based software has been developed. Major procedures performed by AWC tool are segmentation of input texts and context analysis, co-occurrence matrix construction, agglomerative and

K

-means clustering. Special attention is drawn to experimental results on clustering words in raw texts with changing parameters.

Olga Mitrofanova, Anton Mukhin, Polina Panicheva, Vyacheslav Savitsky
Feature Engineering in Maximum Spanning Tree Dependency Parser

In this paper we present the results of our experiments with modifications of the feature set used in the Czech mutation of the Maximum Spanning Tree parser. First we show how new feature templates improve the parsing accuracy and second we decrease the dimensionality of the feature space to make the parsing process more effective without sacrificing accuracy.

Václav Novák, Zdeněk Žabokrtský
Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns

We present experiments with a variety of corpus-based measures applied to the problem of constructing semantic similarity functions for Polish nouns. Rich inflection in Polish allows us to acquire useful syntactic features without parsing; morphosyntactic restrictions checked in a large enough window provide sufficiently useful data. A novel feature selection method gives the accuracy of 86% on the WordNet-based synonymy test, an improvement of 5% over the previous results.

Maciej Piasecki, Stanisław Szpakowicz, Bartosz Broda
Bilingual News Clustering Using Named Entities and Fuzzy Similarity

This paper is focused on discovering bilingual news clusters in a comparable corpus. Particularly, we deal with the news representation and with the calculation of the similarity between documents. We use as representative features of the news the cognate named entities they contain. One of our main goals consists of proving whether the use of only named entities is a good source of knowledge for multilingual news clustering. In the vectorial news representation we take into account the category of the named entities. In order to determine the similarity between two documents, we propose a new approach based on a fuzzy system, with a knowledge base that tries to incorporate the human knowledge about the importance of the named entities category in the news. We have compared our approach with a traditional one obtaining better results in a comparable corpus with news in Spanish and English.

Soto Montalvo, Raquel Martínez, Arantza Casillas, Víctor Fresno
Extractive Summarization of Broadcast News: Comparing Strategies for European Portuguese

This paper presents the comparison between three methods for extractive summarization of Portuguese broadcast news: feature-based, Maximal Marginal Relevance, and Latent Semantic Analysis. The main goal is to understand the level of agreement among the automatic summaries and how they compare to summaries produced by non-professional human summarizers. Results were evaluated using the ROUGE-L metric. Maximal Marginal Relevance performed close to human summarizers. Both feature-based and Latent Semantic Analysis automatic summarizers performed close to each other and worse than Maximal Marginal Relevance, when compared to the summaries done by the human summarizers.

Ricardo Ribeiro, David Martins de Matos
On the Evaluation of Korean WordNet

WordNet has become an important and useful resource for the natural language processing field. Recently, many countries have been developing their own WordNet. In this paper we show an evaluation of the Korean WordNet (U-WIN). The purpose of the work is to study how well the manually created lexical taxonomy U-WIN is built. Evaluation is done level by level, and the reason for selecting words for each level is that we want to compare each level and to find relations between them. As a result the words at a certain level (level 6) give the best score, for which we can make a conclusion that the words at this level are better organized than those at other levels. The score decreases as the level goes up or down from this particular level.

Altangerel Chagnaa, Ho-Seop Choe, Cheol-Young Ock, Hwa-Mook Yoon
An Adaptive Keyboard with Personalized Language-Based Features

Our research is about an adaptive keyboard, which autonomously adjusts its predictive features and key displays to current user input. We used personalized word prediction to improve the performance of such a system. Prediction using common English dictionary (represented by the British National Corpus) is compared with prediction using personal data, such as personal documents, chat logs, and personal emails. A user study was also conducted to gather requirements for a new keyboard design. Based on these studies, we developed a personalized and adaptive on-screen keyboard for both single-handed and zero-handed users. It combines tapping-based and motion-based text input with language-based acceleration techniques, including personalized and adaptive task-based dictionary, frequent character prompting, word completion, and grammar checker with suffix completion.

Siska Fitrianie, Leon J. M. Rothkrantz
An All-Path Parsing Algorithm for Constraint-Based Dependency Grammars of CF-Power

An all-path parsing algorithm for a constraint-based dependency grammar of context-free power is presented. The grammar specifies possible dependencies between words together with a number of constraints. The algorithm builds a packed representation of ambiguous syntactic structure in the form of a dependency graph. For certain types of ambiguities the graph grows slower than the chart or parse forest.

Tomasz Obrêbski
Word Distribution Based Methods for Minimizing Segment Overlaps

Dividing coherent text into a sequence of coherent segments is a challenging task since different topics/subtopics are often related to a common theme(s). Based on lexical cohesion, we can keep track of words and their repetitions and break text into segments at points where the lexical chains are weak. However, there exist words that are more or less evenly distributed across a document (called document-dependent or distributional stopwords), making it difficult to separate one segment from another. To minimize the overlaps between segments, we propose two new measures for removing distributional stopwords based on word distribution. Our experimental results show that the new measures are both efficient to compute and effective for improving the segmentation performance of expository text and transcribed lecture text.

Joe Vasak, Fei Song
On the Relative Hardness of Clustering Corpora

Clustering is often considered the most important unsupervised learning problem and several clustering algorithms have been proposed over the years. Many of these algorithms have been tested on classical clustering corpora such as Reuters and 20 Newsgroups in order to determine their quality. However, up to now the relative hardness of those corpora has not been determined. The relative clustering hardness of a given corpus may be of high interest, since it would help to determine whether the usual corpora used to benchmark the clustering algorithms are hard enough. Moreover, if it is possible to find a set of features involved in the hardness of the clustering task itself, specific clustering techniques may be used instead of general ones in order to improve the quality of the obtained clusters. In this paper, we are presenting a study of the specific feature of the vocabulary overlapping among documents of a given corpus. Our preliminary experiments were carried out on three different corpora: the train and test version of the R8 subset of the Reuters collection and a reduced version of the 20 Newsgroups (Mini20Newsgroups). We figured out that a possible relation between the vocabulary overlapping and the F-Measure may be introduced.

David Pinto, Paolo Rosso
Indexing and Retrieval Scheme for Content-Based Multimedia Applications

Rapid increase in the amount of the digital audio collections demands a generic framework for robust and efficient indexing and retrieval based on the aural content. In this paper we focus our efforts on developing a generic and robust audio-based multimedia indexing and retrieval framework. First an overview for the audio indexing and retrieval schemes with the major limitations and drawbacks are presented. Then the basic innovative properties of the proposed method are justified accordingly. Finally the experimental results and conclusive remarks about the proposed scheme are reported.

Martynov Dmitry, Eugenij Bovbel
Automatic Diacritic Restoration for Resource-Scarce Languages

The orthography of many resource-scarce languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration. This paper describes experiments with a machine learning approach that is able to automatically restore diacritics on the basis of local graphemic context. We apply the method to the African languages of Cilubà, Gĩkũyũ, Kĩkamba, Maa, Sesotho sa Leboa, Tshivenda and Yoruba and contrast it with experiments on Czech, Dutch, French, German and Romanian, as well as Vietnamese and Chinese Pinyin.

Guy De Pauw, Peter W. Wagacha, Gilles-Maurice de Schryver
Lexical and Perceptual Grounding of a Sound Ontology

Sound ontologies need to incorporate source unidentifiable sounds in an adequate and consistent manner. Computational lexical resources like WordNet have either inserted these descriptions into conceptual categories, or make no attempt to organize the terms for these sounds. This work attempts to add structure to linguistic terms for source unidentifiable sounds. Through an analysis of WordNet and a psycho-acoustic experiment we make some preliminary proposal about which features are highly salient for sound classification. This work is essential for interfacing between source unidentifiable sounds and linguistic descriptions of those sounds in computational applications, such as the Semantic Web and robotics.

Anna Lobanova, Jennifer Spenader, Bea Valkenier
Named Entities in Czech: Annotating Data and Developing NE Tagger

This paper deals with the treatment of Named Entities (NEs) in Czech. We introduce a two-level NE classification. We have used this classification for manual annotation of two thousand sentences, gaining more than 11,000 NE instances. Employing the annotated data and Machine-Learning techniques (namely the top-down induction of decision trees), we have developed and evaluated a software system aimed at automatic detection and classification of NEs in Czech texts.

Magda Ševčíková, Zdeněk Žabokrtský, Oldřich Krůza
Identifying Expressions of Emotion in Text

Finding emotions in text is an area of research with wide-ranging applications. We describe an emotion annotation task of identifying emotion category, emotion intensity and the words/phrases that indicate emotion in text. We introduce the annotation scheme and present results of an annotation agreement study on a corpus of blog posts. The average inter-annotator agreement on labeling a sentence as emotion or non-emotion was 0.76. The agreement on emotion categories was in the range 0.6 to 0.79; for emotion indicators, it was 0.66. Preliminary results of emotion classification experiments show the accuracy of 73.89%, significantly above the baseline.

Saima Aman, Stan Szpakowicz
ECAF: Authoring Language for Embodied Conversational Agents

Embodied Conversational Agent (ECA) is the user interface metaphor that allows to naturally communicate information during human-computer interaction in synergic modality dimensions, including voice, gesture, emotion, text, etc. Due to its anthropological representation and the ability to express human-like behavior, ECAs are becoming popular interface front-ends for dialog and conversational applications. One important prerequisite for efficient authoring of such ECA-based applications is the existence of a suitable programming language that exploits the expressive possibilities of multimodally blended messages conveyed to the user. In this paper, we present an architecture and interaction language ECAF, which we used for authoring several ECA-based applications. We also provide the feedback from usability testing we carried for user acceptance of several multimodal blending strategies.

Ladislav Kunc, Jan Kleindienst

Speech

Dynamic Adaptation of Language Models in Speech Driven Information Retrieval

This paper reports on the evaluation of a system that allows the use of spoken queries to retrieve information from a textual document collection. First, a large vocabulary continuous speech recognizer transcribes the spoken query into text. Then, an information retrieval engine retrieves the documents relevant to that query. The system works for Spanish language. In order to increase performance, we proposed a two-pass approach based on dynamic adaptation of language models. The system was evaluated using a standard IR test suite from CLEF. Spoken queries were recorded by 10 different speakers. Results showed that the proposed approach outperforms the baseline system: a relative gain in retrieval precision of 5.74%, with a language model of 60,000 words.

César González-Ferreras, Valentín Cardeñoso-Payo
Whitening-Based Feature Space Transformations in a Speech Impediment Therapy System

It is quite common to use feature extraction methods prior to classification. Here we deal with three algorithms defining uncorrelated features. The first one is the so-called whitening method, which transforms the data so that the covariance matrix becomes an identity matrix. The second method, the well-known Fast Independent Component Analysis (FastICA) searches for orthogonal directions along which the value of the non-Gaussianity measure is large in the whitened data space. The third one, the Whitening-based Springy Discriminant Analysis (WSDA) is a novel method combination, which provides orthogonal directions for better class separation. We compare the effects of the above methods on a real-time vowel classification task. Based on the results we conclude that the WSDA transformation is especially suitable for this task.

András Kocsor, Róber Busa-Fekete, András Bánhalmi
Spanish-Basque Parallel Corpus Structure: Linguistic Annotations and Translation Units

In this paper we propose a corpus structure which represents and manages an aligned parallel corpus. The corpus structure is based on a stand-off annotation model, which is composed of several XML documents. A bilingual parallel corpus represented in the proposed structure will contain: (1) the entire corpus together with its corresponding linguistic information, (2) translation units and alignment relations between units of the two languages: paragraphs, sentences and named entities. The proposed structure permits to work with the corpus both as an annotated corpus with linguistic information, and as a translation memory.

A. Casillas, A. Díaz de Illarraza, J. Igartua, R. Martínez, K. Sarasola, A. Sologaistoa
An Automatic Version of the Post-Laryngectomy Telephone Test

Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after total laryngectomy, i.e. the removal of the larynx. The quality of the substitute voice has to be evaluated during therapy. For the intelligibility evaluation of German speakers over telephone, the Post-Laryngectomy Telephone Test (PLTT) was defined. Each patient reads out 20 of 400 different monosyllabic words and 5 out of 100 sentences. A human listener writes down the words and sentences understood and computes an overall score. This paper presents a means of objective and automatic evaluation that can replace the subjective method. The scores of 11 naïve raters for a set of 31 test speakers were compared to the word recognition rate of speech recognizers. Correlation values of about 0.9 were reached.

Tino Haderlein, Korbinian Riedhammer, Andreas Maier, Elmar Nöth, Hikmet Toy, Frank Rosanowski
Speaker Normalization Via Springy Discriminant Analysis and Pitch Estimation

Speaker normalization techniques are widely used to improve the accuracy of speaker independent speech recognition. One of the most popular group of such methods is Vocal Tract Length Normalization (VTLN). These methods try to reduce the inter-speaker variability by transforming the input feature vectors into a more compact domain, to achieve better separations between the phonetic classes. Among others, two algorithms are commonly applied: the Maximum Likelihood criterion-based, and the Linear Discriminant criterion-based normalization algorithms. Here we propose the use of the Springy Discriminant criterion for the normalization task. In addition we propose a method for the VTLN parameter determination that is based on pitch estimation. In the experiments this proves to be an efficient and swift way to initialize the normalization parameters for training, and to estimate them for the voice samples of new test speakers.

Dénes Paczolay, András Bánhalmi, András Kocsor
A Study on Speech with Manifest Emotions

We present a study of the prosody – seen in a broader sense – that supports the theory of the interrelationship function of speech. “Pure emotions” are meant to show a relationship of the speaker with the general context. The analysis goes beyond the basic prosody, as related to pitch trajectory; namely, the analysis also aims to determine the change in higher formants. The refinement in the analysis asks for improved tools. Methodological aspects are discussed, including a discussion of the limitations of the currently available tools. Some conclusions are drawn.

Horia-Nicolai Teodorescu, Silvia Monica Feraru
Speech Recognition Supported by Prosodic Information for Fixed Stress Languages

In our paper we examine the usage prosodic features in speech recognition, with a special attention payed to agglutinating and fixed stress languages. The used prosodic features, acoustic-prosodic pre-processing, and segmentation in terms of prosodic units are presented in details. We use the expression ”prosodic unit” in order to make a difference from prosodic phrases, which are longer. We trained a HMM-based prosodic segmenter reliing on fundamental frequency and intensity of speech. The output of the prosodic segmenter is used for N-best lattice rescoring in parallel with a simplified bigram language model in a continuous speech recognizer, in order to improve speech recognition performance. Experiments for Hungarian language show a WER reduction of about 4% using a simple lattice rescoring.

György Szaszák, Klára Vicsi
TRAP-Based Techniques for Recognition of Noisy Speech

This paper presents a systematic study of performance of TempoRAl Patterns (TRAP) based features and their proposed modifications and combinations for speech recognition in noisy environment. The experimental results are obtained on AURORA 2 database with clean training data. We observed large dependency of performance of different TRAP modifications on noise level. Earlier proposed TRAP system modifications help in clean conditions but degrade the system performance in presence of noise. The combination techniques on the other hand can bring large improvement in case of weak noise and degrade only slightly for strong noise cases. The vector concatenation combination technique is improving the system performance up to strong noise.

František Grézl, Jan Černocký
Intelligibility Is More Than a Single Word: Quantification of Speech Intelligibility by ASR and Prosody

In this paper we examine the quality of the prediction of intelligibility scores of human experts. Furthermore, we investigate the differences between subjective expert raters who evaluated speech disorders of laryngectomees and children with cleft lip and palate. We use the recognition rate of a word recognizer and prosodic features to predict the intelligibility score of each individual expert. For each expert and the mean opinion of all experts we present the best features to model their scoring behavior according to the mean rank obtained during a 10-fold cross-validation. In this manner all individual speech experts were modeled with a correlation coefficient of at least

r

 > .75. The mean opinion of all raters is predicted with a correlation of

r

 =.90 for the laryngectomees and

r

 =.86 for the children.

Andreas Maier, Tino Haderlein, Maria Schuster, Emeka Nkenke, Elmar Nöth
Appositions Versus Double Subject Sentences – What Information the Speech Analysis Brings to a Grammar Debate

We propose a method based on spoken language analysis to deal with controversial syntactic issues; we apply the method to the problem of the double subject sentences in the Romanian language. The double subject construction is a controversial linguistic phenomenon in Romanian. While some researchers accept it as a language ‘curiosity’ (specific only to the Asian languages, but not to the European ones), others consider it apposition-type structure, in order to embody its behavior in the already existing theories. This paper brings a fresh gleam of light over the debate, by presenting what we believe to be the first study on the phonetic analysis of double-subject sentences in order to account for its difference vs. the appositional constructions.

Horia-Nicolai Teodorescu, Diana Trandabăţ
Automatic Evaluation of Pathologic Speech – from Research to Routine Clinical Use

Previously we have shown that ASR technology can be used to objectively evaluate pathologic speech. Here we report on progress for routine clinical use: 1) We introduce an easy-to-use recording and evaluation environment. 2) We confirm our previous results for a larger group of patients. 3) We show that telephone speech can be analyzed with the same methods with only a small loss of agreement with human experts. 4) We show that prosodic information leads to more robust results. 5) We show that text reference instead of transliteration can be used for evaluation. Using word accuracy of a speech recognizer and prosodic features as features for SVM regression, we achieve a correlation of .90 between the automatic analysis and human experts.

Elmar Nöth, Andreas Maier, Tino Haderlein, Korbinian Riedhammer, Frank Rosanowski, Maria Schuster
The LIA Speech Recognition System: From 10xRT to 1xRT

The LIA developed a speech recognition toolkit providing most of the components required by speech-to-text systems. This toolbox allowed to build a Broadcast News (BN) transcription system was involved in the ESTER evaluation campaign ([11]), on

unconstrained transcription

and

real-time transcription

tasks. In this paper, we describe the techniques we used to reach the real-time, starting from our baseline 10xRT system. We focus on some aspects of the A* search algorithm which are critical for both efficiency and accuracy. Then, we evaluate the impact of the different system components (lexicon, language models and acoustic models) to the trade-off between efficiency and accuracy. Experiments are carried out in framework of the ESTER evaluation campaign. Our results show that the real time system reaches performance on about 5.6% absolute WER whorses than the standard 10xRT system, with an absolute WER (Word Error Rate) of about 26.8%.

G. Linarès, P. Nocera, D. Massonié, D. Matrouf
Logic-Based Rhetorical Structuring for Natural Language Generation in Human-Computer Dialogue

Rhetorical structuring is field approached mostly by research in natural language (pragmatic) interpretation. However, in natural language generation (NLG) the rhetorical structure plays an important part, in monologues and dialogues as well. Hence, several approaches in this direction exist. In most of these, the rhetorical structure is calculated and built in the framework of Rhetorical Structure Theory (RST), or Centering Theory [7], [5]. In language interpretation, a more recent formal account of rhetorical structuring has emerged, namely Segmented Discourse Representation Theory (SDRT), which alleviates some of the issues and weaknesses inherent in previous theories [1]. Research has been initiated in rhetorical structuring for NLG using SDRT, mostly concerning monologues [3]. Most of the approaches in using and / or approximating SDRT in computer implementations lean on dynamic semantics, derived from Discourse Representation Theory (DRT) in order to compute rhetorical relations [9]. Some efforts exist in approximating SDRT using less expressive (and expensive) logics, such as First Order Logic (FOL) or Dynamic Predicate Logic (DPL), but these efforts concern language interpretation [10]. This paper describes a rhetorical structuring component of a natural language generator for human-computer dialogue, using SDRT, approximated via the usage of FOL, doubled by a domain-independent discourse ontology. Thus, the paper is structured as follows: the first section situates the research in context and motivates the approach; the second section describes the discourse ontology; the third section describes the approximations done on vanilla SDRT, in order for it to be used for language generation purposes; the fourth section describes an algorithm for updating the discourse structure for a current dialogue; the fifth section provides a detailed example of rhetorical relation computation. The sixth section concludes the paper and gives pointers to future research and improvements.

Vladimir Popescu, Jean Caelen, Corneliu Burileanu
Text-Independent Speaker Identification Using Temporal Patterns

In this work we present an approach for text-independent speaker recognition. As features we used Mel Frequency Cepstrum Coefficients (MFCCs) and Temporal Patterns (TRAPs). For each speaker we trained Gaussian Mixture Models (GMMs) with different numbers of densities. The used database was a 36 speakers database with very noisy close-talking recordings. For the training a Universal Background Model (UBM) is built by the EM-Algorithm and all available training data. This UBM is then used to create speaker-dependent models for each speaker. This can be done in two ways: Taking the UBM as an initial model for EM-Training or Maximum-A-Posteriori (MAP) adaptation. For the 36 speaker database the use of TRAPs instead of MFCCs leads to a frame-wise recognition improvement of 12.0 %. The adaptation with MAP enhanced the recognition rate by another 14.2 %.

Tobias Bocklet, Andreas Maier, Elmar Nöth
Recording and Annotation of Speech Corpus for Czech Unit Selection Speech Synthesis

The paper gives a brief summarisation of preparation and recording of a phonetically and prosodically rich speech corpus for Czech unit selection text-to-speech synthesis. Special attention is paid to the process of two-phase orthographic annotations of recorded sentences with regard to their coherence.

Jindřich Matoušek, Jan Romportl
Sk-ToBI Scheme for Phonological Prosody Annotation in Slovak

Research and development in speech synthesis and recognition calls for a phonological intonation annotation scheme for the particular language. Inspired by the successful ToBI (Tones and Break Indices) for American English [1] and GToBI [2] for German, this paper introduces a new intonation annotation scheme for Slovak, Sk-ToBI. In spite of the fact that Slovak prosodic rules differ from those of English or German, we decided to follow the main principals of ToBI and to define a special Slovak version of Tones and Break Indices annotation scheme. The speech material belonging to different styles, which was used for the preliminary study of accents in Slovak is shortly described and the conventions of Sk-ToBI annotation are presented.

Milan Rusko, Róbert Sabo, Martin Dzúr
Towards Automatic Transcription of Large Spoken Archives in Agglutinating Languages – Hungarian ASR for the MALACH Project

The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional word-based one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic modeling techniques are also detailed. Using unsupervised speaker adaptations along with morph based lexical models 14.4%-8.1% absolute word error rate reductions have been achieved on a 2 speakers, 2 hours test set as compared to the speaker independent baseline results.

Péter Mihajlik, Tibor Fegyó, Bottyán Németh, Zoltán Tüske, Viktor Trón
Non-uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

We describe novel speech/audio coding technique designed to operate at medium bit-rates. Unlike classical state-of-the-art coders that are based on short-term spectra, our approach uses relatively long temporal segments of audio signal in critical-band-sized sub-bands. We apply auto-regressive model to approximate Hilbert envelopes in frequency sub-bands. Residual signals (Hilbert carriers) are demodulated and thresholding functions are applied in spectral domain. The Hilbert envelopes and carriers are quantized and transmitted to the decoder. Our experiments focused on designing speech/audio coder to provide broadcast radio-like quality audio around 15 − 25kbps. Obtained objective quality measures, carried out on standard speech recordings, were compared to the state-of-the-art 3GPP-AMR speech coding system.

Petr Motlicek, Hynek Hermansky, Sriram Ganapathy, Harinath Garudadri
Filled Pauses in Speech Synthesis: Towards Conversational Speech

Speech synthesis techniques have already reached a high level of naturalness. However, they are often evaluated on text reading tasks. New applications will request for conversational speech instead and disfluencies are crucial in such a style. The present paper presents a system to predict filled pauses and synthesise them. Objective results show that they can be inserted with 96% precision and 58% recall. Perceptual results even shown that its insertion increases naturalness of synthetic speech.

Jordi Adell, Antonio Bonafonte, David Escudero
Exploratory Analysis of Word Use and Sentence Length in the Spoken Dutch Corpus

We present an analysis of word use and sentence length in different types of Dutch speech, ranging from conversations over discussions and formal speech to read speech. We find that the distributions of sentence length and personal pronouns are characteristic for the type of speech. In addition, we analyzed differences in word use between male and female speakers and between speakers with high and low education levels. We find that male speaker use more fillers, while women use more pronouns and adverbs. Furthermore, gender specific differences turn out to be stronger than differences in language use between groups with different education levels.

Pascal Wiggers, Leon J. M. Rothkrantz
Design of Tandem Architecture Using Segmental Trend Features

This paper investigates the tandem architecture (TA) based on segmental features. The segmental feature based recognition system has been reported to show better results than the conventional feature based system in previous studies. In this paper we tried to merge the segmental feature with the tandem architecture which uses both hidden Markov models and neural networks. In general, segmental features can be separated into the trend and location. Since the trend means variation of segmental features and since it occupies a large portion of segmental features, the trend information was used as an independent or additional feature for the speech recognition system. We applied the trend information of segmental features to TA and used posterior probabilities, which are the output of the neural network, as inputs of the recognition system. Experiments were performed on Aurora2 database to examine the potentiality of the trend feature based TA. The results of our experiments verified that the proposed system outperforms the conventional system on very low SNR environments. These findings led us to conclude that the trend information on TA can be additionally used for the traditional MFCC features.

Young-Sun Yun, Yunkeun Lee
An Automatic Retraining Method for Speaker Independent Hidden Markov Models

When training speaker-independent HMM-based acoustic models, a lot of manually transcribed acoustic training data must be available from a good many different speakers. These training databases have a great variation in the pitch of the speakers, articulation and the speed of talking. In practice, the speaker-independent models are used for bootstrapping the speaker-dependent models built by speaker adaptation methods. Thus the performance of the adaptation methods is strongly influenced by the performance of the speaker- independent model and by the accuracy of the automatic segmentation which also depends on the base model. In practice, the performance of the speaker-independent models can vary a great deal on the test speakers. Here our goal is to reduce this performance variability by increasing the performance value for the speakers with low values, at the price of allowing a small drop in the highest performance values. For this purpose we propose a new method for the automatic retraining of speaker-independent HMMs.

András Bánhalmi, Róbert Busa-Fekete, András Kocsor
User Modeling to Support the Development of an Auditory Help System

The implementations of online help in most commercial computing applications deployed today have a number of well documented limitations. Speech technology can be used to complement traditional online help systems and mitigate some of these problems. This paper describes a model used to guide the design and implementation of an experimental auditory help system, and presents results from a pilot test of that system.

Flaithrí Neff, Aidan Kehoe, Ian Pitt
Fast Discriminant Training of Semi-continuous HMM

In this paper, we introduce a fast estimate algorithm for discriminant training of semi-continuous HMM (Hidden Markov Models).

We first present the

Frame Discrimination

(FD) method proposed in [1] for weight re-estimate. Then, the weight update equation is formulated in the specific framework of semi-continuous models. Finally, we propose an approximated update function which requires a very low level of computational resources.

The first experiments validate this method by comparing our fast discriminant weighting (FDW) to the original one. We observe that, on a digit recognition task, FDW and FD estimate obtain similar results, when our method decreases significantly the computational time.

A second experiment evaluates FDW in Large Vocabulary Continuous Speech Recognition (LVCSR) task. We incorporate semi-continuous FDW models in a Broadcast News (BN) transcription system. Experiments are carried out in the framework of ESTER evaluation campaign ([12]). Results show that in particular context of very compact acoustic models, discriminant weights improve the system performance compared to both a baseline continuous system and a SCHMM trained by MLE algorithm.

G. Linarès, C. Lévy
Speech/Music Discrimination Using Mel-Cepstrum Modulation Energy

In this paper, we propose Mel-cepstrum modulation energy (MCME) as an extension of modulation energy (ME) for a feature to discriminate speech and music data. MCME is extracted from the time trajectory of Mel-frequency cepstral coefficients (MFCC), while ME is based on the spectrum. As cepstral coefficients are mutually uncorrelated, we expect MCME to perform better than ME. To find out the best modulation frequency for MCME, we make experiments with 4 Hz to 20 Hz modulation frequency, and we compare the results with those obtained from the ME and the MFCC based cepstral flux. In the experiments, 8 Hz MCME shows the best discrimination performance, and it yields a discrimination error reduction rate of 71% compared with 4 Hz ME. Compared with the cepstral flux (CF), it shows an error reduction rate of 53%.

Bong-Wan Kim, Dae-Lim Choi, Yong-Ju Lee
Parameterization of the Input in Training the HVS Semantic Parser

The aim of this paper is to present an extension of the hidden vector state semantic parser. First, we describe the statistical semantic parsing and its decomposition into the semantic and the lexical model. Subsequently, we present the original hidden vector state parser. Then, we modify its lexical model so that it supports the use of the input sequence of feature vectors instead of the sequence of words. We compose the feature vector from the automatically generated linguistic features (lemma form and morphological tag of the original word). We also examine the effect of including the original word into the feature vector. Finally, we evaluate the modified semantic parser on the Czech Human-Human train timetable corpus. We found that the performance of the semantic parser improved significantly compared with the baseline hidden vector state parser.

Jan Švec, Filip Jurčíček, Luděk Müller
A Comparison Using Different Speech Parameters in the Automatic Emotion Recognition Using Feature Subset Selection Based on Evolutionary Algorithms

Study of emotions in human-computer interaction is a growing research area. Focusing on automatic emotion recognition, work is being performed in order to achieve good results particularly in speech and facial gesture recognition. This paper presents a study where, using a wide range of speech parameters, improvement in emotion recognition rates is analyzed. Using an emotional multimodal bilingual database for Spanish and Basque, emotion recognition rates in speech have significantly improved for both languages comparing with previous studies. In this particular case, as in previous studies, machine learning techniques based on evolutive algorithms (EDA) have proven to be the best emotion recognition rate optimizers.

Aitor Álvarez, Idoia Cearreta, Juan Miguel López, Andoni Arruti, Elena Lazkano, Basilio Sierra, Nestor Garay
Benefit of Maximum Likelihood Linear Transform (MLLT) Used at Different Levels of Covariance Matrices Clustering in ASR Systems

The paper discusses the benefit of a Maximum Likelihood Linear Transform (MLLT) applied on selected groups of covariance matrices. The matrices were chosen and clustered using phonetic knowledge. Results of experiments are compared with outcomes obtained for diagonal and full covariance matrices of a baseline system and also for widely used transforms based on Linear Discriminant Analysis (LDA), Heteroscedastic LDA (HLDA) and Smoothed HLDA (SHLDA).

Josef V. Psutka
Information Retrieval Test Collection for Searching Spontaneous Czech Speech

This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.

Pavel Ircing, Pavel Pecina, Douglas W. Oard, Jianqiang Wang, Ryen W. White, Jan Hoidekr
Inter-speaker Synchronization in Audiovisual Database for Lip-Readable Speech to Animation Conversion

The present study proposes an inter-speaker audiovisual synchronization method to decrease the speaker dependency of our direct speech to animation conversion system. Our aim is to convert an everyday speaker’s voice to lip-readable facial animation for hearing impaired users. This conversion needs mixed training data: acoustic features from normal speakers coupled with visual features from professional lip-speakers. Audio and video data of normal and professional speakers were synchronized with Dynamic Time Warping method. Quality and usefulness of the synchronization were investigated in subjective test with measuring noticeable conflicts between the audio and visual part of speech stimuli. An objective test was done also, training neural network on the synchronized audiovisual data with increasing number of speakers.

Gergely Feldhoffer, Balázs Oroszi, György Takács, Attila Tihanyi, Tamás Bárdi
Constructing Empirical Models for Automatic Dialog Parameterization

Automatic classification of dialogues between clients and a ser vice center needs a preliminary dialogue parameterization. Such a pa rameterization is usually faced with essential difficulties when we deal with politeness, competence, satisfaction, and other similar characteris tics of clients. In the paper, we show how to avoid these difficulties using empirical formulae based on lexical-grammatical properties of a text. Such formulae are trained on given set of examples, which are evaluated manually by an expert(s) and the best formula is selected by the Ivakhnenko Method of Model Self-Organization. We test the suggested methodology on the real set of dialogues from Barcelona railway directory inquiries for estimation of passenger’s politeness.

Mikhail Alexandrov, Xavier Blanco, Natalia Ponomareva, Paolo Rosso
The Effect of Lexicon Composition in Pronunciation by Analogy

Pronunciation by analogy (PbA) is a data-driven approach to phonetic transcription that generates pronunciations for unknown words by exploiting the phonological knowledge implicit in the dictionary that provides the primary source of pronunciations. Unknown words typically include low-frequency ‘common’ words, proper names or neologisms that have not yet been listed in the lexicon. It is received wisdom in the field that knowledge of the class of a word (common versus proper name) is necessary for correct transcription, but in a practical text-to-speech system, we do not know the class of the unknown word

a priori

. So if we have a dictionary of common words and another of proper names, we do not know which one to use for analogy unless we attempt to infer the class of unknown words. Such inference is likely to be error prone. Hence it is of interest to know the cost of such errors (if we are using separate dictionaries) and/or the cost of simply using a single, undivided dictionary, effectively ignoring the problem. Here, we investigate the effect of lexicon composition: common words only, proper names only or a mixture. Results suggest that high-transcription accuracy may be achievable without prior classification.

Tasanawan Soonklang, R. I. Damper, Yannick Marchand
Festival-si: A Sinhala Text-to-Speech System

This paper brings together the development of the first Text-to- Speech (TTS) system for Sinhala using the Festival framework and practical applications of it. Construction of a diphone database and implementation of the natural language processing modules are described. The paper also presents the development methodology of direct Sinhala Unicode text input by rewriting letter-to-sound rules in Festival’s context sensitive rule format and the implementation of Sinhala syllabification algorithm. A Modified Rhyme Test (MRT) was conducted to evaluate the intelligibility of the synthesized speech and yielded a score of 71.5% for the TTS system described.

Ruvan Weerasinghe, Asanka Wasala, Viraj Welgama, Kumudu Gamage
Voice Conversion Based on Probabilistic Parameter Transformation and Extended Inter-speaker Residual Prediction

Voice conversion is a process which modifies speech produced by one speaker so that it sounds as if it is uttered by another speaker. In this paper a new voice conversion system is presented. The system requires parallel training data. By using linear prediction analysis, speech is described with line spectral frequencies and the corresponding residua. LSFs are converted together with instantaneous F

0

by joint probabilistic function. The residua are transformed by employing residual prediction. In this paper, a new modification of residual prediction is introduced which uses information on the desired target F

0

to determine a proper residuum and it also allows an efficient control of F

0

in resulting speech.

Zdeněk Hanzlíček, Jindřich Matoušek
Automatic Czech – Sign Speech Translation

This paper is devoted to the problem of automatic translation between Czech and SC in both directions. We introduced our simple monotone phrase-based decoder -

SiMPaD

suitable for fast translation and compared its results with the results of the state-of-the-art phrase-based decoder -

MOSES

. We compare the translation accuracy of handcrafted and automatically derived phrases and introduce a ”class-based” language model and post-processing step in order to increase the translation accuracy according to several criteria. Finally, we use the described methods and decoding techniques in the task of SC to Czech automatic translation and report the first results for this direction.

Jakub Kanis, Luděk Müller
Maximum Likelihood and Maximum Mutual Information Training in Gender and Age Recognition System

Gender and age estimation based on Gaussian Mixture Models (GMM) is introduced. Telephone recordings from the Czech SpeechDat-East database are used as training and test data set. Mel-Frequency Cepstral Coefficients (MFCC) are extracted from the speech recordings. To estimate the GMMs’ parameters Maximum Likelihood (ML) training is applied. Consequently these estimations are used as the baseline for Maximum Mutual Information (MMI) training. Results achieved when employing both ML and MMI training are presented and discussed.

Valiantsina Hubeika, Igor Szöke, Lukáš Burget, Jan Černocký
Pitch Marks at Peaks or Valleys?

This paper deals with the problem of speech waveform polarity. As the polarity of speech waveform can influence the performance of pitch marking algorithms (see Sec. 4), a simple method for the speech signal polarity determination is presented in the paper. We call this problem peak/valley decision making, i.e. making of decision whether pitch marks should be placed at peaks (local maxima) or at valleys (local minima) of a speech waveform. Besides, the proposed method can be utilized to check the polarity consistence of a speech corpus, which is important for the concatenation of speech units in speech synthesis.

Milan Legát, Daniel Tihelka, Jindřich Matoušek
Quality Deterioration Factors in Unit Selection Speech Synthesis

The purpose of the present paper is to examine the relationships between target and concatenation costs and the quality (with focus on naturalness) of generated speech. Several synthetic phrases were examined by listeners with the aim to find unnatural artefacts in them, and the mutual relation between the artefacts and the behaviour of features used in given unit selection algorithm was examined.

Daniel Tihelka, Jindřich Matoušek, Jiří Kala
Topic-Focus Articulation Algorithm on the Syntax-Prosody Interface of Romanian

We propose in this paper an implementation of the Prague School’s TFA (Topic-Focus Articulation) algorithm to support the Romanian

prosody design

, relying on the experience with FDG (Functional Dependency Grammar) and SCD (Segmentation-Cohesion-Dependency) parsing strategies for the classical,

i.e.

predication-driven, but Information Structure (IS) non-dependent, syntax. As contributions worth to be mentioned are:

(a)

Outlining the

functional

and

hierarchical

organization of linguistic

markers and structures

within SCD and FDG local-global parsing, on both sides of the

syntax-prosody interface

of Romanian.

(b)

Pointing out the relationship between classical (IS-free) syntactic structures, IS (topic-focus, communicative dynamism) depending textual spans, and the corresponding

prosodic intonational

units.

(c)

Adapting

and

implementing

the TFA

algorithm

for

the first time

to Romanian prosodic structures, to be continued with TFA sentence-level refinements, its rhetorical-level extension, and embedding into local-global

linking algorithms

.

Neculai Curteanu, Diana Trandabăţ, Mihai Alex Moruz
Translation and Conversion for Czech Sign Speech Synthesis

Recent research progress in developing of Czech Sign Speech synthesizer is presented. The current goal is to improve a system for automatic synthesis to produce accurate synthesis of the Sign Speech. The synthesis system converts written text to an animation of an artificial human model. This includes translation of text to sign phrases and its conversion to the animation of an avatar. The animation is composed of movements and deformations of segments of hands, a head and also a face. The system has been evaluated by two initial perceptual tests. The perceptual tests indicate that the designed synthesis system is capable to produce intelligible Sign Speech.

Zdeněk Krňoul, Miloš Železný

Dialog

A Wizard-of-Oz System Evaluation Study

In order to evaluate the performance of the dialogue-manager component of a developing, Slovenian and Croatian spoken dialogue system, two Wizard-of-Oz experiments were performed. The only difference between the two experiment settings was in the dialogue-management manner, i.e., while in the first experiment dialogue management was performed by a human, the wizard, in the second experiment it was performed by the newly-implemented dialogue-manager component. The data from both Wizard-of-Oz experiments was evaluated with the PARADISE evaluation framework, a potential general methodology for evaluating and comparing different versions of spoken-language dialogue systems. The study ascertains a remarkable difference in the performance functions when taking different satisfaction-measure sums or even individual scores as the target to be predicted, it proves the indispensableness of the recently introduced

database parameters

when evaluating information-providing dialogue systems, and it confirms the dialogue manager’s cooperativity subject to the incorporated knowledge representation.

Melita Hajdinjak, France Mihelič
New Measures for Open-Domain Question Answering Evaluation Within a Time Constraint

Previous works on evaluating the performance of Question Answering (QA) systems are focused on the evaluation of the precision. In this paper, we developed a mathematic procedure in order to explore new evaluation measures in QA systems considering the answer time. Also, we carried out an exercise for the evaluation of QA systems within a time constraint in the CLEF-2006 campaign, using the proposed measures. The main conclusion is that the evaluation of QA systems in realtime can be a new scenario for the evaluation of QA systems.

Elisa Noguera, Fernando Llopis, Antonio Ferrández, Alberto Escapa
A Methodology for Domain Dialogue Engineering with the Midiki Dialogue Manager

Implementing robust Dialogue Systems (DS) supporting natural interaction with humans still presents challenging problems and difficulties. In the centre of any DS is its Dialogue Manager (DM), providing the functionalities which permit a dialogue to move forward towards some common goal in a cooperative interaction. Unfortunately, there are few authoring tools to provide an easy and intuitive implementation of such dialogues. In this paper we present a methodology for dialogue engineering for the MIDIKI DM. This methodology bridges this gap since it is supported by an authoring tool which generates XML and compilable Java representations of a dialogue.

Lúcio M. M. Quintal, Paulo N. M. Sampaio
The Intonational Realization of Requests in Polish Task-Oriented Dialogues

In the present paper, the intonational realization of

Request Action

dialogue acts in Polish map task dialogues is analyzed. The study is focused on the

Request External Action

acts realized as single, well-formed intonational phrases. Basic pitch-related parameters are measured and discussed. Nuclear melodies are described and categorized, and some generalizations are formulated about their common realizations. Certain aspects of the grammatical form and phrase placement in the dialogue flow are also taken into account. The results will be employed in comparative studies and in the preparation of glottodidactic materials, but they may also prove useful in the field of speech synthesis or recognition.

Maciej Karpinski
Analysis of Changes in Dialogue Rhythm Due to Dialogue Acts in Task-Oriented Dialogues

We consider that factors such as prosody of systems’ utterances and dialogue rhythm are important to attain a natural human-machine dialogue. However, the relations between dialogue rhythm and speaker’s various states in task-oriented dialogue have been not revealed. In this study, we collected task-oriented dialogues and analyzed the relations between “dialogue structures, kinds of dialogue acts (contents of utterances),

Aizuchi

(

backchannel

/

acknowledgment

),

Repeat

and interjection” and “dialogue rhythm (response timing, F0, and speech rate)”.

Noriki Fujiwara, Toshihiko Itoh, Kenji Araki
Recognition and Understanding Simulation for a Spoken Dialog Corpus Acquisition

Since the design and acquisition of a new dialog corpus is a complex task, new methods to facilitate this task are necessary. In this paper, we present a methodology to make use of our previous work within the framework of dialog systems in order to acquire a dialog corpus for a new domain. The main idea is the simulation of recognition and understanding errors in the acquisition of the new dialog corpus. This simulation is based on the analysis of such errors in a previously acquired corpus and the definition of a correspondence table among the concepts and attributes of both tasks. This correspondence table is based on the similarity of semantic meaning and frequencies. Finally, the application of this methodology is illustrated in some examples.

F. Garcia, L. F. Hurtado, D. Griol, M. Castro, E. Segarra, E. Sanchis
First Approach in the Development of Multimedia Information Retrieval Resources for the Basque Context

Information Retrieval (IR) applications require appropriate Multimodal Resources to develop all of their components. The work described in this paper is one of the main steps of a broader project that consists in developing a Multimodal Index System for Information Retrieval. The final goal of this part of the project is to create a robust Automatic Speech Recognition System for Basque that also covers the other languages spoken in the Basque Country: Spanish and French. It is widely accepted that the robustness of these systems is directly related to the quality of the resources used during training. Hence, the digital resources for Multilingual Continuous Speech Recognition systems for the three official languages in the Basque Country have to be described.

N. Barroso, A. Ezeiza, N. Gilisagasti, K. López de Ipiña, A. López, J. M. López
The Weakest Link

In this paper we discuss the phenomenon of grounding in dialogue using a context-change approach to the interpretation of dialogue utterances. We formulate an empirically motivated principle for the strengthening of weak mutual beliefs, and show that with this principle, the building of common ground in dialogue can be explained through ordinary mechanisms of understanding and cooperation.

Harry Bunt, Roser Morante
A Spoken Dialog System for Chat-Like Conversations Considering Response Timing

If a dialog system can respond to a user as naturally as a human, the interaction will be smoother. In this research, we aim to develop a dialog system by emulating the human behavior in a chat-like dialog. In this paper, we developed a dialog system which could generate chat-like responses and their timing using a decision tree. The system could perform “collaborative completion,” “

aizuchi

” (back-channel) and so on. The decision tree utilized the pitch and the power contours of user’s utterance, recognition hypotheses, and response preparation status of the response generator, at every time segment as features to generate response timing.

Ryota Nishimura, Norihide Kitaoka, Seiichi Nakagawa
Digitisation and Automatic Alignment of the DIALOG Corpus: A Prosodically Annotated Corpus of Czech Television Debates

This article describes the development and automatic processing of the audio-visual DIALOG corpus. The DIALOG corpus is a prosodically annotated corpus of Czech television debates that has been recorded and annotated at the Czech Language Institute of the Academy of Sciences of the Czech Republic. It has recently grown to more than 400 VHS 4-hour tapes and 375 transcribed TV debates. The described digitisation process and automatic alignment enable an easily accessible and user-friendly research environment, supporting the exploration of Czech prosody and its analysis and modelling. This project has been carried out in cooperation with the Institute of Formal and Applied Linguistics of Faculty of Mathematics and Physics, Charles University, Prague. Currently the first version of the DIALOG corpus is available to the public (version 0.1,

http://ujc.dialogy.cz

). It includes 10 selected and revised hour-long talk shows.

Nino Peterek, Petr Kaderka, Zdeňka Svobodová, Eva Havlová, Martin Havlík, Jana Klímová, Patricie Kubáčková
Setting Layout in Dialogue Generating Web Pages

Setting layout of a two-dimensional domain is one of the key tasks of our ongoing project aiming at a dialogue system which should give the blind the opportunity to create web pages and graphics by means of dialogue. We present an approach that enables active dialogue strategies in natural language. This approach is based on a procedure that checks the correctness of the given task, analyses the user’s requirements and proposes a consistent solution. An example illustrating the approach is presented.

Luděk Bártek, Ivan Kopeček, Radek Ošlejšek
Graph-Based Answer Fusion in Multilingual Question Answering

One major problem in multilingual Question Answering (QA) is the combination of answers obtained from different languages into one single ranked list. This paper proposes a new method for tackling this problem. This method is founded on a graph-based ranking approach inspired in the popular Google’s PageRank algorithm. Experimental results demonstrate that the proposed method outperforms other current techniques for answer fusion, and also evidence the advantages of multilingual QA over the traditional monolingual approach.

Rita M. Aceves-Pérez, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda
Using Query-Relevant Documents Pairs for Cross-Lingual Information Retrieval

The world wide web is a natural setting for cross-lingual information retrieval. The European Union is a typical example of a multilingual scenario, where multiple users have to deal with information published in at least 20 languages. Given queries in some source language and a target corpus in another language, the typical approximation consists in translating either the query or the target dataset to the other language. Other approaches use parallel corpora to obtain a statistical dictionary of words among the different languages. In this work, we propose to use a training corpus made up by a set of Query-Relevant Document Pairs (QRDP) in a probabilistic cross-lingual information retrieval approach which is based on the IBM alignment model 1 for statistical machine translation. Our approach has two main advantages over those that use direct translation and parallel corpora: we will not obtain a translation of the query, but a set of associated words which share their meaning in some way and, therefore, the obtained dictionary is, in a broad sense, more semantic than a translation one. Besides, since the queries are supervised, we are working in a more restricted domain than that when using a general parallel corpus (it is well known that in this context results are better than those which are performed in a general context). In order to determine the quality of our experiments, we compared the results with those obtained by a direct translation of the queries with a query translation system, observing promising results.

David Pinto, Alfons Juan, Paolo Rosso
Detection of Dialogue Acts Using Perplexity-Based Word Clustering

In the present work we used a word clustering algorithm based on the perplexity criterion, in a Dialogue Act detection framework in order to model the structure of the speech of a user at a dialogue system. Specifically, we constructed an n-gram based model for each target Dialogue Act, computed over the word classes. Then we evaluated the performance of our dialogue system on ten different types of dialogue acts, using an annotated database which contains 1,403,985 unique words. The results were very promising since we achieved about 70% of accuracy using trigram based models.

Iosif Mporas, Dimitrios P. Lyras, Kyriakos N. Sgarbas, Nikos Fakotakis
Dialogue Management for Intelligent TV Based on Statistical Learning Method

In this paper, we introduce a practical spoken dialogue interface for intelligent TV based on goal-oriented dialogue modeling. It uses a frame structure for representing the user intention and determining the next action. To analyze discourse context, we employ several statistical learning techniques and device an incremental dialogue strategy learning method from training corpus. By empirical experiments, we demonstrated the efficiency of the proposed system. In case of the subjective evaluation, we obtained 73% user satisfaction ratio, while the objective evaluation result was over 90% in case of a restricted situation for commercialization.

Hyo-Jung Oh, Chung-Hee Lee, Yi-Gyu Hwang, Myung-Gil Jang
Multiple-Taxonomy Question Classification for Category Search on Faceted Information

In this paper we present a novel multiple-taxonomy question classification system, facing the challenge of assigning categories in multiple taxonomies to natural language questions. We applied our system to category search on faceted information. The system provides a natural language interface to faceted information, detecting the categories requested by the user and narrowing down the document search space to those documents pertaining to the facet values identified. The system was developed in the framework of language modeling, and the models to detect categories are inferred directly from the corpus of documents.

David Tomás, José L. Vicedo
Indexing and Retrieval Scheme for Content-Based Multimedia Applications
Martynov Dmitry, Eugenij Bovbel
Backmatter
Metadaten
Titel
Text, Speech and Dialogue
herausgegeben von
Václav Matoušek
Pavel Mautner
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-74628-7
Print ISBN
978-3-540-74627-0
DOI
https://doi.org/10.1007/978-3-540-74628-7