Skip to main content

2009 | Buch

Text, Speech and Dialogue

12th International Conference, TSD 2009, Pilsen, Czech Republic, September 13-17, 2009. Proceedings

herausgegeben von: Václav Matoušek, Pavel Mautner

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Talks

Code Breaking for Automatic Speech Recognition

Practical automatic speech recognition is of necessity a (near) real time activity performed by a system whose structure is fixed and whose parameters once trained may be adapted on the basis of the speech that the system observed during recognition.

However, in specially important situations (e.g., recovery of out-of-vocabulary words) the recognition task could be viewed as an activity akin to code-breaking to whose accomplishment can be devoted an essentially infinite amount effort. In such a case everything would be fair, including, for instance, the retraining of a language and/or acoustic model on the basis of newly acquired data (from the Internet!) or even a complete change of the recognizer paradigm.

An obvious way to proceed is to use the basic recognizer to produce a lattice or confusion network and then do the utmost to eliminate ambiguity. Another possibility is to create a list of frequent confusions (for instance the pair IN and AND) and prepare a appropriate individual decision processes to resolve each when it occurs in test data. We will report on our initial code breaking effort.

Frederick Jelinek
The Semantics of Semantics in Language Processing

In speech and language research, the semantics of an utterance always corresponds to the meaning of the utterance. Meaning however, is a concept that has been argued by philosophers for centuries, so in Language Processing, semantics has come to be used very differently in different applications. One can even make the case that although we believe we must program computers to represent the “semantics” of an utterance, we often have great difficulty as humans defining exactly what we want. The talk will give an overview of formal semantics, lexical semantics and conceptual semantics and then focus on how semantics is used in several application areas of Language Processing.

Louise Guthrie
Are We There Yet? Research in Commercial Spoken Dialog Systems

In this paper we discuss the recent evolution of spoken dialog systems in commercial deployments. Yet based on a simple finite state machine design paradigm, dialog systems reached today a higher level of complexity. The availability of massive amounts of data during deployment led to the development of continuous optimization strategy pushing the design and development of spoken dialog applications from an art to science. At the same time new methods for evaluating the subjective caller experience are available. Finally we describe the inevitable evolution for spoken dialog applications from speech only to multimodal interaction.

Roberto Pieraccini, David Suendermann, Krishna Dayanidhi, Jackson Liscombe
Semantic Information Processing for Multi-party Interaction

We present ongoing research efforts using semantic representations and processing, combined with machine learning approaches to structure, understand, summarize etc. the multimodal information available from multi-party meeting recordings.

Tilman Becker
Communication Disorders and Speech Technology

In this talk we will give an overview of the different kinds of communication disorders. We will concentrate on communication disorders related to language and speech (i.e., not look at disorders like blindness or deafness). Speech and language disorders can range from simple sound substitution to the inability to understand or use language. Thus, a disorder may affect one or several linguistic levels: A patient with an articulation disorder cannot correctly produce speech sounds (phonemes) because of imprecise placement, timing, pressure, speed, or flow of movement of the lips, tongue, or throat. His speech may be acoustically unintelligible, yet the syntactic, semantic, and pragmatic level are not affected. With other pathologies, e.g. Wernicke-aphasia, the acoustics of the speech signal might be intelligible, yet the patient is – due to mixup of words (semantic paraphasia) or sounds (phonematic paraphasia) – unintelligible.

We will look at what linguistic knowledge has to be modeled in order to analyze different pathologies with speech technology, how difficult the task is, and how speech technology is able to support the speech therapist for the tasks diagnosis, therapy control, comparison of therapies, and screening.

Elmar Nöth, Stefan Steidl, Maria Schuster

Text

A Gradual Combination of Features for Building Automatic Summarisation Systems

This paper presents a Text Summarisation approach, which combines three different features (Word frequency, Textual Entailment, and The Code Quantity Principle) in order to produce extracts from newswire documents in English. Experiments shown that the proposed combination is appropriate for generating summaries, improving the system’s performance by 10% over the best DUC 2002 participant. Moreover, a preliminary analysis of the suitability of these features for domain-independent documents has been addressed obtaining encouraging results, as well.

Elena Lloret, Manuel Palomar
Combining Text Vector Representations for Information Retrieval

This paper suggests a novel representation for documents that is intended to improve precision. This representation is generated by combining two central techniques: Random Indexing; and Holographic Reduced Representations (HRRs). Random indexing uses co-occurrence information among words to generate semantic context vectors that are the sum of randomly generated term identity vectors. HRRs are used to encode textual structure which can directly capture relations between words (e.g., compound terms, subject-verb, and verb-object). By using the random vectors to capture semantic information, and then employing HRRs to capture structural relations extracted from the text, document vectors are generated by summing all such representations in a document. In this paper, we show that these representations can be successfully used in information retrieval, can effectively incorporate relations, and can reduce the dimensionality of the traditional vector space model (VSM). The results of our experiments show that, when a representation that uses random index vectors is combined with different contexts, such as document occurrence representation (DOR), term co-occurrence representation (TCOR) and HRRs, the VSM representation is outperformed when employed in information retrieval tasks.

Maya Carrillo, Chris Eliasmith, A. López-López
Detecting and Correcting Errors in an English Tectogrammatical Annotation

We present our first experiments with detecting and correcting errors in a manual annotation of English texts, taken from the Penn Treebank, at the dependency-based tectogrammatical layer, as it is defined in the Prague Dependency Treebank. The main idea is that errors in the annotation usually result in an inconsistency, i. e. the state when a phenomenon is annotated in different ways at several places in a corpus. We describe our algorithm for detecting inconsistencies (it got positive feedback from annotators) and we present some statistics on the manually corrected data and results of a tectogrammatical analyzer which uses these data for its operation. The corrections have improved the data just slightly so far, but we outline some ways to more significant improvement.

Václav Klimeš
Improving the Clustering of Blogosphere with a Self-term Enriching Technique

The analysis of blogs is emerging as an exciting new area in the text processing field which attempts to harness and exploit the vast quantity of information being published by individuals. However, their particular characteristics (shortness, vocabulary size and nature, etc.) make it difficult to achieve good results using automated clustering techniques. Moreover, the fact that many blogs may be considered to be narrow domain means that exploiting external linguistic resources can have limited value. In this paper, we present a methodology to improve the performance of clustering techniques on blogs, which does not rely on external resources. Our results show that this technique can produce significant improvements in the quality of clusters produced.

Fernando Perez-Tellez, David Pinto, John Cardiff, Paolo Rosso
Advances in Czech – Signed Speech Translation

This article describes advances in Czech – Signed Speech translation. A method using a new criterion based on minimal loss principle for log-linear model phrase extraction was introduced and it was evaluated against two another criteria. The performance of phrase table extracted with introduced method was compared with performance of two another phrase tables (manually and automatically extracted). A new criterion for semantic agreement evaluation of translations was introduced too.

Jakub Kanis, Luděk Müller
Improving Word Alignment Using Alignment of Deep Structures

In this paper, we describe differences between a classical word alignment on the surface (word-layer alignment) and an alignment of deep syntactic sentence representations (tectogrammatical alignment). The deep structures we use are dependency trees containing content (autosemantic) words as their nodes. Most of other functional words, such as prepositions, articles, and auxiliary verbs are hidden. We introduce an algorithm which aligns such trees using perceptron-based scoring function. For evaluation purposes, a set of parallel sentences was manually aligned. We show that using statistical word alignment (GIZA

++

) can improve the tectogrammatical alignment. Surprisingly, we also show that the tectogrammatical alignment can be then used to significantly improve the original word alignment.

David Mareček
Trdlo, an Open Source Tool for Building Transducing Dictionary

This paper describes the development of an open-source tool named Trdlo. Trdlo was developed as part of our effort to build a machine translation system between very close languages. These languages usually do not have available pre-processed linguistic resources or dictionaries suitable for computer processing. Bilingual dictionaries have a big impact on quality of translation. Proposed methods described in this paper attempt to extend existing dictionaries with inferable translation pairs. Our approach requires only ‘cheap’ resources: a list of lemmata for each language and rules for inferring words from one language to another. It is also possible to use other resources like annotated corpora or Wikipedia. Results show that this approach greatly improves effectivity of building Czech-Slovak dictionary.

Marek Grác
Improving Patient Opinion Mining through Multi-step Classification

Automatically tracking attitudes, feelings and reactions in on-line forums, blogs and news is a desirable instrument to support statistical analyses by companies, the government, and even individuals. In this paper, we present a novel approach to polarity classification of short text snippets, which takes into account the way data are naturally distributed into several topics in order to obtain better classification models for polarity. Our approach is multi-step, where in the initial step a standard topic classifier is learned from the data and the topic labels, and in the ensuing step several polarity classifiers, one per topic, are learned from the data and the polarity labels. We empirically show that our approach improves classification accuracy over a real-world dataset by over 10%, when compared against a standard single-step approach using the same feature sets. The approach is applicable whenever training material is available for building both topic and polarity learning models.

Lei Xia, Anna Lisa Gentile, James Munro, José Iria
Update Summarization Based on Latent Semantic Analysis

This paper deals with our recent research in text summarization. We went from single-document summarization through multi-document summarization to update summarization. We describe the development of our summarizer which is based on latent semantic analysis (LSA) and propose the update summarization component which determines the redundancy and novelty of each topic discovered by LSA. The final part of this paper presents the results of our participation in the experiment of Text Analysis Conference 2008.

Josef Steinberger, Karel Ježek
WEBSOM Method - Word Categories in Czech Written Documents

We applied well-known WEBSOM method (based on two layer architecture) to categorization of Czech written documents. Our research was focused on the syntactic and semantic relationship within word categories of word category map (WCM). The document classification system was tested on a subset of 100 documents (manual work was necessary) from the corpus of Czech News Agency documents. The result confirmed that WEBSOM method could be hardly evaluated because humans have problems with natural language semantics and determination of semantic domains from word categories.

Roman Mouček, Pavel Mautner
Opinion Target Network: A Two-Layer Directed Graph for Opinion Target Extraction

Unknown opinion targets lead to a low coverage in opinion mining. To deal with this, the previous opinion target extraction methods consider human-compiled opinion targets as seeds and adopt syntactic/statistic patterns to extract new opinion targets. Three problems are notable. 1) Manually compiled opinion targets are too large to be sound seeds. 2) Array that maintains seeds is less effective to represent relations between seeds. 3) Opinion target extraction can hardly achieve a satisfactory performance in merely one cycle. The opinion target network (OTN) is proposed in this paper to organize atom opinion targets of component and attribute in a two-layer directed graph. With multiple cycles of OTN construction, a higher coverage of opinion target extraction is achieved via generalization and propagation. Experiments on Chinese opinion target extraction show that the OTN is promising in handling the unknown opinion targets.

Yunqing Xia, Boyi Hao
The Czech Broadcast Conversation Corpus

This paper presents the final version of the Czech Broadcast Conversation Corpus released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yield about 33 hours of transcribed conversational speech from 128 speakers. The release not only includes verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and discourse markers, and annotation of speech disfluencies. The annotation is based on the LDC’s MDE annotation standard for English, with changes applied to accommodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction research, the corpus is also useful for linguistic analysis of conversational Czech.

Jáchym Kolář, Jan Švec
Vector-Based Unsupervised Word Sense Disambiguation for Large Number of Contexts

This paper presents a possible improvement of unsupervised word sense disambiguation (WSD) systems by extending the number of contexts applied by the discrimination algorithms. We carried out an experiment for several WSD algorithms based on the vector space model with the help of the SenseClusters ([1]) toolkit. Performances of algorithms were evaluated on a standard benchmark, on the nouns of the Senseval-3 English lexical-sample task ([2]). Paragraphs from the British National Corpus were added to the contexts of Senseval-3 data in order to increase the number of contexts used by the discrimination algorithms. After parameter optimization on Senseval-2 English lexical sample data performance measures show slight improvement, and the optimized algorithm is competitive with the best unsupervised WSD systems evaluated on the same data, such as [3].

Gyula Papp
Chinese Pinyin-Text Conversion on Segmented Text

Most current research and applications on Pinyin to Chinese word conversion employs a hidden Markov model (HMMs) which in turn uses a character-based language model. The reason is because Chinese texts are written without word boundaries. However in some tasks that involve the Pinyin to Chinese conversion, such as Chinese text proofreading, the original Chinese text is known. This enables us to extract the words and a word-based language model can be developed. In this paper we compare the two models and come to a conclusion that using word-based bi-gram language model achieve higher conversion accuracy than character-based bi-gram language model.

Wei Liu, Louise Guthrie
Mining Phrases from Syntactic Analysis

In this paper we describe the exploitation of the syntactic parser

synt

to obtain information about syntactic structures (such as noun or verb phrases) of common sentences in Czech. These phrases/structures are from the analysis point of view usually identical to nonterminals in the grammar used by the parser to find possible valid derivations of the given sentence. The parser has been extended in such a way that enables its highly ambiguous output to be used for mining those phrases

unambiguously

and offers several ways how to identify them. To achieve this, some previously unused results of syntactic analysis have been evolved leading to more precise morphological analysis and hence also to deeper distinction among various syntactic (sub)structures. Finally, an application for shallow valency extraction and punctuation correction is presented.

Miloš Jakubíček, Aleš Horák, Vojtěch Kovář
Problems with Pruning in Automatic Creation of Semantic Valence Dictionary for Polish

In this paper we present the first step towards the automatic creation of semantic valence dictionary of Polish verbs. First, resources used in the process are listed. Second, the way of gathering corpus-based observations into a semantic valence dictionary and pruning them is discussed. Finally, an experiment in the application of the method is presented and evaluated.

Elżbieta Hajnicz

Speech

Disambiguating Tags in Blogs

Blog users enjoy tagging for better document organization, while ambiguity in tags leads to inaccuracy in tag-based applications, such as retrieval, visualization or trend discovery. The dynamic nature of tag meanings makes current word sense disambiguation(WSD) methods not applicable. In this paper, we propose an unsupervised method for disambiguating tags in blogs. We first cluster the tags by their context words using Spectral Clustering. Then we compare a tag with these clusters to find the most suitable meaning. We use Normalized Google Distance to measure word similarity, which can be computed by querying search engines, thus reflects the up-to-date meaning of words. No human labeling efforts or dictionary needed in our method. Evaluation using crawled blog data showed a promising micro average precision of 0.842.

Xiance Si, Maosong Sun
Intraclausal Coordination and Clause Detection as a Preprocessing Step to Dependency Parsing

The impact of clause and intraclausal coordination detection to dependency parsing of Slovene is examined. New methods based on machine learning and heuristic rules are proposed for clause and intraclausal coordination detection. They were included in a new dependency parsing algorithm, PACID. For evaluation, Slovene dependency treebank was used. At parsing, 6.4% and 9.2 % relative error reduction was achieved, compared to the dependency parsers MSTP and Malt, respectively.

Domen Marinčič, Matjaž Gams, Tomaž Šef
Transcription of Catalan Broadcast Conversation

The paper describes aspects, methods and results of the development of an automatic transcription system for Catalan broadcast conversation by means of speech recognition. Emphasis is given to Catalan language, acoustic and language modelling methods and recognition. Results are discussed in context of phenomena and challenges in spontaneous speech, in particular regarding phoneme duration and feature space reduction.

Henrik Schulz, José A. R. Fonollosa, David Rybach
An Analysis of the Impact of Ambiguity on Automatic Humour Recognition

One of the most amazing characteristics that defines the human being is humour. Its analysis implies a set of subjective and fuzzy factors, such as the linguistic, psychological or sociological variables that produce it. This is one of the reasons why its automatic processing seems to be not straightforward. However, recent researches in the Natural Language Processing area have shown that humour can automatically be generated and recognised with success. On the basis of those achievements, in this study we present the experiments we have carried out on a collection of Italian texts in order to investigate how to characterize humour through the study of the ambiguity, especially with respect to morphosyntactic and syntactic ambiguity. The results we have obtained show that it is possible to differentiate humorous from non humorous data through features like perplexity or sentence complexity.

Antonio Reyes, Davide Buscaldi, Paolo Rosso
Objective vs. Subjective Evaluation of Speakers with and without Complete Dentures

For dento-oral rehabilitation of edentulous (toothless) patients, speech intelligibility is an important criterion. 28 persons read a standardized text once with and once without wearing complete dentures. Six experienced raters evaluated the intelligibility subjectively on a 5-point scale and the voice on the 4-point Roughness-Breathiness-Hoarseness (RBH) scales. Objective evaluation was performed by Support Vector Regression (SVR) on the word accuracy (WA) and word recognition rate (WR) of a speech recognition system, and a set of 95 word based prosodic features. The word accuracy combined with selected prosodic features showed a correlation of up to

r

= 0.65 to the subjective ratings for patients with dentures and

r

= 0.72 for patients without dentures. For the RBH scales, however, the average correlation of the feature subsets to the subjective ratings for both types of recordings was

r

< 0.4.

Tino Haderlein, Tobias Bocklet, Andreas Maier, Elmar Nöth, Christian Knipfer, Florian Stelzle
Automatic Pitch-Synchronous Phonetic Segmentation with Context-Independent HMMs

This paper deals with an HMM-based automatic phonetic segmentation (APS) system. In particular, the use of a pitch-synchronous (PS) coding scheme within the context-independent (CI) HMM-based APS system is examined and compared to the “more traditional” pitch-asynchronous (PA) coding schemes for a given Czech male voice. For bootstrap-initialised CI-HMMs, exploited when some (manually) pre-segmented data are available, the proposed PS coding scheme performed best, especially in combination with CART-based refinement of the automatically segmented boundaries. For flat-start-initialised CI-HMMs, an inferior initialisation method used when no pre-segmented data are at disposal, standard PA coding schemes with longer parameterization shifts yielded better results. The results are also compared to the results obtained for APS systems with context-dependent (CD) HMMs. It was shown that, at least for the researched male voice, multiple-mixture CI-HMMs outperform CD-HMMs in the APS task.

Jindřich Matoušek
First Experiments on Text-to-Speech System Personification

In the present paper, several experiments on text-to-speech system personification are described. The personification enables TTS system to produce new voices by employing voice conversion methods. The baseline speech synthetizer is a concatenative corpus-based TTS system which utilizes the unit selection method. The voice identity change is performed by the transformation of spectral envelope, spectral detail and pitch. Two different personification approaches are compared in this paper. The former is based on the transformation of the original speech corpus, the latter transforms the output of the synthesizer. Specific advantages and disadvantages of both approaches are discussed and their performance is compared in listening tests.

Zdeněk Hanzlíček, Jindřich Matoušek, Daniel Tihelka
Parsing with Agreement

Shallow parsing has been proposed as a means of arriving at practically useful structures while avoiding the difficulties of full syntactic analysis. According to Abney’s principles, it is preferred to leave an ambiguity pending than to make a likely wrong decision. We show that continuous phrase chunking as well as shallow constituency parsing display evident drawbacks when faced with freer word order languages. Those drawbacks may lead to unnecessary data loss as a result of decisions forced by the formalism and therefore diminish practical value of shallow parsers for Slavic languages.

We present an alternate approach to shallow parsing of noun phrases for Slavic languages which follows the original Abney’s principles. The proposed approach to parsing is decomposed into several stages, some of which allow for marking discontinuous phrases.

Adam Radziszewski
An Adaptive BIC Approach for Robust Speaker Change Detection in Continuous Audio Streams

In this paper we focus on an audio segmentation. We present a novel method for robust and accurate detection of acoustic change points in continuous audio streams. The presented segmentation procedure was developed as a part of an audio diarization system for broadcast news audio indexing. In the presented approach, we tried to remove a need for using pre-determined decision-thresholds for detecting of segment boundaries, which are usually the case in the standard segmentation procedures. The proposed segmentation aims to estimate decision-thresholds directly from the currently processed audio data and thus reduces a need for additional threshold tuning from development data. It employs change-detection methods from two well-established audio segmentation approaches based on the Bayesian Information Criterion. Combining methods from both approaches enabled us to adaptively tune boundary-detection thresholds from the underlying processing data. All three segmentation procedures are tested and compared on a broadcast news audio database, where our proposed audio segmentation procedure shows its potential.

Janez Žibert, Andrej Brodnik, France Mihelič
Fusion of Acoustic and Prosodic Features for Speaker Clustering

This work focus on a speaker clustering methods that are used in speaker diarization systems. The purpose of speaker clustering is to associate together segments that belong to the same speakers. It is usually applied in the last stage of the speaker-diarization process. We concentrate on developing of proper representations of speaker segments for clustering and explore different similarity measures for joining speaker segments together. We realize two different competitive systems. The first is a standard approach using a bottom-up agglomerative clustering principle with the Bayesian Information Criterion (BIC) as a merging criterion. In the next approach a fusion speaker clustering system is developed, where the speaker segments are modeled by acoustic and prosody representations. The idea here is to additionally model the speaker prosody characteristics and add it to basic acoustic information estimated from the speaker segments. We construct 10 basic prosody features derived from the energy of the audio signals, the estimated pitch contours, and the recognized voiced and unvoiced regions in speech. In this way we impose higher-level information in the representations of the speaker segments, which leads to improved clustering of the segments in the case of similar speaker acoustic characteristics or poor acoustic conditions.

Janez Žibert, France Mihelič
Combining Topic Information and Structure Information in a Dynamic Language Model

We present a language model implemented with dynamic Bayesian networks that combines topic information and structure information to capture long distance dependencies between the words in a text while maintaining the robustness of standard

n

-gram models. We show that the model is an extension of sentence level mixture models, thereby providing a Bayesian explanation for these models. We describe a procedure for unsupervised training of the model. Experiments show that it reduces perplexity by 13% compared to an interpolated trigram.

Pascal Wiggers, Leon Rothkrantz
Expanding Topic-Focus Articulation with Boundary and Accent Assignment Rules for Romanian Sentence

The present paper, maintaining the interest for applying Prague School’s Topic-Focus Articulation (TFA) algorithm to Romanian, takes the advantage of an experiment of investigating the intonational focus assignment to the Romanian sentence. Using two lines of research in a previous study, it has been showed that TFA behaves better than an inter-clausal selection procedure for

assigning pitch accents

to the Background-Kontrast (Topic-Focus) entities, while the

Inference Boundary

algorithm for computing the Theme-Rheme is more reliable and easier extendable towards

boundary and contour tone assignment

rules, leading to our novel proposal of Sentence Boundary Assignment Rules (SBAR). The main contributions of this paper are:

(

a

)

The

TFA algorithm

applied for Romanian is extended to inter-clause level, and embedded into a discursive approach for computing the Background-Kontrast entities.

(

b

)

Inference Boundary

algorithm, applied to Romanian for clause-level Theme-Rheme span computing, is extended with a set of SBAR rules, which are relying on different

Communicative Dynamism

(CD)

degrees

of the clause constituents.

(

c

)

On each intonational unit, the extended TFA algorithm is further refined, in order to eliminate ambiguities, with a filter derived from Gussenhoven’s SAAR (Sentence Accent Assignment Rule). Remarkably, TFA and CD degrees

proved

again to be resourceful, especially as technical procedures for a new and systematic development of an Information Structure Discourse Theory (ISDT).

Neculai Curteanu, Diana Trandabăţ, Mihai Alex Moruz
Lexical Affinity Measure between Words

In this paper we research the lexical affinity between words. Our goal is to define a distance measure which corresponds with the semantic affinity between words. This measure is based on WordNet, where two words/concepts can be connected by a string of stepwise synonyms. The number of hops defines the distance between words. In addition we present a natural language processing toolbox, developed in the course of this work, that combines a number of existing tools and adds a number of analysis tools.

Ivar van Willegen, Leon Rothkrantz, Pascal Wiggers
Multimodal Labeling

This paper is about automated labeling of emotions. Prototypes have been developed for extraction of features from facial expressions and speech. To train such systems data is needed. In this paper we report about the recordings of semi-spontaneous emotions. Multimodal emotional reactions are evoked in 21 controlled contexts. The purpose of this database is to make it a benchmark for the current and future emotion recognition studies in order to compare the results from different research groups. Validation of the recorded data is done online. Over 60 users scored the apex images (1.272 ratings), audio clips (201 ratings) and video clips (503 ratings) on the valence and arousal scale. Textual validation is done based on Whissell’s Dictionary of Affect in Language. A comparison is made between the scores of all four validation methods and the results showed some clusters for distinct emotions.

Leon Rothkrantz, Pascal Wiggers
The ORD Speech Corpus of Russian Everyday Communication “One Speaker’s Day”: Creation Principles and Annotation

The main aim of the ORD speech corpus is to fix Russian spontaneous speech in natural communicative situations. The corpus presents the unique linguistic material, allowing to perform fundamental research in many scientific aspects and to solve different practical tasks, especially in speech technologies. The paper concerns methodology and description of the ORD corpus creating and presents the system of annotations.

Alexander Asinovsky, Natalia Bogdanova, Marina Rusakova, Anastassia Ryko, Svetlana Stepanova, Tatiana Sherstinova
The Structure of the ORD Speech Corpus of Russian Everyday Communication

The paper presents the structure of the ORD speech corpus of Russian everyday communication, which contains recordings of all spoken episodes recorded during twenty-four hours by a demographically balanced group of people in St. Petersburg. The paper describes the structure of the corpus, consisting of audio files, annotation files and information system and reviews the main communicative episodes presented in the corpus.

Tatiana Sherstinova
Analysis and Assessment of AvID: Multi-Modal Emotional Database

The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is given to the process of evaluating the emotional database. Different kappa statistics normally used in measuring the agreement between annotators are discussed. Following the problems of standard kappa coefficients, when used in emotional database assessment, a new time-weighted free-marginal kappa is presented. It differs from the other kappa statistics in that it weights each utterance’s particular score of agreement based on the duration of the utterance. The new method is evaluated and the superiority over the standard kappa, when dealing with a database of spontaneous emotions, is demonstrated.

Rok Gajšek, Vitomir Štruc, Boštjan Vesnicer, Anja Podlesek, Luka Komidar, France Mihelič
Refinement Approach for Adaptation Based on Combination of MAP and fMLLR

This paper deals with a combination of basic adaptation techniques of Hidden Markov Model used in the speech recognition. The adaptation methods approach the data only through their statistics, which have to be accumulated before the adaptation process. When performing two adaptations subsequently, the data statistics have to be accumulated twice in each of the adaptation passes. However, when the adaptation methods are chosen with care, the data statistics may be accumulated only once, as proposed in this paper. This significantly reduces the time consumption and avoids the need to store all the adaptation data. Combination of Maximum A-Posteriori Probability and feature Maximum Likelihood Linear Regression adaptation is considered. Motivation for such an approach could be the on-line adaptation, where the time consumption is of big importance.

Zbyněk Zajíc, Lukáš Machlica, Luděk Müller
Towards the Automatic Classification of Reading Disorders in Continuous Text Passages

In this paper, we present an automatic classification approach to identify reading disorders in children. This identification is based on a standardized test. In the original setup the test is performed by a human supervisor who measures the reading duration and notes down all reading errors of the child at the same time. In this manner we recorded tests of 38 children who were suspected to have reading disorders. The data was confronted to an automatic system which employs speech recognition and prosodic analysis to identify the reading errors. In a subsequent classification experiment — based on the speech recognizer’s output, the duration of the test, and prosodic features — 94.7 % of the children could be classified correctly.

Andreas Maier, Tobias Bocklet, Florian Hönig, Stefanie Horndasch, Elmar Nöth
A Comparison of Acoustic Models Based on Neural Networks and Gaussian Mixtures

This article tries to compare the performance of neural network and Gaussian mixture acoustic models (GMMs). We have carried out tests which match up various models in terms of speed and achieved recognition accuracy. Since the speed-accuracy trade-off is not only dependent on the acoustic model itself, but also on the settings of decoder parameters, we have suggested a comparison based on equal number of active states during the decoding search. Statistical significance measures are also discussed and a new method for confidence interval computation is introduced.

Tomáš Pavelka, Kamil Ekštein
Semantic Annotation for the LingvoSemantics Project

In this paper, a methodology of semantic annotation of the LingvoSemantic corpus is presented. Semantic annotation is usually a time consuming and expensive process. We thus developed a methodology that significantly reduces the demands of the process. The methodology consists of a set of techniques and computer tools designed to simplify the process as much as possible. We claim that in this way it is possible to obtain sufficient amount of annotated data in a reasonable time frame. The LingvoSemantic project focuses on semantic analysis of user questions to an Internet information retrieval system. The semantic representation approach is based on abstract semantic annotation methodology. However, we advanced the annotation process. The bootstrapping method was used during the corpus annotation. The resulting annotated corpus consists of 20292 annotated sentences. In comparison to the straight-forward style of annotation, our approach significantly improved the efficiency of the annotation. The results, as well as a set of recommendations for creating the annotated data, are presented at the end of the paper.

Ivan Habernal, Miloslav Konopík
Hybrid Semantic Analysis

This article is focused on the problem of meaning recognition in written utterances. The goal is to find a computer algorithm capable to construct the meaning description of a given utterance. An original system for meaning recognition is described in this paper. The key idea of the system is the hybrid combination of expert and machine-learning approaches to meaning recognition. The system utilizes a novel algorithm for semantic parsing. The algorithm is based upon extended context-free grammars. The grammars are automatically inferred from the data.

Miloslav Konopík, Ivan Habernal
On a Computational Model for Language Acquisition: Modeling Cross-Speaker Generalisation

The discovery of words by young infants involves two interrelated processes: (a) the detection of recurrent word-like acoustic patterns in the speech signal, and (b) cross-modal association between auditory and visual information. This paper describes experimental results obtained by a computational model that simulates these two processes. The model is able to build word-like representations on the basis of multimodal input data (stimuli) without the help of an a priori specified lexicon. Each input stimulus consists of a speech signal accompanied by an abstract visual representation of the concepts referred to in the speech signal. In this paper we investigate how internal representations generalize across speakers. In doing so, we also analyze the cognitive plausibility of the model.

Louis ten Bosch, Joris Driesen, Hugo Van hamme, Lou Boves
Efficient Parsing of Romanian Language for Text-to-Speech Purposes

This paper presents the design of the text analysis component of a TTS system for the Romanian language. Our text analysis is performed in two steps: document structure detection and text normalization. The output is a tree-based representation of the processed data. Parsing is made efficient with the help of the Boost Spirit LL parser [1], the usage of this tool allowing for a greater flexibility in the source code and in the output representation.

Andrei Şaupe, Lucian Radu Teodorescu, Mihai Alexandru Ordean, Răzvan Boldizsar, Mihaela Ordean, Gheorghe Cosmin Silaghi
Discriminative Training of Gender-Dependent Acoustic Models

The main goal of this paper is to explore the methods of gender-dependent acoustic modeling that would take the possibly of imperfect function of a gender detector into consideration. Such methods will be beneficial in real-time recognition tasks (eg. real-time subtitling of meetings) when the automatic gender detection is delayed or incorrect. The goal is to minimize an impact to the correct function of the recognizer. The paper also describes a technique of unsupervised splitting of training data, which can improve gender-dependent acoustic models trained on the basis of manual markers (male/female). The idea of this approach is grounded on the fact that a significant amount of ”masculine” female and ”feminine” male voices occurring in training corpora and also on frequent errors in manual markers.

Jan Vaněk, Josef V. Psutka, Jan Zelinka, Aleš Pražák, Josef Psutka
Design of the Test Stimuli for the Evaluation of Concatenation Cost Functions

A large number of methods for measuring of audible discontinuities, which occur at concatenation points in synthesized speech, have been proposed in recent years. However, none of them proved to be comparatively better than others across all languages and recording conditions and the presented results have sometimes even been in contradiction. What is more, none of the tested concatenation cost functions seem to be reliably reflecting the human perception of such discontinuities. Thus, the design of the concatenation cost functions is still an open issue, and there is a lot of work remaining to be done. In this paper, we deal with the problem of preparing the test stimuli for evaluating the performance of these functions, which is, in our opinion, one of the key aspects in this field.

Milan Legát, Jindřich Matoušek
Towards an Intelligent User Interface: Strategies of Giving and Receiving Phone Numbers

Strategies of giving and receiving phone numbers in Estonian institutional calls are considered with the further aim to develop a telephone-based user interface to data bases which enables interaction in natural Estonian language. The analysis is based on the Estonian dialogue corpus. Human operators give long phone numbers in several parts, making pauses after parts. Pitch contour works as a signal of continuation or finishing the process. Clients give feedback during the process (repetition, particles, and pauses). Special strategies are used by clients to finish receiving the number as well as to initiate repairs in the case of communication problems.

Tiit Hennoste, Olga Gerassimenko, Riina Kasterpalu, Mare Koit, Andriela Rääbis, Krista Strandson
Error Resilient Speech Coding Using Sub-band Hilbert Envelopes

Frequency Domain Linear Prediction (FDLP) represents a technique for auto-regressive modelling of Hilbert envelopes of a signal. In this paper, we propose a speech coding technique that uses FDLP in Quadrature Mirror Filter (QMF) sub-bands of short segments of the speech signal (25 ms). Line Spectral Frequency parameters related to autoregressive models and the spectral components of the residual signals are transmitted. For simulating the effects of lossy transmission channels, bit-packets are dropped randomly. In the objective and subjective quality evaluations, the proposed FDLP speech codec is judged to be more resilient to bit-packet losses compared to the state-of-the-art Adaptive Multi-Rate Wide-Band (AMR-WB) codec at 12 kbps.

Sriram Ganapathy, Petr Motlicek, Hynek Hermansky

Dialog

Prototyping Dialogues with Midiki: An Information State Update Dialogue Manager

The context of the work presented in this paper is related to the design of systems and interfaces which enable natural language interaction between a person and a machine, through the use of Dialogue Managers (DM). Any practical DM must incorporate a model of a dialogue for the domain or task being addressed. Unfortunately, there are few methodologies and tools that allow authors to carry out an intuitive and easy modelling of such dialogues. This paper presents the proposal and application of a methodology and a tool for the authoring of generic dialogues for the MIDIKI DM, illustrated by the development of a concrete model of a dialogue.

Lúcio M. M. Quintal, Paulo N. M. Sampaio
Experiments with Automatic Query Formulation in the Extended Boolean Model

This paper concentrates on experiments with automatic creation of queries from natural language topics, suitable for use in the Extended Boolean information retrieval system. Because of the lack and/or inadequacy of the available methods, we propose a new method, based on pairing terms into a binary tree structure. The results of this method are compared with the results achieved by our implementation of the known method proposed by Salton and also with the results obtained with manually created queries. All experiments were performed on the same collection that was used in the CLEF 2007 campaign.

Lucie Skorkovská, Pavel Ircing
Daisie: Information State Dialogues for Situated Systems

In this paper, we report on an information-state update (ISU) based dialogue management framework developed specifically for the class of situated systems. Advantages and limitations of the underlying ISU methodology and its supporting tools are first discussed, followed by an overview of the situated dialogue framework. Notable features of the new framework include its ISU basis, an assumption of agency in domain applications, a tightly-coupled, plugin-based integration mechanism, and a function-based contextualization process. In addition to reporting on these features, we also compare the framework to existing works both inside and outside of the situated dialogue domain.

Robert J. Ross, John Bateman
Linguistic Models Construction and Analysis for Satisfaction Estimation

Automatic analysis of customer conversations would be beneficial for service companies to improve service quality. In this case, such customer characteristics as satisfaction or competence are of the special interest. Unfortunately, their manual estimation is very laborious and it has a high level of subjectivity. In this work, we aim at parameterization of dialogues for formal (automatic) assessment of customer satisfaction. We elaborate a set of linguistic indicators represented both by lexico-syntactic patterns and rules and introduce their classification by kind, location and sign. We propose several linear regression models for satisfaction estimation and check them and their parameters on statistical significance. The best of the models demonstrates rather high level of concordance between automatic and manual assessments.

Natalia Ponomareva, Angels Catena
Shallow Features for Differentiating Disease-Treatment Relations Using Supervised Learning A Pilot Study

Clinical narratives provide an information rich, nearly unexplored corpus of evidential knowledge that is considered as a challenge for practitioners in the language technology field, particularly because of the nature of the texts (excessive use of terminology, abbreviations, orthographic term variation), the significant opportunities for clinical research that such material can provide and the potentially broad impact that clinical findings may have in every day life. It is therefore recognized that the capability to automatically extract key concepts and their relationships from such data will allow systems to properly understand the content and knowledge embedded in the free text which can be of great value for applications such as information extraction and question & answering. This paper gives a brief presentation of such textual data and its semantic annotation, and discusses the set of semantic relations that can be observed between

diseases

and

treatments

in the sample. The problem is then designed as a supervised machine learning task in which the relations are tried to be learned using pre-annotated data. The challenges designing the problem and empirical results are presented.

Dimitrios Kokkinakis
Extended Hidden Vector State Parser

The key component of a spoken dialogue system is a spoken understanding module. There are many approaches to the understanding module design and one of the most perspective is a statistical based semantic parsing. This paper presents a combination of a set of modifications of the hidden vector state (HVS) parser which is a very popular method for the statistical semantic parsing. This paper describes the combination of three modifications of the basic HVS parser and proves that these changes are almost independent. The proposed changes to the HVS parser form the extended hidden vector state parser (EHVS). The performance of the parser increases from 47.7% to 63.1% under the exact match between the reference and the hypothesis semantic trees evaluated using Human-Human Train Timetable corpus. In spite of increased performance, the complexity of the EHVS parser increases only linearly. Therefore the EHVS parser preserves simplicity and robustness of the baseline HVS parser.

Jan Švec, Filip Jurčíček
Semantic Annotation of City Transportation Information Dialogues Using CRF Method

The article presents results of an experiment consisting in automatic concept annotation of the transliterated spontaneous human-human dialogues in the city transportation domain. The data source was a corpus of dialogues collected at a Warsaw call center and annotated with about 200 concepts’ types. The machine learning technique we used is the linear-chain Conditional Random Fields (CRF) sequence labeling approach. The model based on word lemmas in a window of length 5 gave results of concept recognition with an F-measure equal to 0.85.

Agnieszka Mykowiecka, Jakub Waszczuk
Towards Flexible Dialogue Management Using Frames

This article is focused on our approach to a dialogue management using frames. As we show, even when dealing with this simple technique, the manager is able to provide a complex behaviour, as for example maintenance of context causality. Our research goal is to create a domain-independent dialogue manager accompanied with an easy-to-use dialogue flow editor. At the end of this paper, future work is outlined, as the manager is still under development.

Tomáš Nestorovič
Backmatter
Metadaten
Titel
Text, Speech and Dialogue
herausgegeben von
Václav Matoušek
Pavel Mautner
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-04208-9
Print ISBN
978-3-642-04207-2
DOI
https://doi.org/10.1007/978-3-642-04208-9

Premium Partner