Plenary

Interactive Computer Aids for Acquiring Proficiency in Mandarin

It is widely recognized that one of the best ways to learn a foreign language is through spoken dialogue with a native speaker. However, this is not a practical method in the classroom due to the one-to-one student/teacher ratio it implies. A potential solution to this problem is to rely on computer spoken dialogue systems to role play a conversational partner. This paper describes several multilingual dialogue systems specifically designed to address this need. Students can engage in dialogue with the computer either over the telephone or through audio/typed input at a Web page. Several different domains are being developed, in which a student’s conversational interaction is assisted by a software agent functioning as a “tutor” which can provide them with translation assistance at any time. Thus, two recognizers are running in parallel, one for English and one for Chinese. Some of the research issues surrounding high-quality spoken language translation and dialogue interaction with a non-native speaker are discussed.

Stephanie Seneff

The Affective and Pragmatic Coding of Prosody

Prosody or intonation is a prime carrier of affective information, a function that has often been neglected in speech research. Most work on prosody has been informed by linguistic models of sentence intonation that focus on accent structure and which are based on widely differing theoretical assumptions. Human speech production includes

both

prosodic coding of emotions, such as anger or happiness,

and

pragmatic intonations, such as interrogative or affirmative modes, as part of the language codes. The differentiation between these two types of prosody still presents a major problem to speech researchers. It is argued that this distinction becomes more feasible when it is acknowledged that these two types of prosody are differently affected by the so-called “push” and “pull” effects. Push effects, influenced by psychophysiological activation, strongly affect emotional prosody, whereas pull effects, influenced by cultural rules of expression, predominantly affect intonation or pragmatic prosody, even though both processes influence all prosodic production. The push-pull distinction implies that biological marking (push) is directly externalized in motor expression, whereas pull effects (based on socio-cultural norms or desirable, esteemed reference persons) will require the shaping of the expression to conform to these models. Given that the underlying biological processes are likely to be dependent on both the idiosyncratic nature of the individual and the specific nature of the situation, we would expect relatively strong inter-individual differences in the expressive patterns resulting from push effects. This is not the case for pull effects. Here, because of the very nature of the models that pull the expression, we would expect a very high degree of symbolization and conventionalization, in other words comparatively few and small individual differences. With respect to cross-cultural comparison, we would expect the opposite: very few differences between cultures for push effects, large differences for pull effects.

Klaus R. Scherer

Challenges in Machine Translation

In recent years there has been an enormous boom in MT research. There has been not only an increase in the number of research groups in the field and in the amount of funding, but there is now also optimism for the future of the field and for achieving even better quality. The major reason for this change has been a paradigm shift away from linguistic/rule-based methods towards empirical/data-driven methods in MT. This has been made possible by the availability of large amounts of training data and large computational resources. This paradigm shift towards empirical methods has fundamentally changed the way MT research is done. The field faces new challenges. For achieving optimal MT quality, we want to train models on as much data as possible, ideally language models trained on hundreds of billions of words and translation models trained on hundreds of millions of words. Doing that requires very large computational resources, a corresponding software infrastructure, and a focus on systems building and engineering. In addition to discussing those challenges in MT research, the talk will also give specific examples on how some of the data challenges are being dealt with at Google Research.

Franz Josef Och

Automatic Indexing and Retrieval of Large Broadcast News Video Collections – The TRECVID Experience

Most existing operational systems rely purely on automatic speech recognition (ASR) text as the basis for news video indexing and retrieval. While current research shows that ASR text has been the most influential component, results of large scale news video processing experiments indicate that the use of other modality features and external information sources such as the Web is essential in various situations. This talk reviews the frameworks and machine learning techniques used to fuse the ASR text with multi-modal and multi-source information to tackle the challenging problems of story segmentation, concept detection and retrieval in broadcast news video. This paper also points the way towards the development of scalable technology to process large news video archives.

Tat-Seng Chua

Tutorial

An HMM-Based Approach to Flexible Speech Synthesis

The increasing availability of large speech databases makes it possible to construct speech synthesis systems, which are referred to as corpus-based, data-driven, speaker-driven, or trainable approach, by applying statistical learning algorithms. These systems, which can be automatically trained, not only generate natural and high quality synthetic speech but also can reproduce voice characteristics of the original speaker. This talk presents one of these approaches, HMM-based speech synthesis. The basic idea of the approach is very simple: just train HMMs (hidden Markov models) and generate speech directly from them. To realize such a speech synthesis system, however, we need some tricks: algorithms for speech parameter generation from HMMs, and a mel-cepstrum based vocoding technique are reviewed, and an approach to simultaneous modeling of phonetic and prosodic parameters (spectrum, F0, and duration) is also presented. The main feature of the system is the use of dynamic feature: by inclusion of dynamic coefficients in the feature vector, the speech parameter sequence generated in synthesis is constrained to be realistic, as defined by the parameters of the HMMs. The attraction of this approach is that voice characteristics of synthesized speech can easily be changed by transforming HMM parameters. Actually, it has been shown that we can change voice characteristics of synthetic speech by applying a speaker adaptation technique which has been used in speech recognition systems. The relationship between the HMM-based approach and other concatenative speech synthesis approaches is also discussed. In the talk, not only the technical description but also recent results and demos will be presented.

Keiichi Tokuda

Text Information Extraction and Retrieval

Every day people spend much time on creating, processing, and accessing information. In fact, most of the information exists in the form of "text", contained in books, emails, web pages, news paper articles, blogs, and reports. How to help people quickly find information from text data and how to help people discover new knowledge from text data has become an enormously important issue. Many research efforts have been made on text information extraction, retrieval, and mining; and significant progress has made in recent years. A large number of new methods have been proposed, and many systems have been developed and put into practical uses. This tutorial is aimed at giving an overview on two central topics of the area: namely Information Extraction (IE) and Information Retrieval (IR). Important technologies on them will be introduced. Specifically, models for IE such as Maximum Entropy Markov Model and Conditional Random Fields will be explained. Models for IR such as Language Model and Learning to Rank will be described. A brief survey on recent work on both IE and IR will be given. Finally, some recent work on the combined uses of IE and IR technologies will also be introduced.

Hang Li

Topics in Speech Science

Mechanisms of Question Intonation in Mandarin

This study investigates mechanisms of question intonation in Mandarin Chinese. Three mechanisms of question intonation have been proposed: an overall higher phrase curve, higher strengths of sentence final tones, and a tone-dependent mechanism that flattens the falling slope of the final falling tone and steepens the rising slope of the final rising tone. The phrase curve and strength mechanisms were revealed by a computational modeling study and verified by the acoustic analyses as well as the perception experiments. The tone-dependent mechanism was suggested by a result from the perceptual study: question intonation is easier to identify if the sentence-final tone is falling whereas it is harder to identify if the sentence-final tone is rising, and was revealed by the acoustic analyses on the final Tone2 and Tone4.

Jiahong Yuan

Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech

Although there have been many studies on the prosodic structure of spoken Mandarin as well as many proposals for labeling the prosody of spoken Mandarin, the labeling of prosodic boundaries in all the existing annotation systems relies on auditory perception, and lacks a direct relation to the acoustic process of prosody generation. Besides, perception-based annotation cannot ensure a high degree of consistency and reliability. In the present study, we investigate the phrasing of spoken Mandarin from the production point of view, by using an acoustic model for generating

F

0

contours. The relationship between perceived prosodic boundaries at various layers and phrase commands derived from the model-based analysis of

F

0

contours is then revealed. The results indicate that a perception-based prosody labeling system cannot describe the prosodic structure as accurately as the model for

F

0

contour generation.

Wentao Gu, Keikichi Hirose, Hiroya Fujisaki

Linguistic Markings of Units in Spontaneous Mandarin

Spontaneous speech is produced and probably also perceived in some kinds of units. This paper applies the perceptually defined intonation units to segment spontaneous Mandarin data. The main aim is to examine spontaneous data to see if linguistic cues which mark the unit boundaries exist. If the production of spontaneous speech is a kind of concatenation of these "chunks", we can deepen our understanding of human language processing and the related knowledge about the boundary markings can be applied to improve language models used for automatic speech recognizers. Our results clearly show that discourse items and repair resumptions, which are typical phenomena in spontaneous speech, are mostly located at the boundary of intonation unit. Moreover, temporal marking of items at unit boundary is empirically identified through a series of analyses making use of segmentation of intonation units and measurements of syllable durations.

Shu-Chuan Tseng

Phonetic and Phonological Analysis of Focal Accents of Disyllabic Words in Standard Chinese

The article investigates the phonetic and phonological property of focal accents conveyed by disyllabic focused words with various tonal combinations in Standard Chinese. Phonetically, the effect of focal accents upon f

0

resides in two aspects: the manner and the condition of focal accents. Phonologically, the distribution of focal accents is mainly concerned. Acoustic and perceptual experiments and the underlying tonal target of focused constituents are employed in both phonetic realization and phonological analysis. Major findings are that: f

0

ranges of focused words are expanded as the H tones of both focused syllables are raised; the f

0

of the post-focus syllables are compressed obviously in the way the H tones of Tone1 and Tone2 are lowered; the realization of accents is closely related to the tonal target of the focused words; specifically, accents influence the acoustic performances of tones; furthermore, the combination of H/L determines the distribution of accents.

Yuan Jia, Ziyu Xiong, Aijun Li

Focus, Lexical Stress and Boundary Tone: Interaction of Three Prosodic Features

This paper studies how focus, lexical stress and rising boundary tone act on F0 of the last preboundary word. We find that when the word is non focused, the rising boundary tone takes control almost from the beginning of the word and flattens F0 peak of the lexical stress. When the word is focused, the rising boundary tone is only dominant after F0 peak of lexical stress is formed. This peak is even higher than F0 height required by the rising boundary tone at the end of the word. Furthermore, the location of lexical stress restrains the height at F0 peak and high end to be reached. The interaction of these three factors on a single word leads to F0 competition due to limited articulatory dimensions. The study helps to build prosodic model for high quality speech synthesis.

Lu Zhang, Yi-Qing Zu, Run-Qiang Yan

Speech Analysis

A Robust Voice Activity Detection Based on Noise Eigenspace Projection

A robust voice activity detector (VAD) is expected to increase the accuracy of ASR in noisy environments. This study focuses on how to extract robust information for designing a robust VAD. To do so, we construct a noise eigenspace by the principal component analysis of the noise covariance matrix. Projecting noise speech onto the eigenspace, it is found that available information with higher SNR is generally located in the channels with smaller eigenvalues. According to this finding, the available components of the speech are obtained by sorting the noise eigenspace. Based on the extracted high-SNR components, we proposed a robust voice activity detector. The threshold for deciding the available channels is determined using a histogram method. A probability-weighted speech presence is used to increase the reliability of the VAD. The proposed VAD is evaluated using TIMIT database mixed with a number of noises. Experiments showed that our algorithm performs better than traditional VAD algorithms.

Dongwen Ying, Yu Shi, Frank Soong, Jianwu Dang, Xugang Lu

Pitch Mean Based Frequency Warping

In this paper, a novel pitch mean based frequency warping (PMFW) method is proposed to reduce the pitch variability in speech signals at the front-end of speech recognition. The warp factors used in this process are calculated based on the average pitch of a speech segment. Two functions to describe the relations between the frequency warping factor and the pitch mean are defined and compared. We use a simple method to perform frequency warping in the Mel-filter bank frequencies based on different warping factors. To solve the problem of mismatch in bandwidth between the original and the warped spectra, the Mel-filters selection strategy is proposed. At last, the PMFW mel-frequency cepstral coefficient (MFCC) is extracted based on the regular MFCC with several modifications. Experimental results show that the new PMFW MFCCs are more distinctive than the regular MFCCs.

Jian Liu, Thomas Fang Zheng, Wenhu Wu

A Study of Knowledge-Based Features for Obstruent Detection and Classification in Continuous Mandarin Speech

A study on acoustic-phonetic features for the obstruent detection and classification based on the knowledge of Mandarin speech is proposed. Seneff auditory model is used as the front-end processor for extracting acoustic-phonetic features. These features are rich in their information content in a hierarchical decision process to detect and classify the Mandarin obstruents. The preliminary experiments showed that accuracy of obstruent detection is about 84%. An algorithm based on the information of feature distribution is applied to further classify the obstruents into stops, fricatives, and affricates. The average accuracy of obstruent classification is about 80%. The proposed approach based on the feature distribution is simple and effective. It could be a very promising method for improving the phone detection in continuous speech recognition.

Kuang-Ting Sung, Hsiao-Chuan Wang

Speaker-and-Environment Change Detection in Broadcast News Using Maximum Divergence Common Component GMM

In this paper, the supervised maximum-divergence common component GMM (MD-CCGMM) model was used to the speaker-and-environment change detection in broadcast news signal. In order to discriminate the speaker-and-environment change in broadcast news, the MD-CCGMM signal model will maximize the likelihood of CCGMM signal modeling and the divergence measure of different audio signal segments simultaneously. Performance of the MD-CCGMM model was examined using a four-hour TV broadcast news database. A result of 16.0% Equal Error Rate (EER) was achieved by using the divergence measure of CCGMM model. When using supervised MD-CCGMM model, 14.6% Equal Error Rate can be achieved

.

Yih-Ru Wang

UBM Based Speaker Segmentation and Clustering for 2-Speaker Detection

In this paper, a speaker segmentation method based on log-likelihood ratio score (LLRS) over universal background model (UBM) and a speaker clustering method based on difference of log-likelihood scores between two speaker models are proposed. During the segmentation process, the LLRS between two adjacent speech segments over UBM is used as a distance measure Cwhile during the clustering process Cthe difference of log-likelihood scores between two speaker models is used as a speaker classification criterion. A complete system for NIST 2002 2-speaker task is presented using the methods mentioned above. Experimental results on NIST 2002 Switchboard Cellular speaker segmentation corpus, 1-speaker evaluation corpus and 2- speaker evaluation corpus show the potentiality of the proposed algorithms.

Jing Deng, Thomas Fang Zheng, Wenhu Wu

Design of Cubic Spline Wavelet for Open Set Speaker Classification in Marathi

In this paper, a new method of feature extraction based on design of cubic spline wavelet has been described.

Dialectal zone

based speaker classification in Marathi language has been attempted in the open set mode using polynomial classifier. The method consists of dividing the speech signal into nonuniform subbands in approximate Mel-scale using an admissible wavelet packet filterbank and modeling each dialectal zone with the 2

nd

and 3

rd

order polynomial expansions of feature vector. Confusion matrices are also shown for different dialectal zones.

Hemant A. Patil, T. K. Basu

Speech Synthesis and Generation

Rhythmic Organization of Mandarin Utterances — A Two-Stage Process

This paper investigates the rhythmic organization of Mandarin utterances through both corpus analyses and experimental studies. We propose to add a new prosodic unit, the principle prosodic unit (PPU), into the prosodic hierarchy of Mandarin utterances. The key characteristic of PPU is that inner-unit words normally have to be spoken closely, while inter-unit grouping is rather flexible. Because of this, we further suggest that the rhythmic organization of Mandarin utterances is a two-stage process. In the first stage, syllables are grouped into prosodic words, and then to PPUs. The forming of PPUs is restricted by the local syntactic constraint and the length constraint. In the second stage, though the rhythmic constraint still has influences, the grouping of PPUs into phrases is rather flexible. Normally, multiple equally good solutions exist for a sentence in this stage.

Min Chu, Yunjia Wang

Prosodic Boundary Prediction Based on Maximum Entropy Model with Error-Driven Modification

Prosodic boundary prediction is the key to improving the intelligibility and naturalness of synthetic speech for a TTS system. This paper investigated the problem of automatic segmentation of prosodic word and prosodic phrase, which are two fundamental layers in the hierarchical prosodic structure of Mandarin Chinese. Maximum Entropy (ME) Model was used at the front end for both prosodic word and prosodic phrase prediction, but with different feature selection schemes. A multi-pass prediction approach was adopted. Besides, an error-driven rule-based modification module was introduced into the back end to amend the initial prediction. Experiments showed that this combined approach outperformed many other methods like C4.5 and TBL.

Xiaonan Zhang, Jun Xu, Lianhong Cai

Prosodic Words Prediction from Lexicon Words with CRF and TBL Joint Method

Predicting prosodic words boundaries will directly influence the naturalness of synthetic speech, because prosodic word is at the lowest level of prosody hierarchy. In this paper, a Chinese prosodic phrasing method based on CRF and TBL model is proposed. First a CRF model is trained to predict the prosodic words boundaries from lexicon words. After that we apply a TBL based error driven learning approach to refine the results. The experiments shows that this joint method performs much better than HMM.

Heng Kang, Wenju Liu

Prosodic Word Prediction Using a Maximum Entropy Approach

As the basic prosodic unit, the prosodic word influences the naturalness and the intelligibility greatly. Although the research shows that the lexicon word are greatly different from the prosodic word, the lexicon word still provides the important cues for the prosodic word forming. The rhythm constraint is another important factor for the prosodic word prediction. Some lexicon word length patterns trend to be combined together. Based on the mapping relationship and the difference between the lexicon words and the prosodic words, the process of the prosodic word prediction is divided into two parts, grouping the lexicon word to the prosodic word and splitting the lexicon word into prosodic words. This paper proposes a maximum entropy method to model these two parts, respectively. The experiment results show that this maximum entropy model is competent for the prosodic word prediction task. In the word grouping model, a feature selection algorithm is used to induce more efficient features for the model, which not only decrease the feature number greatly, but also improve the model performance at the same time. And, the splitting model can correctly detect the prosodic word boundary in the lexicon word. The f-score of the prosodic word boundary prediction reaches 95.55%.

Honghui Dong, Jianhua Tao, Bo Xu

Predicting Prosody from Text

In order to improve unlimited TTS, a framework to organize the multiple perceived units into discourse is proposed in [1]. To make an unlimited TTS system, we must transform the original text to the text with corresponding boundary breaks. So we describe how we predicate prosody from Text in this paper. We use the corpora with boundary breaks which follow the prosody framework. Then we use the lexical and syntactic information to predict prosody from text. The result shows that the weighted precision in our model is better than some speakers. We have shown our model can predict a reasonable prosody form text.

Keh-Jiann Chen, Chiu-yu Tseng, Chia-hung Tai

Nonlinear Emotional Prosody Generation and Annotation

Emotion is an important element in expressive speech synthesis. The paper makes the brief analysis on prosody parameters, stresses, rhythms and paralinguistic information in different emotional speech, and labels the speech with rich annotation information in multi-layers. Then, a CART model is used to do the emotional prosody generation. Unlike the traditional linear modification method, which makes direct modification of F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations and intensities, the CART models try to map the subtle prosody distributions between neutral and emotional speech within various context information. Experiments show that, with the CART model, the traditional context information is able to generate a good emotional prosody outputs, however the results could be improved if more rich information, such as stresses, breaks and jitter information, are integrated into the context information.

Jianhua Tao, Jian Yu, Yongguo Kang

A Unified Framework for Text Analysis in Chinese TTS

This paper presents a robust text analysis system for Chinese text-to-speech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambiguation, unknown word resolution and non-standard word (NSW) normalization can be unified in a single framework and reformulated as a two-pass tagging task on a sequence of morphemes. Our system consists of four main components: (1) a pre-segmenter for sentence segmentation and morpheme segmentation; and (2) a lexicalized HMM-based chunker for identifying unknown words and guessing their part-of-speech categories; and (3) a HMM-based tagger for converting orthographic morphemes to their Chinese phonetic representation (viz. pinyin), given their word-formation patterns and part-of-speech information; (4) a post-processing for interpreting phonetic tags and fine-tuning pronunciation order for some special NSWs if necessary. The evaluation on a pinyin-notated corpus built from the Peking University corpus shows that our system can achieve correct interpretation for most words.

Guohong Fu, Min Zhang, GuoDong Zhou, Kang-Kuong Luke

Speech Synthesis Based on a Physiological Articulatory Model

In this paper, a framework for speech synthesis is proposed to realize the process of speech production of human, which is based on a physiological articulatory model. Within this framework, it begins with given articulatory targets, then muscle activation patterns are estimated according to the targets by accounting for both the equilibrium characteristics and muscle dynamics, consequently, the articulatory model is driven to generate a time-varying vocal tract shape corresponding to the targets by contracting the corresponding muscles. Thereafter, a transmission line model is implemented for the time-varying vocal tract to produce speech sound. At last, a primary experiment is carried out to synthesize the single vowels and diphthongs of Chinese with the physiological articulatory model based synthesizer. The result shows that the spectra of the synthetic sound for single vowels are consistent with those of the real speech, and proper acoustic characteristics are obtained in most cases for diphthongs.

Qiang Fang, Jianwu Dang

An HMM-Based Mandarin Chinese Text-To-Speech System

In this paper we present our Hidden Markov Model (HMM)-based, Mandarin Chinese Text-to-Speech (TTS) system. Mandarin Chinese or Putonghua, “the common spoken language”, is a tone language where each of the 400 plus base syllables can have up to 5 different lexical tone patterns. Their segmental and supra-segmental information is first modeled by 3 corresponding HMMs, including: (1) spectral envelop and gain; (2) voiced/unvoiced and fundamental frequency; and (3) segment duration. The corresponding HMMs are trained from a read speech database of 1,000 sentences recorded by a female speaker. Specifically, the spectral information is derived from short-time LPC spectral analysis. Among all LPC parameters, Line Spectrum Pair (LSP) has the closest relevance to the natural resonances or the “formants” of a speech sound and it is selected to parameterize the spectral information. Furthermore, the property of clustered LSPs around a spectral peak justify augmenting LSPs with their dynamic counterparts, both in time and frequency, in both HMM modeling and parameter trajectory synthesis. One hundred sentences synthesized by 4 LSP-based systems have been subjectively evaluated with an AB comparison test. The listening test results show that LSP and its dynamic counterpart, both in time and frequency, are preferred for the resultant higher synthesized speech quality.

Yao Qian, Frank Soong, Yining Chen, Min Chu

HMM-Based Emotional Speech Synthesis Using Average Emotion Model

This paper presents a technique for synthesizing emotional speech based on an emotion-independent model which is called “average emotion” model. The average emotion model is trained using a multi-emotion speech database. Applying a MLLR-based model adaptation method, we can transform the average emotion model to present the target emotion which is not included in the training data. A multi-emotion speech database including four emotions, “neutral”, “happiness”, “sadness”, and “anger”, is used in our experiment. The results of subjective tests show that the average emotion model can effectively synthesize neutral speech and can be adapted to the target emotion model using very limited training data.

Long Qin, Zhen-Hua Ling, Yi-Jian Wu, Bu-Fan Zhang, Ren-Hua Wang

A Hakka Text-To-Speech System

In this paper, the implementation of a Hakka text-to-speech (TTS) system is presented. The system is designed based on the same principle of developing a Mandarin and a Min-Nan TTS systems proposed previously. It takes 671 base-syllables as basic synthesis units and uses a recurrent neural network (RNN)-based prosody generator to generate proper prosodic parameters for synthesizing natural output speech. The whole system is implemented by software and runs in real-time on PC. Informal subjective listening test confirmed that the system performed well. All synthetic speeches sounded well for well-tokenized texts and fair for texts with automatic tokenization.

Hsiu-Min Yu, Hsin-Te Hwang, Dong-Yi Lin, Sin-Horng Chen

Speech Enhancement

Adaptive Null-Forming Algorithm with Auditory Sub-bands

This paper presents a modified noise reduction algorithm for speech enhancement based on the scheme of null-forming. A fixed infinite-duration impulse response (IIR) filter is designed to calibrate the mismatch of the microphone pair. To weaken the performance degradation caused by narrow-band effect, the signal is decomposed into several specified sub-bands with auditory characters. This increases the signal to noise ratio (SNR) considerably while preserving the auditory effect. Experiments are carried out to show the effectiveness of these processes.

Heng Zhang, Qiang Fu, Yonghong Yan

Multi-channel Noise Reduction in Noisy Environments

Multi-channel noise reduction has been widely researched to reduce acoustic noise signals and to improve the performance of many speech applications in noisy environments. In this paper, we first introduce the state-of-the-art multi-channel noise reduction methods, especially beamforming based methods, and discuss their performance limitations. Subsequently, we present a multi-channel noise reduction system we are developing that consists of localized noise suppression by microphone array and non-localized noise suppression by post-filtering. Experimental results are also presented to show the benefits of our developed noise reduction system with respect to the traditional algorithms in terms of speech recognition rate. Some suggestions are finally presented for the further research.

Junfeng Li, Masato Akagi, Yôiti Suzuki

Acoustic Modeling for Automatic Speech Recognition

Minimum Phone Error (MPE) Model and Feature Training on Mandarin Broadcast News Task

The Minimum Phone Error (MPE) criterion for discriminative training was shown to be able to offer acoustic models with significantly improved performance. This concept was then further extended to Feature-space Minimum Phone Error (fMPE) and offset fMPE for training feature parameters as well. This paper reviews the concept of MPE and reports the experiments and results in performing MPE, fMPE and offset fMPE on the task of Mandarin Broadcast News, and significant improvements were obtained similar to the results reported for other languages and other tasks by other sites. In addition, a new concept of dimension-weighted offset fMPE is proposed in this work and even better performance than offset fMPE was obtained.

Jia-Yu Chen, Chia-Yu Wan, Yi Chen, Berlin Chen, Lin-shan Lee

State-Dependent Phoneme-Based Model Merging for Dialectal Chinese Speech Recognition

Aiming at building a dialectal Chinese speech recognizer from a standard Chinese speech recognizer with a small amount of dialectal Chinese speech, a novel, simple but effective acoustic modeling method, named

state-dependent phoneme-based model merging

(SDPBMM) method, is proposed and evaluated, where a tied-state of standard triphone(s) will be merged with a state of the dialectal monophone that is identical with the central phoneme in the triphone(s). It can be seen that the proposed method has a good performance however it will introduce a Gaussian mixtures expansion problem. To deal with it, an acoustic model distance measure, named

pseudo-divergence based distance measure

, is proposed based on the difference measurement of Gaussian mixture models and then implemented to downsize the model size almost without causing any performance degradation for dialectal speech. With a small amount of only 40-minute Shanghai-dialectal Chinese speech, the proposed SDPBMM achieves a significant absolute syllable error rate (SER) reduction of 5.9% for dialectal Chinese and almost no performance degradation for standard Chinese. In combination with a certain existing adaptation method, another absolute SER reduction of 1.9% can be further achieved.

Linquan Liu, Thomas Fang Zheng, Wenhu Wu

Non-uniform Kernel Allocation Based Parsimonious HMM

In conventional Gaussian mixture based Hidden Markov Model (HMM), all states are usually modeled with a uniform, fixed number of Gaussian kernels. In this paper, we propose to allocate kernels non-uniformly to construct a more parsimonious HMM. Different number of Gaussian kernels are allocated to states in a non-uniform and parsimonious way so as to optimize the Minimum Description Length (MDL) criterion, which is a combination of data likelihood and model complexity penalty. By using the likelihoods obtained in Baum-Welch training, we develop an effcient backward kernel pruning algorithm, and it is shown to be optimal under two mild assumptions. Two databases, Resource Management and Microsoft Mandarin Speech Toolbox, are used to test the proposed parsimonious modeling algorithm. The new parsimonious models improve the baseline word recognition error rate by 11.1% and 5.7%, relatively. Or at the same performance level, a 35-50% model compressions can be obtained.

Peng Liu, Jian-Lai Zhou, Frank Soong

Consistent Modeling of the Static and Time-Derivative Cepstrums for Speech Recognition Using HSPTM

Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.

Yiu-Pong Lai, Man-Hung Siu

Robust Speech Recognition

Vector Autoregressive Model for Missing Feature Reconstruction

This paper proposes a Vector Autoregressive (VAR) model as a new technique for missing feature reconstruction in ASR. We model the spectral features using multiple VAR models. A VAR model predicts missing features as a linear function of a block of feature frames. We also propose two schemes for VAR training and testing. The experiments on AURORA-2 database have validated the modeling methodology and shown that the proposed schemes are especially effective for low SNR speech signals. The best setting has achieved a recognition accuracy of 88.2% at -5dB SNR on subway noise task when oracle data mask is used.

Xiong Xiao, Haizhou Li, Eng Siong Chng

Auditory Contrast Spectrum for Robust Speech Recognition

Traditional speech representations are based on power spectrum which is obtained by energy integration from many frequency bands. Such representations are sensitive to noise since noise energy distributed in a wide frequency band may deteriorate speech representations. Inspired by the contrast sensitive mechanism in auditory neural processing, in this paper, we propose an auditory contrast spectrum extraction algorithm which is a relative representation of auditory temporal and frequency spectrum. In this algorithm, speech is first processed using a temporal contrast processing which enhances speech temporal modulation envelopes in each auditory filter band and suppresses steady low contrast envelopes. The temporal contrast enhanced speech is then integrated to form speech spectrum which is named as temporal contrast spectrum. The temporal contrast spectrum is then analyzed in spectral scale spaces. Since speech and noise spectral profiles are different, we apply a lateral inhibition function to choose a spectral profile subspace in which noise component is reduced more while speech component is not deteriorated. We project the temporal contrast spectrum to the optimal scale space in which cepstral feature is extracted. We apply this cepstral feature for robust speech recognition experiments on AURORA-2J corpus. The recognition results show that there is 61.12% improvement of relative performance for clean training and 27.45% improvement of relative performance for multi-condition training.

Xugang Lu, Jianwu Dang

Signal Trajectory Based Noise Compensation for Robust Speech Recognition

This paper presents a novel signal trajectory based noise compensation algorithm for robust speech recognition. Its performance is evaluated on the Aurora 2 database. The algorithm consists of two processing stages: 1) noise spectrum is estimated using trajectory auto-segmentation and clustering, so that spectral subtraction can be performed to roughly estimate the clean speech trajectories; 2) these trajectories are regenerated using trajectory HMMs, where the constraint between static and dynamic spectral information is imposed to refine the noise subtracted trajectories both in “level” and “shape”. Experimental results show that the recognition performance after spectral subtraction is improved with or without trajectory regeneration, but the HMM regenerated trajectories yields the best performance improvement. After spectral subtraction, the average relative error rate reductions of clean and multi-condition training are 23.21% and 5.58%, respectively. And the proposed trajectory regeneration algorithm further improves them to 42.59% and 15.80%.

Zhi-Jie Yan, Jian-Lai Zhou, Frank Soong, Ren-Hua Wang

An HMM Compensation Approach Using Unscented Transformation for Noisy Speech Recognition

The performance of current HMM-based automatic speech recognition (ASR) systems degrade significantly in real-world applications where there exist mismatches between training and testing conditions caused by factors such as mismatched signal capturing and transmission channels and additive environmental noises. Among many approaches proposed previously to cope with the above robust ASR problem, two notable HMM compensation approaches are the so-called Parallel Model Combination (PMC) and Vector Taylor Series (VTS) approaches, respectively. In this paper, we introduce a new HMM compensation approach using a technique called Unscented Transformation (UT). As a first step, we have studied three implementations of the UT approach with different computational complexities for noisy speech recognition, and evaluated their performance on Aurora2 connected digits database. The UT approaches achieve significant improvements in recognition accuracy compared to log-normal-approximation-based PMC and first-order-approximation-based VTS approaches.

Yu Hu, Qiang Huo

Noisy Speech Recognition Performance of Discriminative HMMs

Discriminatively trained HMMs are investigated in both clean and noisy environments in this study. First, a recognition error is defined at different levels including string, word, phone and acoustics. A high resolution error measure in terms of minimum divergence (MD) is specifically proposed and investigated along with other error measures. Using two speaker-independent continuous digit databases, Aurora2(English) and CNDigits (Mandarin Chinese), the recognition performance of recognizers, which are trained in terms of different error measures and using different training modes, is evaluated under different noise and SNR conditions. Experimental results show that discriminatively trained models performed better than the maximum likelihood baseline systems. Specifically, for MD trained systems, relative error reductions of 17.62% and 18.52% were obtained applying multi-training on Aurora2 and CNDigits, respectively.

Jun Du, Peng Liu, Frank Soong, Jian-Lai Zhou, Ren-Hua Wang

Distributed Speech Recognition of Mandarin Digits String

In this paper, the performance of the pitch detection algorithm in ETSI ES-202-212 XAFE standard is evaluated on a Mandarin digit string recognition task. Experimental results showed that the performance of the pitch detection algorithm degraded seriously when the SNR of speech signal was lower than 10dB. This makes the recognizer using pitch information perform inferior to the original recognizer without using pitch information in low SNR environments. A modification of the pitch detection algorithm is therefore proposed to improve the performance of pitch detection in low SNR environments. The recognition performance can be improved for most SNR levels by integrating the recognizers with and without using pitch information. Overall recognition rates of 82.1% and 86.8% were achieved for clean and multi-condition training cases.

Yih-Ru Wang, Bo-Xuan Lu, Yuan-Fu Liao, Sin-Horng Chen

Speech Adaptation/Normalization

Unsupervised Speaker Adaptation Using Reference Speaker Weighting

Recently, we revisited the fast adaptation method called

reference speaker weighting

(RSW), and suggested a few modifications. We then showed that the algorithmically simplest technique actually outperformed conventional adaptation techniques like MAP and MLLR for 5- or 10-second supervised adaptation on the Wall Street Journal 5K task. In this paper, we would like to further investigate the performance of RSW in unsupervised adaptation mode, which is the more natural way of doing adaptation in practice. Moreover, various analyses were carried out on the reference speakers computed by the method.

Tsz-Chung Lai, Brian Mak

Automatic Construction of Regression Class Tree for MLLR Via Model-Based Hierarchical Clustering

In this paper, we propose a model-based hierarchical clustering algorithm that automatically builds a regression class tree for the well-known speaker adaptation technique – Maximum Likelihood Linear Regression (MLLR). When building a regression class tree, the mean vectors of the Gaussian components of the model set of a speaker independent CDHMM-based speech recognition system are collected as the input data for clustering. The proposed algorithm comprises two stages. First, the input data (i.e., all the Gaussian mean vectors of the CDHMMs) is iteratively partitioned by a divisive hierarchical clustering strategy, and the Bayesian Information Criterion (BIC) is applied to determine the number of clusters (i.e., the base classes of the regression class tree). Then, the regression class tree is built by iteratively merging these base clusters using an agglomerative hierarchical clustering strategy, which also uses BIC as the merging criterion. We evaluated the proposed regression class tree construction algorithm on a Mandarin Chinese continuous speech recognition task. Compared to the regression class tree implementation in HTK, the proposed algorithm is more effective in building the regression class tree and can determine the number of regression classes automatically.

Shih-Sian Cheng, Yeong-Yuh Xu, Hsin-Min Wang, Hsin-Chia Fu

General Topics in Speech Recognition

A Minimum Boundary Error Framework for Automatic Phonetic Segmentation

This paper presents a novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries. In the framework, both training and segmentation approaches are proposed according to the minimum boundary error (MBE) criterion, which tries to minimize the expected boundary errors over a set of possible phonetic alignments. This framework is inspired by the recently proposed minimum phone error (MPE) training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition. To evaluate the proposed MBE framework, we conduct automatic phonetic segmentation experiments on the TIMIT acoustic-phonetic continuous speech corpus. MBE segmentation with MBE-trained models can identify 80.53% of human-labeled phone boundaries within a tolerance of 10 ms, compared to 71.10% identified by conventional ML segmentation with ML-trained models. Moreover, by using the MBE framework, only 7.15% of automatically labeled phone boundaries have errors larger than 20 ms.

Jen-Wei Kuo, Hsin-Min Wang

Large Vocabulary Continuous Speech Recognition

Advances in Mandarin Broadcast Speech Transcription at IBM Under the DARPA GALE Program

This paper describes the technical and system building advances in the automatic transcription of Mandarin broadcast speech made at IBM in the first year of the DARPA GALE program. In particular, we discuss the application of

minimum phone error

(MPE) discriminative training and a new topic-adaptive language modeling technique. We present results on both the RT04 evaluation data and two larger community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that with the described advances, the new transcription system achieves a 26.3% relative reduction in character error rate over our previous best-performing system, and is competitive with published numbers on these datasets. The results are further analyzed to give a comprehensive account of the relationship between the errors and the properties of the test data.

Yong Qin, Qin Shi, Yi Y. Liu, Hagai Aronowitz, Stephen M. Chu, Hong-Kwang Kuo, Geoffrey Zweig

Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks

Word-based consensus networks have been verified to be very useful in minimizing word error rates (WER) for large vocabulary continuous speech recognition for western languages. By considering the special structure of Chinese language, this paper points out that character-based rather then word-based consensus networks should work better for Chinese language. This was verified by extensive experimental results also reported in the paper.

Yi-Sheng Fu, Yi-Cheng Pan, Lin-shan Lee

All-Path Decoding Algorithm for Segmental Based Speech Recognition

In conventional speech processing, researchers adopt a dividable assumption, that the speech utterance can be divided into non-overlapping feature sequences and each segment represents an acoustic event or a label. And the probability of a label sequence on an utterance approximates to the probability of the best utterance segmentation for this label sequence. But in the real case, feature sequences of acoustic events may be overlapped partially, especially for the neighboring phonemes within a syllable. And the best segmentation approximation even reinforces the distortion by the dividable assumption. In this paper, we propose an all-path decoding algorithm, which can fuse the information obtained by different segmentations (or paths) without paying obvious computation load, so the weakness of the dividable assumption could be alleviated. Our experiments show, the new decoding algorithm can improve the system performance effectively in tasks with heavy insertion and deletion errors.

Yun Tang, Wenju Liu, Bo Xu

Improved Mandarin Speech Recognition by Lattice Rescoring with Enhanced Tone Models

Tone plays an important lexical role in spoken tonal languages like Mandarin Chinese. In this paper we propose a two-pass search strategy for improving tonal syllable recognition performance. In the first pass, instantaneous F0 information is employed along with corresponding cepstral information in a 2-stream HMM based decoding. The F0 stream, which incorporates both discrete voiced/unvoiced information and continuous F0 contour, is modeled with a multi-space distribution. With just the first-pass decoding, we recently reported a relative improvement of 24% reduction of tonal syllable recognition errors on a Mandarin Chinese database [5]. In the second pass, F0 information over a horizontal, longer time span is used to build explicit tone models for rescoring the lattice generated in the first pass. Experimental results on the same Mandarin database show that an additional 8% relative error reduction of tonal syllable recognition is obtained by the second-pass search, lattice rescoring with enhanced tone models.

Huanliang Wang, Yao Qian, Frank Soong, Jian-Lai Zhou, Jiqing Han

On Using Entropy Information to Improve Posterior Probability-Based Confidence Measures

In this paper, we propose a novel approach that reduces the confidence error rate of traditional posterior probability-based confidence measures in large vocabulary continuous speech recognition systems. The method enhances the discriminability of confidence measures by applying entropy information to the posterior probability-based confidence measures of word hypotheses. The experiments conducted on the Chinese Mandarin broadcast news database MATBN show that entropy-based confidence measures outperform traditional posterior probability-based confidence measures. The relative reductions in the confidence error rate are 14.11% and 9.17% for experiments conducted on field reporter speech and interviewee speech, respectively.

Tzan-Hwei Chen, Berlin Chen, Hsin-Min Wang

Vietnamese Automatic Speech Recognition: The FLaVoR Approach

Automatic speech recognition for languages in Southeast Asia, including Chinese, Thai and Vietnamese, typically models both acoustics and languages at the syllable level. This paper presents a new approach for recognizing those languages by exploiting information at the word level. The new approach, adapted from our FLaVoR architecture[1], consists of two layers. In the first layer, a pure acoustic-phonemic search generates a dense phoneme network enriched with meta data. In the second layer, a word decoding is performed in the composition of a series of finite state transducers (FST), combining various knowledge sources across sub-lexical, word lexical and word-based language models. Experimental results on the Vietnamese Broadcast News corpus showed that our approach is both effective and flexible.

Quan Vu, Kris Demuynck, Dirk Van Compernolle

Multilingual Recognition and Identification

Language Identification by Using Syllable-Based Duration Classification on Code-Switching Speech

Many approaches to automatic spoken language identification (LID) on monolingual speech are successfully, but LID on the code-switching speech identifying at least 2 languages from one acoustic utterance challenges these approaches. In [6], we have successfully used one-pass approach to recognize the Chinese character on the Mandarin-Taiwanese code-switching speech. In this paper, we introduce a classification method (named syllable-based duration classification) based on three clues: recognized common tonal syllable tonal syllable, the corresponding duration and speech signal to identify specific language from code-switching speech. Experimental results show that the performance of the proposed LID approach on code-switching speech exhibits closely to that of parallel tonal syllable recognition LID system on monolingual speech.

Dau-cheng Lyu, Ren-yuan Lyu, Yuang-chin Chiang, Chun-nan Hsu

Speaker Recognition and Characterization

CCC Speaker Recognition Evaluation 2006: Overview, Methods, Data, Results and Perspective

For the special session on speaker recognition of the

5th International Symposium on Chinese Spoken Language Processing

(ISCSLP 2006), the

Chinese Corpus Consortium

(CCC), the session organizer, developed a speaker recognition evaluation (SRE) to act as a platform for developers in this field to evaluate their speaker recognition systems using two databases provided by the CCC. In this paper, the objective of the evaluation, and the methods and the data used are described. The results of the evaluation are also presented.

Thomas Fang Zheng, Zhanjiang Song, Lihong Zhang, Michael Brasser, Wei Wu, Jing Deng

The IIR Submission to CSLP 2006 Speaker Recognition Evaluation

This paper describes the design and implementation of a practical automatic speaker recognition system for the CSLP speaker recognition evaluation (SRE). The speaker recognition system is built upon four subsystems using speaker information from acoustic spectral features. In addition to the conventional spectral features, a novel

temporal discrete cosine transform

(TDCT) feature is introduced in order to capture long-term speech dynamic. The speaker information is modeled using two complementary speaker modeling techniques, namely, Gaussian mixture model (GMM) and support vector machine (SVM). The resulting subsystems are then integrated at the score level through a multilayer perceptron (MLP) neural network. Evaluation results confirm that the feature selection, classifier design, and fusion strategy are successful, giving rise to an effective speaker recognition system.

Kong-Aik Lee, Hanwu Sun, Rong Tong, Bin Ma, Minghui Dong, Changhuai You, Donglai Zhu, Chin-Wei Eugene Koh, Lei Wang, Tomi Kinnunen, Eng-Siong Chng, Haizhou Li

A Novel Alternative Hypothesis Characterization Using Kernel Classifiers for LLR-Based Speaker Verification

In a log-likelihood ratio (LLR)-based speaker verification system, the alternative hypothesis is usually ill-defined and hard to characterize a priori, since it should cover the space of all possible impostors. In this paper, we propose a new LLR measure in an attempt to characterize the alternative hypothesis in a more effective and robust way than conventional methods. This LLR measure can be further formulated as a non-linear discriminant classifier and solved by kernel-based techniques, such as the Kernel Fisher Discriminant (KFD) and Support Vector Machine (SVM). The results of experiments on two speaker verification tasks show that the proposed methods outperform classical LLR-based approaches.

Yi-Hsiang Chao, Hsin-Min Wang, Ruei-Chuan Chang

Speaker Verification Using Complementary Information from Vocal Source and Vocal Tract

This paper describes a speaker verification system which uses two complementary acoustic features: Mel-frequency cepstral coefficients (MFCC) and wavelet octave coefficients of residues (WOCOR). While MFCC characterizes mainly the spectral envelope, or the formant structure of the vocal tract system, WOCOR aims at representing the spectro-temporal characteristics of the vocal source excitation. Speaker verification experiments carried out on the ISCSLP 2006 SRE database demonstrate the complementary contributions of MFCC and WOCOR to speaker verification. Particularly, WOCOR performs even better than MFCC in single channel speaker verification task. Combining MFCC and WOCOR achieves higher performance than using MFCC only in both single and cross channel speaker verification tasks.

Nengheng Zheng, Ning Wang, Tan Lee, P. C. Ching

ISCSLP SR Evaluation, UVA–CS_es System Description. A System Based on ANNs

This paper shows a description of the system used in the ISCSLP06 Speaker Recognition Evaluation, text independent cross-channel speaker verification task. It is a discriminative Artificial Neural Network-based system, using the Non-Target Incremental Learning method to select world representatives. Two different training strategies have been followed: (i) to use world representative samples with the same channel type as the true model, (ii) to select the world representatives from a pool of samples without channel type identification. The best results have been achieved with the first alternative, but with the appearance of the additional problem of the true model channel type recognition. The system used in this task will also be shown.

Carlos E. Vivaracho

Evaluation of EMD-Based Speaker Recognition Using ISCSLP2006 Chinese Speaker Recognition Evaluation Corpus

In this paper, we present the evaluation results of our proposed text-independent speaker recognition method based on the Earth Mover’s Distance (EMD) using

ISCSLP2006 Chinese speaker recognition evaluation

corpus developed by the Chinese Corpus Consortium (CCC). The EMD based speaker recognition (EMD-SR) was originally designed to apply to a distributed speaker identification system, in which the feature vectors are compressed by vector quantization at a terminal and sent to a server that executes a pattern matching process. In this structure, we had to train speaker models using quantized data, so that we utilized a non-parametric speaker model and EMD. From the experimental results on a Japanese speech corpus, EMD-SR showed higher robustness to the quantized data than the conventional GMM technique. Moreover, it has achieved higher accuracy than the GMM even if the data were not quantized. Hence, we have taken the challenge of

ISCSLP2006 speaker recognition evaluation

by using EMD-SR. Since the identification tasks defined in the evaluation were on an open-set basis, we introduce a new speaker verification module in this paper. Evaluation results showed that EMD-SR achieves 99.3%

Identification Correctness Rate

in a closed-channel speaker identification task.

Shingo Kuroiwa, Satoru Tsuge, Masahiko Kita, Fuji Ren

Integrating Complementary Features with a Confidence Measure for Speaker Identification

This paper investigates the effectiveness of integrating complementary acoustic features for improved speaker identification performance. The complementary contributions of two acoustic features, i.e. the conventional vocal tract related features MFCC and the recently proposed vocal source related features WOCOR, for speaker identification are studied. An integrating system, which performs a score level fusion of MFCC and WOCOR with a confidence measure as the weighting parameter, is proposed to take full advantage of the complementarity between the two features. The confidence measure is derived based on the speaker discrimination powers of MFCC and WOCOR in each individual identification trial so as to give more weight to the one with higher confidence in speaker discrimination. Experiments show that information fusion with such a confidence measure based varying weight outperforms that with a pre-trained fixed weight in speaker identification.

Nengheng Zheng, P. C. Ching, Ning Wang, Tan Lee

Discriminative Transformation for Sufficient Adaptation in Text-Independent Speaker Verification

In conventional Gaussian Mixture Model – Universal Background Model (GMM-UBM) text-independent speaker verification applications, the discriminability between speaker models and the universal background model (UBM) is crucial to system’s performance. In this paper, we present a method based on heteroscedastic linear discriminant analysis (HLDA) that can enhance the discriminability between speaker models and UBM. This technique aims to discriminate the individual Gaussian distributions of the feature space. After the discriminative transformation, the overlapped parts of Gaussian distributions can be reduced. As a result, some Gaussian components of a target speaker model can be adapted more sufficiently during Maximum a Posteriori (MAP) adaptation, and these components will have more discriminative capability over the UBM. Results are presented on NIST 2004 Speaker Recognition data corpora where it is shown that this method provides significant performance improvements over the baseline system.

Hao Yang, Yuan Dong, Xianyu Zhao, Jian Zhao, Haila Wang

Fusion of Acoustic and Tokenization Features for Speaker Recognition

This paper describes our recent efforts in exploring effective discriminative features for speaker recognition. Recent researches have indicated that the appropriate fusion of features is critical to improve the performance of speaker recognition system. In this paper we describe our approaches for the NIST 2006 Speaker Recognition Evaluation. Our system integrated the cepstral GMM modeling, cepstral SVM modeling and tokenization at both phone level and frame level. The experimental results on both NIST 2005 SRE corpus and NIST 2006 SRE corpus are presented. The fused system achieved 8.14% equal error rate on 1conv4w-1conv4w test condition of the NIST 2006 SRE.

Rong Tong, Bin Ma, Kong-Aik Lee, Changhuai You, Donglai Zhu, Tomi Kinnunen, Hanwu Sun, Minghui Dong, Eng-Siong Chng, Haizhou Li

Spoken Language Understanding

Contextual Maximum Entropy Model for Edit Disfluency Detection of Spontaneous Speech

This study describes an approach to edit disfluency detection based on maximum entropy (ME) using contextual features for rich transcription of spontaneous speech. The contextual features contain word-level, chunk-level and sentence-level features for edit disfluency modeling. Due to the problem of data sparsity, word-level features are determined according to the taxonomy of the primary features of the words defined in Hownet. Chunk-level features are extracted based on mutual information of the words. Sentence-level feature are identified according to verbs and their corresponding features. The Improved Iterative Scaling (IIS) algorithm is employed to estimate the optimal weights in the maximum entropy models. Performance on edit disfluency detection and interruption point detection are conducted for evaluation. Experimental results show that the proposed method outperforms the DF-gram approach.

Jui-Feng Yeh, Chung-Hsien Wu, Wei-Yen Wu

Human Language Acquisition, Development and Learning

Automatic Detection of Tone Mispronunciation in Mandarin

In this paper we present our study on detecting tone mispronunciations in Mandarin. Both template and HMM approaches are investigated. Schematic templates of pitch contours are shown to be impractical due to their larger pitch range of inter-, even intra-speaker variation. The statistical Hidden Markov Models (HMM) is used to generate a Goodness of Pronunciation (GOP) score for detection with an optimized threshold. To deal with the discontinuity issue of the F0 in speech, the multi-space distribution (MSD) modeling is used for building corresponding HMMs. Under an MSD-HMM framework, detection performance of different choices of features, HMM types and GOP measures are evaluated.

Li Zhang, Chao Huang, Min Chu, Frank Soong, Xianda Zhang, Yudong Chen

Towards Automatic Tone Correction in Non-native Mandarin

Feedback is an important part of foreign language learning and

Computer Aided Language Learning

(CALL) systems. For pronunciation tutoring, one method to provide feedback is to provide examples of correct speech for the student to imitate. However, this may be frustrating if a student is unable to completely match the example speech. This research advances towards providing feedback using a student’s own voice. Using the case of an American learning Mandarin Chinese, the differences between native and non-native pronunciations of Mandarin tone are highlighted, and a method for correcting tone errors is presented, which uses pitch transformation techniques to alter student tone productions while maintaining other voice characteristics.

Mitchell Peabody, Stephanie Seneff

Spoken and Multimodal Dialog Systems

A Corpus-Based Approach for Cooperative Response Generation in a Dialog System

This paper presents a corpus-based approach for cooperative response generation in a spoken dialog system for the Hong Kong tourism domain. A corpus with 3874 requests and responses is collected using Wizard-of- Oz framework. The corpus then undergoes a regularization process that simplifies the interactions to ease subsequent modeling. A semi-automatic process is developed to annotate each utterance in the dialog turns in terms of their key concepts (KC), task goal (TG) and dialog acts (DA). TG and DA characterize the informational goal and communicative goal of the utterance respectively. The annotation procedure is integrated with a dialog modeling heuristic and a discourse inheritance strategy to generate a semantic abstraction (SA), in the form of {

TG

,

DA

,

KC

}, for each user request and system response in the dialog. Semantic transitions, i.e. {

TG

,

DA

,

KC

}

user

→{

TG

,

DA

,

KC

}

system

, may hence be directly derived from the corpus as rules for

response message planning

. Related verbalization methods may also be derived from the corpus and used as templates for

response message realization

. All the rules and templates are stored externally in a human-readable text file which brings the advantage of easy extensibility of the system. Evaluation of this corpus based approach shows that 83% of the generated responses are coherent with the user fs request and qualitative rating achieves a score of 4.0 on a five-point Likert scale.

Zhiyong Wu, Helen Meng, Hui Ning, Sam C. Tse

A Cantonese Speech-Driven Talking Face Using Translingual Audio-to-Visual Conversion

This paper proposes a novel approach towards a video- realistic, speech-driven talking face for Cantonese. We present a technique that realizes a talking face for a target language (Cantonese) using only audio-visual facial recordings for a base language (English). Given a Cantonese speech input, we first use a Cantonese speech recognizer to generate a Cantonese syllable transcription. Then we map it to an English phoneme transcription via a translingual mapping scheme that involves symbol mapping and time alignment from Cantonese syllables to English phonemes. With the phoneme transcription, the input speech, and the audio-visual models for English, an EM-based conversion algorithm is adopted to generate mouth animation parameters associated with the input Cantonese audio. We have carried out audio-visual syllable recognition experiments to objectively evaluate the proposed talking face. Results show that the visual speech synthesized by the Cantonese talking face can effectively increase the accuracy of Cantonese syllable recognition under noisy acoustic conditions.

Lei Xie, Helen Meng, Zhi-Qiang Liu

The Implementation of Service Enabling with Spoken Language of a Multi-modal System Ozone

In this paper we described the architecture and key issues of the service enabling layer of a multi-modal system Ozone which is oriented for new technologies and services for emerging nomadic societies. The main objective of the Ozone system is to offer a generic framework to enable consumer-oriented Ambient-Intelligence applications. As a large multi-modal system, Ozone consists of many functional modules. However, spoken language played an important role to facilitate the usage of the system. Hence, we presented the design principle of the architecture of the system, the service enabling layer, and spoken language processing techniques in multi-modal interaction, etc.

Sen Zhang, Yves Laprie

Spoken Correction for Chinese Text Entry

With an average of 17 Chinese characters per phonetic syllable, correcting conversion errors with current phonetic input method editors (IMEs) is often painstaking and time consuming. We explore the application of spoken character description as a correction interface for Chinese text entry, in part motivated by the common practice of describing Chinese characters in names for self-introductions. In this work, we analyze typical character descriptions, extend a commercial IME with a spoken correction interface, and evaluate the resulting system in a user study. Preliminary results suggest that although correcting IME conversion errors with spoken character descriptions may not be more effective than traditional techniques for everyone, nearly all users see the potential benefit of such a system and would recommend it to friends.

Bo-June Paul Hsu, James Glass

Speech Data Mining and Document Retrieval

Extractive Chinese Spoken Document Summarization Using Probabilistic Ranking Models

The purpose of extractive summarization is to automatically select indicative sentences, passages, or paragraphs from an original document according to a certain target summarization ratio, and then sequence them to form a concise summary. In this paper, in contrast to conventional approaches, our objective is to deal with the extractive summarization problem under a probabilistic modeling framework. We investigate the use of the hidden Markov model (HMM) for spoken document summarization, in which each sentence of a spoken document is treated as an HMM for generating the document, and the sentences are ranked and selected according to their likelihoods. In addition, the relevance model (RM) of each sentence, estimated from a contemporary text collection, is integrated with the HMM model to improve the representation of the sentence model. The experiments were performed on Chinese broadcast news compiled in Taiwan. The proposed approach achieves noticeable performance gains over conventional summarization approaches.

Yi-Ting Chen, Suhan Yu, Hsin-Min Wang, Berlin Chen

Meeting Segmentation Using Two-Layer Cascaded Subband Filters

The extraction of information from recorded meetings is a very important yet challenging task. The problem lies in the inability of speech recognition systems to be directly applied onto meeting speech data, mainly because meeting participants speak concurrently and head-mounted microphones record more than just their wearers’ utterances – crosstalk from his neighbours are inevitably recorded as well. As a result, a degree of preprocessing of these recordings is needed. The current work presents an approach to segment meetings into four audio classes:

Single speaker

,

crosstalk

,

single speaker plus crosstalk

and

silence

. For this purpose, we propose Two-Layer Cascaded Subband Filters, which spread according to the pitch and formant frequency scales. This filters are able to detect the presence or absence of pitch and formants in an audio signal. In addition, the filters can determine how many numbers of pitches and formants are present in an audio signal based on the output subband energies. Experiments conducted on the ICSI meeting corpus, show that although an overall recognition rate of up to 57% was achieved, rates for crosstalk and silence classes are as high as 80%. This indicates the positive effect and potential of this subband feature in meeting segmentation tasks.

Manuel Giuliani, Tin Lay Nwe, Haizhou Li

A Multi-layered Summarization System for Multi-media Archives by Understanding and Structuring of Chinese Spoken Documents

The multi-media archives are very difficult to be shown on the screen, and very difficult to retrieve and browse. It is therefore important to develop technologies to summarize the entire archives in the network content to help the user in browsing and retrieval. In a recent paper [1] we proposed a complete set of multi-layered technologies to handle at least some of the above issues: (1) Automatic Generation of Titles and Summaries for each of the spoken documents, such that the spoken documents become much more easier to browse, (2) Global Semantic Structuring of the entire spoken document archive, offering to the user a global picture of the semantic structure of the archive, and (3) Query-based Local Semantic Structuring for the subset of the spoken documents retrieved by the user’s query, providing the user the detailed semantic structure of the relevant spoken documents given the query he entered. The Probabilistic Latent Semantic Analysis (PLSA) is found to be helpful. This paper presents an initial prototype system for Chinese archives with the functions mentioned above, in which the broadcast news archive in Mandarin Chinese is taken as the example archive.

Lin-shan Lee, Sheng-yi Kong, Yi-cheng Pan, Yi-sheng Fu, Yu-tsun Huang, Chien-Chih Wang

Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities

Story segmentation plays a critical role in spoken document processing. Spoken documents often come in a continuous audio stream without explicit boundaries related to stories or topics. It is important to be able to automatically segment these audio streams into coherent units. This work is an initial attempt to make use of informative lexical terms (or key terms) in recognition transcripts of Chinese spoken documents for story segmentation. This is because changes in the distribution of informative terms are generally associated with story changes and topic shifts. Our methods of information lexical term extraction include the extraction of POS-tagged nouns, as well as a named entity identifier that extracts Chinese person names, transliterated person names, location and organization names. We also adopted a lexical chaining approach that links up sentences that are lexically “coherent” with each other. This leads to the definition of a lexical chain score that is used for story boundary hypothesis. We conducted experiments on the recognition transcripts of the TDT2 Voice of America Mandarin speech corpus. We compared among several methods of story segmentation, including the use of pauses for story segmentation, the use of lexical chains of all lexical entries in the recognition transcripts, the use of lexical chains of nouns tagged by a part-of-speech tagger, as well as the use of lexical chains of extracted named entities. Lexical chains of informative terms, namely POS-tagged nouns and named entities were found to give comparable performance (F-measures of 0.71 and 0.73 respectively), which is superior to the use of all lexical entries (F-measure of 0.69).

Devon Li, Wai-Kit Lo, Helen Meng

Machine Translation of Speech

Some Improvements in Phrase-Based Statistical Machine Translation

In statistical machine translation, many of the top-performing systems are phrase-based systems. This paper describes a phrase-based translation system and some improvements. We use more information to compute translation probability. The scaling factors of the log-linear models are estimated by the minimum error rate training that uses an evaluation criteria to balance BLEU and NIST scores. We extract phrase-template from initial phrases to deal with data sparseness and distortion problem through decoding. By re-ranking the n-best list of translations generated firstly, the system gets the final output. Some experiments concerned show that all these refinements are beneficial to get better results.

Zhendong Yang, Wei Pang, Jinhua Du, Wei Wei, Bo Xu

Automatic Spoken Language Translation Template Acquisition Based on Boosting Structure Extraction and Alignment

In this paper, we propose a new approach for acquiring translation templates automatically from unannotated bilingual spoken language corpora. Two basic algorithms are adopted: a grammar induction algorithm, and an alignment algorithm using Bracketing Transduction Grammar. The approach is unsupervised, statistical, data-driven, and employs no parsing procedure. The acquisition procedure consists of two steps. First, semantic groups and phrase structure groups are extracted from both the source language and the target language through a boosting procedure, in which a synonym dictionary is used to generate the seed groups of the semantic groups. Second, an alignment algorithm based on Bracketing Transduction Grammar aligns the phrase structure groups. The aligned phrase structure groups are post-processed, yielding translation templates. Preliminary experimental results show that the algorithm is effective.

Rile Hu, Xia Wang

Spoken Language Resources and Annotation

HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, David Graff

The Paradigm for Creating Multi-lingual Text-To-Speech Voice Databases

Voice database is one of the most important parts in TTS systems. However, creating a high quality new TTS voice is not an easy task even for a professional team. The whole process is rather complicated and contains plenty minutiae that should be handled carefully. In fact, in many stages, human interference such as manually checking or labeling is necessary. In multi-lingual situations, it is more challenge to find qualified people to do this kind of interference. That’s why most state-of-the-art TTS systems can provide only a few voices. In this paper, we outline a uniform paradigm for creating multi-lingual TTS voice databases. It focuses on technologies that can either improve the scalability of data collection or reduce human interference such as manually checking or labeling. With this paradigm, we decrease the complexity and work load of the task.

Min Chu, Yong Zhao, Yining Chen, Lijuan Wang, Frank Soong

Multilingual Speech Corpora for TTS System Development

In this paper, four speech corpora collected in the Speech Lab of NCTU in recent years are discussed. They include a Mandarin tree-bank speech corpus, a Min-Nan speech corpus, a Hakka speech corpus, and a Chinese-English mixed speech corpus. Currently, they are used separately to develop a corpus-based Mandarin TTS system, a Min-Nan TTS system, a Hakka TTS system, and a Chinese-English bilingual TTS system. These systems will be integrated in the future to construct a multilingual TTS system covering the four primary languages used in Taiwan.

Hsi-Chun Hsiao, Hsiu-Min Yu, Yih-Ru Wang, Sin-Horng Chen

Construct Trilingual Parallel Corpus on Demand

This paper describes the effort of constructing the Olympic Oriented Trilingual Corpus for the development of NLP applications for Beijing 2008. Designed to support the real NLP applications instead of pure research purpose, this corpus is challenged by multilingual, multi domain and multi system requirements in its construction. The key issue, however, lies in the determination of the proper corpus scale in relation to the time and cost allowed. To solve this problem, this paper proposes to observe the better system performance in the sub-domain than in the whole corpus as the signal of least corpus needed. The hypothesis is that the multi-domain corpus should be sufficient to reveal the domain features at least. So far a Chinese English Japanese tri-lingual corpus totaling 2.4 million words has been accomplished as the first stage result, in which information on domains, locations and topics of the language materials has been annotated in XML.

Muyun Yang, Hongfei Jiang, Tiejun Zhao, Sheng Li

The Contribution of Lexical Resources to Natural Language Processing of CJK Languages

The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, especially for proper nouns, and the lack of a standardized orthography, especially in Japanese. This paper summarizes some of the major linguistic issues in the development NLP applications that are dependent on lexical resources, and discusses the central role such resources should play in enhancing the accuracy of NLP tools.

Jack Halpern

Multilingual Spoken Language Corpus Development for Communication Research

A multilingual spoken language corpus is indispensable for spoken language communication research such as speech-to-speech translation. To promote multilingual spoken language research and development, unified structure and annotation, such as tagging, is indispensable for both speech and natural language processing. We describe our experience with multilingual spoken language corpus development at our research institution, focusing in particular on speech recognition and natural language processing for speech translation of travel conversations.

Toshiyuki Takezawa

Development of Multi-lingual Spoken Corpora of Indian Languages

This paper describes a recently initiated effort for collection and transcription of read as well as spontaneous speech data in four Indian languages. The completed preparatory work include the design of phonetically rich sentences, data acquisition setup for recording speech data over telephone channel, a Wizard of Oz setup for acquiring speech data of a spoken dialogue of a caller with the machine in the context of a remote information retrieval task. An account of care taken to collect speech data that is as close to real world as possible is given. The current status of the programme and the set of actions planned to achieve the goal is given.

K. Samudravijaya

Springer Professional

Inhaltsverzeichnis

Frontmatter