Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed post-conference proceedings of the 11th International Seminar on Speech Production, ISSP 2017, held in Tianjin, China, In October 2017.
The 20 revised full papers included in this volume were carefully reviewed and selected from 68 submissions. They cover a wide range of speech science fields including phonology, phonetics, prosody, mechanics, acoustics, physiology, motor control, neuroscience, computer science and human interaction. The papers are organized in the following topical sections: emotional speech analysis and recognition; articulatory speech synthesis; speech acquisition; phonetics; speech planning and comprehension, and speech disorder.

Inhaltsverzeichnis

Frontmatter

Emotional Speech Analysis and Recognition

Frontmatter

Personality Judgments Based on Speaker’s Social Affective Expressions

This paper describes some of the acoustic characteristics that influence peoples’ judgments about others. The database used was the multilingual corpus recorded with speakers in communicative dialogue contexts (e.g., Rilliard et al. 2013). The acoustic measurements were F0, intensity, HNR, H1-H2, and formant frequencies (F1, F2, and F3). The personality assessment was based on that proposed by Costa and McCrae (1992). A Multiple Factor Analysis (MFA) related the acoustic measures, the performance scores for each attitude, and the number of high, high-medium, low-medium and low ratings in the 5 personality traits, for audio-only and for audio-visual modalities. The results show that the most expressive speakers, those who produced the widest range in acoustic changes, were perceived as more EXTROVERTED and CONSCIENTIOUS. Speakers with high noise levels in the voice were judged with low AGREEABLENESS, and produced the best expressions involving an imposition on the interlocutor. Speakers judged as having high NEUROTICISM and low OPENNESS were perceived as the best performers for expressions with strong social constraints.

Donna Erickson, Albert Rilliard, João de Moraes, Takaaki Shochi

Speech Emotion Recognition Considering Local Dynamic Features

Recently, increasing attention has been directed to the study of the speech emotion recognition, in which global acoustic features of an utterance are mostly used to eliminate the content differences. However, the expression of speech emotion is a dynamic process, which is reflected through dynamic durations, energies, and some other prosodic information when one speaks. In this paper, a novel local dynamic pitch probability distribution feature, which is obtained by drawing the histogram, is proposed to improve the accuracy of speech emotion recognition. Compared with most of the previous works using global features, the proposed method takes advantage of the local dynamic information conveyed by the emotional speech. Several experiments on Berlin Database of Emotional Speech are conducted to verify the effectiveness of the proposed method. The experimental results demonstrate that the local dynamic information obtained with the proposed method is more effective for speech emotion recognition than the traditional global features.

Haotian Guan, Zhilei Liu, Longbiao Wang, Jianwu Dang, Ruiguo Yu

Commonalities of Glottal Sources and Vocal Tract Shapes Among Speakers in Emotional Speech

This paper explores the commonalities of the glottal source waves and vocal tract shapes among four speakers in emotional speech (vowel: /a/, neutral, joy, anger, and sadness) based on a source-filter model with the proposed precise estimation scheme. The results are as follows. When compared with the spectral tilts of glottal source waves of neutral, (1) those of anger and joy increased, and those of sadness decreased in the 200- to 700-Hz frequency range; (2) those of anger increased, but those of joy decreased, and those of sadness were the same as those of neutral in the 700- to 2000-Hz range; and (3) all spectral tilts had the same tendency over 2000 Hz. For front vocal tract shapes, the area function of anger was the largest, that of sadness was the smallest, and those of joy and neutral were in the middle.

Yongwei Li, Ken-Ichi Sakakibara, Daisuke Morikawa, Masato Akagi

Articulatory Speech Synthesis

Frontmatter

Articulatory Speech Synthesis from Static Context-Aware Articulatory Targets

The aim of this work is to develop an algorithm for controlling the articulators (the jaw, the tongue, the lips, the velum, the larynx and the epiglottis) to produce given speech sounds, syllables and phrases. This control has to take into account coarticulation and be flexible enough to be able to vary strategies for speech production. The data for the algorithm are 97 static MRI images capturing the articulation of French vowels and blocked consonant-vowel syllables. The results of this synthesis are evaluated visually, acoustically and perceptually, and the problems encountered are broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation. We conclude that, among our test examples, the articulatory strategies for vowels and stops are most correct, followed by those of nasals and fricatives. Improving timing strategies with dynamic data is suggested as an avenue for future work.

Anastasiia Tsukanova, Benjamin Elie, Yves Laprie

Particle Interaction Adaptivity and Absorbing Boundary Conditions in the Lagrangian Particle Aeroacoustic Model

Recently developed Lagrangian particle aeroacoustic model has shown its capability for simulating acoustic wave propagation problems in flowing fluids. It also has high potential for solving transient acoustics in a domain with moving boundaries. Typical application is sound wave propagation in continuous speech production. When the fluid flow or moving boundary is taken into account, initially evenly distributed particles will become irregular. For irregular particle distribution, the smoothed particle hydrodynamics (SPH) method with constant smoothing length suffers from low accuracy, phase error and instability problems. To tackle these problems, SPH with particle interaction adaptivity might be more efficient, with analog to mesh-based methods with adaptive grids. When the wave arrives at the open boundary, absorbing conditions have also to be applied. Therefore, the main task of this work is to incorporate variable smoothing length and absorbing boundary conditions into the Lagrangian particle aeroacoustic model. The extended model is successfully validated against three typical one- and two-dimensional sound wave propagation problems.

Futang Wang, Qingzhi Hou, Jie Deng, Song Wang, Jianwu Dang

Prediction of F0 Based on Articulatory Features Using DNN

In this paper, articulatory-to-F0 prediction contains two parts, one part is articulatory-to-voiced/unvoiced flag classification and the other one is articulatory-to-F0 mapping for voiced frames. This paper explores several types of articulatory features to confirm the most suitable one for F0 prediction using deep neural networks (DNNs) and long short-term memory (LSTM). Besides, the conventional method for articulatory-to-F0 mapping for voiced frames uses the F0 values after interpolation to train the model. In this paper, only F0 values at voiced frames are adopted for training. Experimental results on the test set on MNGU0 database show: (1) the velocity and acceleration of articulatory movements are quite effective on articulatory-to-F0 prediction; (2) acoustic feature evaluated from articulatory feature with neural networks makes a little better performance than the fusion of it and articulatory feature on articulatory-to-F0 prediction; (3) LSTM models can achieve better effect in articulatory-to-F0 prediction than DNNs; (4) Only-voiced model training method is proved to outperform the conventional method.

Cenxi Zhao, Longbiao Wang, Jianwu Dang, Ruiguo Yu

A Hybrid Method for Acoustic Analysis of the Vocal Tract During Vowel Production

A hybrid method for vocal-tract acoustic simulation is proposed to handle the complex and moving geometries during speech production by combining the finite-difference time-domain (FDTD) method and the immersed boundary method (IBM). In this method, two distinct discrete point systems are employed for discretization. The fluid field is discretized by regular Eulerian grid points, while the wall boundary is represented by a series of Lagrangian points. A direct body force is calculated on the Lagrangian points and then interpolated to the neighboring Eulerian points. To validate the proposed hybrid method, a 2D vocal tract model was set by extracting area function from MRI data obtained for the Mandarin vowel /a/. By simulating acoustic wave in this model, the synthesized vowel was analyzed and the obtained formant frequencies were compared to those of real speech sounds. It is found that the mean absolute error of formant frequencies was 8.17% and better than the result in Literature. To show the ability of the hybrid method for solving acoustic problems with moving geometry, a pseudo moving boundary problem was designed and the results agree well with the acoustic theory.

Futang Wang, Qingzhi Hou, Dingyi Pan, Jianguo Wei, Jianwu Dang

Considering Lip Geometry in One-Dimensional Tube Models of the Vocal Tract

One-dimensional tube models are an effective representation of the vocal tract for acoustic simulations. However, the conversion of a 3D vocal tract shape into such a 1D tube model raises the question of how to account for the lips, because between the corners of the mouth and the most anterior points of the lips, the cross sections of the vocal tract are open at the sides and hence not well-defined. Here it was examined to what extent simplified tube models of the vocal tract with notches as representations of the lips are acoustically similar to corresponding unnotched models with reduced lengths at the lips end, both with and without teeth. To this end, 3D-printed models of /a, ae, e/ and schwa with different notches and reduced lengths were created. For these, the formant frequencies were measured and analyzed. The results indicate that notched resonators are acoustically most similar to their unnotched counterparts when the length of the unnotched tubes is anteriorly reduced by 50% of the notch depth. However, depending on the formant, vowel, and notch depth, the optimal length reduction can vary between 20–90%.

Peter Birkholz, Elisabeth Venus

Speech Acquisition

Frontmatter

Production of Neutral Tone on Disyllabic Words by Two-Year-Old Mandarin-Speaking Children

This study examined the production of neutral tone in disyllabic words by two-year-old Mandarin-speaking children. The results showed that children were fully aware of the neutral tone sandhi rule phonologically at the age of two. However, they cannot phonetically produce neutral tone well. In particular, children made off-standard production with higher pitch register, wider pitch range and longer duration, while made correct production with correct pitch pattern but the duration ratio between the initial syllable and the final syllable is slightly larger than the adults’. The difficulty of the neutral tone production is closely related to the type of the preceding tone and the coordination of articulation for disyllabic neutral tone words.

Jun Gao, Aijun Li

English Lexical Stress Production by Native Speakers of Tibetan and Uyghur

This study studies the production of English lexical stress by native speakers of Tibetan and Uyghur, and the factors that may affect stress assignment. Thirty subjects in their twenties participated, with 10 native speakers (gender balanced) for each language, i.e. native speakers of Uyghur (NSUs), native speakers of Tibetan (NSTs) and native speakers of American English (NSAs). A total of 4,000 tokens are collected, judged and analyzed. Results indicate that: (1) Consistent with the prediction of Stress Typology Model, less negative transfer has been observed in NSTs than in NSUs in stress production. (2) Compared with NSAs, NSUs and NSTs employ different acoustic features when assigning stress. (3) Stress positions affect the accuracy of stress production by NSUs, and also the acoustic features of NSTs and NSUs when assigning stress. A speech-final lengthening effect is observed. (4) Syllable structures have little effect on the accuracy of stress production.

Dan Hu, Hui Feng, Yingjie Zhao, Jie Lian

Phonetics

Frontmatter

C-V and V-C Co-articulation in Cantonese

The present study investigates the co-articulation strength (CS) between the initial/final consonants and the neighboring vowels in Cantonese CV, VC, and CVC monosyllables, where C = [p t k] and V = [i a u]. EMA AG500 was used for recording the articulatory actions of the tongue and the lips during the test syllables. The findings based on the articulatory data collected from two male Cantonese speakers are as follows. First, CS is strong (i) between the initial [p-] and the following [i] or [u], but not [a], and (ii) between the final [-p] and a preceding vowel of any type. Second, CS is weak (i) between the initial [t-] and the following vowel of any type and (ii) between the final [-t] and the preceding [u], but not [i] or [a]. Third, CS is weak between the initial [k-] and the following [a], but not [i] or [u], however high between the final [-k] and the preceding [a]. In general, (i) the order of decreasing CS for both C-V and V-C co-articulation is when C = [p] > C = [k] > C = [t] and (ii) the degree of CS is higher in the VC than the CV context. The findings support the phonological structuring of the syllable, wherein the final consonant (or syllable coda), but not the initial consonant (or syllable onset), and the preceding vowel (or syllable nucleus) form the phonological unit of the rhyme.

Wai-Sum Lee

Speech Style Effects on Local and Non-local Coarticulation in French

Interactions between speech style and coarticulation are investigated by examining local, non-local, anticipatory and carryover contextual effects on vowels in two French corpora of conversational and journalistic speech. C-to-V coarticulation is analyzed on 22 k tokens of /i, E, a, u, ɔ/(/E/=/e, ɛ/) 50-to-80 ms long. Contextual effects are measured as F2 changes in relation to the adjacent consonant (alveolar vs. uvular) in CV1 and V1C sequences. V-to-V coarticulation is analyzed on 33 k V1C(C)V2 sequences with V1 = /e, ɛ, o, ɔ, a/ falling within the same range of duration, and V2 either high/mid-high or low/mid-low. Contextual effects are measured as F1 changes as function of V2 height. Results show more local C-to-V coarticulation in conversational than in journalistic speech, as previously found for other languages. Interestingly, this interaction is clearer for all vowels in V1C, whereas coarticulation in CV1 is affected by style for non-high vowels only. V-to-V coarticulation is also found in both corpora but is modulated by style only for mid-front vowels and in the opposite direction (i.e. more overlap in journalistic than in conversational speech). Findings are interpreted in light of dynamic models of speech production and of a phonological account of French V-to-V harmony.

Giuseppina Turco, Fanny Guitard-Ivent, Cécile Fougeron

Word-Initial Irregular Phonation as a Function of Speech Rate and Vowel Quality in Hungarian

We examined vowel-initial irregular phonation in real words as a function of vowel quality, backness and height, and speech rate in Hungarian. We analyzed two types of irregular phonation: glottalization and glottal stop. We found that open vowels elicited more irregular phonation than mid and close ones, but we found no effect of the backness. The frequency of irregular phonation was lower in fast than in slow speech. Inconsistently with the claims of earlier studies, the relative frequency of glottalization to glottal stops was not influenced by speech rate in general. However, while /i/ was produced with a relatively higher ratio of glottal stops in fast speech, the open vowels showed the widely documented tendency of being realized with relatively less glottal stops under the same conditions.

Alexandra Markó, Andrea Deme, Márton Bartók, Tekla Etelka Gráczi, Tamás Gábor Csapó

Effects of Entering Tone on Vowel Duration and Formants in Nanjing Dialect

There are five tones in Nanjing Dialect, including four open-syllable tones and one entering tone with a syllable-final glottal stop, which is not found in Beijing Mandarin. Each checked syllable normally corresponds to an open syllable with the same vowel but different tones. This study focuses on the acoustic comparisons of the two kinds of syllables to discuss the acoustic discrepancies between checked and open syllables in Nanjing Dialect, specifically with regards to vowel quality. The acoustic parameters include the duration, the first formant (F1), the second formant (F2), and the third formant (F3) of the vowels. The results of vowel duration indicate that the entering tone is still the shortest among the five tones. The relatively higher F1, F2 and F3 in the entering tone suggest the effect of a glottal stop coda on the vowel quality.

Yongkai Yang, Ying Chen

Which Factors Can Explain Individual Outcome Differences When Learning a New Articulatory-to-Acoustic Mapping?

Speech motor learning is characterized by inter-speaker outcome differences where some speakers fail to compensate for articulatory and/or auditory perturbations. Hypotheses put forward to explain these differences entertain the idea that speakers employ auditory and sensorimotor feedback differently depending on their predispositions or different acuity traits. A related idea implies that individual speakers’ traits may further interact with the amount of auditory and somatosensory feedback involved in the production of a specific speech sound, e.g. with the degree of the tongue-palate contact. To investigate these hypotheses, we performed two experiments with an identical group of Russian native speakers where we perturbed vowel and fricative spectra employing identical experimental designs. In both experiments we observe compensatory efforts for all participants. However, among our participants we find neither compelling evidence for individual feedback preferences nor for consistent speaker-internal patterns of the learning outcomes in the context of vowels and fricatives. We suggest that a more plausible explanation for our results is provided by the idea that fricatives and vowels exhibit different degrees of complexity of the articulatory-to-acoustic mapping.

Eugen Klein, Jana Brunner, Phil Hoole

Speech Planning and Comprehension

Frontmatter

Investigation of Speech-Planning Mechanism Based on Eye Movement

Eye movements can reflect the brain activities in word recognition and speech planning processes during reading and spontaneous speech comprehension. Most of the previous studies used isolated words alone to investigate the latency time of speech planning. However, it is difficult to explore the mechanism for speech planning in the real situation. In this paper, we used continuous speech to investigate the mechanism of speech planning by means of matching eye movements with the utterances during reading Chinese sentences. The plan units of the reading speech were estimated using the fixation of the eye movements, and the latent times for speech planning were measured for each unit in the reading sentences. It is found that the majority of the planning units was a grammatical word, while most of the four-syllable words were planned as two disyllable units. The latent time of the planning units decreases gradually along the time axis of a sentence. When a sentence consists of two sub-sentences, the decline tendency of the latent time was reset in the boundary between the two sub-sentences. When removing the meaning of the sentence by randomizing the word order, however, the declined tendency of the latent time was disappeared. These phenomena showed that the posterior words of the sentences may save time in semantic comprehension since the preceding words provided more semantic information for posterior ones. We can conclude that the semantic comprehension is an important part for speech planning as well as motor command designs, although comprehension is not an explicit task in the uttering.

Jinfeng Huang, Di Zhou, Jianwu Dang

Global Monitoring of Dynamic Functional Interactions in the Brain During Chinese Verbs Perception

Previous studies suggested that during speech perception and processing, auditory analyses clearly took place in the auditory cortices in the temporal lobes bilaterally, semantic processing was supported by a temporo-frontal network which strongly lateralized to the left hemisphere while a prosodic processing network mainly located in right hemisphere. However, some studies proposed that the linguistic abilities such as phonology and semantics recruited regions in both hemispheres. To understand the neural mechanism underlying speech perception and processing, it is important to uncover the dynamic functional interaction in the brain. Our aim is to investigate whether a prevalent human brain network exists during perceiving Chinese verbs and how the effective connectivity changes in spatial and temporal domains. An auditory listening experiment was carried out to monitor the brain activations in full-time scale by using the electroencephalograph (EEG) signals recording when native subjects perceiving Mandarin verbs. By performing a Granger causality analysis and statistical analysis, six connection patterns in different time periods were constructed under global monitoring of dynamic brain activities. The results showed different connections and inter-regional information flows in six time intervals. It can be indicated from this study that the bilateral hemispheres not only involves in the speech perception and processing, but also have information interaction. The results are essential for the more detailed bilateral brain network analysis with full-time monitoring in future studies.

Yuke Si, Jianwu Dang, Gaoyan Zhang, Longbiao Wang

Interactions Between Modal and Amodal Semantic Areas in Spoken Word Comprehension

In neurolinguistics, the controversy about whether word semantics are stored in an amodal language-specific center or distributed in modality-specific sensory-motor systems comes from two inconsistent evidences: (i) Semantic Dementia (SD) patients who got a focal brain damage in the anterior temporal lobe (ATL) exhibit a general loss of conceptual knowledge across all word categories; (ii) fMRI examinations of semantic memory found no clues in the ATL but a broad activation in the sensory and motor regions (SMR) that represent the visual and motor features of words. To settle this dispute, the current study aims to examine the whole-range brain dynamics during word processing using (i) 2-D ERP-image analysis, (ii) independent component clustering and (iii) EEG source reconstruction methods. It was found that both ATL and SMR participated in the spoken word processing by means of recurrent interaction, and the visual and motor cortex exhibited specific activation patterns for noun and verb respectively. These results suggest a hierarchical organization of word semantics that combines amodal ATL and modal SMR to form a complete concept.

Bin Zhao, Gaoyan Zhang, Jianwu Dang

Speech Disorder

Frontmatter

Acoustic Analysis of Mandarin Speech in Parkinson’s Disease with the Effects of Levodopa

This study investigated prosodic and articulatory characteristics of parkinsonian speech by acoustic analysis of the read speech from 21 Mandarin-speaking Parkinson’s Disease (PD) patients before and after administration of levodopa medication, and 21 age- and gender-matched healthy controls (HC). PD exhibited reduced F0 variability and increased minimum intensity during the closures of stops than HC. For females, PD also showed smaller vowel space area and vowel articulation index than HC. Administration of levodopa increased the mean, the max, and the range of F0, bringing them closer to those of HC, but little effect was found on other acoustic parameters. Correlation analysis between acoustic parameters and physiological/pathological indices of PD showed that the only significant positive correlation was between pause ratio and UPDRS III score. The findings on acoustic differences between PD and HC can potentially be applied to diagnosis and speech therapy for PD.

Wentao Gu, Ping Fan, Weiguo Liu

Dynamic Acoustic Evidence of Nasalization as a Compensatory Mechanism for Voicing in Spanish Apraxic Speech

This paper is concerned with the phonetic realization of the voicing contrast by two Spanish speakers with surgery-related apraxia of speech and two matched control speakers. Specifically, it examines whether speakers with AOS, widely reported to have a deficit in laryngeal control, use nasal leak as a compensatory mechanism aimed at facilitating the initiation of voicing in word-initial stops. The results show that the two apraxic speakers produced prevoicing in /b d g/ in only one third of the cases (correctly identified as ‘voiced’). In these cases, however, they exhibited significantly longer prevoicing than control subjects, and this longer voiced portion was closely related to a longer nasal murmur. These results shed light on the compensation strategies used by apraxic subjects to achieve voicing. Differences in the intensity patterns of nasal and voiced stops indicate that apraxic speakers control the timing of velopharyngeal gesture, suggesting that apraxia is a selective impairment.

Anna K. Marczyk, Yohann Meynadier, Yulia Gaydina, Maria-Josep Solé

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise