Skip to main content

2021 | OriginalPaper | Buchkapitel

Delay Mitigation for Backchannel Prediction in Spoken Dialog System

verfasst von : Amalia Istiqlali Adiba, Takeshi Homma, Dario Bertero, Takashi Sumiyoshi, Kenji Nagamatsu

Erschienen in: Conversational Dialogue Systems for the Next Decade

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

To provide natural dialogues between spoken dialog systems and users, backchannel feedback can be used to make the interaction more sophisticated. Many related studies have combined acoustic and lexical features into a model to achieve better prediction. However, extracting lexical features leads to a delay caused by the automatic speech recognition (ASR) process. The systems should respond with no delay, since delays reduce the naturalness of the conversation and make the user feel dissatisfied. In this work, we present a prior prediction model for reducing response delay in backchannel prediction. We first train both acoustic- and lexical-based backchannel prediction models independently. In the lexical-based model, prior prediction is necessary to consider the ASR delay. The prior prediction model is trained with a weighting value that gradually increases when a sequence is closer to a suitable response timing. The backchannel probability is calculated based on the outputs from both acoustic- and lexical-based models. Evaluation results show that the prior prediction model can predict backchannel with an improvement rate on the F1 score 8% better than the current state-of-the-art algorithm under a 2.0-s delay condition.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
Each metric (precision, recall, or F1) is calculated for both positive and negative classes. Each macro-averaged metric is calculated by averaging the corresponding metrics for the positive class and the negative class.
 
Literatur
1.
Zurück zum Zitat Aldeneh Z, Dimitriadis D, Provost EM (2018) Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 6159–6163 (2018) Aldeneh Z, Dimitriadis D, Provost EM (2018) Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 6159–6163 (2018)
2.
Zurück zum Zitat Calhoun S, Carletta J, Brenier JM, Mayo N, Jurafsky D, Steedman M, Beaver D (2010) The NXT-format switchboard sorpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Lang Resources Eval 44(4):387–419CrossRef Calhoun S, Carletta J, Brenier JM, Mayo N, Jurafsky D, Steedman M, Beaver D (2010) The NXT-format switchboard sorpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Lang Resources Eval 44(4):387–419CrossRef
3.
Zurück zum Zitat Chan FH, Chen YT, Xiang Y, Sun M (2016) Anticipating accidents in dashcam videos. In: Proceedings of the computer vision-asian conference on computer vision (ACCV). Springer, pp 136–153 Chan FH, Chen YT, Xiang Y, Sun M (2016) Anticipating accidents in dashcam videos. In: Proceedings of the computer vision-asian conference on computer vision (ACCV). Springer, pp 136–153
4.
Zurück zum Zitat Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, Devillers LY, Epps J, Laukka P, Narayanan SS et al (2015) The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):190–202CrossRef Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, Devillers LY, Epps J, Laukka P, Narayanan SS et al (2015) The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):190–202CrossRef
5.
Zurück zum Zitat Eyben F, Wullmer M, Schuller (2018) OpenSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM international conference on multimedia (ACM Multimedia), pp 1459–1462 Eyben F, Wullmer M, Schuller (2018) OpenSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM international conference on multimedia (ACM Multimedia), pp 1459–1462
6.
Zurück zum Zitat Godfrey J, Holliman E, McDaniel J (1992) Telephone speech corpus for research and development. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 517–520 Godfrey J, Holliman E, McDaniel J (1992) Telephone speech corpus for research and development. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 517–520
7.
Zurück zum Zitat Hara K, Inoue K, Takanashi K, Kawahara T (2018) Prediction of turn-taking using multitask learning with prediction of backchannels and fillers. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 991–995 Hara K, Inoue K, Takanashi K, Kawahara T (2018) Prediction of turn-taking using multitask learning with prediction of backchannels and fillers. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 991–995
8.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
9.
Zurück zum Zitat Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: Proc. IEEE International Conference on Robotics and Automation (ICRA), pp. 3118–3125. IEEE (2016) Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: Proc. IEEE International Conference on Robotics and Automation (ICRA), pp. 3118–3125. IEEE (2016)
10.
Zurück zum Zitat Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: Proceedings of the international workshop on spoken dialogue systems technology (IWSDS), pp 1–10 Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: Proceedings of the international workshop on spoken dialogue systems technology (IWSDS), pp 1–10
11.
Zurück zum Zitat Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward NG (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 2890–2894 Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward NG (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 2890–2894
12.
Zurück zum Zitat Masumura R, Asami T, Masataki H, Ishii R, Higashinaka R (2017) Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 1661–1665 Masumura R, Asami T, Masataki H, Ishii R, Higashinaka R (2017) Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 1661–1665
13.
Zurück zum Zitat Meshorer T, Heeman PA (2016) Using past speaker behavior to better predict turn transitions. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 2900–2904 Meshorer T, Heeman PA (2016) Using past speaker behavior to better predict turn transitions. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 2900–2904
14.
Zurück zum Zitat Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84CrossRef Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84CrossRef
15.
Zurück zum Zitat Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 497–500 Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 497–500
16.
Zurück zum Zitat Roddy M, Skantze G, Harte N (2018) Investigating speech features for continuous turn-taking prediction using LSTMs. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 586–590 Roddy M, Skantze G, Harte N (2018) Investigating speech features for continuous turn-taking prediction using LSTMs. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 586–590
17.
Zurück zum Zitat Ruede R, Müller M, Stüker S, Waibel A (2017) Enhancing backchannel prediction using word embeddings. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 879–883 (2017) Ruede R, Müller M, Stüker S, Waibel A (2017) Enhancing backchannel prediction using word embeddings. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 879–883 (2017)
18.
Zurück zum Zitat Ruede R, Müller M, Stüker S, Waibel A (2017) Yeah, right, uh-huh: a deep learning backchannel predictor. In: Proceedings of the international workshop on spoken dialogue systems technology (IWSDS), pp 247–258 Ruede R, Müller M, Stüker S, Waibel A (2017) Yeah, right, uh-huh: a deep learning backchannel predictor. In: Proceedings of the international workshop on spoken dialogue systems technology (IWSDS), pp 247–258
19.
Zurück zum Zitat Shiwa T, Kanda T, Imai M, Ishiguro H, Hagita N (2008) How quickly should communication robots respond? In: Proceedings of the ACM/IEEE international conference on human-robot interaction (HRI), pp. 153–160 (2008) Shiwa T, Kanda T, Imai M, Ishiguro H, Hagita N (2008) How quickly should communication robots respond? In: Proceedings of the ACM/IEEE international conference on human-robot interaction (HRI), pp. 153–160 (2008)
20.
Zurück zum Zitat Skantze G (2017) Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In: Proceedings of the annual SIGdial meeting on discourse and dialogue (SIGDIAL), pp 220–230 (2017) Skantze G (2017) Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In: Proceedings of the annual SIGdial meeting on discourse and dialogue (SIGDIAL), pp 220–230 (2017)
21.
Zurück zum Zitat Truong KP, Poppe R, Heylen D (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 3058–3061 (2010) Truong KP, Poppe R, Heylen D (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), pp 3058–3061 (2010)
22.
Zurück zum Zitat Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207CrossRef Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207CrossRef
Metadaten
Titel
Delay Mitigation for Backchannel Prediction in Spoken Dialog System
verfasst von
Amalia Istiqlali Adiba
Takeshi Homma
Dario Bertero
Takashi Sumiyoshi
Kenji Nagamatsu
Copyright-Jahr
2021
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-8395-7_10

Neuer Inhalt