Skip to main content
Erschienen in: Journal on Multimodal User Interfaces 2/2017

11.01.2017

Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

verfasst von: Yuya Chiba, Takashi Nose, Akinori Ito

Erschienen in: Journal on Multimodal User Interfaces | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Spoken dialog systems are employed in various devices to help users operate them. An advantage of a spoken dialog system is that the user can make input utterances freely, but the system sometimes makes it difficult for the user to speak to it. The system should estimate the state of a user who encounters a problem when starting a dialog and then give appropriate help before the user abandons the dialog. Based on this assumption, our research aims to construct a system which responds to a user who does not reply to the system. In this paper, we propose a method of discriminating the user’s state based on vector quantization of non-verbal information such as prosodic features, facial feature points, and gaze. The experimental results showed that the proposed method outperforms the conventional approaches and achieves a discrimination ratio of 72.0%. Then, we examined sequential discrimination for responding to the user at an appropriate timing. The results indicate that the discrimination ratio reached equal to the end of the session at around 6.0 s.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Adelhardt J, Shi R, Frank C, Zeißler V, Batliner A, Nöth E, Niemann H (2003) Multimodal user state recognition in a modern dialogue system. In: Proceedings of the 26th german conference on artificial intelligence, pp 591–605 Adelhardt J, Shi R, Frank C, Zeißler V, Batliner A, Nöth E, Niemann H (2003) Multimodal user state recognition in a modern dialogue system. In: Proceedings of the 26th german conference on artificial intelligence, pp 591–605
2.
Zurück zum Zitat Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035 Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
3.
Zurück zum Zitat Brennan SE, Williams M (1995) The feeling of another’s knowing: prosody and filled pauses as cues to listeners about the metacognitive states of speakers. J Mem Lang 34(3):383–398CrossRef Brennan SE, Williams M (1995) The feeling of another’s knowing: prosody and filled pauses as cues to listeners about the metacognitive states of speakers. J Mem Lang 34(3):383–398CrossRef
4.
Zurück zum Zitat Callejas Z, Griol D, López-Cózar R (2011) Predicting user mental states in spoken dialogue systems. EURASIP J Adv Signal Process 6:1–21 Callejas Z, Griol D, López-Cózar R (2011) Predicting user mental states in spoken dialogue systems. EURASIP J Adv Signal Process 6:1–21
6.
Zurück zum Zitat Chiba Y, Ito M, Ito A (2012) Effect of linguistic contents on human estimation of internal state of dialog system users. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog, pp 11–14 Chiba Y, Ito M, Ito A (2012) Effect of linguistic contents on human estimation of internal state of dialog system users. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog, pp 11–14
7.
Zurück zum Zitat Chiba Y, Ito M, Ito A (2013) Estimation of user’s state during a dialog turn with sequential multi-modal features. In: HCI international 2013-posters’ extended abstracts, pp 572–576 Chiba Y, Ito M, Ito A (2013) Estimation of user’s state during a dialog turn with sequential multi-modal features. In: HCI international 2013-posters’ extended abstracts, pp 572–576
8.
Zurück zum Zitat Chiba Y, Ito M, Ito A (2014a) Modeling user’s state during dialog turn using HMM for multi-modal spoken dialog system. In: Proceedigs of the 7th international conference on advances in computer–human interactions, pp 343–346 Chiba Y, Ito M, Ito A (2014a) Modeling user’s state during dialog turn using HMM for multi-modal spoken dialog system. In: Proceedigs of the 7th international conference on advances in computer–human interactions, pp 343–346
9.
Zurück zum Zitat Chiba Y, Nose T, Ito A, Ito M (2014b) User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor’s internal state. In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue, pp 74–78 Chiba Y, Nose T, Ito A, Ito M (2014b) User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor’s internal state. In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue, pp 74–78
10.
Zurück zum Zitat Collignon O, Girard S, Gosselin F, Roy S, Saint-Amour D, Lassonde M, Lepore F (2008) Audio-visual integration of emotion expression. Brain Res 1242:126–135CrossRef Collignon O, Girard S, Gosselin F, Roy S, Saint-Amour D, Lassonde M, Lepore F (2008) Audio-visual integration of emotion expression. Brain Res 1242:126–135CrossRef
11.
Zurück zum Zitat Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision, pp 1–22 Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision, pp 1–22
12.
Zurück zum Zitat de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. J Biomed Inform 39(5):514–531CrossRef de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. J Biomed Inform 39(5):514–531CrossRef
13.
Zurück zum Zitat Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceediings of the 21st international conference on machine learning, pp 225–232 Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceediings of the 21st international conference on machine learning, pp 225–232
14.
Zurück zum Zitat Forbes-Riley K, Litman D (2011a) Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Commun 53(9–10):1115–1136CrossRef Forbes-Riley K, Litman D (2011a) Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Commun 53(9–10):1115–1136CrossRef
15.
Zurück zum Zitat Forbes-Riley K, Litman D (2011b) Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Comput Speech Lang 25(1):105–126CrossRef Forbes-Riley K, Litman D (2011b) Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Comput Speech Lang 25(1):105–126CrossRef
16.
Zurück zum Zitat Griol D, Molina JM, Callejas Z (2014) Modeling the user state for context-aware spoken interaction in ambient assisted living. Appl Intell 40(4):749–771CrossRef Griol D, Molina JM, Callejas Z (2014) Modeling the user state for context-aware spoken interaction in ambient assisted living. Appl Intell 40(4):749–771CrossRef
17.
Zurück zum Zitat Hudson S, Fogarty J, Atkeson C, Avrahami D, Forlizzi J, Kiesler S, Lee J, Yang J (2003) Predicting human interruptibility with sensors: a Wizard of Oz feasibility study. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 257–264 Hudson S, Fogarty J, Atkeson C, Avrahami D, Forlizzi J, Kiesler S, Lee J, Yang J (2003) Predicting human interruptibility with sensors: a Wizard of Oz feasibility study. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 257–264
18.
Zurück zum Zitat Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 494–501 Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 494–501
19.
Zurück zum Zitat Jokinen K, Kanto K (2004) User expertise modelling and adaptivity in a speech-based e-mail system. In: Proceedings of the 42nd annual meeting on association for computational linguistics, pp 88–95 Jokinen K, Kanto K (2004) User expertise modelling and adaptivity in a speech-based e-mail system. In: Proceedings of the 42nd annual meeting on association for computational linguistics, pp 88–95
20.
Zurück zum Zitat Kobayashi A, Kayama K, Mizukami E, Misu T, Kashioka H, Kawai H, Nakamura S (2010) Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system. In: Proceedings of the international workshop on spoken dialogue systems technology, pp 73–84 Kobayashi A, Kayama K, Mizukami E, Misu T, Kashioka H, Kawai H, Nakamura S (2010) Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system. In: Proceedings of the international workshop on spoken dialogue systems technology, pp 73–84
21.
Zurück zum Zitat Koda T, Maes P (1996) Agents with faces: the effect of personification. In: Proceedings of the IEEE international workshop on robot and human communication, pp 189–194 Koda T, Maes P (1996) Agents with faces: the effect of personification. In: Proceedings of the IEEE international workshop on robot and human communication, pp 189–194
22.
Zurück zum Zitat Lin JC, Wu CH, Wei WL (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156CrossRef Lin JC, Wu CH, Wei WL (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156CrossRef
23.
Zurück zum Zitat Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198CrossRef Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198CrossRef
24.
Zurück zum Zitat Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: Proceedings of the 9th IEEE international workshop on advanced motion control, pp 762–767 Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: Proceedings of the 9th IEEE international workshop on advanced motion control, pp 762–767
25.
Zurück zum Zitat Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. In: Proceedings of computer vision and pattern recognition, pp 1298–1305 Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. In: Proceedings of computer vision and pattern recognition, pp 1298–1305
26.
Zurück zum Zitat Paliwal KK, Atal BS (1993) Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans Speech Audio Process 1:3–14CrossRef Paliwal KK, Atal BS (1993) Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans Speech Audio Process 1:3–14CrossRef
27.
Zurück zum Zitat Pargellis AN, Kuo HKJ, Lee CH (2004) An automatic dialogue generation platform for personalized dialogue applications. Speech Commun 42(3–4):329–351CrossRef Pargellis AN, Kuo HKJ, Lee CH (2004) An automatic dialogue generation platform for personalized dialogue applications. Speech Commun 42(3–4):329–351CrossRef
28.
Zurück zum Zitat Paulmann S, Pell MD (2011) Is there an advantage for recognizing multi-modal emotional stimuli? Motiv Emot 35(2):192–201CrossRef Paulmann S, Pell MD (2011) Is there an advantage for recognizing multi-modal emotional stimuli? Motiv Emot 35(2):192–201CrossRef
29.
Zurück zum Zitat Pon-Barry H, Schultz K, Bratt EO, Clark B, Peters S (2006) Responding to student uncertainty in spoken tutorial dialogue systems. Int J Artif Intell Educ 16(2):171–194 Pon-Barry H, Schultz K, Bratt EO, Clark B, Peters S (2006) Responding to student uncertainty in spoken tutorial dialogue systems. Int J Artif Intell Educ 16(2):171–194
30.
Zurück zum Zitat Saragih JM, Lucey S, Cohn JF (2011) Deformable model fitting by regularized landmark mean-shift. Int J Comput Vis 91(2):200–215MathSciNetCrossRefMATH Saragih JM, Lucey S, Cohn JF (2011) Deformable model fitting by regularized landmark mean-shift. Int J Comput Vis 91(2):200–215MathSciNetCrossRefMATH
31.
Zurück zum Zitat Satake S, Kanda T, Glas DF, Imai M, Ishiguro H, Hagita N (2009) How to approach humans? Strategies for social robots to initiate interaction. In: Proceedings of the 4th ACM/IEEE international conference on human–robot interaction, pp 109–116 Satake S, Kanda T, Glas DF, Imai M, Ishiguro H, Hagita N (2009) How to approach humans? Strategies for social robots to initiate interaction. In: Proceedings of the 4th ACM/IEEE international conference on human–robot interaction, pp 109–116
32.
Zurück zum Zitat Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565MathSciNetMATH Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565MathSciNetMATH
33.
Zurück zum Zitat Swerts M, Krahmer E (2005) Audiovisual prosody and feeling of knowing. J Mem Lang 53(1):81–94CrossRef Swerts M, Krahmer E (2005) Audiovisual prosody and feeling of knowing. J Mem Lang 53(1):81–94CrossRef
34.
Zurück zum Zitat Walker JH, Sproull L, Subramani R (1994) Using a human face in an interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 85–91 Walker JH, Sproull L, Subramani R (1994) Using a human face in an interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 85–91
35.
Zurück zum Zitat Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607CrossRef Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607CrossRef
36.
Zurück zum Zitat Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163CrossRef Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163CrossRef
37.
Zurück zum Zitat Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58CrossRef Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58CrossRef
Metadaten
Titel
Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt
verfasst von
Yuya Chiba
Takashi Nose
Akinori Ito
Publikationsdatum
11.01.2017
Verlag
Springer International Publishing
Erschienen in
Journal on Multimodal User Interfaces / Ausgabe 2/2017
Print ISSN: 1783-7677
Elektronische ISSN: 1783-8738
DOI
https://doi.org/10.1007/s12193-017-0238-y

Weitere Artikel der Ausgabe 2/2017

Journal on Multimodal User Interfaces 2/2017 Zur Ausgabe