Abstract
Humans are known to use a wide range of non-verbal behaviour while speaking. Generating naturalistic embodied speech for an artificial agent is therefore an application where techniques that draw directly on recorded human motions can be helpful. We present a system that uses corpus-based selection strategies to specify the head and eyebrow motion of an animated talking head. We first describe how a domain-specific corpus of facial displays was recorded and annotated, and outline the regularities that were found in the data. We then present two different methods of selecting motions for the talking head based on the corpus data: one that chooses the majority option in all cases, and one that makes a weighted choice among all of the options. We compare these methods to each other in two ways: through cross-validation against the corpus, and by asking human judges to rate the output. The results of the two evaluation studies differ: the cross-validation study favoured the majority strategy, while the human judges preferred schedules generated using weighted choice. The judges in the second study also showed a preference for the original corpus data over the output of either of the generation strategies.
Similar content being viewed by others
Notes
No sentence in the script had more than two clauses.
We did not select any motions on words for which the speech-synthesiser output was very short, such as but and is, because the synthesiser could not make those words long enough to make any motion sensible.
A baseline system that never proposes any motion scores 0.79 on this measure.
The corpus schedules were modified to remove motions on short words such as but and is, for the reasons discussed in Sect. 4.
References
Artstein, R., & Poesio, M. (2005). Kappa3 = alpha (or beta). Technical Report CSM-437, University of Essex Department of Computer Science.
Bangalore, S., Rambow, O., & Whittaker, S. (2000). Evaluation metrics for generation. In Proceedings of INLG 2000.
Belz, A., Gatt, A., Reiter, E., & Viethen, J. (2007). First NLG shared task and evaluation challenge on attribute selection for referring expression generation. http://www.csd.abdn.ac.uk/research/evaluation/
Belz, A., & Reiter, E. (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of EACL 2006 (pp. 313–320).
Belz, A., & Varges, S. (Eds.) (2005) Corpus linguistics 2005 workshop on using corpora for natural language generation.
Cassell, J., Bickmore, T., Vilhjálmsson H., & Yan, H. (2001a). More than just a pretty face: Conversational protocols and the affordances of embodiment. Knowledge-Based Systems, 14(1–2), 55–64.
Cassell, J., Nakano, Y., Bickmore, T. W., Sidner, C. L., & Rich, C. (2001b). Non-verbal cues for discourse structure. In Proceedings of ACL 2001.
Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (2000). Embodied conversational agents. MIT Press.
Clark, R. A. J., Richmond, K., & King, S. (2004) Festival 2 – Build your own general purpose unit selection speech synthesiser. In Proceedings of the 5th ISCA Workshop on Speech Synthesis.
de Carolis, B., Carofiglio, V., & Pelachaud, C. (2002). From discourse plans to believable behavior generation. In Proceedings of INLG 2002.
DeCarlo, D., Stone, M., Revilla, C., & Venditti, J. (2004). Specifying and animating facial signals for discourse in embodied conversational agents. Computer Animation and Virtual Worlds, 15(1), 27–38.
Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline. Cambridge University Press.
Foster, M. E. (2007). Evaluating the impact of variation in automatically generated embodied object descriptions. Ph.D. thesis, School of Informatics, University of Edinburgh.
Foster, M. E., & Oberlander, J. (2006). Data-driven generation of emphatic facial displays. In Proceedings of EACL 2006 (pp. 353–360).
Foster, M. E., White, M., Setzer, A., & Catizone, R. (2005). Multimodal generation in the COMIC dialogue system. In Proceedings of the ACL 2005 Demo Session.
Fox, J. (2002). An R and S-Plus companion to applied regression. Sage Publications.
Graf, H., Cosatto, E., Strom, V., & Huang, F. (2002). Visual prosody: Facial movements accompanying speech. In Proceedings of FG 2002 (pp. 397–401).
Kipp, M. (2004). Gesture generation by imitation – From human behavior to computer character animation. Dissertation.com.
Krahmer, E., & Swerts, M. (2005). How children and adults produce and perceive uncertainty in audiovisual speech. Language and Speech, 48(1), 29–53.
Langkilde, I., & Knight, K. (1998). Generation that exploits corpus-based statistical knowledge. In Proceedings of COLING-ACL 1998.
Langkilde-Geary, I. (2002). An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of INLG 2002.
Mana, N., & Pianesi, F. (2006). HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In Proceedings of ICMI 2006.
Martin, J.-C., Kühnlein, P., Paggio, P., Stiefelhagen, R., & Pianesi, F. (Eds.) (2006). LREC 2006 workshop on multimodal corpora: From multimodal behaviour theories to usable models.
McNeill, D. (Ed.) (2000). Language and gesture: Window into thought and action. Cambridge University Press.
Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings, Fourth International Conference on Language Resources and Evaluation (LREC 2004) (Vol. 4, pp. 1503–1506). Lisbon.
Rehm, M., & André, E. (2005). Catch me if you can – Exploring lying agents in social settings. In Proceedings of AAMAS 2005 (pp. 937–944).
Steedman, M. (2000). Information structure and the syntax-phonology interface. Linguistic Inquiry, 31(4), 649–689.
Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Lees, A., Stere, A., & Bregler, C. (2004). Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Trans. Graphics, 23(3), 506–513.
White, M. (2006). Efficient realization of coordinate structures in combinatory categorial grammar. Research on Language and Computation, 4(1), 39–75.
Acknowledgements
This work was supported by the EU projects COMIC (IST-2001-32311) and JAST (FP6-003747-IP). An initial version of this study was published as Foster and Oberlander (2006).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Foster, M.E., Oberlander, J. Corpus-based generation of head and eyebrow motion for an embodied conversational agent. Lang Resources & Evaluation 41, 305–323 (2007). https://doi.org/10.1007/s10579-007-9055-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-007-9055-3