Skip to main content
Log in

Corpus-based generation of head and eyebrow motion for an embodied conversational agent

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Humans are known to use a wide range of non-verbal behaviour while speaking. Generating naturalistic embodied speech for an artificial agent is therefore an application where techniques that draw directly on recorded human motions can be helpful. We present a system that uses corpus-based selection strategies to specify the head and eyebrow motion of an animated talking head. We first describe how a domain-specific corpus of facial displays was recorded and annotated, and outline the regularities that were found in the data. We then present two different methods of selecting motions for the talking head based on the corpus data: one that chooses the majority option in all cases, and one that makes a weighted choice among all of the options. We compare these methods to each other in two ways: through cross-validation against the corpus, and by asking human judges to rate the output. The results of the two evaluation studies differ: the cross-validation study favoured the majority strategy, while the human judges preferred schedules generated using weighted choice. The judges in the second study also showed a preference for the original corpus data over the output of either of the generation strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. No sentence in the script had more than two clauses.

  2. We did not select any motions on words for which the speech-synthesiser output was very short, such as but and is, because the synthesiser could not make those words long enough to make any motion sensible.

  3. A baseline system that never proposes any motion scores 0.79 on this measure.

  4. The corpus schedules were modified to remove motions on short words such as but and is, for the reasons discussed in Sect. 4.

References

  • Artstein, R., & Poesio, M. (2005). Kappa3 = alpha (or beta). Technical Report CSM-437, University of Essex Department of Computer Science.

  • Bangalore, S., Rambow, O., & Whittaker, S. (2000). Evaluation metrics for generation. In Proceedings of INLG 2000.

  • Belz, A., Gatt, A., Reiter, E., & Viethen, J. (2007). First NLG shared task and evaluation challenge on attribute selection for referring expression generation. http://www.csd.abdn.ac.uk/research/evaluation/

  • Belz, A., & Reiter, E. (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of EACL 2006 (pp. 313–320).

  • Belz, A., & Varges, S. (Eds.) (2005) Corpus linguistics 2005 workshop on using corpora for natural language generation.

  • Cassell, J., Bickmore, T., Vilhjálmsson H., & Yan, H. (2001a). More than just a pretty face: Conversational protocols and the affordances of embodiment. Knowledge-Based Systems, 14(1–2), 55–64.

    Article  Google Scholar 

  • Cassell, J., Nakano, Y., Bickmore, T. W., Sidner, C. L., & Rich, C. (2001b). Non-verbal cues for discourse structure. In Proceedings of ACL 2001.

  • Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (2000). Embodied conversational agents. MIT Press.

  • Clark, R. A. J., Richmond, K., & King, S. (2004) Festival 2 – Build your own general purpose unit selection speech synthesiser. In Proceedings of the 5th ISCA Workshop on Speech Synthesis.

  • de Carolis, B., Carofiglio, V., & Pelachaud, C. (2002). From discourse plans to believable behavior generation. In Proceedings of INLG 2002.

  • DeCarlo, D., Stone, M., Revilla, C., & Venditti, J. (2004). Specifying and animating facial signals for discourse in embodied conversational agents. Computer Animation and Virtual Worlds, 15(1), 27–38.

    Article  Google Scholar 

  • Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline. Cambridge University Press.

  • Foster, M. E. (2007). Evaluating the impact of variation in automatically generated embodied object descriptions. Ph.D. thesis, School of Informatics, University of Edinburgh.

  • Foster, M. E., & Oberlander, J. (2006). Data-driven generation of emphatic facial displays. In Proceedings of EACL 2006 (pp. 353–360).

  • Foster, M. E., White, M., Setzer, A., & Catizone, R. (2005). Multimodal generation in the COMIC dialogue system. In Proceedings of the ACL 2005 Demo Session.

  • Fox, J. (2002). An R and S-Plus companion to applied regression. Sage Publications.

  • Graf, H., Cosatto, E., Strom, V., & Huang, F. (2002). Visual prosody: Facial movements accompanying speech. In Proceedings of FG 2002 (pp. 397–401).

  • Kipp, M. (2004). Gesture generation by imitation – From human behavior to computer character animation. Dissertation.com.

  • Krahmer, E., & Swerts, M. (2005). How children and adults produce and perceive uncertainty in audiovisual speech. Language and Speech, 48(1), 29–53.

    Article  Google Scholar 

  • Langkilde, I., & Knight, K. (1998). Generation that exploits corpus-based statistical knowledge. In Proceedings of COLING-ACL 1998.

  • Langkilde-Geary, I. (2002). An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of INLG 2002.

  • Mana, N., & Pianesi, F. (2006). HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In Proceedings of ICMI 2006.

  • Martin, J.-C., Kühnlein, P., Paggio, P., Stiefelhagen, R., & Pianesi, F. (Eds.) (2006). LREC 2006 workshop on multimodal corpora: From multimodal behaviour theories to usable models.

  • McNeill, D. (Ed.) (2000). Language and gesture: Window into thought and action. Cambridge University Press.

  • Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings, Fourth International Conference on Language Resources and Evaluation (LREC 2004) (Vol. 4, pp. 1503–1506). Lisbon.

  • Rehm, M., & André, E. (2005). Catch me if you can – Exploring lying agents in social settings. In Proceedings of AAMAS 2005 (pp. 937–944).

  • Steedman, M. (2000). Information structure and the syntax-phonology interface. Linguistic Inquiry, 31(4), 649–689.

    Article  Google Scholar 

  • Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Lees, A., Stere, A., & Bregler, C. (2004). Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Trans. Graphics, 23(3), 506–513.

    Article  Google Scholar 

  • White, M. (2006). Efficient realization of coordinate structures in combinatory categorial grammar. Research on Language and Computation, 4(1), 39–75.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the EU projects COMIC (IST-2001-32311) and JAST (FP6-003747-IP). An initial version of this study was published as Foster and Oberlander (2006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mary Ellen Foster.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Foster, M.E., Oberlander, J. Corpus-based generation of head and eyebrow motion for an embodied conversational agent. Lang Resources & Evaluation 41, 305–323 (2007). https://doi.org/10.1007/s10579-007-9055-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9055-3

Keywords

Navigation