ABSTRACT
During face to face communication, it has been suggested that as much as 70% of what people communicate when talking directly with others is through paralanguage involving multiple modalities combined together (e.g. voice tone and volume, body language). In an attempt to render humancomputer interaction more similar to human-human communication and enhance its naturalness, research on sensory acquisition and interpretation of single modalities of human expressions have seen ongoing progress over the last decade. These progresses are rendering current research on artificial sensor fusion of multiple modalities an increasingly important research domain in order to reach better accuracy of congruent messages on the one hand, and possibly to be able to detect incongruent messages across multiple modalities (incongruency being itself a message about the nature of the information being conveyed). Accurate interpretation of emotional signals - quintessentially multimodal - would hence particularly benefit from multimodal sensor fusion and interpretation algorithms. In this paper we provide a state of the art multimodal fusion and describe one way to implement a generic framework for multimodal emotion recognition. The system is developed within the MAUI framework [31] and Scherer's Component Process Theory (CPT) [49, 50, 51, 24, 52], with the goal to be modular and adaptive. We want the designed framework to be able to accept different single and multi modality recognition systems and to automatically adapt the fusion algorithm to find optimal solutions. The system also aims to be adaptive to channel (and system) reliability.
- P. Aleksic and A. Katsaggelos. Product hmms for audio-visual continuous speech recognition using facial, 2003. Google ScholarDigital Library
- C. Bartneck, J. Reichenbach, and A. van Breemen. In your face, robot! the influence of a character's embodiment on how users perceive its emotional expressions. In Proceedings of the Fourth International Conference on Design & Emotion, Ankara, Turquey, July 2004.Google Scholar
- C. Besson, D. Graf, I. Hartung, B. Kropfhusser, and S. Voisard. The importance of non-verbal communication in professional interpretation, 2004.Google Scholar
- R. A. Bolt. Put-that-there: Voice and gesture at the graphics interface. In SIGGRAPH '80: Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pages 262--270, New York, NY, USA, 1980. ACM Press. Google ScholarDigital Library
- C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on multimodal interfaces (ICMI '04), pages 205--211, State College, PA, USA, 2004. ACM Press, New York, NY, USA. Google ScholarDigital Library
- L. Chen, H. Tao, T. Huang, T. Miyasato, and R. Nakatsu. Emotion recognition from audiovisual information. In Proceedings, IEEE Workshop on Multimedia Signal Processing, pages 83--88, Los Angeles, CA, USA, 1998.Google ScholarCross Ref
- A. Colmenarez, B. Frey, and T. Huang. Embedded face and facial expression recognition. In Proceedings of ICIP 1999, volume 1, pages 633--637, 1999.Google ScholarCross Ref
- A. Corradini, M. Mehta, N. Bernsen, and J.-C. Martin. Multimodal input fusion in human-computer interaction on the example of the on-going nice project. In Proceedings of the NATO-ASI conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management, Yerevan (Armenia), August 2003.Google Scholar
- A. Duminuco, C. Liu, D. Kryze, and L. Rigazio. Flexible feature spaces based on generalized heteroscedastic linear discriminant analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2006.Google ScholarCross Ref
- P. Ekman. Universals and cultural differences in facial expressions of emotion. J. K. Cole, editor, In Proceeding of Nebraska Symposium on Motivation, volume 19, pages 207--283, Lincoln (NE), 1971. Lincoln: University of Nebraska Press.Google Scholar
- P. Ekman, W. V. Friesen, and J. C. Hager. Facial Action Coding System Invistagator's Guide. A Human Face, 2002.Google Scholar
- P. Ekman and F. W. Facial Action Coding System. Palo Alto (CA), 1978.Google Scholar
- T. Fong, I. Nourbakhsh, and K. Dautenhahn. A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 2002.Google Scholar
- A. Grizard, M. Paleari, and C. Lisetti. Adaptating psychologically grounded facial emotional expressions to different platforms. In Proceedings of KI06 26th German Annual Conference in Artificial Intelligence, Bremen, Germany, 2006.Google Scholar
- A. Haag, S. Goronzy, P. Schaich, and J. Williams. Emotion recognition using biosensors: First steps towards an automatic system. In Proceedings of LNCS, pages 36--48, 2004.Google Scholar
- R. Huber, A. Batliner, J. Buckov, E. Noth, V. Warnke, and H. Niemann. Recognition of emotions in realistic dialog scenario. In Proceedings of ICSLP 2000, pages 665--668, 2000.Google Scholar
- I.Poggi, C.Pelachaud, F. de Rosis, V. Carofiglio, and B. D. Carolis. Multimodal Intelligent Information Presentation, chapter GRETA. A Believable Embodied Conversational Agent. Kluwer, 2005.Google Scholar
- H. Ishiguro. 2006-2056 projects and vision in robotics. In Proceedings of 50 years AI Symposium at KI06 26th German Annual Conference in Artificial Intelligence, Bremen, Germany, 2006.Google Scholar
- T. Kang, C. Han, S. Lee, D. Youn, and C. Lee. Speaker dependent emotion recognition using speech signals. In Proceedings of ICSLP 2000, pages 383--386, 2000.Google Scholar
- S. Kettebekov and R. Sharma. Toward multimodal interpretation in a natural speecmgesture interface. In Proceedings of IEEE Symposium on Image, Speech, and Natural Language Systems, pages 328--335. IEEE, November 1999. Google ScholarDigital Library
- K. Kim, S. Bang, and S. Kim. Emotion recognition system using short-term monitoring of physiological signals. Medical and Biological Engineering and Computing, 42, 2004.Google Scholar
- B. J. A. Kröse, J. M. Porta, A. J. N. van Breemen, K. Crucq, M. Nuttin, and E. Demeester. Lino, the user-interface robot. In EUSAI, pages 264--274, 2003.Google ScholarCross Ref
- H. Leventhal. A perceptual-motor theory of emotion. Journal of Advances in Experimental Social Psychology, 17:117--182, 1984.Google ScholarCross Ref
- H. Leventhal and K. R. Scherer. The relationship of emotion to cognition: A functional approach to a semantic controversy. Cognition and Emotion, 1:3--28, 1987.Google ScholarCross Ref
- X. Li and Q. Ji. Active affective state detection and user assistance with dynamic bayesian networks. In IEEE Transactions on Systems, Man , and Cybernetics. Part A: Systems and Humans, volume 35, pages 93--105. IEEE, January 2005. Google ScholarDigital Library
- Y. Li and Y. zhao. Recognition of emotions in speech using short term and long term features. In Proceedings of ICSLP 1998, pages 2255--2258, 1998Google Scholar
- H. Liao. Multimodal Fusion. Master's thesis, University of Cambridge, july 2002.Google Scholar
- C. Lisetti and F. Nasoz. Using noninvasive wearable computers to recognize human emotions from physiological signals. EURASIP Journal on Applied Signal Processing, 11:16721687, 2004. Google ScholarDigital Library
- C. L. Lisetti and P. J. Gmytrasiewicz. Emotions and personality in agent design. In Proceedings of AAMAS 2002, 2002. Google ScholarDigital Library
- C. L. Lisetti and A. Maurpang. Bdi+e framework: An affective cognitive modeling for autonomous agents based on scherers emotion theory. In Proceedings of KI06 26th German Annual Conference in Artificial Intelligence, Bremen, Germany, 2006.Google ScholarDigital Library
- C. L. Lisetti and F. Nasoz. Maui: a multimodal affective user interface. In Proceedings of the ACM Multimedia International Conference 2002, Juan les Pins, December 2002. Google ScholarDigital Library
- C. Mallauran, F. Dugelay, J.L.and Perronnin, and C. Garcia. Online face detection and user authentication. In Proceedings of the ACM Multimedia Conference 2005, Singapore, Nov 2005. Google ScholarDigital Library
- K. Mase. Recognition of facial expression from optical flow. In Proceedings of IEICE Transactions, volume E74, pages 3474--3483, 1991.Google Scholar
- F. Matta and J. Dugelay. Towards person recognition using head dynamics. In Proceedings of ISPA 2005, 4th International Symposium on Image and Signal Processing and Analysis, Zagreb, Croatia, September 2005.Google ScholarCross Ref
- A. Mehrabian. Silent Messages. Wadsworth Publishing Company, Inc, Belmont, CA, 1971.Google Scholar
- A. Mehrabian. Nonverbal Communication. Aldine-Atherton, Chicago, 1972.Google Scholar
- G. Merola and I. Poggi. Multimodality and gestures in the teachers communication. In Lecture Notes in Computer Science, volume 2915, pages 101--111, Feb 2004.Google Scholar
- R. Murphy, C. Lisetti, L. Irish, R. Tardif, and A. Gage. Emotion-based control of cooperating heterogeneous mobile robots. IEEE Transactions on Robotics and Automation: special issue on Multi-robots Systems, 2001.Google Scholar
- M. Paleari and C. Lisetti. Psychologically grounded avatar expressions. In Proceedings of KI06 26th German Annual Conference in Artificial Intelligence, Bremen, Germany, 2006.Google Scholar
- M. Pantic and L. Rothkrantz. Automatic analysis of facial expression: The state of the art. IEEE Transaction, Issue on Pattern Analisys and Machine Intelligence, volume 22, pages 1424--1445, 2000. Google ScholarDigital Library
- M. Pantic and L. Rothkrantz. Expert systems for automatic analysis of facial expression. Image, Vision and Computing Journal, 18:881--905, 2000.Google ScholarCross Ref
- M. Pantic and L. Rothkrantz. Toward an affect-sensitive multimodal human-computer interaction. In Proceedings of IEEE, volume 91, pages 1370--1390. IEEE, September 2003.Google ScholarCross Ref
- C. Pelachaud, V. Carofiglio, and I. Poggi. Embodied contextual agent in information delivering application. In Proceedings of the First International Joint Conference on Autonomous Agents & Multi-Agent Systems, Bologna, Italy, 2002. Google ScholarDigital Library
- R. Picard. Affective Computing. MIT Press, Cambridge (MA), 1997. Google ScholarDigital Library
- L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ, 1993. Google ScholarDigital Library
- A. S. Rao and M. P. Georgeff. Modeling rational agents within a bdi-architecture. J. Allen, R. Fikes, and E. Sandewall, editors, In Proceedings of the Second International Conference on Principles of Knowledge Representation and Reasoning (KR'91, pages 473--484, 1991.Google Scholar
- A. S. Rao and M. P. Georgeff. Bdi agents: From theory to practice. In Proceedings of 1st International Conference on Multi-Agent Systems, ICMAS--95), page 312--319, San Francisco, CA., 1995.Google Scholar
- H. Sato, Y. Mitsukura, M. Fukumi, and N. Akamatsu. Emotional speech classification with prosodic parameters using neural networks. In Proceedings of Australian and NewZealand Intelligner Information System Conference, pages 395--398, 2001.Google Scholar
- K. R. Scherer. Emotion as a process: Function, origin and regulation. Social Science Information, 21:555--570, 1982.Google ScholarCross Ref
- K. R. Scherer. Emotions can be rational. Social Science Information, 24(2):331--335, 1985.Google ScholarCross Ref
- K. R. Scherer. Toward a dynamic theory of emotion: The component process model of affective states. Geneva Studies in Emotion and Communication, 1(1), 1--98, 1987.Google Scholar
- K. R. Scherer. Appraisal processes in emotion: Theory, methods, research, chapter Appraisal Considered as a Process of Multilevel Sequential Checking, pages 92--120. New York, NY, US: Oxford University Press, 2001.Google Scholar
- N. Sebe, I. Cohen, and T. Huang. Multimodal emotion recognition. World Scientific, 2005.Google ScholarCross Ref
- N. Sebe, M. Lew, I. Cohen, A. Garg, and T. Huang. Emotion recognition using a cauchy naive bayes classifier. In Proceedings of ICPR 2002, volume 1, pages 17--20, 2002. Google ScholarDigital Library
- R. Sharma, V. Pavlovic, and T. Huang. Toward multimodal human-computer interface. In Proceedings of the IEEE, 1998.Google ScholarCross Ref
- V. Tyagi and C. Wellekens. Adaptive enhancement of speech signals for robust ASR. In ASIDE 2005, COST278 Final Workshop and ISCA Tutorial and Research Workshop, Aalborg, Denmark, Nov 2005.Google Scholar
- Haptek website: www.haptek.com, 2006.Google Scholar
- iCat website at Philips: www.research.philips.com/robotics, 2006.Google Scholar
- A. van Breemen. Animation engine for believable interactive user-interface robots. In Proceedings of IROS 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, September 2004.Google ScholarCross Ref
- A. van Breemen. Bringing robots to life: Applying principles of animation to robots. In Proceedings of Shapping Human-Robot Interaction workshop held at CHI 2004, Vienna, Austria, 2004.Google Scholar
- A. van Breemen. icat: Experimenting with animabotics. In Proceedings of AISB, pages 27--32, 2005.Google Scholar
- O. Villon and C. L. Lisetti. Toward building adaptive users psycho-physiological maps of emotions using bio-sensors. In Proceedings of KI06 26th German Annual Conference in Artificial Intelligence, Bremen, Germany, 2006.Google Scholar
- Y. Wu and T. Huang. Vision-based gesture recognition: A review. Lecture Notes in Computer Science, 1739:103+, 1999. Google ScholarDigital Library
Index Terms
- Toward multimodal fusion of affective cues
Recommendations
Automatic understanding of affective and social signals by multimodal mimicry recognition
ACII'11: Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part IIHuman mimicry is one of the important behavioral cues displayed during social interaction that inform us about the interlocutors' interpersonal states and attitudes. For example, the absence of mimicry is usually associated with negative attitudes. A ...
MULTIMODAL FUSION AS COMMUNICATIVE ACTS DURING HUMAN–ROBOT INTERACTION
Research on dialog systems is a very active area in social robotics. During the last two decades, these systems have evolved from those based only on speech recognition and synthesis to the current and modern systems, which include new components and ...
Disentangled Representation Learning for Multimodal Emotion Recognition
MM '22: Proceedings of the 30th ACM International Conference on MultimediaMultimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is ...
Comments