research-article

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring

Authors:
Vikram Ramanarayanan

Educational Testing Service R&D, San Francisco, CA, USA

Educational Testing Service R&D, San Francisco, CA, USA
View Profile

,
Chee Wee Leong

Educational Testing Service R&D, Princeton, NJ, USA

Educational Testing Service R&D, Princeton, NJ, USA
View Profile

,
Lei Chen

Educational Testing Service R&D, Princeton, NJ, USA

Educational Testing Service R&D, Princeton, NJ, USA
View Profile

,
Gary Feng

Educational Testing Service R&D, Princeton, NJ, USA

Educational Testing Service R&D, Princeton, NJ, USA
View Profile

,
David Suendermann-Oeft

Educational Testing Service R&D, San Francisco, CA, USA

Educational Testing Service R&D, San Francisco, CA, USA
View Profile

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal InteractionNovember 2015Pages 23–30https://doi.org/10.1145/2818346.2820765

Published:09 November 2015Publication History

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pages 23–30

ABSTRACT

We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.

References

1. C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Google ScholarDigital Library
2. L. Chen, G. Feng, J. Joe, C. W. Leong, C. Kitchen, and C. M. Lee. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 200--203. ACM, 2014. Google ScholarDigital Library
3. L. Chen, J. Tetreault, and X. Xi. Towards using structural events to assess non-native speech. In Proceedings of the 5th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL-HLT, Los Angeles, CA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
4. L. Chen and S.-Y. Yoon. Application of structural events detected on ASR outputs for automated speaking assessment. In Proceedings of Interspeech, 2012.Google Scholar
5. L. Chen and K. Zechner. Applying rhythm features to automatically assess non-native speech. In Proceedings of Interspeech, 2011.Google Scholar
6. L. Chen, K. Zechner, and X. Xi. Improved pronunciation features for construct-driven assessment of non-native spontaneous speech. In Proceedings of NAACL-HLT, 2009. Google ScholarDigital Library
7. D. Higgins, X. Xi, K. Zechner, and D. M. Williamson. A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech and Language, 25(2):282--306, 2011. Google ScholarDigital Library
8. J. H. Jeon and S.-Y. Yoon. Acoustic feature-based non-scorable response detection for an automated speaking proficiency assessment. In Proceedings of Interspeech, pages 1275--1278, 2012.Google Scholar
9. A. Kapoor and R. W. Picard. Multimodal affect recognition in learning environments. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 677--682. ACM, 2005. Google ScholarDigital Library
10. I. Naim, M. I. Tanveer, D. Gildea, and M. E. Hoque. Automated prediction and analysis of job interview performance: The role of what you say and how you say it.Google Scholar
11. L. Nguyen, D. Frauendorfer, M. Schmid Mast, and D. Gatica-Perez. Hire me: Computational inference of hirability in employment interviews based on nonverbal behavior. IEEE transactions on multimedia, 16(4):1018--1031, 2014. Google ScholarDigital Library
12. F. Pianesi, N. Mana, A. Cappelletti, B. Lepri, and M. Zancanaro. Multimodal recognition of personality traits in social interactions. In Proceedings of the 10th international conference on Multimodal interfaces, pages 53--60. ACM, 2008. Google ScholarDigital Library
13. V. Ramanarayanan, M. Van Segbroeck, and S. Narayanan. Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories. Computer Speech and Language, in press.Google Scholar
14. R. Ranganath, D. Jurafsky, and D. A. McFarland. Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech & Language, 27(1):89--115, 2013. Google ScholarDigital Library
15. D. Sanchez-Cortes, J.-I. Biel, S. Kumano, J. Yamato, K. Otsuka, and D. Gatica-Perez. Inferring mood in ubiquitous conversational video. In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, page 22. ACM, 2013. Google ScholarDigital Library
16. L. M. Schreiber, G. D. Paul, and L. R. Shibley. The development and test of the public speaking competence rubric. Communication Education, 61(3):205--233, 2012.Google ScholarCross Ref
17. B. Schuller, A. Batliner, S. Steidl, F. Schiel, and J. Krajewski. The interspeech 2011 speaker state challenge. In Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, pages 3201--3204, 2011.Google Scholar
18. B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. Van Son, F. Weninger, F. Eyben, T. Bocklet, et al. The interspeech 2012 speaker trait challenge. In INTERSPEECH, 2012.Google Scholar
19. H. Van hamme. HAC-models: a novel approach to continuous speech recognition. In Interspeech, 2008.Google Scholar
20. M. Van Segbroeck and H. Van hamme. Unsupervised learning of time-frequency patches as a noise-robust representation of speech. Speech Communication, 51(11):1124--1138, 2009. Google ScholarDigital Library
21. S. M. Witt. Use of Speech Recognition in Computer-assisted Language Learning. PhD thesis, University of Cambridge, 1999.Google Scholar
22. K. Zechner, D. Higgins, X. Xi, and D. M. Williamson. Automatic scoring of non-native spontaneous speech in tests of spoken english. Speech Communication, 51(10):883--895, 2009. Figure2:Example-gureshowingalldatastreams. Google ScholarDigital Library

Index Terms

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring
1. Computing methodologies
  1. Machine learning

Recommendations

Speaker Turn Detection Based on Multimodal Situation Analysis
SPECOM 2013: Proceedings of the 15th International Conference on Speech and Computer - Volume 8113

The main stage of speaker diarization is a detection of time labels, where speakers are changed. The most of the approaches to the decision of the speaker turn detection problem is focused on processing of audio signal captured in one channel and ...
Read More
Multimodal person authentication using speech, face and visual speech

This paper presents a method for automatic multimodal person authentication using speech, face and visual speech modalities. The proposed method uses the motion information to localize the face region, and the face region is processed in YC"rC"b color ...
Read More
Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coefficient (RMCC) features.A ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction
November 2015
678 pages
ISBN:9781450339124
DOI:10.1145/2818346
General Chairs:
Zhengyou Zhang
Microsoft Research, USA
,
Phil Cohen
VoiceBox Technologies, USA
,
Program Chairs:
Dan Bohus
Microsoft Research, USA
,
Radu Horaud
INRIA Grenoble Rhone-Alpes, France
,
Helen Meng
Chinese University of Hong Kong, China
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
emotion tracking
face tracking
motion capture
multimodal analysis
presentation assessment
speech recognition
Qualifiers
- research-article
Conference

Acceptance Rates
ICMI '15 Paper Acceptance Rate52of127submissions,41%Overall Acceptance Rate453of1,080submissions,42%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 621
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speaker Turn Detection Based on Multimodal Situation Analysis

Multimodal person authentication using speech, face and visual speech

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speaker Turn Detection Based on Multimodal Situation Analysis

Multimodal person authentication using speech, face and visual speech

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media