research-article

Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing

Authors:
Sushant Kafle

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

,
Matt Huenerfauth

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

ASSETS '17: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and AccessibilityOctober 2017Pages 165–174https://doi.org/10.1145/3132525.3132542

Published:19 October 2017Publication History

ASSETS '17: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility

Pages 165–174

ABSTRACT

The accuracy of Automated Speech Recognition (ASR) technology has improved, but it is still imperfect in many settings. Researchers who evaluate ASR performance often focus on improving the Word Error Rate (WER) metric, but WER has been found to have little correlation with human-subject performance on many applications. We propose a new captioning-focused evaluation metric that better predicts the impact of ASR recognition errors on the usability of automatically generated captions for people who are Deaf or Hard of Hearing (DHH). Through a user study with 30 DHH users, we compared our new metric with the traditional WER metric on a caption usability evaluation task. In a side-by-side comparison of pairs of ASR text output (with identical WER), the texts preferred by our new metric were preferred by DHH participants. Further, our metric had significantly higher correlation with DHH participants' subjective scores on the usability of a caption, as compared to the correlation between WER metric and participant subjective scores. This new metric could be used to select ASR systems for captioning applications, and it may be a better metric for ASR researchers to consider when optimizing ASR systems.

References

T. Apone, B. Botkin, M. Brooks and L. Goldberg. 2011. Caption Accuracy Metrics Project. Research into Automated Error Ranking of Real-time Captions in Live Television News Programs, WGBH. Retrieved from http:// ncam.wgbh.org/file_download/136Google Scholar
K. Bain, S. H. Basson, M. Wald. 2002. Speech recognition in university classrooms: liberated learning project. In Proc. ASSETS '02, ACM, New York, NY, USA, 192-196. Google ScholarDigital Library
N. N. Belanger and K. Rayner. 2013. Frequency and predictability effects in eye fixations for skilled and less-skilled deaf readers. Visual cognition, 21(4):477-497Google Scholar
D. L. Blackwell, J. W. Lucas, T. C. Clarke. 2014. Summary health statistics for us adults: national health interview survey, 2012. Vital and health statistics. Series 10, Data from the National Health Survey, (260):1-161Google Scholar
T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean. 2007. Large language models in machine translation. In Proc. EMNLP-CoNLL'07, Prague, Czech Republic, 858-867, Association for Computational Linguistics.Google Scholar
A.-B. Dominguez, J. Alegria. 2010. Reading mechanisms in orally educated deaf adults. Journal of deaf studies and deaf education, 15(2): 136-148.Google ScholarCross Ref
A.-B. Dominguez, M.-S. Carrillo, M. del Mar Perez, J. Alegría. 2014. Analysis of reading strategies in deaf adults as a function of their language and meta-phonological skills. Research in developmental disabilities, 35(7):1439-1456.Google Scholar
J. Duffy, T. Giolas. 1971. The effect of word predictability on sentence intelligibility. Technical report, Submarine Medical Research Laboratory.Google Scholar
L. Elliot, M. Stinson, J. Mallory, D. Easton, M. Huenerfauth. 2016. Deaf and hard of hearing individuals' perceptions of communication with hearing colleagues in small groups. In Proc. ASSETS '16, ACM, New York, NY, USA, 271-272. Google ScholarDigital Library
B. Favre, K. Cheung, S. Kazemian, A. Lee, Y. Liu, C. Munteanu, A. Nenkova, D. Ochei, G. Penn, S. Tratz, et al. 2013. Automatic human utility evaluation of ASR systems: does WER really predict performance? In Proc. Interspeech'13, 3463-3467.Google Scholar
M. Federico, M. Furini. 2012. Enhancing learning accessibility through fully automatic captioning. In Proc. W4A'12, ACM, New York, NY, USA, Article 40, 4 pages. Google ScholarDigital Library
J. Fiscus. 1998. Sclite scoring package version 1.5. US National Institute of Standard Technology, Retrieved on May 1, 2017 from http://www. itl. nist. gov/iaui/894.01/tools.Google Scholar
I.R. Forman, B. Fletcher, J. Hartley, B. Rippon, A. Wilson. 2012. Blue herd: automated captioning for videoconferences. In Proc. ASSETS '12, ACM, New York, NY, USA, 227-228. Google ScholarDigital Library
N. Garay-Vitoria, J. Abascal. 2006. Text prediction systems: a survey. Univers. Access Inf. Soc. 4, 3 (February 2006), 188-203. Google ScholarDigital Library
J.S. Garofolo, C.G.P. Auzanne, E.M. Voorhees. 2000. The TREC spoken document retrieval track: A success story. In Content-Based Multimedia Information Access - Volume 1, RIAO ?00, Paris, France, 1-20. Google ScholarDigital Library
Y. Gaur, W. S. Lasecki, F. Metze, J. P. Bigham. 2016. In Proc. W4A'16. ACM, New York, NY, USA, Article 23, 8 pages.Google Scholar
J.J. Godfrey, E.C. Holliman, J. McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 517-520. IEEE. Google ScholarDigital Library
D. Grangier, A. Vinciarelli, H. Bourlard. 2003. Information retrieval on noisy text. Technical report, IDIAP.Google Scholar
S.S. Gray, D. Willett, J. Lu, J. Pinto, P. Maergner, N. Bodenstab. 2014. Child automatic speech recognition for US English: child interaction with living-room-electronic-devices. In Proc. WOCCI, 21-26.Google Scholar
R.P. Harrington, G.C. Vanderheiden. 2013. Crowd caption correction (CCC). In Proc. of ASSETS '13. ACM, New York, NY, USA, Article 45, 2 pages. Google ScholarDigital Library
D. W. Jackson, P. V. Paul, J. C. Smith. 1997. Prior knowledge and reading comprehension ability of deaf adolescents. Journal of Deaf Studies and Deaf Education, 2(3), 172-184.Google ScholarCross Ref
S. Kafle, M. Huenerfauth. 2016. Effect of speech recognition errors in text understandability for people who are deaf or hard-of-hearing. In Proceedings of the 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), Interspeech'16, San Francisco, CA, USA. 20-25.Google ScholarCross Ref
S. Kawas, G. Karalis, T. Wen, R.E. Ladner. 2016. Improving real-time captioning experiences for deaf and hard of hearing students. In Proc. ASSETS'16, ACM, New York, NY, USA, 15-23. Google ScholarDigital Library
R. Kliegl, E. Grabner, M. Rolfs, R. Engbert. 2004. Length, frequency, and predictability effects of words on eye movements in reading. European Journal of Cognitive Psychology, 16(1-2):262-284.Google ScholarCross Ref
R.S. Kushalnagar, W.S. Lasecki, J.P. Bigham. 2014. Accessibility evaluation of classroom captions. ACM Trans. Access. Comput. 5, 3, Article 7, 24 pages. Google ScholarDigital Library
W.S. Lasecki, J.P. Bigham. 2014. Real-time captioning with the crowd. interactions, 21, 3 (May 2014), 50-55. Google ScholarDigital Library
X. Lei, A.W. Senior, A. Gruenstein, J. Sorensen. 2013. Accurate and compact large vocabulary speech recognition on mobile devices. In Proc. Interspeech'13, vol. 1, 662-665.Google Scholar
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 4 (April 2014), 745-777. Google ScholarDigital Library
J.L. Luckner, C.M. Handley. 2008. A summary of the reading comprehension research undertaken with students who are deaf or hard of hearing. American Annals of the Deaf, 153(1):6-36.Google ScholarCross Ref
I.A. McCowan, D. Moore, J. Dines, D. Gatica-Perez, M. Flynn, P. Wellner, H. Bourlard. 2004. On the use of information retrieval measures for speech recognition evaluation. Technical report, IDIAP.Google Scholar
T. Mishra, A. Ljolje, M. Gilbert. 2011. Predicting human perceived accuracy of ASR systems. In Proc. Interspeech'11, 1945-1948Google Scholar
A.C. Morris, V. Maier, P.D. Green. 2004. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In Interspeech'04, 2765-2768.Google Scholar
H. Nanjo, T. Kawahara. 2005. A new ASR evaluation measure and minimum Bayes-risk decoding for open-domain speech understanding. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1053-1056.Google ScholarCross Ref
K. Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3):372-422Google ScholarCross Ref
K. Rayner, X. Li, B.J. Juhasz, G. Yan. 2005. The effect of word predictability on the eye movements of Chinese readers. Psychonomic Bulletin & Review, 12(6):1089-1093.Google ScholarCross Ref
K. Rayner, E.D. Reichle, M.J. Stroud, C.C. Williams, A. Pollatsek. 2006. The effect of word frequency, word predictability, and font difficulty on the eye movements of young and older readers. Psychology and Aging, 21(3):448.Google ScholarCross Ref
K. Rayner, T.J. Slattery, D. Drieghe, S.P. Liversedge. 2011. Eye movements and word skipping during reading: effects of word length and predictability. Journal of Experimental Psychology: Human Perception and Performance, 37(2):514-528.Google ScholarCross Ref
A. Rousseau, P. Deleglise, Y. Esteve. 2012. Ted-lium: an automatic speech recognition dedicated corpus. In Proc. of LREC,'12 125-129. ELRA, Paris, France.Google Scholar
M.S. Stinson, P. Francis, L.B. Elliot, D.Easton. 2014. Real-time caption challenge: C-print. In Proc. of ASSETS '14, ACM, New York, NY, USA, 317-318. Google ScholarDigital Library
H. Takagi, T. Itoh, K. Shinkawa. 2015. Evaluation of real-time captioning by machine recognition with human support. In Proc. W4A'15, ACM, New York, NY, USA, Article 5, 4 pages. Google ScholarDigital Library
M. Wald. 2011. Crowdsourcing correction of speech recognition captioning errors. In Proc. W4A'11, ACM, New York, NY, USA, Article 22, 2 pages. Google ScholarDigital Library
Y.-Y. Wang, A. Acero, C. Chelba. 2003. Is word error rate a good indicator for spoken language understanding accuracy. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, 577-582. IEEE.Google Scholar
W. Xiong, J. Droppo, X. Huang, F. Seide,. Seltzer, A. Stolcke, D. Yu, G. Zweig. 2016. Achieving human parity in conversational speech recognition. Computing Research Repository (CoRR), http://arxiv.org/abs/1610.05256Google Scholar

Index Terms

Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing
1. Human-centered computing
  1. Accessibility
    1. Accessibility design and evaluation methods
    2. Empirical studies in accessibility

Recommendations

Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

As the accuracy of Automatic Speech Recognition (ASR) nears human-level quality, it might become feasible as an accessibility tool for people who are Deaf and Hard of Hearing (DHH) to transcribe spoken language to text. We conducted a study using in-...
Read More
Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing
Special Issue on ASSETS'17 and Regular Papers

Automatic Speech Recognition (ASR) technology has seen major advancements in its accuracy and speed in recent years, making it a possible mechanism for supporting communication between people who are Deaf or Hard-of-Hearing (DHH) and their hearing ...
Read More
Automatically generated captions: will they help non-native speakers communicate in english?
ICIC '10: Proceedings of the 3rd international conference on Intercultural collaboration

Many people find it difficult to communicate in a foreign language. In order to help these people, one approach being studied is the use of captions generated by automatic speech recognition (ASR). Captions are known to facilitate comprehension of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASSETS '17: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility
October 2017
450 pages
ISBN:9781450349260
DOI:10.1145/3132525
General Chair:
Amy Hurst
University of Maryland, Baltimore County, USA
,
Program Chairs:
Leah Findlater
University of Washington, USA
,
Meredith Ringel Morris
Microsoft Research, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
accessibility for people who are deaf or hard-of-hearing
automatic speech recognition
caption usability evaluation
real-time captioning system
Qualifiers
- research-article
Conference

Acceptance Rates
ASSETS '17 Paper Acceptance Rate28of126submissions,22%Overall Acceptance Rate436of1,556submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 834
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing

ASSETS '17: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility

ABSTRACT

References

Cited By

Index Terms

Recommendations

Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers

Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing

Automatically generated captions: will they help non-native speakers communicate in english?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media