ABSTRACT
The accuracy of Automated Speech Recognition (ASR) technology has improved, but it is still imperfect in many settings. Researchers who evaluate ASR performance often focus on improving the Word Error Rate (WER) metric, but WER has been found to have little correlation with human-subject performance on many applications. We propose a new captioning-focused evaluation metric that better predicts the impact of ASR recognition errors on the usability of automatically generated captions for people who are Deaf or Hard of Hearing (DHH). Through a user study with 30 DHH users, we compared our new metric with the traditional WER metric on a caption usability evaluation task. In a side-by-side comparison of pairs of ASR text output (with identical WER), the texts preferred by our new metric were preferred by DHH participants. Further, our metric had significantly higher correlation with DHH participants' subjective scores on the usability of a caption, as compared to the correlation between WER metric and participant subjective scores. This new metric could be used to select ASR systems for captioning applications, and it may be a better metric for ASR researchers to consider when optimizing ASR systems.
- T. Apone, B. Botkin, M. Brooks and L. Goldberg. 2011. Caption Accuracy Metrics Project. Research into Automated Error Ranking of Real-time Captions in Live Television News Programs, WGBH. Retrieved from http:// ncam.wgbh.org/file_download/136Google Scholar
- K. Bain, S. H. Basson, M. Wald. 2002. Speech recognition in university classrooms: liberated learning project. In Proc. ASSETS '02, ACM, New York, NY, USA, 192-196. Google ScholarDigital Library
- N. N. Belanger and K. Rayner. 2013. Frequency and predictability effects in eye fixations for skilled and less-skilled deaf readers. Visual cognition, 21(4):477-497Google Scholar
- D. L. Blackwell, J. W. Lucas, T. C. Clarke. 2014. Summary health statistics for us adults: national health interview survey, 2012. Vital and health statistics. Series 10, Data from the National Health Survey, (260):1-161Google Scholar
- T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean. 2007. Large language models in machine translation. In Proc. EMNLP-CoNLL'07, Prague, Czech Republic, 858-867, Association for Computational Linguistics.Google Scholar
- A.-B. Dominguez, J. Alegria. 2010. Reading mechanisms in orally educated deaf adults. Journal of deaf studies and deaf education, 15(2): 136-148.Google ScholarCross Ref
- A.-B. Dominguez, M.-S. Carrillo, M. del Mar Perez, J. Alegría. 2014. Analysis of reading strategies in deaf adults as a function of their language and meta-phonological skills. Research in developmental disabilities, 35(7):1439-1456.Google Scholar
- J. Duffy, T. Giolas. 1971. The effect of word predictability on sentence intelligibility. Technical report, Submarine Medical Research Laboratory.Google Scholar
- L. Elliot, M. Stinson, J. Mallory, D. Easton, M. Huenerfauth. 2016. Deaf and hard of hearing individuals' perceptions of communication with hearing colleagues in small groups. In Proc. ASSETS '16, ACM, New York, NY, USA, 271-272. Google ScholarDigital Library
- B. Favre, K. Cheung, S. Kazemian, A. Lee, Y. Liu, C. Munteanu, A. Nenkova, D. Ochei, G. Penn, S. Tratz, et al. 2013. Automatic human utility evaluation of ASR systems: does WER really predict performance? In Proc. Interspeech'13, 3463-3467.Google Scholar
- M. Federico, M. Furini. 2012. Enhancing learning accessibility through fully automatic captioning. In Proc. W4A'12, ACM, New York, NY, USA, Article 40, 4 pages. Google ScholarDigital Library
- J. Fiscus. 1998. Sclite scoring package version 1.5. US National Institute of Standard Technology, Retrieved on May 1, 2017 from http://www. itl. nist. gov/iaui/894.01/tools.Google Scholar
- I.R. Forman, B. Fletcher, J. Hartley, B. Rippon, A. Wilson. 2012. Blue herd: automated captioning for videoconferences. In Proc. ASSETS '12, ACM, New York, NY, USA, 227-228. Google ScholarDigital Library
- N. Garay-Vitoria, J. Abascal. 2006. Text prediction systems: a survey. Univers. Access Inf. Soc. 4, 3 (February 2006), 188-203. Google ScholarDigital Library
- J.S. Garofolo, C.G.P. Auzanne, E.M. Voorhees. 2000. The TREC spoken document retrieval track: A success story. In Content-Based Multimedia Information Access - Volume 1, RIAO ?00, Paris, France, 1-20. Google ScholarDigital Library
- Y. Gaur, W. S. Lasecki, F. Metze, J. P. Bigham. 2016. In Proc. W4A'16. ACM, New York, NY, USA, Article 23, 8 pages.Google Scholar
- J.J. Godfrey, E.C. Holliman, J. McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 517-520. IEEE. Google ScholarDigital Library
- D. Grangier, A. Vinciarelli, H. Bourlard. 2003. Information retrieval on noisy text. Technical report, IDIAP.Google Scholar
- S.S. Gray, D. Willett, J. Lu, J. Pinto, P. Maergner, N. Bodenstab. 2014. Child automatic speech recognition for US English: child interaction with living-room-electronic-devices. In Proc. WOCCI, 21-26.Google Scholar
- R.P. Harrington, G.C. Vanderheiden. 2013. Crowd caption correction (CCC). In Proc. of ASSETS '13. ACM, New York, NY, USA, Article 45, 2 pages. Google ScholarDigital Library
- D. W. Jackson, P. V. Paul, J. C. Smith. 1997. Prior knowledge and reading comprehension ability of deaf adolescents. Journal of Deaf Studies and Deaf Education, 2(3), 172-184.Google ScholarCross Ref
- S. Kafle, M. Huenerfauth. 2016. Effect of speech recognition errors in text understandability for people who are deaf or hard-of-hearing. In Proceedings of the 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), Interspeech'16, San Francisco, CA, USA. 20-25.Google ScholarCross Ref
- S. Kawas, G. Karalis, T. Wen, R.E. Ladner. 2016. Improving real-time captioning experiences for deaf and hard of hearing students. In Proc. ASSETS'16, ACM, New York, NY, USA, 15-23. Google ScholarDigital Library
- R. Kliegl, E. Grabner, M. Rolfs, R. Engbert. 2004. Length, frequency, and predictability effects of words on eye movements in reading. European Journal of Cognitive Psychology, 16(1-2):262-284.Google ScholarCross Ref
- R.S. Kushalnagar, W.S. Lasecki, J.P. Bigham. 2014. Accessibility evaluation of classroom captions. ACM Trans. Access. Comput. 5, 3, Article 7, 24 pages. Google ScholarDigital Library
- W.S. Lasecki, J.P. Bigham. 2014. Real-time captioning with the crowd. interactions, 21, 3 (May 2014), 50-55. Google ScholarDigital Library
- X. Lei, A.W. Senior, A. Gruenstein, J. Sorensen. 2013. Accurate and compact large vocabulary speech recognition on mobile devices. In Proc. Interspeech'13, vol. 1, 662-665.Google Scholar
- J. Li, L. Deng, Y. Gong, R. Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 4 (April 2014), 745-777. Google ScholarDigital Library
- J.L. Luckner, C.M. Handley. 2008. A summary of the reading comprehension research undertaken with students who are deaf or hard of hearing. American Annals of the Deaf, 153(1):6-36.Google ScholarCross Ref
- I.A. McCowan, D. Moore, J. Dines, D. Gatica-Perez, M. Flynn, P. Wellner, H. Bourlard. 2004. On the use of information retrieval measures for speech recognition evaluation. Technical report, IDIAP.Google Scholar
- T. Mishra, A. Ljolje, M. Gilbert. 2011. Predicting human perceived accuracy of ASR systems. In Proc. Interspeech'11, 1945-1948Google Scholar
- A.C. Morris, V. Maier, P.D. Green. 2004. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In Interspeech'04, 2765-2768.Google Scholar
- H. Nanjo, T. Kawahara. 2005. A new ASR evaluation measure and minimum Bayes-risk decoding for open-domain speech understanding. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1053-1056.Google ScholarCross Ref
- K. Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3):372-422Google ScholarCross Ref
- K. Rayner, X. Li, B.J. Juhasz, G. Yan. 2005. The effect of word predictability on the eye movements of Chinese readers. Psychonomic Bulletin & Review, 12(6):1089-1093.Google ScholarCross Ref
- K. Rayner, E.D. Reichle, M.J. Stroud, C.C. Williams, A. Pollatsek. 2006. The effect of word frequency, word predictability, and font difficulty on the eye movements of young and older readers. Psychology and Aging, 21(3):448.Google ScholarCross Ref
- K. Rayner, T.J. Slattery, D. Drieghe, S.P. Liversedge. 2011. Eye movements and word skipping during reading: effects of word length and predictability. Journal of Experimental Psychology: Human Perception and Performance, 37(2):514-528.Google ScholarCross Ref
- A. Rousseau, P. Deleglise, Y. Esteve. 2012. Ted-lium: an automatic speech recognition dedicated corpus. In Proc. of LREC,'12 125-129. ELRA, Paris, France.Google Scholar
- M.S. Stinson, P. Francis, L.B. Elliot, D.Easton. 2014. Real-time caption challenge: C-print. In Proc. of ASSETS '14, ACM, New York, NY, USA, 317-318. Google ScholarDigital Library
- H. Takagi, T. Itoh, K. Shinkawa. 2015. Evaluation of real-time captioning by machine recognition with human support. In Proc. W4A'15, ACM, New York, NY, USA, Article 5, 4 pages. Google ScholarDigital Library
- M. Wald. 2011. Crowdsourcing correction of speech recognition captioning errors. In Proc. W4A'11, ACM, New York, NY, USA, Article 22, 2 pages. Google ScholarDigital Library
- Y.-Y. Wang, A. Acero, C. Chelba. 2003. Is word error rate a good indicator for spoken language understanding accuracy. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, 577-582. IEEE.Google Scholar
- W. Xiong, J. Droppo, X. Huang, F. Seide,. Seltzer, A. Stolcke, D. Yu, G. Zweig. 2016. Achieving human parity in conversational speech recognition. Computing Research Repository (CoRR), http://arxiv.org/abs/1610.05256Google Scholar
Index Terms
- Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing
Recommendations
Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing SystemsAs the accuracy of Automatic Speech Recognition (ASR) nears human-level quality, it might become feasible as an accessibility tool for people who are Deaf and Hard of Hearing (DHH) to transcribe spoken language to text. We conducted a study using in-...
Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing
Special Issue on ASSETS'17 and Regular PapersAutomatic Speech Recognition (ASR) technology has seen major advancements in its accuracy and speed in recent years, making it a possible mechanism for supporting communication between people who are Deaf or Hard-of-Hearing (DHH) and their hearing ...
Automatically generated captions: will they help non-native speakers communicate in english?
ICIC '10: Proceedings of the 3rd international conference on Intercultural collaborationMany people find it difficult to communicate in a foreign language. In order to help these people, one approach being studied is the use of captions generated by automatic speech recognition (ASR). Captions are known to facilitate comprehension of ...
Comments