ABSTRACT
Speech transcription is an expensive service with high turnaround time for audio files containing languages spoken in developing countries and regional accents of well-represented languages. We present Respeak - a voice-based, crowd-powered system that capitalizes on the strengths of crowdsourcing and automatic speech recognition (instead of typing) to transcribe such audio files. We created Respeak and optimized its design through a series of cognitive experiments. We deployed it with 25 university students in India who completed 5464 micro-transcription tasks, transcribing 55 minutes of widely-varied audio content, and collectively earning USD 46 as mobile airtime. The Respeak engine aligned the transcript generated by five randomly selected users to transcribe Hindi and Indian English audio files with a word error rate (WER) of 8.6% and 15.2%, respectively. The cost of speech transcription was USD 0.83 per minute with a turnaround time of 39.8 hours, substantially less than industry standards. Using a mixed-methods analysis of cognitive experiments, system performance and qualitative interviews, we evaluate Respeak's design, user experience, strengths, and weaknesses. Our findings suggest that Respeak improves the quality of speech transcription while enhancing the earning potential of low-income populations in resource-constrained settings.
Supplemental Material
Available for Download
- 2012. Press Note on Release of Data on Houses, Household Amenities and Assets, Census 2011. Technical Report. Ministry of Home Affairs, Government of India. http://censusindia.gov.in/2011census/hlo/Data_sheet/ India/HLO_Press_Release.pdfGoogle Scholar
- 2014. Global Findex 2014 - Financial Inclusion. Technical Report. World Bank. http://datatopics. worldbank.org/financialinclusion/country/indiaGoogle Scholar
- 2016. CastingWords. (2016). https://castingwords.com/.Google Scholar
- 2016. CLOUD SPEECH API: Speech to text conversion powered by machine learning. (2016). https://cloud.google.com/speech/.Google Scholar
- 2016. CrowdSurf. (2016). http://crowdsurfwork.com/.Google Scholar
- 2016. Google Input Tools. (2016). https://www.google.com/inputtools/.Google Scholar
- 2016. India Average Daily Wage Rate Forecast 2016--2020. (2016). http://www.tradingeconomics.com/india/wages/forecast.Google Scholar
- 2016. India Typing. (2016). http://indiatyping.com/.Google Scholar
- 2016. Jana. (2016). https://www.jana.com/.Google Scholar
- 2016. Medical Transcription Services Market - Global Industry Analysis, Size, Share, Growth, Trends and Forecast, 2013 - 2019. Technical Report. Transparency Market Research.Google Scholar
- 2016. Quick Transcription Service. (2016). http://www.quicktranscriptionservice.com/ Hindi-Transcription.html.Google Scholar
- 2016. Rev. (2016). https://www.rev.com/.Google Scholar
- 2016. Samasource. (2016). http://www.samasource.org/.Google Scholar
- 2016. Scripts Complete. (2016). http://scriptscomplete. com/Hindi-Transcription-Services.php.Google Scholar
- 2016. SpeechPad. (2016). https://www.speechpad.com/.Google Scholar
- 2016. Tigerfish. (2016). http://tigerfish.com/.Google Scholar
- 2016a. TranscribeMe. (2016). http://transcribeme.com/.Google Scholar
- 2016b. Transcription Services Us. (2016). http://www.transcription-services-us.com/ Language-Transcription-Rates.php.Google Scholar
- Rio Akasaka. 2009. Foreign accented speech transcription and accent recognition using a game-based approach. Ph.D. Dissertation. Swarthmore Dept. of Linguistics.Google Scholar
- Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for Placing Cuts and Transitions in Interview Video. ACM Transaction of Graphics 31, 4 (2012). Google ScholarDigital Library
- Nathan Eagle. 2009. Txteagle: Mobile Crowdsourcing. In Proceedings of HCI International. Springer-Verlag. Google ScholarDigital Library
- Keelan Evanini and Klaus Zechner. 2011. Using crowdsourcing to provide prosodic annotations for non-native speech. In Proceedings of Interspeech.Google ScholarCross Ref
- Alexander Gruenstein, Ian McGraw, and Andrew Sutherland. 2009. A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In Proceedings of SLaTE.Google Scholar
- Aakar Gupta, William Thies, Edward Cutrell, and Ravin Balakrishnan. 2012. mClerk: Enabling Mobile Crowdsourcing in Developing Regions. In Proceedings of CHI. Google ScholarDigital Library
- Toru Imai, Atsushi Matsui, Shinichi Homma, Takeshi Kobayakawa, Kazuo Onoe, Shoei Sato, and Akio Ando. 2002. Speech recognition with a re-speak method for subtitling live broadcasts. In Proceedings of ICSLP.Google ScholarCross Ref
- Jennifer Lai and John Vergo. 1997. MedSpeak: Report Creation with Continuous Speech Recognition. In Proceedings of CHI. Google ScholarDigital Library
- Ian Lane, Alex Waibel, Matthias Eck, and Kay Rottmann. 2010. Tools for Collecting Speech Corpora via Mechanical Turk. In Proceedings of the NAACL HLT.Google Scholar
- Walter Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, and Jeffrey Bigham. 2012. Real-time Captioning by Groups of Non-experts. In Proceedings of UIST. Google ScholarDigital Library
- Jonathan Ledlie, Billy Odero, Einat Minkov, Imre Kiss, and Joseph Polifroni. 2010. Crowd Translator: On Building Localized Speech Recognizers Through Micropayments. SIGOPS Oper. Syst. Rev. 43, 4 (Jan. 2010).Google ScholarDigital Library
- Chia-ying Lee and James Glass. 2011. A Transcription Task for Crowdsourcing with Automatic Quality Control. In Proceedings of Interspeech.Google Scholar
- Ian Mcgraw, Er Gruenstein, and Andrew Sutherl. 2009. A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game. In Proceedings of Interspeech.Google ScholarCross Ref
- Ian Mcgraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, and Jim Glass. 2010. Collecting Voices from the Cloud. In Proceedings of LREC.Google Scholar
- Mary Meeker. 2015. 2015 Internet Trends. Technical Report. KPCB. http://www.kpcb.com/blog/2015-internet-trendsGoogle Scholar
- Mary Meeker and Liang Wu. 2014. 2014 Internet Trends. Technical Report. KPCB. https://www.kpcb.com/insights/2014-internet-trendsGoogle Scholar
- Preeti Mudliar, Jonathan Donner, and William Thies. 2012. Emergent Practices Around CGNet Swara, A Voice Forum for Citizen Journalism in Rural India. In Proceedings of ICTD. Google ScholarDigital Library
- Iftekhar Naim, Daniel Gildea, Walter S. Lasecki, and Jeffrey P. Bigham. 2013. Text Alignment for Real-Time Crowd Captioning. In Proceesings of HLT-NAACL.Google Scholar
- Prayag Narula, David Rolnitzky, and Bjoern Hartmann. 2011. MobileWorks: A Mobile Crowdsourcing Platform for Workers at the Bottom of the Pyramid. In In Proceedings of HCOMP.Google ScholarDigital Library
- G. Parent and M. Eskenazi. 2010. Toward better crowdsourced transcription: Transcription of a year of the Let's Go Bus Information System data. In Proceedings of SLT. Google ScholarCross Ref
- Ales PrazÃak, Zdenek Loose, Jan Trmal, Josef V. Psutka, and Josef Psutka. 2012. Novel Approach to Live Captioning Through Re-speaking: Tailoring Speech Recognition to Re-speaker's Needs.. In Proceedings of Interspeech.Google ScholarCross Ref
- Agha Ali Raza, Farhan Ul Haq, Zain Tariq, Mansoor Pervaiz, Samia Razaq, Umar Saif, and Roni Rosenfeld. 2013. Job Opportunities Through Entertainment: Virally Spread Speech-based Services for Low-literate Users. In Proceedings of CHI. Google ScholarDigital Library
- Venkatesh Sivaraman, Dongwook Yoon, and Piotr Mitros. 2016. Simplified Audio Production in Asynchronous Voice-Based Discussions. In Proceedings of CHI. Google ScholarDigital Library
- Matthias Sperber, Graham Neubig, Christian Fugen, Satoshi Nakamura, and Alex Waibel. 2013. Efficient Speech Transcription Through Respeaking. In Proceesings of Interspeech.Google Scholar
- Aditya Vashistha, Edward Cutrell, Gaetano Borriello, and William Thies. 2015. Sangeet Swara: A Community-Moderated Voice Forum in Rural India. In Proceedings of CHI. Google ScholarDigital Library
- Dongwook Yoon, Nicholas Chen, Franğois Guimbretire, and Abigail Sellen. 2014. RichReview: Blending Ink, Speech, and Gesture to Support Collaborative Document Review. In Proceedings of UIST. Google ScholarDigital Library
Index Terms
- Respeak: A Voice-based, Crowd-powered Speech Transcription System
Recommendations
BSpeak: An Accessible Voice-based Crowdsourcing Marketplace for Low-Income Blind People
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing SystemsBSpeak is an accessible crowdsourcing marketplace that enables blind people in developing regions to earn money by transcribing audio files through speech. We examine accessibility and usability barriers that 15 first-time users, who are low-income and ...
On the perception of "segmental intonation": F0 context effects on sibilant identification in German
In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...
A complemented Greek text to speech system
ISPRA'05: Proceedings of the 4th WSEAS International Conference on Signal Processing, Robotics and AutomationThis paper tries to give a comprehensive insight of a complemented Greek Text to Speech system by highlighting its basic Digital Signal Processing (DSP) and Natural Language Processing (NLP) modules. The main focus will be the development of such a ...
Comments