research-article

Respeak: A Voice-based, Crowd-powered Speech Transcription System

Authors:
Aditya Vashistha

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Pooja Sethi

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Richard Anderson

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing SystemsMay 2017Pages 1855–1866https://doi.org/10.1145/3025453.3025640

Published:02 May 2017Publication History

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

Pages 1855–1866

ABSTRACT

Speech transcription is an expensive service with high turnaround time for audio files containing languages spoken in developing countries and regional accents of well-represented languages. We present Respeak - a voice-based, crowd-powered system that capitalizes on the strengths of crowdsourcing and automatic speech recognition (instead of typing) to transcribe such audio files. We created Respeak and optimized its design through a series of cognitive experiments. We deployed it with 25 university students in India who completed 5464 micro-transcription tasks, transcribing 55 minutes of widely-varied audio content, and collectively earning USD 46 as mobile airtime. The Respeak engine aligned the transcript generated by five randomly selected users to transcribe Hindi and Indian English audio files with a word error rate (WER) of 8.6% and 15.2%, respectively. The cost of speech transcription was USD 0.83 per minute with a turnaround time of 39.8 hours, substantially less than industry standards. Using a mixed-methods analysis of cognitive experiments, system performance and qualitative interviews, we evaluate Respeak's design, user experience, strengths, and weaknesses. Our findings suggest that Respeak improves the quality of speech transcription while enhancing the earning potential of low-income populations in resource-constrained settings.

Supplemental Material

pn1920p.mp4

mp4

2 MB

Download

Available for Download

zip

pn1920-file4.zip (542 B)

References

2012. Press Note on Release of Data on Houses, Household Amenities and Assets, Census 2011. Technical Report. Ministry of Home Affairs, Government of India. http://censusindia.gov.in/2011census/hlo/Data_sheet/ India/HLO_Press_Release.pdfGoogle Scholar
2014. Global Findex 2014 - Financial Inclusion. Technical Report. World Bank. http://datatopics. worldbank.org/financialinclusion/country/indiaGoogle Scholar
2016. CastingWords. (2016). https://castingwords.com/.Google Scholar
2016. CLOUD SPEECH API: Speech to text conversion powered by machine learning. (2016). https://cloud.google.com/speech/.Google Scholar
2016. CrowdSurf. (2016). http://crowdsurfwork.com/.Google Scholar
2016. Google Input Tools. (2016). https://www.google.com/inputtools/.Google Scholar
2016. India Average Daily Wage Rate Forecast 2016--2020. (2016). http://www.tradingeconomics.com/india/wages/forecast.Google Scholar
2016. India Typing. (2016). http://indiatyping.com/.Google Scholar
2016. Jana. (2016). https://www.jana.com/.Google Scholar
2016. Medical Transcription Services Market - Global Industry Analysis, Size, Share, Growth, Trends and Forecast, 2013 - 2019. Technical Report. Transparency Market Research.Google Scholar
2016. Quick Transcription Service. (2016). http://www.quicktranscriptionservice.com/ Hindi-Transcription.html.Google Scholar
2016. Rev. (2016). https://www.rev.com/.Google Scholar
2016. Samasource. (2016). http://www.samasource.org/.Google Scholar
2016. Scripts Complete. (2016). http://scriptscomplete. com/Hindi-Transcription-Services.php.Google Scholar
2016. SpeechPad. (2016). https://www.speechpad.com/.Google Scholar
2016. Tigerfish. (2016). http://tigerfish.com/.Google Scholar
2016a. TranscribeMe. (2016). http://transcribeme.com/.Google Scholar
2016b. Transcription Services Us. (2016). http://www.transcription-services-us.com/ Language-Transcription-Rates.php.Google Scholar
Rio Akasaka. 2009. Foreign accented speech transcription and accent recognition using a game-based approach. Ph.D. Dissertation. Swarthmore Dept. of Linguistics.Google Scholar
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for Placing Cuts and Transitions in Interview Video. ACM Transaction of Graphics 31, 4 (2012). Google ScholarDigital Library
Nathan Eagle. 2009. Txteagle: Mobile Crowdsourcing. In Proceedings of HCI International. Springer-Verlag. Google ScholarDigital Library
Keelan Evanini and Klaus Zechner. 2011. Using crowdsourcing to provide prosodic annotations for non-native speech. In Proceedings of Interspeech.Google ScholarCross Ref
Alexander Gruenstein, Ian McGraw, and Andrew Sutherland. 2009. A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In Proceedings of SLaTE.Google Scholar
Aakar Gupta, William Thies, Edward Cutrell, and Ravin Balakrishnan. 2012. mClerk: Enabling Mobile Crowdsourcing in Developing Regions. In Proceedings of CHI. Google ScholarDigital Library
Toru Imai, Atsushi Matsui, Shinichi Homma, Takeshi Kobayakawa, Kazuo Onoe, Shoei Sato, and Akio Ando. 2002. Speech recognition with a re-speak method for subtitling live broadcasts. In Proceedings of ICSLP.Google ScholarCross Ref
Jennifer Lai and John Vergo. 1997. MedSpeak: Report Creation with Continuous Speech Recognition. In Proceedings of CHI. Google ScholarDigital Library
Ian Lane, Alex Waibel, Matthias Eck, and Kay Rottmann. 2010. Tools for Collecting Speech Corpora via Mechanical Turk. In Proceedings of the NAACL HLT.Google Scholar
Walter Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, and Jeffrey Bigham. 2012. Real-time Captioning by Groups of Non-experts. In Proceedings of UIST. Google ScholarDigital Library
Jonathan Ledlie, Billy Odero, Einat Minkov, Imre Kiss, and Joseph Polifroni. 2010. Crowd Translator: On Building Localized Speech Recognizers Through Micropayments. SIGOPS Oper. Syst. Rev. 43, 4 (Jan. 2010).Google ScholarDigital Library
Chia-ying Lee and James Glass. 2011. A Transcription Task for Crowdsourcing with Automatic Quality Control. In Proceedings of Interspeech.Google Scholar
Ian Mcgraw, Er Gruenstein, and Andrew Sutherl. 2009. A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game. In Proceedings of Interspeech.Google ScholarCross Ref
Ian Mcgraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, and Jim Glass. 2010. Collecting Voices from the Cloud. In Proceedings of LREC.Google Scholar
Mary Meeker. 2015. 2015 Internet Trends. Technical Report. KPCB. http://www.kpcb.com/blog/2015-internet-trendsGoogle Scholar
Mary Meeker and Liang Wu. 2014. 2014 Internet Trends. Technical Report. KPCB. https://www.kpcb.com/insights/2014-internet-trendsGoogle Scholar
Preeti Mudliar, Jonathan Donner, and William Thies. 2012. Emergent Practices Around CGNet Swara, A Voice Forum for Citizen Journalism in Rural India. In Proceedings of ICTD. Google ScholarDigital Library
Iftekhar Naim, Daniel Gildea, Walter S. Lasecki, and Jeffrey P. Bigham. 2013. Text Alignment for Real-Time Crowd Captioning. In Proceesings of HLT-NAACL.Google Scholar
Prayag Narula, David Rolnitzky, and Bjoern Hartmann. 2011. MobileWorks: A Mobile Crowdsourcing Platform for Workers at the Bottom of the Pyramid. In In Proceedings of HCOMP.Google ScholarDigital Library
G. Parent and M. Eskenazi. 2010. Toward better crowdsourced transcription: Transcription of a year of the Let's Go Bus Information System data. In Proceedings of SLT. Google ScholarCross Ref
Ales PrazÃak, Zdenek Loose, Jan Trmal, Josef V. Psutka, and Josef Psutka. 2012. Novel Approach to Live Captioning Through Re-speaking: Tailoring Speech Recognition to Re-speaker's Needs.. In Proceedings of Interspeech.Google ScholarCross Ref
Agha Ali Raza, Farhan Ul Haq, Zain Tariq, Mansoor Pervaiz, Samia Razaq, Umar Saif, and Roni Rosenfeld. 2013. Job Opportunities Through Entertainment: Virally Spread Speech-based Services for Low-literate Users. In Proceedings of CHI. Google ScholarDigital Library
Venkatesh Sivaraman, Dongwook Yoon, and Piotr Mitros. 2016. Simplified Audio Production in Asynchronous Voice-Based Discussions. In Proceedings of CHI. Google ScholarDigital Library
Matthias Sperber, Graham Neubig, Christian Fugen, Satoshi Nakamura, and Alex Waibel. 2013. Efficient Speech Transcription Through Respeaking. In Proceesings of Interspeech.Google Scholar
Aditya Vashistha, Edward Cutrell, Gaetano Borriello, and William Thies. 2015. Sangeet Swara: A Community-Moderated Voice Forum in Rural India. In Proceedings of CHI. Google ScholarDigital Library
Dongwook Yoon, Nicholas Chen, Franğois Guimbretire, and Abigail Sellen. 2014. RichReview: Blending Ink, Speech, and Gesture to Support Collaborative Document Review. In Proceedings of UIST. Google ScholarDigital Library

Index Terms

Respeak: A Voice-based, Crowd-powered Speech Transcription System
1. Human-centered computing

Recommendations

BSpeak: An Accessible Voice-based Crowdsourcing Marketplace for Low-Income Blind People
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems

BSpeak is an accessible crowdsourcing marketplace that enables blind people in developing regions to earn money by transcribing audio files through speech. We examine accessibility and usability barriers that 15 first-time users, who are low-income and ...
Read More
On the perception of "segmental intonation": F0 context effects on sibilant identification in German

In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...
Read More
A complemented Greek text to speech system
ISPRA'05: Proceedings of the 4th WSEAS International Conference on Signal Processing, Robotics and Automation

This paper tries to give a comprehensive insight of a complemented Greek Text to Speech system by highlighting its basic Digital Signal Processing (DSP) and Natural Language Processing (NLP) modules. The main focus will be the development of such a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
May 2017
7138 pages
ISBN:9781450346559
DOI:10.1145/3025453
General Chairs:
Gloria Mark
University of California Irvine
,
Susan Fussell
Cornell University
,
Program Chairs:
Cliff Lampe
University of Michigan
,
m.c. schraefel
University of Southampton
,
Juan Pablo Hourcade
University of Iowa
,
Caroline Appert
Université Paris-Sud
,
Daniel Wigdor
University of Toronto
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Honorable Mention
Author Tags
HCI4D
India
crowdsourcing
speech
transcription
Qualifiers
- research-article
Conference

Acceptance Rates
CHI '17 Paper Acceptance Rate600of2,400submissions,25%Overall Acceptance Rate6,199of26,314submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 821
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Respeak: A Voice-based, Crowd-powered Speech Transcription System

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

BSpeak: An Accessible Voice-based Crowdsourcing Marketplace for Low-Income Blind People

On the perception of "segmental intonation": F0 context effects on sibilant identification in German

A complemented Greek text to speech system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media