Article

Searching in audio: the utility of transcripts, dichotic presentation, and time-compression

Authors:
Abhishek Ranjan

University of Toronto

University of Toronto
View Profile

,
Ravin Balakrishnan

University of Toronto

University of Toronto
View Profile

,
Mark Chignell

University of Toronto

University of Toronto
View Profile

CHI '06: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsApril 2006Pages 721–730https://doi.org/10.1145/1124772.1124879

Published:22 April 2006Publication History

CHI '06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Pages 721–730

ABSTRACT

Searching audio data can potentially be facilitated by the use of automatic speech recognition (ASR) technology to generate text transcripts which can then be easily queried. However, since current ASR technology cannot reliably generate 100% accurate transcripts, additional techniques for fluid browsing and searching of the audio itself are required. We explore the impact of transcripts of various qualities, dichotic presentation, and time-compression on an audio search task. Results show that dichotic presentation and reasonably accurate transcripts can assist in the search process, but suggest that time-compression and low accuracy transcripts should be used carefully.

References

Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12. p. 35--50.Google Scholar
Arons, B. (1992). Techniques, perception, and applications of time-compressed speech. American Voice I/O Society Conference. p. 169--177.Google Scholar
Arons, B. (1997). SpeechSkimmer: a system for interactively skimming recorded speech. ACM Transactions on Computer Human Interaction, 4(1). p. 3--38. Google ScholarDigital Library
Bregman, A. (1994). Auditory scene analysis: MIT Press, Cambridge, MA.Google Scholar
Broadbent, D. (1958). Perception and communication: Pergamon, New York.Google Scholar
Cherry, E. (1953). Some experiments of the recognition of speech, with one and with two ears. Journal of the Acoustic Society of America, 25. p. 975--979.Google ScholarCross Ref
Cherry, E. and Taylor, W. (1954). Some further experiments on the recognition of speech, with one and two ears. Journal of the Acoustic Society of America, 26. p. 554--559.Google ScholarCross Ref
Fiscus, J., Fisher, W., Martin, A., Przybocki, M., and Pallett, D. (2000). NIST Evaluation of conversational speech recognition over the telephone: English and Mandarin performance results. DARPA Broadcast News Workshop.Google Scholar
Henja, D. and Musicus, B. (1991). The SOLAFS time-scale modification algorithm. Technical Report, Bolt Beranek & Newman.Google Scholar
Inkpen, D. and Desilets, A. (2004). Extracting semantically-coherent keyphrases from speech. Canadian Acoustics, 32(3). p. 130--131.Google Scholar
Kilgore, R., Chignell, M., and Smith, P. (2003). Spatialized audioconferencing: what are the benefits. IBM Center for Advanced Studies Conference (CASCON). p. 135--144. Google ScholarDigital Library
Kobayashi, M. and Schmandt, C. (1997). Dynamic soundscape: mapping time to space for audio browsing. ACM CHI Conference on Human Factors in Computing Systems. p. 194--201. Google ScholarDigital Library
Sawhney, N. and Schmandt, C. (2000). Nomadic radio: speech and audio interaction for contextual messaging in nomadic environments. ACM Transactions on Computer-Human Interaction, (7). p. 353--383. Google ScholarDigital Library
Schmandt, C. (1998). Audio hallway: a virtual acoustic environment for browsing. ACM UIST Symposium on User Interface Software and Technology. p. 163--170. Google ScholarDigital Library
Schmandt, C. and Mullins, A. (1995). Audiostreamer: exploiting simultaneity for listening. Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems. p. 218--219. Google ScholarDigital Library
Spieth, W., Curtis, J., and Webster, J. (1954). Responding to one of two simultaneous messages. Journal of the Acoustic Society of America, 26(1). p. 391--396.Google ScholarCross Ref
Stark, L., Whittaker, S., and Hirschberg, J. (2000). ASR satisficing: the effects of ASR accuracy on speech retrieval. International Conference on Spoken Language Processing.Google Scholar
Stifelman, L. (1994). The cocktail party effect in auditory interfaces: A study of simultaneous presentation. Technical Report, MIT Media Laboratory.Google Scholar
Vemuri, S., DeCamp, P., Bender, W., and Schmandt, C. (2004). Improving speech playback using time-compression and speech recognition. ACM CHI Conference on Human Factors in Computing Systems. p. 295--302. Google ScholarDigital Library
Webster, J. and Thompson, P. (1954). Responding to both of two overlapping messages. Journal of the Acoustic Society of America, 26(1). p. 396--402.Google ScholarCross Ref
Whittaker, S. and Amento, B. (2004). Semantic speech editing. ACM CHI Conference on Human Factors in Computing Systems. p. 527--534. Google ScholarDigital Library
Whittaker, S. and Hirschberg, J. (2003). Look or listen: Discovering effective techniques for accessing speech data. Proceedings of Human Computer Interaction. p. 253--269.Google Scholar
Whittaker, S., Hirschberg, J., Amento, B., Stark, L., Bacchiani, M., Isenhour, P., Stead, L., Zamchick, G., and Rosenberg, A. (2002). Scanmail: a voicemail interface that makes speech browsable, readable and searchable. ACM CHI Conference on Human Factors in Computing Systems. p. 257--282. Google ScholarDigital Library
Whittaker, S., J, H., Choi, J., Hindle, D., Pereira, F., and Singhal, A. (1999). SCAN: designing and evaluating user interfaces to support retrieval from speech archives. ACM SIGIR Conference on Research and Development in Information Retrieval. p. 26--33. Google ScholarDigital Library

Index Terms

Searching in audio: the utility of transcripts, dichotic presentation, and time-compression
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Relationship between Chinese speech intelligibility and speech transmission index in rooms based on auralization

Based on simulated monaural and binaural room impulse responses, the relationship between Chinese speech intelligibility scores and speech transmission index (STI) including the effect of noise is investigated using a phonetically balanced test in ...
Read More
Semantic speech editing
CHI '04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a ...
Read More
Speaker Identification Within Whispered Speech Audio Streams

Whisper is an alternative speech production mode used by subjects in natural conversation to protect the privacy. Due to the profound differences between whisper and neutral speech in both excitation and vocal tract function, the performance of speaker ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
April 2006
1353 pages
ISBN:1595933727
DOI:10.1145/1124772
Editors:
Rebecca Grinter
Georgia Institute of Technology, USA
,
Thomas Rodden
University of Nottingham, UK
,
Paul Aoki
Intel, USA
,
Ed Cutrell
Microsoft, USA
,
Robin Jeffries
Google, USA
,
Gary Olson
University of Michigan, USA
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 April 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio time-compression
dichotic listening
transcripts
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate6,199of26,314submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 656
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Searching in audio: the utility of transcripts, dichotic presentation, and time-compression

CHI '06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Relationship between Chinese speech intelligibility and speech transmission index in rooms based on auralization

Semantic speech editing

Speaker Identification Within Whispered Speech Audio Streams