ABSTRACT
Searching audio data can potentially be facilitated by the use of automatic speech recognition (ASR) technology to generate text transcripts which can then be easily queried. However, since current ASR technology cannot reliably generate 100% accurate transcripts, additional techniques for fluid browsing and searching of the audio itself are required. We explore the impact of transcripts of various qualities, dichotic presentation, and time-compression on an audio search task. Results show that dichotic presentation and reasonably accurate transcripts can assist in the search process, but suggest that time-compression and low accuracy transcripts should be used carefully.
- Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12. p. 35--50.Google Scholar
- Arons, B. (1992). Techniques, perception, and applications of time-compressed speech. American Voice I/O Society Conference. p. 169--177.Google Scholar
- Arons, B. (1997). SpeechSkimmer: a system for interactively skimming recorded speech. ACM Transactions on Computer Human Interaction, 4(1). p. 3--38. Google ScholarDigital Library
- Bregman, A. (1994). Auditory scene analysis: MIT Press, Cambridge, MA.Google Scholar
- Broadbent, D. (1958). Perception and communication: Pergamon, New York.Google Scholar
- Cherry, E. (1953). Some experiments of the recognition of speech, with one and with two ears. Journal of the Acoustic Society of America, 25. p. 975--979.Google ScholarCross Ref
- Cherry, E. and Taylor, W. (1954). Some further experiments on the recognition of speech, with one and two ears. Journal of the Acoustic Society of America, 26. p. 554--559.Google ScholarCross Ref
- Fiscus, J., Fisher, W., Martin, A., Przybocki, M., and Pallett, D. (2000). NIST Evaluation of conversational speech recognition over the telephone: English and Mandarin performance results. DARPA Broadcast News Workshop.Google Scholar
- Henja, D. and Musicus, B. (1991). The SOLAFS time-scale modification algorithm. Technical Report, Bolt Beranek & Newman.Google Scholar
- Inkpen, D. and Desilets, A. (2004). Extracting semantically-coherent keyphrases from speech. Canadian Acoustics, 32(3). p. 130--131.Google Scholar
- Kilgore, R., Chignell, M., and Smith, P. (2003). Spatialized audioconferencing: what are the benefits. IBM Center for Advanced Studies Conference (CASCON). p. 135--144. Google ScholarDigital Library
- Kobayashi, M. and Schmandt, C. (1997). Dynamic soundscape: mapping time to space for audio browsing. ACM CHI Conference on Human Factors in Computing Systems. p. 194--201. Google ScholarDigital Library
- Sawhney, N. and Schmandt, C. (2000). Nomadic radio: speech and audio interaction for contextual messaging in nomadic environments. ACM Transactions on Computer-Human Interaction, (7). p. 353--383. Google ScholarDigital Library
- Schmandt, C. (1998). Audio hallway: a virtual acoustic environment for browsing. ACM UIST Symposium on User Interface Software and Technology. p. 163--170. Google ScholarDigital Library
- Schmandt, C. and Mullins, A. (1995). Audiostreamer: exploiting simultaneity for listening. Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems. p. 218--219. Google ScholarDigital Library
- Spieth, W., Curtis, J., and Webster, J. (1954). Responding to one of two simultaneous messages. Journal of the Acoustic Society of America, 26(1). p. 391--396.Google ScholarCross Ref
- Stark, L., Whittaker, S., and Hirschberg, J. (2000). ASR satisficing: the effects of ASR accuracy on speech retrieval. International Conference on Spoken Language Processing.Google Scholar
- Stifelman, L. (1994). The cocktail party effect in auditory interfaces: A study of simultaneous presentation. Technical Report, MIT Media Laboratory.Google Scholar
- Vemuri, S., DeCamp, P., Bender, W., and Schmandt, C. (2004). Improving speech playback using time-compression and speech recognition. ACM CHI Conference on Human Factors in Computing Systems. p. 295--302. Google ScholarDigital Library
- Webster, J. and Thompson, P. (1954). Responding to both of two overlapping messages. Journal of the Acoustic Society of America, 26(1). p. 396--402.Google ScholarCross Ref
- Whittaker, S. and Amento, B. (2004). Semantic speech editing. ACM CHI Conference on Human Factors in Computing Systems. p. 527--534. Google ScholarDigital Library
- Whittaker, S. and Hirschberg, J. (2003). Look or listen: Discovering effective techniques for accessing speech data. Proceedings of Human Computer Interaction. p. 253--269.Google Scholar
- Whittaker, S., Hirschberg, J., Amento, B., Stark, L., Bacchiani, M., Isenhour, P., Stead, L., Zamchick, G., and Rosenberg, A. (2002). Scanmail: a voicemail interface that makes speech browsable, readable and searchable. ACM CHI Conference on Human Factors in Computing Systems. p. 257--282. Google ScholarDigital Library
- Whittaker, S., J, H., Choi, J., Hindle, D., Pereira, F., and Singhal, A. (1999). SCAN: designing and evaluating user interfaces to support retrieval from speech archives. ACM SIGIR Conference on Research and Development in Information Retrieval. p. 26--33. Google ScholarDigital Library
Index Terms
- Searching in audio: the utility of transcripts, dichotic presentation, and time-compression
Recommendations
Relationship between Chinese speech intelligibility and speech transmission index in rooms based on auralization
Based on simulated monaural and binaural room impulse responses, the relationship between Chinese speech intelligibility scores and speech transmission index (STI) including the effect of noise is investigated using a phonetically balanced test in ...
Semantic speech editing
CHI '04: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsEditing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a ...
Speaker Identification Within Whispered Speech Audio Streams
Whisper is an alternative speech production mode used by subjects in natural conversation to protect the privacy. Due to the profound differences between whisper and neutral speech in both excitation and vocal tract function, the performance of speaker ...
Comments