skip to main content
10.1145/1124772.1124879acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Article

Searching in audio: the utility of transcripts, dichotic presentation, and time-compression

Published:22 April 2006Publication History

ABSTRACT

Searching audio data can potentially be facilitated by the use of automatic speech recognition (ASR) technology to generate text transcripts which can then be easily queried. However, since current ASR technology cannot reliably generate 100% accurate transcripts, additional techniques for fluid browsing and searching of the audio itself are required. We explore the impact of transcripts of various qualities, dichotic presentation, and time-compression on an audio search task. Results show that dichotic presentation and reasonably accurate transcripts can assist in the search process, but suggest that time-compression and low accuracy transcripts should be used carefully.

References

  1. Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12. p. 35--50.Google ScholarGoogle Scholar
  2. Arons, B. (1992). Techniques, perception, and applications of time-compressed speech. American Voice I/O Society Conference. p. 169--177.Google ScholarGoogle Scholar
  3. Arons, B. (1997). SpeechSkimmer: a system for interactively skimming recorded speech. ACM Transactions on Computer Human Interaction, 4(1). p. 3--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bregman, A. (1994). Auditory scene analysis: MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  5. Broadbent, D. (1958). Perception and communication: Pergamon, New York.Google ScholarGoogle Scholar
  6. Cherry, E. (1953). Some experiments of the recognition of speech, with one and with two ears. Journal of the Acoustic Society of America, 25. p. 975--979.Google ScholarGoogle ScholarCross RefCross Ref
  7. Cherry, E. and Taylor, W. (1954). Some further experiments on the recognition of speech, with one and two ears. Journal of the Acoustic Society of America, 26. p. 554--559.Google ScholarGoogle ScholarCross RefCross Ref
  8. Fiscus, J., Fisher, W., Martin, A., Przybocki, M., and Pallett, D. (2000). NIST Evaluation of conversational speech recognition over the telephone: English and Mandarin performance results. DARPA Broadcast News Workshop.Google ScholarGoogle Scholar
  9. Henja, D. and Musicus, B. (1991). The SOLAFS time-scale modification algorithm. Technical Report, Bolt Beranek & Newman.Google ScholarGoogle Scholar
  10. Inkpen, D. and Desilets, A. (2004). Extracting semantically-coherent keyphrases from speech. Canadian Acoustics, 32(3). p. 130--131.Google ScholarGoogle Scholar
  11. Kilgore, R., Chignell, M., and Smith, P. (2003). Spatialized audioconferencing: what are the benefits. IBM Center for Advanced Studies Conference (CASCON). p. 135--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kobayashi, M. and Schmandt, C. (1997). Dynamic soundscape: mapping time to space for audio browsing. ACM CHI Conference on Human Factors in Computing Systems. p. 194--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sawhney, N. and Schmandt, C. (2000). Nomadic radio: speech and audio interaction for contextual messaging in nomadic environments. ACM Transactions on Computer-Human Interaction, (7). p. 353--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Schmandt, C. (1998). Audio hallway: a virtual acoustic environment for browsing. ACM UIST Symposium on User Interface Software and Technology. p. 163--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Schmandt, C. and Mullins, A. (1995). Audiostreamer: exploiting simultaneity for listening. Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems. p. 218--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Spieth, W., Curtis, J., and Webster, J. (1954). Responding to one of two simultaneous messages. Journal of the Acoustic Society of America, 26(1). p. 391--396.Google ScholarGoogle ScholarCross RefCross Ref
  17. Stark, L., Whittaker, S., and Hirschberg, J. (2000). ASR satisficing: the effects of ASR accuracy on speech retrieval. International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  18. Stifelman, L. (1994). The cocktail party effect in auditory interfaces: A study of simultaneous presentation. Technical Report, MIT Media Laboratory.Google ScholarGoogle Scholar
  19. Vemuri, S., DeCamp, P., Bender, W., and Schmandt, C. (2004). Improving speech playback using time-compression and speech recognition. ACM CHI Conference on Human Factors in Computing Systems. p. 295--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Webster, J. and Thompson, P. (1954). Responding to both of two overlapping messages. Journal of the Acoustic Society of America, 26(1). p. 396--402.Google ScholarGoogle ScholarCross RefCross Ref
  21. Whittaker, S. and Amento, B. (2004). Semantic speech editing. ACM CHI Conference on Human Factors in Computing Systems. p. 527--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Whittaker, S. and Hirschberg, J. (2003). Look or listen: Discovering effective techniques for accessing speech data. Proceedings of Human Computer Interaction. p. 253--269.Google ScholarGoogle Scholar
  23. Whittaker, S., Hirschberg, J., Amento, B., Stark, L., Bacchiani, M., Isenhour, P., Stead, L., Zamchick, G., and Rosenberg, A. (2002). Scanmail: a voicemail interface that makes speech browsable, readable and searchable. ACM CHI Conference on Human Factors in Computing Systems. p. 257--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Whittaker, S., J, H., Choi, J., Hindle, D., Pereira, F., and Singhal, A. (1999). SCAN: designing and evaluating user interfaces to support retrieval from speech archives. ACM SIGIR Conference on Research and Development in Information Retrieval. p. 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Searching in audio: the utility of transcripts, dichotic presentation, and time-compression

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CHI '06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
      April 2006
      1353 pages
      ISBN:1595933727
      DOI:10.1145/1124772

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 April 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate6,199of26,314submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader