Elsevier

Brain Research

Volume 1111, Issue 1, 21 September 2006, Pages 134-142
Brain Research

Research Report
Audiovisual synchrony perception for music, speech, and object actions

https://doi.org/10.1016/j.brainres.2006.05.078Get rights and content

Abstract

We investigated the perception of synchrony for complex audiovisual events. In Experiment 1, a series of music (guitar and piano), speech (sentences), and object action video clips were presented at a range of stimulus onset asynchronies (SOAs) using the method of constant stimuli. Participants made unspeeded temporal order judgments (TOJs) regarding which stream (auditory or visual) appeared to have been presented first. Temporal discrimination accuracy was significantly better for the object actions than for the speech video clips, and both were significantly better than for the music video clips. In order to investigate whether or not these differences in TOJ performance were driven by differences in stimulus familiarity, we conducted a second experiment using brief speech (syllables), music (guitar), and object action video clips of fixed duration together with temporally reversed (i.e., less familiar) versions of the same stimuli. The results showed no main effect of stimulus type on temporal discrimination accuracy. Interestingly, however, reversing the video clips resulted in a significant decrement in temporal discrimination accuracy as compared to the normally presented for the music and object actions clips, but not for the speech stimuli. Overall, our results suggest that cross-modal temporal discrimination performance is better for audiovisual stimuli of lower complexity as compared to stimuli having continuously varying properties (e.g., syllables versus words and/or sentences).

Introduction

The study of the perception of synchrony dates back to the very beginnings of the study of Experimental Psychology (e.g., see Mollon and Perkins (1996)). However, researchers have yet to uncover the fundamental psychological or neurophysiological processes underlying the perception of synchrony for speech or other complex non-speech stimuli, such as music or object actions (Dixon and Spitz, 1980). The recent renewal of interest in the multisensory perception of synchrony (e.g., King, 2005, Spence and Squire, 2003) has tended to focus instead on the perception of very simple stimuli (such as brief noise bursts, light flashes, and punctate tactile stimuli; i.e., Hirsh and Sherrick, 1961, Spence et al., 2001, Zampini et al., 2003). While such studies have successfully identified a number of the key factors modulating the perception of synchrony for simple stimuli (see Spence et al., 2001, Spence and Squire, 2003 for reviews), it would now seem appropriate to start investigating the extent to which these factors also influence the perception of more complex (and ecologically valid) multisensory stimuli (De Gelder and Bertelson, 2003).

To date, such research has primarily been focused on the perception of synchrony for audiovisual speech stimuli (e.g., Conrey and Pisoni, in press, Dixon and Spitz, 1980, Grant and Greenberg, 2001, Steinmetz, 1996). However, speech represents a highly overlearned stimulus for most people and it has even been argued by some researchers that it may represent a ‘special’ class of sensory event (e.g., Bernstein et al., 2004, Massaro, 2004, Munhall and Vatikiotis-Bateson, 2004, Tuomainen et al., 2005). One might therefore wonder whether other kinds of complex naturalistic audiovisual stimuli should also be used to investigate audiovisual synchrony perception such as, for example, object actions and the playing of music (that have typically been neglected by researchers in this field).

At present, only a few psychophysical studies have attempted to investigate and compare the temporal perception of speech to other kinds of complex non-speech stimuli (e.g., Dixon and Spitz, 1980, Hollier and Rimell, 1998). For example, Dixon and Spitz (1980) reported one of the first studies to compare the perception of synchrony in continuous audiovisual speech with that for complex audiovisual object actions. The participants in Dixon and Spitz's study monitored continuous videos consisting of an audiovisual speech stream or an object action event consisting of a hammer repeatedly hitting a peg that started off in synchrony and were then gradually desynchronized at a constant rate of 51 ms/s up to a maximum asynchrony of 500 ms. The participants were instructed to respond as soon as they noticed the asynchrony. When participants monitored the continuous speech stream, the auditory stream had to lag by 258 ms or else to lead by 131 ms before the discrepancy was detected. By contrast, an auditory lag of only 188 ms or a lead of 75 ms was sufficient for participants to report the asynchrony in the object action video. That is, participants in Dixon and Spitz's study were significantly more sensitive to asynchrony in the object action video than in the continuous speech video stream.

It is important to note, however, that for a number of reasons, the results of Dixon and Spitz (1980) seminal study may not provide an accurate estimate of people's sensitivity to asynchrony for complex stimuli. First, the auditory stimuli were presented over headphones while the visual stimuli were presented from directly in front of the participants. It has recently been demonstrated that the use of such a spatially disparate set-up can introduce confounds, since the spatial separation of stimulus sources often impairs multisensory integration (e.g., Spence and Driver, 1997, Spence and Squire, 2003, Zampini et al., 2003, Soto-Faraco et al., 2002, although see also Recanzone, 2003, Fujisaki and Nishida, 2005, Noesselt et al., 2005, Vroomen and Keetels, in press, Teder-Salejarvi et al., 2005). Second, the gradual desynchronization of the auditory and visual streams in Dixon and Spitz's study might inadvertently have presented participants with subtle auditory pitch-shifting cues that could also have facilitated their performance (Reeves and Voelker, 1993). Third, the magnitude of the asynchrony was always proportional to the length of time for which the video had been presented, thus potentially confounding asynchrony with the amount of time for which participants had been watching the video (and hence possibly introducing temporal adaptation effects; see Navarra et al., 2005). Fourth, the fact that no catch trials were presented in Dixon and Spitz's study means that the influence of criterion shifting on performance cannot be assessed (cf., Spence and Driver, 1997). These factors, should they prove important, mean that people's ability to discriminate the temporal order of continuous speech stimuli and object actions may actually be far worse than Dixon and Spitz's results suggest.

More recently, Grant et al. (2004) reported that participants only noticed the asynchrony in a continuous stream of audiovisual speech when the speech sounds led the visual lip movements by at least 50 ms or else lagged by 220 ms or more (note that headphones were again used in this study to present the auditory stimuli; i.e., the auditory and visual stimuli were presented from different spatial positions). Meanwhile, Hollier and Rimell (1998) reported that participants were more sensitive to asynchrony in the case of audition leading vision when monitoring speech as opposed to non-speech stimuli (i.e., an axe hitting a piece of wood). However, given that no detailed results or stimulus parameter information (such as the duration of the clips) were provided, Hollier and Rimell's results cannot be compared to those from the other two studies described above.

The limitations of previous studies (regarding people's sensitivity to asynchrony for speech stimuli), and the fact that no studies have as yet been conducted to investigate the perception of synchrony for musical stimuli (though see Summerfield, 1992 for anecdotal findings) motivated the present study. We built on the previous research on speech and object actions by assessing people's sensitivity to asynchrony for the perception of speech (complete sentences) and object action videos. We circumvented the limitations identified in previous research by presenting the auditory and visual stimuli from the same spatial location (to avoid the potential spatial confound), by using a variety of fixed SOAs (to avoid introducing any possible pitch-shifting cues introduced while the participant was viewing the video clip), and by assessing performance using a TOJ task (to distinguish genuine perceptual effects from response biases; e.g., when testing the ‘unity assumption’, any bias to assume that the visual and auditory signals were, or were not, matched should not influence the participant preferentially toward making either a ‘visual first’ or ‘auditory first’ response; see Bertelson and de Gelder, 2004, Vatakis and Spence, submitted for publication). Specifically, in our first experiment, we examined the extent to which the nature of the auditory and visual stimuli being monitored affects the perception of synchrony using audiovisual video clips of speech, music, and object actions. We also investigated how the effects observed in speech relate to those observed for these other kinds of complex non-speech events. Utilizing a series of naturalistic audiovisual events in a short video clip format, we assessed the limits on the perception of synchrony using a TOJ task. A wide range of auditory and visual SOA leads/lags were randomly presented using the method of constant stimuli.

Section snippets

Participants

28 participants (14 male and 14 female) aged between 21 and 34 years (mean age of 24 years) were given a 5 pound (U.K. Sterling) gift voucher, or course credit, in return for taking part in this study. All of the participants were naïve as to the purpose of the study and all reported normal hearing and normal or corrected-to-normal visual acuity. The experiment took approximately 50 min to complete.

Apparatus and materials

The experiment was conducted in a completely dark sound-attenuated booth. During the experiment,

Materials and methods

18 new participants (8 male and 10 female) aged between 18 and 29 years (mean age of 23 years) took part in this experiment. The experiment took approximately 30 min to complete. The apparatus, stimuli, design, and procedure were exactly the same as for Experiment 1 with the sole exception that the audiovisual speech stimuli now consisted of (a) a close-up view of the face of a British male uttering the CV syllable /ba/; (b) a close-up view of a person's fingers on a classical guitar playing

General discussion

Overall, the two experiments reported in the present study provide the first empirical evidence concerning how familiarity with a stimulus (in this case normal versus reversed presentation of short video clips) can affect people's sensitivity to audiovisual asynchrony. Additionally, our findings also provide more convincing evidence regarding the limits on the perception of synchrony for speech, musical, and object action events, by eliminating the potential limitations inherent in previous

Acknowledgments

A.V. was supported by a Newton Abraham Studentship from the Medical Sciences Division, University of Oxford.

References (47)

  • A. Vatakis et al.

    Audiovisual synchrony perception for speech and music using a temporal order judgment task

    Neurosci. Lett.

    (2006)
  • M. Zampini et al.

    Multisensory temporal order judgements: the role of hemispheric redundancy

    Int. J. Psychophysiol.

    (2003)
  • R.J. Zatorre et al.

    Structure and function of auditory cortex: music and speech

    Trends Cogn. Sci.

    (2002)
  • K.C. Armel et al.

    Projecting sensations to external objects: evidence from skin conductance response

    Proc. R. Soc., B

    (2003)
  • L.E. Bernstein et al.

    Audiovisual speech binding: convergence or association?

  • P. Bertelson et al.

    The psychology of multimodal perception

  • K.O. Bushara et al.

    Neural correlates of auditory–visual stimulus onset asynchrony detection

    J. Neurosci.

    (2001)
  • G.A. Calvert et al.

    Activation of auditory cortex during silent lipreading

    Science

    (1997)
  • Conrey, B., Pisoni, D.B., in press. Auditory-visual speech perception and synchrony detection for speech and nonspeech...
  • S. Coren et al.

    Sensation and Perception

    (2004)
  • N.F. Dixon et al.

    The detection of auditory visual desynchrony

    Perception

    (1980)
  • D.J. Finney

    Probit Analysis: Statistical Treatment of the Sigmoid Response Curve

    (1964)
  • W. Fujisaki et al.

    Temporal frequency characteristics of synchrony-asynchrony discrimination of audio-visual signals

    Exp. Brain Res.

    (2005)
  • Cited by (0)

    View full text