The Animal Ability to Identify and Produce Phonemes
Much research has addressed phoneme identification—i.e., fine-tuned perceptual discrimination of vowel- and consonant-like sounds in animals. In this regard, a study reported that one chimpanzee, who had long been exposed to spoken English before being tested, was able to recognize spoken words, even when spectrally degraded (Heimbauer et al.
2011). Further work shows that animals can learn categorical discrimination of distinct phonemes along an acoustic continuum. For instance, macaques (
Macaca mulatta) (Kuhl and Padden
1983) and budgerigars (
Melopsittacus undulatus) (Dooling and Brown
1990) can learn to discriminate between voiced and voiceless consonants in the pairs /ba/-/pa/, /da/-/ta/, /ga/-/ka/. Similarly, chinchillas (
Chinchilla laniger), a mammalian species with auditory abilities similar to humans, can be trained to discriminate a voiced plosive consonant, /d/, from a voiceless one, /t/ in the initial position of a syllable (Kuhl and Miller
1975).
In addition to research on animals’ ability to learn to identify fine-grained differences between phonemes, extensive research has addressed their ability to produce phoneme-like sounds. A fundamental theory in the study of animal vocal production is the so-called source-filter theory, which identifies two main factors affecting the vocal output: the “source” and the “filter” (Fitch
2000; Titze
1994). The source of vocal sound production is the larynx in mammals, amphibians, and reptiles, and the syrinx in birds. Specifically, vocal sound is generated by tissue vibrations stimulated by the passage of air through the vocal folds, in the source. The lowest frequency of the vocal folds’ opening-closing cycles determines the fundamental frequency of the vocal sound (F0), which corresponds to the tonal sensation of the voice’s pitch. Subsequently, the sound reaches the supralaryngeal vocal tract, i.e., the filter, where certain frequencies are enhanced while others are attenuated by articulation of various parts of the filter, e.g., lips, or tongue. This results in concentrations of acoustic energy in particular frequency bands (called ‘formants’), which are perceivable in vowels and consonants (Fant
1960). For instance, if you produce a sequence of different vowels, equal in duration, F0, and amplitude, the perceived acoustic variation is resultant of the difference in formant frequencies.
Following Lieberman et al. (
1969), until the last two decades it was commonly assumed that mammals (including primates) are not able to articulate sounds included in human speech due to an anatomical limitation in the filter, namely a heightened larynx. This has been argued to impact the range of articulatory movements in the vocal tract, and hence the formants that could be produced. However, a recent growing body of converging data from empirical studies and computer models of animal vocal production has been undermining Lieberman’s hypothesis. For instance, research shows that, when resting, the larynx of red and fallow deer (
Cervus elaphus and
Dama dama, respectively) is in a position comparable to that of humans, and retracts even lower during vocalization (Fitch and Reby
2001). Furthermore, Boë et al. (
2017) reported that vocalizations of baboons (
Papio papio) have the formant structure of human [ɨ æ ɑ ɔ u] vowels. This finding suggests that, unless the ability to produce these vowels emerged independently in humans and baboons, the ability to articulate vowel-like sounds may be traced to the last common ancestor from which humans and Cercopithecoidea diverged, about 25 million years ago (Stevens et al.
2013). Consistent with this work, a study adopting a computer model based on vocal tract configurations of living rhesus macaques (
Macaca mulatta) confirmed that the primate vocal apparatus is potentially capable of producing human-like vowel sounds, as well as a variety of consonants, including stop consonants that are widely shared across languages (e.g., /h/, /m/, /w/, /p/, /b/, /k/, and /g/) (Fitch et al.
2016). This study implies that the human ability for speech required the evolution of specific neural connections between forebrain and laryngeal muscles, rather than anatomical changes in the vocal apparatus. Importantly, this research supports findings from previous studies hypothesizing that, in humans, direct neural connections between the laryngeal motor cortex (LMC) and the brainstem laryngeal motoneurons (which are, in turn connected to the laryngeal muscles), as well as the location of the LMC in the primary motor cortex (as opposed to its location in the premotor cortex in nonhuman primates), might have been key evolutionary steps enabling the ability to control complex laryngeal movements involved in producing learned vocal utterances (Jürgens
2002,
2009; Simonyan
2014; Simonyan and Horwitz
2011). In monkeys, and presumably, other nonhuman primates, the LMC is linked only indirectly—namely, through the reticular formation—to the laryngeal motoneurons in the brainstem (Simonyan
2014). Critically, their innate vocal production seems to be enabled by a specific voice control system in the brain, involving the brain stem and spinal cord sensorimotor phonatory nuclei only (Simonyan and Horwitz
2011). This might explain why the destruction of the LMC region in monkeys does not affect their innate vocal production (Simonyan
2014), which can take place without involving voluntary coordination and control of laryngeal muscles. Comparative studies on primate vocal production are a clear example of how research can help shed light on which trait was present (the anatomy of the vocal tract) and which was still missing (e.g., direct neural connections from motor cortical regions onto laryngeal motoneurons) before full-blown speech evolved.
Strikingly, this picture could be placed into a broader evolutionary scale to gain a wider perspective on the selective pressures enabling the emergence of the neural connections necessary for articulating human speech sounds. Indeed, although much research reports on animals’ ability to produce novel sounds and sound combinations by imitation (see section below), only a few species of mammals and birds seem to be able to learn to modulate their vocal tract to imitate words and sentences in existing human languages, e.g., Asian elephant,
Elephas maximusi, (Stoeger et al.
2012); captive harbor seals,
Phoca vitulina, (Ralls et al.
1985); gray seals,
Halichoerus grypus (Stansbury and Janik
2019); and birds (Grey parrot,
Psittacus erithacus, Pepperberg
2010; mynah bird,
Acridotheres tristis, Stefanski and Klatt
1974). Social bonding can be identified as a potential selective factor boosting the ability to learn to produce novel sounds that are not included in the given species’ vocal repertoire (Stoeger et al.
2012). Hence, this research provides crucial insights on the key role of social pressures in language evolution and is consistent with work suggesting that social bonding (which is likely highly connected to the use of vocal emotional intonation in inter-individual communications) might have promoted the evolution of neural connections enabling the production of human spoken language (Dunbar
2003).
Generally, although it is plausible that species that are able to produce speech sounds can equally discriminate them at a perceptual level (cf. Pulvermller
2005), more research is needed on this topic within a cross-species perspective. This will favor a broader understanding of the evolutionary pressures behind neural and anatomical predispositions for identifying and producing phonemes.
The Animal Ability to Process Compositional Rules in Vocal Utterances
A number of cross-species studies revolve around the ability to process vocal sequences according to compositional rules. This line of research aims at understanding the evolutionary precursors and selective pressures that led from the ability to parse simple forms of compositionality, which has been demonstrated in multiple species, to the human ability to parse fully-fledged syntactic systems of languages (Russell and Townsend
2017). Although much research on this topic is still ongoing, our understanding of the evolution of the human ability for syntax has significantly advanced in the last two decades (Collier et al.
2014; Engesser and Townsend
2019; Townsend et al.
2018; Zuberbühler
2020). For instance, it has been shown that Campbell’s monkeys (
Cercopithecus campbelli) add an acoustic modifier (i.e., a sort of affix) to predator-specific alarm calls (Ouattara et al.
2009). In this way, the meaning of the alarm call is no longer perceived as linked to a predator, but to the presence of a general disturbance. Intriguingly, the ability to process compositional structures was also found in birds, providing insights for comparative studies on its evolutionary origins. For example, Engesser et al. (
2016) showed that southern pied babblers (
Turdoides bicolor) respond to combinations of alert and recruitment calls with mobbing-like behavior, while no obvious reaction is elicited by control combinations of foraging and recruitment calls. In a similar study, Suzuki et al. (
2016) report that, in Japanese tits (
Parus minor), the combination of two calls—namely of a call typically eliciting scanning behaviors in the listeners, and a call typically eliciting approach behavior to the caller—results in the combination of these two behaviors, i.e., scanning and approach. In a control experiment, the inversion of these two calls did not elicit any behavior, suggesting that these birds are processing the call combination according to a specific order. Therefore, these studies suggest that the southern pied babblers and the Japanese tits are sensitive to compositional properties of call sequences, and that structural changes impair signal perception. These systems can be fruitfully compared with compositional structures in language, where variation of words within a sequence (e.g., changing “gimme a break” into “apple a break” or “break a gimme”) can turn a well-formed and meaningful spoken utterance into an ill-formed and meaningless sequence of words.
Further cross-species studies on the ability to discriminate syntactical structures have typically adopted artificial grammars that are created following specific formal rules. For instance, Spierings and ten Cate (
2016) found that zebra finches (
Taeniopygia guttata) are able to discriminate units of their own vocal repertoire, arranged in a XYX or XXY structure, and that budgerigars (
Melopsittacus undulatus) can discriminate and generalize this grammatical rule to novel elements they were never trained on during a previous rule learning phase.
A fundamental strand of comparative research on animals’ ability to process compositional structures has attempted to identify the cognitive abilities that enable humans (but not other animal species) to process more complex compositional structures in language. This research builds on the assumption that the human-specific ability to express an open-ended number of thoughts using a finite set of linguistic units relies on recursion (Everaert et al.
2015), i.e., the operation of embedding constituents within constituents of the same kind (Pinker and Jackendoff
2005; cf. Martins
2012). Building on this assumption, Hauser et al. (
2002) proposed that the ability to use recursion might be
the key computational ability that differentiates the syntactical competence of humans from combinatorial abilities found in animals (cf. Bolhuis et al.
2018). Within this conceptual framework, much research has relied on the so-called “Chomsky hierarchy” (Chomsky
1956,
1959) as a way to guide empirical work. This hierarchy provides a theoretical structure to identify and classify different levels of computational powers, each corresponding to a specific “grammar”. Each grammar includes a finite number of symbols, rules, and operators to apply to these symbols. One of the aims of this classification is to identify the level of computational power that enables an automaton to process natural languages on a mere mathematical and abstract level, i.e., excluding aspects such as lexical semantics, interactional dynamics, or context. This highly formal character of grammars favors well-controlled cross-species investigations of computational abilities that are foundational to language (O’Donnell et al.
2005). Hence, this research framework enables the investigation of animals’ computational capacities along a complexity axis, which includes the computational capacity underpinning natural language processing.
As Fitch and Friederici (
2012) explain in their exhaustive and, at the same time, intuitive overview of the formal language theory at the base of Chomsky’s hierarchy, a crucial distinction within this hierarchy is between “regular” and “supra-regular” grammars. This distinction is important because it provides a line of demarcation between the computational abilities that are necessary to process very simple structures and those that are necessary to process hierarchical syntactic structures in natural languages. Regular grammars can be computed by the simplest class of automata (called “finite state automata”), using basic computational rules, namely, transition probabilities between a finite number of “states” (e.g., phonemes, syllables, or words). Examples of strings that can be processed by regular grammars are “(AB)
n” - where the automaton has to accept an
n number of “AB” bigrams, or “AB*A”—where any number of B units can occur between the A units at the edges. These basic rules are not enough to process the structural complexity of natural languages (Jäger and Rogers
2012). However, they might suffice to process phonological sequences, an ability that humans might share with other animals (Fitch
2018a). In contrast, “supra-regular” grammars, which include multiple subsets of grammars, rely on more complex rules and computational power than that required for a finite state automaton. An example of a supra-regular grammar is a context-free grammar, which can be computed by a “pushdown automaton”. For instance, the A
nB
n sequence—where a number of B elements follows the same number of A elements—can be processed by this type of automaton, but not by a finite state one (Fitch and Friederici
2012; O’Donnell et al.
2005), which is not able to count and compare (Jäger and Rogers
2012). Crucially, the set of supra-regular grammars vary in the amount of requested computational power that can be used to process dependencies between the constitutive elements of an expression. Importantly, this set includes grammars that can process dependencies within recursive structures, such as A
1A
2A
3B
3B
2B
1, where the same pattern AB is nested in itself, following a center-embedded structure (Chomsky
1956,
1959). As Jäger and Rogers (
2012) explain, an example of nested dependencies is given by the English construction “neither-nor”, repeated multiple times within the same sentence, as in “
Neither did Mary think she would
neither go to the cinema
nor eat pizza,
nor did I”.
The first study to use the distinction between regular and supra-regular grammars to compare humans and animals’ (specifically, cotton-top tamarins) was conducted by Fitch and Hauser (
2004). In this study, the authors found that cotton-top tamarins are able to process AB
n sequences—i.e., regular grammars, but fail to process A
nB
n sequence - i.e., supra-regular grammars, while humans, as predicted, succeeded in processing both grammars. Following up this work, a number of studies have probed how phylogenetically widespread the ability to process regular and supra-regular grammars is. To date, the majority of studies have found that, multiple species of animals are able to process regular grammars, specifically, (AB)
n sequences (ravens,
Corvus corax, Reber et al.
2016; kea,
Nestor notabilis, and pigeons,
Columba livia, Stobbe et al.
2012; cf. ten Cate and Okanoya
2012) and perceptual dependencies between edge stimuli in AB
nA sequences both in the visual domain (chimpanzees,
Pan troglodytes, Sonnweber et al.
2015; cotton-top tamarins,
Saguinus oedipus, Versace et al.
2019) and in the auditory domain (squirrel monkeys,
Saimiri sciureus, Ravignani et al.
2013; cotton-top tamarins,
Saguinus oedipus, Newport et al.
2004; common marmosets,
Callithrix jacchus, Reber et al.
2019). In addition, although some research has suggested that birds are able to process supra-regular grammars (Abe and Watanabe
2011; Gentner et al.
2006), subsequent studies have shown that these birds might have used simple strategies—that do not require any of the computational power at the level of supra-regular automata—to parse these structures (Ravignani et al.
2015; Van Heijningen et al.
2009). However, in a recent study, Jiang et al. (
2018) provided, for the first time, compelling evidence that an animal species—specifically, the macaque monkey (
Macaca mulatta)—is able not only to parse, but also to produce a sequence according to a supra-regular grammar, namely, a “mirror” (context-free) grammar of the form ABCCBA. Here, the second part of the string is a mirror image of the first part, thus including a center-embedding organization. The authors tested pre-school children on the same task, and found that, compared to monkeys, who needed a massive amount of training to learn the grammar, humans learned to master the grammar with only a little training. These findings suggest that monkeys possess these computational competences, although they do not have the same human inclination to use them (Fitch
2018b).
Here, it is important to stress that much debate is currently ongoing regarding the assumption that recursion is
the defining computational system of language (Christiansen and Chater
2015; Evans and Levinson
2009; Parker
2007; Perruchet and Rey
2005). Nevertheless, comparative research relying on grammars defined within Chomsky’s hierarchy is effective for a systematic investigation of the ability of animals to process different levels of structural complexities in the vocal domain. This, in turn, may provide key insights into the evolution of the human ability to parse compositional patterns.
But what is the evolutionary advantage of the animals’ ability to produce and process compositional structures? Crucially, in animal communication systems, higher levels of structural complexity in compositional structures allows for the transmission of information with greater degrees of complexity compared to vocalizations with simpler structures (Nowicki and Searcy
2014). In this regard, research indicates that higher levels of vocal complexity typically co-occur with the predisposition to learn to articulate a signal by imitating (and modifying) someone else’s signal (Nottebohm
2002). Hence, the tendency to learn vocally might have been a key factor in the evolution of the human ability to identify and produce syntactical structures in language. Extensive research has addressed the phylogenic path of the ability for vocal learning, and the selective pressures underpinning its evolution (cf. Martins and Boeckx
2020). In particular, animal research on this topic has mainly focused on three groups of birds (parrots, hummingbirds, and songbirds) (Beecher and Brenowitz
2005; Jarvis
2006). Recently, this line of research has been complemented by studies on phylogenetically distant mammalian species, including terrestrial and marine mammals (e.g., African elephants,
Loxodonta africana, Poole et al.
2005; Egyptian fruit bat,
Rousettus aegyptiacus, Prat et al.
2015; humpback whale,
Megaptera novaeangliae, Cerchio et al.
2001; Californian sea lion,
Zalophus californianus, Reichmuth and Casey
2014).
The Animal Ability to Associate Vocal Utterances with Meanings
Much research aimed at pinpointing the evolutionary precursors of the human ability for word-meaning association in animal communication systems has focused on animals’ ability to understand the link between vocal utterances and their meaning—i.e., the information they express or refer to (Dawkins and Krebs
1978; Macedonia and Evans
1993; Marler et al.
1992; Wiley
1983). For instance, studies indicate a strong link between acoustic features of the signal and information related to the body size and the emotional state of the signaler (Owren and Rendall
2001). Body size has been demonstrated to be reliably cued by formant-structure of mammalian vocalizations. Specifically, individuals with bigger bodies have lower formant frequencies than smaller individuals (domestic piglets,
Sus scrofa domesticus, Garcia et al.
2016; koala,
Phascolarctos cinereus, Charlton et al.
2011; rhesus macaques,
Macaca mulatta, Fitch
1997; humans, Pisanski et al.
2014; for cross-species studies, see: Bowling et al.
2017; Charlton and Reby
2016; Taylor and Reby
2010). In accordance with these studies, research on the perception of vocal indicators of body size suggests that formants are also the most reliable acoustic parameters for perception of size-related variation in animals (e.g., whooping cranes,
Grus americana, Fitch and Kelley
2000; red deer,
Cervus elaphus, Charlton et al.
2007a,
b; dog,
Canis lupus familiaris, Faragó et al.
2010), and between species (Taylor et al.
2008). Similar mechanisms seem to be at play in the perception of body size and related information through acoustic features of the voice in humans. Indeed, research shows that formants are linked to size perception (Ohala
1984; Pisanski et al.
2014; Rendall et al.
2009) and dominance (Puts et al.
2006) in humans, and suggest that back vowels (e.g., /o/, /a/) are associated with big objects and front vowels (e.g., /i/, /e/) are associated with small objects (see Lockwood and Dingemanse
2015a for a review). In addition, Auracher (
2017) reports that human participants associate back vowels with larger sizes, aggression, strength, and social dominance, and front vowels with small sizes, weakness, fearfulness, and social subordination. Interestingly, the author found that, in this association process, the semantic content of the pictures (e.g., elephant vs. rabbit) overwrites the actual size of the depicted objects in this association process—given, for instance, by using an image of the elephant that was relatively smaller than the image of the rabbit.
In addition, Bowling et al. (
2017) showed that body size inversely correlates to F0 in a wide variety of mammalian species. This study is consistent with Morton’s (
1977) “motivational-structural rules” hypothesis, which states that in mammals and birds, harsh, low-frequency vocalizations are used in competitive contexts to signal physical dominance, whereas more tonal, high-frequency vocalizations are used in fearful or appeasing contexts to signal submission. Recent research has extended these findings, suggesting that larynx size (in particular, vocal fold length), which might not be proportional to body size, predicts F0 better than body size (Garcia et al.
2017).
Critically, research found evidence for the ability to process simple spoken sound-meaning associations in animals. Dogs (Kaminski et al.
2004), parrots (Pepperberg
2006), and chimpanzees (Savage-Rumbaugh et al.
1993) have all been shown able to infer which specific object a word refers to. Finally, comparative research on animal communication has described animal calls as “word-like” vocal units in that these calls are associated with specific objects or events akin to the referential nature of human words. For instance, in a very influential study, Seyfarth et al. (
1980) suggested that the vervet monkey (
Chlorocebus pygerythrus) have three distinct alarm calls, each associated with ‘snake’, ‘eagle’, and ‘leopard’ respectively. These calls elicit appropriate behaviors in the listeners, such as looking up upon hearing the call emitted by the signaler in response to the presence of an eagle. More recently, research has revisited these original findings and adopted state-of-the-art techniques for acoustic data analyses (Fischer and Price
2017; Price et al.
2015). These studies highlight that animal calls do not “carry” information on the basis of an arbitrary association between sounds and meanings, as in the case of human words. On the contrary, in primates, vocalizations are genetically determined and are triggered by emotional and cognitive states of the signalers, which are reflected in specific acoustic features of the signal. The perception of these acoustic features, combined with contextual cues, allows listeners to associate the signal with its eliciting stimuli, and subsequently select the appropriate responses (Wheeler and Fischer
2012).
Within the comparative approach proposed here, studies on emotional expression through voice intonation are particularly relevant to the study of the evolution of the ability to associate arbitrary vocal utterances with their meaning. Indeed, as I will describe in the next sections, emotional expressions are widespread across a wide variety of vocalizing animal species (Darwin
1872), and, within humans, across cultures (Barrett and Bryant
2008; Sauter et al.
2015; Scherer et al.
2001). This makes emotional expresssions a good candidate for enhancing our understanding of the dynamics underpinning the evolution of the human ability for speech processing and word-meaning associations.
Vocal Emotional Expression: A Cross-Species Comparative Approach
The study of emotional expression through voice intonation in animals may provide crucial insights to reconstruct the dynamics underpinning language evolution (Darwin
1871; Filippi
2016; Filippi and Gingras
2018; Filippi et al.
2019). Across animal species, emotions serve adaptive functions, favoring actions that promote survival, such as a fight-or-flight response to an attacking predator in the surroundings (Nesse
1990). In addition, emotional stimuli engage selective attention (Kret et al.
2016) and favor associative learning in animals (McGaugh
2004; Seymour and Dolan
2008).
Importantly, changes in emotional states may create tension in the muscles involved in vocal phonation, as for instance, those involved in respiration (diaphragm and intercostal muscles) and, importantly, in the vocal folds (Ladefoged
1996; Titze
1994). These changes affect vocal sound production, generating audible differences between vocalizations emitted in intense emotional states and those emitted in less intense ones. In line with Darwin (
1871,
1872), I assume that the mechanisms of production of emotional vocalizations might be evolutionary conserved across species (Filippi et al.
2017a), and were, presumably, in place at the time the first hominids diverged from the last common ancestor between humans and chimpanzees.
Multiple studies have addressed emotional communication in animals, focusing on discrete emotions, such as fear or rage (Camperio Ciani
2000; Forkman et al.
2007). However, as Mendl et al. (
2010) observe, this approach may narrow down the range of emotions that can be assessed in animals within a comparative approach. In fact, a research framework that is best suited for comparative analyses is offered by the dimensional approach (Russell
1980), in which emotions are described according to two dimensions: arousal (low/calm or high/excited) and valence (positive or negative). Crucially, the investigation of arousal, which relies on quantitative measures of physiological correlates of emotional activation of signalers, serves cross-species comparison very well (Briefer
2012). In addition, this quantitative approach allows researchers to identify vocal indicators of arousal levels in the vocalizing animals. In an extensive review, Briefer (
2012) reports that across the vast majority of studied mammalian species (including humans, see Banse and Scherer
1996; Johnstone and Scherer
2000), heightened levels of arousal are expressed through energy distribution towards higher frequencies, higher frequency-related parameters, amplitude contour, vocalization rate, and lower inter-vocalization interval. In addition to studies on emotional arousal expression in mammals, research reports that a songbird species, the black-capped chickadee (
Poecile atricapillus) encodes the degree of threat posed by small or large predators—which presumably trigger low and high arousal emotional states, respectively—in their calls (Templeton et al.
2005). Specifically, the higher the threat, the higher the number of D notes at the end of their call.
The ability to identify emotional states in vocal signals, which may be produced within social interactions (Altenmüller et al.
2013; Bryant
2013), favors survival of conspecifics in contexts such as territory defense or predation (Cross and Rogers
2006; Desrochers et al.
2002; Owings and Morton
1998). In addition, survival chances may be favored by “eavesdropping” on another species’ alarm calls although acoustically different from their own (de Boer et al.
2015; Fallow et al.
2011; Kitchen et al.
2010; Lea et al.
2008; Magrath et al.
2009). In line with these studies, recent work has found that humans and black-capped chickadees can discriminate high versus low arousal calls across a large variety of vocalizing species, spanning all classes of vocalizing vertebrates (Congdon et al.
2019; Filippi et al.
2017a,
b).
Finally, the ability to identify emotional activation in the signaler (conspecific or heterospecific) may determine survival of newborns, who can express their needs very effectively through voice intonation, thus enabling their caregivers to respond appropriately (Marmoset monkey,
Callithrix jacchus, Tchernichovski and Oller
2016; Zhang and Ghazanfar
2016; human, Fernald
1992). Interestingly, Lingle and Riede (
2014) found that mule deer (
Odocoileus hemionus) and white-tailed deer (
Odocoileus virginianus) mothers are sensitive to high arousal, negatively-valenced vocalizations of infants of a variety of mammalian species (e.g., mule deer,
Odocoileus hemionus, bighorn sheep,
Ovis canadensis, marmots,
Marmota flaviventris, bats,
Lasionycteris noctivagans, Australian sea lion,
Neophoca cinerea and Subantarctic fur seals,
Arctocephalus tropicalis), if the F0 values are within the frequency range produced by infants of their own species.
Taken together, these studies are consistent with Darwin’s (
1871) hypothesis that emotional communication in animals is produced through mechanisms underpinning voice production that are conserved across phylogenetically distant species. This hypothesis is in line with a growing body of studies attesting to the human ability to identify vocal emotions across widely different cultures (Barrett and Bryant
2008; Sauter et al.
2015; Scherer et al.
2001). Emotional communication might, therefore, be biologically ancient and immune to the influence of cultural dynamics.
In light of the evidence reviewed in this section, it is worth addressing how emotional intonation, as a communication code used across a wide variety of animal species, affects language processing in humans. This line of investigation will provide insights into the dynamics underlying the emergence of language from nonhuman animal communication systems.