Infants initially have little understanding of what is being said around them, and yet at approximately 9 months old are able to produce their first words. When they start producing their first multi-word utterances around 18 months, they can already produce about 45 words and comprehend many more [
1,
2]. One of the challenges infants face is that speech does not contain neat breaks between words, which would allow them to segment the utterance into words. To complicate things further, words might be embedded in longer words (e.g.
ham in
hamster) and furthermore, no two realisations of the same spoken word are ever the same due to speaker differences, accents, co-articulation and speaking rate, etc. [
3]. In this study, we investigate whether a computational model of speech recognition inspired by infant learning processes can learn to recognise words without prior linguistic knowledge.
Cognitive science has long tried to explain our capacity for speech comprehension through computational models (see [
4] for an overview). Models such as Trace [
5], Cohort [
6], Shortlist [
7], Shortlist B [
8] and FineTracker [
9] attempt to explain how variable and continuous acoustic signals are mapped onto a discrete and limited-size mental lexicon. These models all assume that the speech signal is first mapped to a set of pre-lexical units (e.g. phones, articulatory features) and then to a set of lexical units (words). The exact set of units is predetermined by the model developer, avoiding the issue of learning what these units are in the first place. Even the recently introduced DIANA model [
10], which does away with fixed pre-lexical units, uses a set of predetermined lexical units.
While all these models have proven successful at explaining behavioural data from listening experiments, they all require prior lexical knowledge in the form of a fully specified set of (pre-)lexical units. In contrast, infants learn words without prior lexical knowledge (or, arguably, any other linguistic knowledge) as well as without explicit supervision. A viable computational model should simulate word learning in a similar manner.
We take inspiration from the way infants learn language in order to model human word learning and recognition in a more cognitively plausible and ‘human-like’ manner. While learning language, children are exposed to a wide range of sensory experiences beyond purely linguistic input. On the other hand, current computational models of word learning and recognition are often limited to linguistic input. Using a multi-modal model, we aim to show that it is possible learn to recognise words without prior lexical knowledge and explicit supervision if the model is exposed to sensory experiences beyond speech. While there are many sensory experiences that could contribute to language learning, we focus on the most prominent of the human senses: vision. The model that we investigate in the current work exploits visual context in order to learn to recognise words in speech without supervision or prior lexical knowledge.
Visually Grounded Speech
Humans have access to multiple streams of sensory information besides the speech signal, perhaps most prominently the visual stream. It has been suggested that infants learn to extract words from speech by repeatedly hearing words while seeing the associated objects or actions [
11], and indeed speech is often used to refer to and describe the world around us. For instance, parents might say ‘the ball is on the table’ and ‘there’s a ball on the floor’ etc., while consistently pointing towards a ball.
Visually Grounded Speech (VGS) models are speech recognition models inspired by this learning process. The basic idea behind VGS models (e.g. [
12‐
14]) is to make use of co-occurrences between the visual and auditory streams. For instance, from the sentences ‘a dog playing with a stick’ and ‘a dog running through a field’ along with images of these scenes, a model could learn to link the auditory signal for ‘dog’ to the visual representation of a dog because they are common to both image-sentence pairs. This allows the model to discover words, that is, to learn which utterance constituents are meaningful linguistic units. While there is a wide variety of VGS models, they all share the common concept of combining visual and auditory information in a common multi-modal representational space in which the similarity between matching image-sentence pairs is maximised while the similarity between mismatched pairs is minimised.
The potential of visual input for modelling the learning of linguistic units has long been recognised. In 1998, Roy and Pentland introduced their model of early word learning [
15]. While many models at the time (and even today) relied on phonetic transcripts or written words, they implemented a model that learns solely from co-occurrences between the visual and auditory inputs. This model builds an ‘audio-visual lexicon’ by finding clusters in the visual input and looking for reoccurring units in the acoustic signal. It performs many tasks that are still the focus of research today: unsupervised discovery of linguistic units, retrieval of relevant images, and generation of relevant utterances. However, the model was limited to colours and shapes (utterances such as ‘this is a blue ball’) and has not been shown to learn from more natural, less restricted input.
The tasks performed by Roy and Pentland’s model involve challenges for both computer vision and natural language processing. Advances in both fields have led to renewed interest in multi-modal learning, and with it increased the need for multi-modal datasets. In 2013, Hodosh, Young and Hockenmaier introduced Flickr8k [
16], a database of images accompanied by written captions describing their contents, which was quickly followed by similar databases such as MSCOCO Captions [
17]. These datasets are now widely used for image-caption retrieval models (e.g. [
18‐
24]) and caption generation (e.g. [
19,
25]).
Harwath and Glass collected spoken captions for the Flickr8k database and used it to train the first neural network-based VGS model [
26]. Since then, there have been many improvements to the model architecture ([
27‐
33]), as well as new applications of VGS models such as semantic keyword spotting ([
14,
34,
35]), image generation [
36], recovering of masked speech [
37], and even the combination of speech and video [
38].
Many studies have since investigated the properties of the representations learned by such VGS models (e.g. [
13,
39‐
42]). Perhaps the most prominent question is whether words are encoded in these utterance embeddings even though VGS models are not explicitly trained to encode words and are only exposed to complete sentences. The VGS model presented in [
31] showed that representations of a speech unit and a visual patch are often most similar when the visual patch contains the speech unit’s visual referent. In [
28,
29], the authors show that VGS models encode the presence of individual words that can reliably be detected in the resulting sentence representation.
Räsänen and Khorrami [
43] made a VGS model that was able to discover words from even more naturalistic input than image captions: recordings from head-mounted cameras worn by infants during child-parent interaction. The authors showed that their model was able to learn utterance representations in which several words (e.g. ‘doggy’, ‘ball’) could reliably be detected. Even though their model used visual labels indicating the objects the infants were paying attention to rather than the actual video input, this study is an important step towards showing that VGS models can acquire linguistic units from actual child-directed speech.
While the presence of individual words is encoded in the representations of a VGS model, the model does not explicitly yield any segmentation or discrete linguistic units. A technique which allows for the unsupervised acquisition of such discrete units is Vector Quantisation (VQ). VQ layers were recently popularised by [
44], who showed that these layers could efficiently learn a discrete latent representational space. Harwath, Hsu and Glass [
13] have recently applied these layers in a VGS model, and showed that their model learned to encode phones and words in its VQ layers.
Havard and colleagues went beyond simply detecting the presence of words in sentence representations: they presented isolated nouns to a VGS model trained on whole utterances, and showed that the model was able to retrieve images of the nouns’ visual referents [
45]. This shows that their model does not merely encode the presence of these nouns in the sentence representations, but actually ‘recognises’ individual words and learns to map them onto their visual referents. So, regarding the example mentioned above, the model learned to link the auditory signal for ‘dog’ to the visual representation of a dog.
However, the model by Havard and colleagues [
45] was trained on synthetic speech. Word recognition in natural speech is known to be more challenging, as shown for instance by a large performance gap between VGS models trained on synthetic and real speech [
28]. Dealing with the variability of speech is an important aspect of human speech recognition. If VGS models are to be plausible as computational models of speech recognition, it is important that these models implicitly learn to extract words from natural speech.
Current Study
The goal of this study is to investigate whether a VGS model discovers and recognises words from natural, as opposed to synthetic, speech. We furthermore go beyond earlier work because we investigate the model’s cognitive plausibility by testing whether its word recognition performance is affected by word competition known to take place during human speech comprehension. We aim to answer the following questions:
1.
Does a VGS model trained on natural speech learn to recognise words, and does this generalise to isolated words?
2.
Is the model’s word recognition process affected by word competition?
3.
Does the model learn the difference between singular and plural nouns?
4.
Does the introduction of VQ layers for learning discrete linguistic units aid word recognition?
Our
first experiment is a continuation of our previous work [
46] and the work by Havard et al. [
45]. As in [
45], we present isolated target words to the VGS model and measure its word recognition performance by looking at the proportion of retrieved images containing the target word’s visual referent. If the model is indeed able to recognise a word in isolation, it should be able to retrieve images depicting the word’s visual referent, indicating that the model has learned a representation of the word from the multi-modal input. Whereas previous work focused on the recognition of nouns, we also include verbs as our target words.
For this experiment, we collect new speech data, consisting of words pronounced in isolation. On the one hand, such data can be thought of as ‘cleaner’ than words extracted from sentences (as in [
46]) due to the absence of co-articulation. On the other hand, the model was trained on words in their sentence context, co-articulation included, and might have learned to rely on this contextual information too heavily to also recognise words in isolation. Thus, to answer our first research question, we investigate whether our VGS model learns to recognise words independently of their context. Furthermore, we investigate whether linguistic and acoustic factors affect the model’s recognition performance similarly to human performance. For instance, we know that faster speaking negatively impacts human word recognition (e.g. [
47]).
In our
second experiment we investigate the time course of word recognition in our VGS model. This allows us to test whether the model’s word recognition performance is affected by word competition as is known to take place during human speech comprehension. For this experiment, we look at two measures of word competition: word-initial cohort size and neighbourhood density. In the Cohort model of human speech recognition [
6], the incoming speech signal is mapped onto phone representations. These activated phone representations activate every word in which they appear. As more speech information becomes available, activation reduces for words that no longer match the input. The word that best matches the speech input is recognised. The number of activated or competing words is called the word-initial cohort size and plays a role in human speech processing: the larger the cohort size (i.e. the more competitors there are), the longer it takes to recognise a word [
48]. Words with a denser neighbourhood of similar-sounding words are also harder to recognise as they compete with more words [
49].
We also use our model to test the interaction between neighbourhood density and word frequency. Several studies have investigated this interaction, with inconclusive results. In a gating study, Metsala [
50] found an interaction where recognition was facilitated by a dense neighbourhood for low-frequency words and by a sparse neighbourhood for high-frequency words. Goh et al. [
51] found that response latencies in word recognition were shorter for words with sparser neighbourhoods. They furthermore found a higher recognition accuracy for sparse-neighbourhood high-frequency words as opposed to the other conditions (i.e. sparse-low, dense-high, dense-low). This means that, unlike Metsala, they found no facilitatory effect of neighbourhood density for low-frequency words. Others found no interaction between lexical frequency and neighbourhood density at all [
52,
53].
For this experiment, we use a gating paradigm, a well-known technique borrowed from human speech processing research (e.g. [
54,
55]). In the gating experiment, a word is presented to the VGS model in speech segments of increasing duration, that is, with an increasing number of phones, and the model is asked to retrieve an image of the correct visual referent on the basis of the speech signal available so far. We then analyse the effects of word competition and several control factors on word recognition performance.
In our third experiment we investigate whether our VGS model learns to differentiate between singular and plural instances of nouns. By the same principle of co-occurrences between the visual and auditory streams that allows the model to discover and recognise nouns, it may also be able to differentiate between their singular and plural forms. We test this by presenting both forms of all nouns to the model, and analysing whether the retrieved images contain single or multiple visual referents of that noun.
Our
fourth question investigates VQ, a technique that was recently first applied to VGS models by Harwath, Hsu and Glass [
13]. Their model acquired discrete linguistic units, including words. However, it is still an unanswered question whether such VQ-induced word units also aid the recognition of words in isolation. If they do, the addition of VQ layers should improve word recognition results of our VGS model. Havard, Chevrot and Besacier [
30] improved retrieval performance of their VGS model by providing explicit word boundary information, thereby showing that knowledge of the linguistic units is indeed beneficial to the model. Rather than explicitly providing word boundary information, VQ layers allow units to emerge in an end-to-end fashion. Because prior knowledge of word boundaries is not cognitively plausible, VQ layers are a more suitable approach for our cognitive model. To investigate if the introduction of VQ layers indeed aids word recognition, all our experiments compare the baseline VGS model to a VGS model with added VQ layers.
To foreshadow our results, we find that (1) our VGS model does learn to recognise words in isolation but performance is much higher on nouns than on verbs; (2) word recognition in the model is affected by competition similarly to humans; (3) the model can distinguish between singular and plural nouns to a limited extent; and (4) the use of VQ layers does not improve the model’s recognition performance.