Is the previous analysis enough to infer the limits of GPT-3? We now know that to get something out of GPT we are limited (barring fine-tuning) to modifying the prompt. Having discussed informativity, we know that for that continuation to be “good”, GPT must return the correct distribution of answers conditional on that question.
1 Previous investigations (like that of Floridi & Chiriatti
2020, Marcus & Davis,
2020) focused on challenging GPT-3 with tests and observing, whether they succeeded in tripping it up. But tests, as a posteriori judgements, cannot tell us which of these failings are due to some necessary limitation. Instead, we need to find out what questions cannot result in a good distribution of answers from a statistical respondent. This cannot be done based solely on our discussion of reversibility, but based on a more thorough understanding of what are the statistical capabilities of such, even infinitely trained, language models.
In the following section, we will create these criteria, which we’ll label as distinct limits, and generalize our discussion of questions to tasks. Marcus & Davis (
2020) highlight, that issues with GPT-3 are the same as those of GPT-2. With this in mind, we will attempt to find such limits of GPT-3, which will persist into GPT-4, and so will pertain to all such language models. We will consider whether it is as Floridi, Chiriatti and others (e.g. Marcus & Davis
2020) claim that semantics are what is beyond GPT-3’s capabilities. To find this out, we’ll have to understand how statistical capabilities and compression allow GPT to answer questions, which we’ll do by first considering examples of tasks that find themselves well within its scope, and which will serve as constraints on a theory of its limits. In the process we will attempt to answer
why GPT-3 behaves as it does when Turing-tested.
3.1 What are statistical capabilities anyway?
Searle argued, that computers ‘have syntax but not semantics’ (Searle,
1980; p. 423). Following this tradition many early commenters on GPT-3 employed this language to describe (or demean) GPT’s abilities, such as Floridi and Chiriatti’s (
2020) claim, that GPT-3 has ‘only a syntactic (statistical) capacity to associate words’, and Marcus and Davis’ (2020) ‘the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics’. Peregrin (
2021) has recently argued that this distinction is unhelpful in the context of discussing the capabilities of AIs, as the distinction between them has become blurred. Although among these descriptions one stands out as certainly accurate: The nature of GPT-3 is statistical. Predicting conditional probabilities is at the core of the model’s working, and as such determines its capabilities. However, stating that GPT-3 has statistical capabilities does not delineate these capabilities, as only with GPT’s predecessors have we started to discover what kinds of skills those capabilities could endow GPT with.
Recall how a language model during training must compress an untenable number of conditional probabilities. The only way to do this successfully is to pick up on the regularities in language (as pioneered by Shannon
1948). Why do we claim that learning to predict words, as GPT does, can be treated as compressing some information? Let’s assume we’ve calculated the conditional probability distribution given only the previous word of all English words. Consider, that such a language model can either be used as a (Markavion) language generator or, following Shannon, be used for an efficient compression of English texts. Continuing this duality, it has been shown, that if a language model such as GPT would be perfectly trained it can be used to optimally compress any English text (using arithmetic coding on its predicted probabilities; Shmilovici et al.,
2009). Thus the relationship between prediction and compression is that training a language generator is equivalent to training a compressor, and a compressor must know something about the regularities present in its domain (as formalized in AIXI theory; Mahoney
2006). To make good predictions it is not enough to compress information about what words to use to remain grammatical (to have a syntactical capacity), but also about all the regularities that ensure an internal coherence of a piece of text. Couldn’t it be feasible that among these GPT has picked up on regularities that go beyond syntax? We believe so, which we’ll illustrate with examples of how GPT can associate certain types of syntax with the content of the text, and even content to other content, which could imply that existing theories of syntax and semantics do not account well for its abilities.
First, GPT-3 can continue a prompt in the same style. The style and content of the prompt will both influence the style and content of the continuation - given a mention of a mysterious murder case it might continue in the style of a detective drama. Although the relationship between style and content is a clear regularity in language, GPT’s use of it goes beyond syntax, because of the bidirectional causation between content and form.
Second, GPT can also translate between natural languages. This surprising ability may be understood in statistical terms. It is likely that GPT learned to translate through encountering parallel texts in its training data. These are texts that are written in multiple languages (for example the Rosetta Stone, many Wikipedia articles, EU legislation), and as training data they give the model a chance to learn statistical links between words or phrases on the inter-language level. It could even be the case that GPT-3 leverages the effects of learning translation from monolingual texts alone, recently found successful for example by Lample et al., (
2018). The striking thing here is that we have moved from mere syntactic text generation to GPT performing tasks which would seem to require semantic competence to attempt.
The described regularity that underlies this ability is an example of what linguists and machine learning researchers call the
distributional hypothesis (Boleda,
2020) - that semantic relationships present themselves as regularities or distributional patterns in language data. While we do not espouse a distributional theory of semantics – words being “characterized by the company they keep” (Firth,
1957), we nonetheless see empirical support for the fact that semantic relationships can be learned from texts alone (for example in word embeddings, or through learning knowledge graphs, see respectively Almeida & Xexéo
2019 and Nickel et al.,
2015). Thus, in order to compress probabilities GPT learns regularities indiscriminately, semantic or otherwise, which endows it with the ability to predict semantically related continuations.
It seems that Peregrin’s diagnosis of the syntax-semantics distinction being unhelpful in discussing the capabilities of AIs holds true in the case of GPT-3. On the level of the language it produces, its abilities go beyond what would be considered syntax. While one can still correctly state that these models have no semantics, by using a theory referring to intentionality, mental states, or the human ability to impute symbols (or data) with meaning (Floridi,
2011a), this would not be practically helpful, as it wouldn’t elucidate any limits of these statistical language generators. A better metaphor would be to describe GPT as engaging competently in a variety of language games that do not require an embodied context, as the things that people
do in language present themselves as regularities to be learned. An even more instructive description would drop these linguistic metaphors altogether and speak in GPT’s language of conditional probabilities. Concretely, that the need to compress probabilities to predict continuations leads to the learning of regularities, which is the basis for there existing in GPT’s weights a distribution over good answers to a question. This distribution could then possibly be evoked with a well-constructed prompt to receive useful continuations, and once any prompt succeeds in producing these answers, we get to know such a distribution exists. These qualities jointly comprise statistical abilities. We can thus see the first limitation of even infinitely-trained models of GPTs: GPT cannot produce the right continuation if there cannot be learned a distribution over answers, and the only way for models to learn this distribution is for the right answers to present themselves as regularities in language. We can call this the
regularity limit.
3.2 How can GPT Play Games using statistical capabilities?
One kind of language game GPT-3 can be said to engage competently in is that given a set of instructions it produces an output compliant with those instructions. Such a description of this ability, highlighting that complying with an instruction is a regularity in language, is easily explainable in statistical terms, whereas in humans this would require semantic competence to understand and implement the meaning of the instructions. We know GPT can perform such feats, as this is exactly what Brown et al., (
2020) labelled zero-shot learning tasks and OpenAI provides prompts which can make GPT-3 engage in tasks such as: creating or summarizing study notes, explaining code in plain language, simplifying the language of a text, structuring data, or classifying the content of tweets (OpenAI,
2021). How can a task description be enough to convey the semantic relationship to a completion?
We need to move from explaining the underlying regularities to explaining the way that things contained in the prompt (words, task instructions etc.) can evoke a right response. What we have already shown is concrete evidence that words increase the probability of the occurrence of their semantically related words (think “piccolo” - “small” in translation); at a higher level, that phrases from some genres activate specific contents and styles, and at a higher level still, that passages that imply some sort of task or game activate distributions that produce passages that seem to comply with that task. While distributional semantics assumes only words to have semantic relationships encoded in regularities, what this illustrates is that the transformer architecture allows GPT to have intermediate (latent) representations of such relationships on many different levels of complexity. This property of not having a priori decided what are the basic units that can possess semantic relationships (e.g. words in word embeddings) means that it can learn semantic relationships between many levels of complexity (between words, styles, contents and task descriptions). The endpoint of such a relationship does not have to be a word, but can be a meaningfully compressed
way of writing words, which we’ll explore with the example of semantically evoking the writing of code. These abilities stand in contrast with previously proposed model architectures like LSTMs (Hochreiter & Schmidhuber,
1997). The transformer has allowed GPT to pick up on long-distance dependencies, and the attention mechanism specifically has allowed it to prime ways of writing words, without having to embed them in a “ways of writing words” space.
The metaphor of activation spreading through a semantic web has been introduced in context of human cognition (Collins & Quillian,
1969; Collins & Loftus,
1975) and while a simplification of human abilities, it may capture how these learnt links of different complexity are utilized by GPT. Namely, that if we are able to specify a word, style, or task in the prompt, then the activation is precisely the increase in probability for words, contents or answers that possess semantic relevance to their priors (an example of how this implementation of activation works was given in Fig.
1, where ‘answer the questions’ increased the probability of an answer). While only a subset of these links will be realized in a single completion, over all possible continuations the priming that occurs reproduces the web of semantic links. Similar ideas have been pursued in the field of deep learning, such as Wang, Liu and Song’s (
2020) proposal that language models are open knowledge graphs, where they extract these links from language models. What we just described, in conjunction with the distributional hypothesis, explains how the semanticity that GPT possesses is realized only through transmission of meanings between the prompt and continuation.
What are the limits of our ability to use this mechanism of priming? The limit that we are hinting at here has been foreshadowed, for example by Marcus & Davis (
2020), who note that finding out, by trying many different prompts, that in one instance to claim GPT can answer such questions or do such tasks. While it is unlikely to stumble on the right answer by accident such an existence proof is evidence that the semantic connection we were looking for exists in GPT’s weights, that it is a regularity, but it also shows that we are unable to reliably evoke this connection with just the prompt. There is thus an extra step to the usefulness of GPT: it is not enough to know the regularity exists, we also need to be able to prime the right semantic connection to a right completion using the prompt - the ability to “locate the meaning” of the symbols in the semantic web, the right node in the web of semantic relations, using only words. This semantic meaning of the prompt needs to be specifiable without relying on any outside grounding of the symbols, nor context, nor inflexion, nor gesture, which GPT does not possess. This, as we’ll see, can prove hard, because GPT does not share an actuality with us.
Recall, from our discussion of psychometrics, that answers depend not only on the question, but also on the task being performed by the respondent. As part of the definition of informativity we included standardization, in which we make sure that GPT-3 knows it is performing the task of answering like a human would. We may now expand this to include any other tasks, and thus judge the informativeness of responses to tasks. The necessary limitation is that such models cannot answer non-informatively, i.e. correctly, when we cannot prime either the question or the task.
A bad priming of a question, like “When was Bernoulli born?” will leave GPT at the superposition of which of the notable mathematicians with that name we meant, but can be easily fixed by expanding the prompt. This may not however work to fix a prime of the task, as it is harder to precisely locate a procedure from its description. That’s why few-shot learning works: giving GPT some examples of correct completions works to locate in the space of tasks the one we wanted GPT-3 to engage in. But what we are after are questions and tasks that cannot be specified by using a longer prompt or by few-shot learning. An example of the first one may be a question about the Beijing 2022 Winter Olympics, in which case we cannot locate the node in the semantic web, as it cannot be a part of a model trained in 2020. An example of a task that cannot be conveyed to GPT-3 is for it to answer questions ‘as itself’ (despite it often seeming like it’s doing so
2). Having encountered no knowledge about the qualities of GPT-3 in its training data it cannot simulate what GPT-3 would answer as itself. These are the ways in which GPT cannot produce the desired answer, even for which it learned a distribution, because we cannot specify the prompt as to locate the meaning of the symbols that will activate the link that conveys our expectation of the semantically related continuation. We can call this the
priming limit.
To this end we’ll define games (as in “games GPT can play”), as these questions and tasks that satisfy both the regularity and priming limits. A game is thus a thing people do in language, (a) that is a regularity thus has a distribution over correct continuations, and (b) which can be specified within the prompt. We have thus generalised the informativeness of questions, as we state that informative tasks are these tasks which could not be considered games in this sense (infinitely well-trained models fail at them), but which humans can complete with ease.
An example of a game, and perhaps GPT-3’s most surprising ability, is its ability to write a computer program when prompted with a natural language description of it (Pal,
2021). In humans this skill, notwithstanding understanding the description, requires the procedural ability of expression in a formal programming language. GPT excels at syntactical tasks semantically evoked, because the skill of permutation of symbols lends itself extremely well to compression. To understand what we may mean by compression (of a skill) we need to invoke Kolmogorov Complexity – a task is compressible if correct answering can be simulated well with a short programme (low Kolmogorov Complexity) — a programme that is shorter than listing all solutions. A similar definition has been used in AIXI theory, where the size of the compressor counts towards the size of a compressed file. In such easily-compressible tasks we claim that compression leads to generalisation — the ability to perform tasks seen in the training set on previously unseen inputs (as in Solomonoff’s induction, where shorter programmes lead to better predictions). This in turn creates the ability to operate on novel inputs and create novel outputs. GPT’s successes in such cases have even led to its evolution into OpenAI’s Codex
3, where it has been shown not to just memorize solutions to such problems but generate novel solutions (contrary to early detractor’s accusations of “mere copy and pasting”), generalization being also a much better compression strategy. These ideas have been explicitly used in deep learning for example in the development of Variational Autoencoders, where compression drives the need to find the underlying features of data, and which endows these models with the ability to generate new examples (Kingma & Welling,
2019). In short: prediction leads to compression, compression leads to generalisation, and generalisation leads to computer intelligence.
This outline encapsulates the schema that can describe the spectacular successes of deep learning on many language tasks, execution of which is well simulated under such conditions (e.g. tasks requiring creative writing).
4 What is of interest to us is to think what tasks would not be well executed under such a scheme. The notion of game is what we’ll use to identify such tasks not only for GPT-3, but its successors, as the prediction of unlabelled text data seems bound to be the pervasive paradigm of large language models in the foreseeable future. With this, we are finally ready to discuss a task which seems to lend itself poorly to compression – the Turing Test.
3.3 Is the Turing Test a game that GPT can play?
We’ve seen that GPT-3 can complete tasks that require the use of semantic relationships
5 (e.g. “A shoe is to a foot, as a hat is to what?”) and symbol manipulation (e.g. coding). However, not all tasks can be completed using just these abilities. We can now aim to find out whether the Turing Test is such a task. To say whether GPT can play the imitation game well, we need to explore whether the abilities required in the Turing Test are simulated well with compression of regularities and answer whether the Turing Test is even a game (in the sense that it exploits an existing regularity that can be prompted)?
Many different abilities have been proposed as being required to pass the Turing Test (e.g. conversational fluency, shared attention and motivation; Montemayor
2021). As the job of the interrogator is to test the respondent on one of these, as to reveal its non-humanity, the strategy of which weakness the interrogator will exploit changes how hard the test will be to pass. We thus need to specify which ability we will be testing, but if even one of these narrower definitions of the Turing Test fails to be a game, we will know that the Turing Test in general is not a game GPT can play.
Let us adopt the version of the Turing Test offered by Floridi and Chiriatti, of testing semantic competence. As we’ve already problematized what this competence entails, an apt description should be that it is a truth-telling ability (as we don’t accept “three” as an answer to “how many feet fit in a shoe?” as it is not actual, cf. Floridi,
2011b). It is obligatory for an AI wishing to imitate a human to have the ability to consistently tell the truth, or at least be sufficiently plausible to fool even an inquisitive interlocutor. We can call this particular version of the Turing Test the
Truing Test.
So is the Truing test a game? First it must satisfy the regularity limit, meaning there has to exist a distribution over answers that corresponds to the production of true continuations. The regularity that could endow GPT with this capacity stems from the millions of factually correct sentences it has observed in its training data However, the training data also included great amounts of falsehoods. These go beyond statements, where speakers are confused as regards to the facts, fiction, metaphor, counterfactual discussions as regards to choices, historical and present events, or ways the world might have been (Ronen,
1994). An optimist could claim that these statements give rise to regularities through which both the real world and these possible worlds can be learned by GPT-3. However, even assuming such regularities exist, they would be different to the ones we previously described as being conducive to performing tasks. This is because there is no uniform logic tying together the factual that could be losslessly compressed by a language model and generalized without losing the factuality of outputs. Truth-telling, as opposed to poem writing, does not warrant creativity. This leads to the first of three issues that we’ll claim stand in the way of a model like GPT engaging competently in truth-telling — that semantic knowledge can only be memorised and thus lends itself poorly to compression. The second and third problems are, as we’ll show, that GPT cannot be primed for truth, and that it cannot differentiate between the real and possible worlds during training.
Let’s suppose that the Truing game satisfies the regularity limit. GPT could then produce false and true continuations. However, due to what we described as the first issue of compression, its memory of facts would still be fuzzy and error-ridden. Nonetheless, another obstacle for the Truing Test to be a game would be the priming limit – whether we can construct a prompt that will narrow down the probability distribution beforehand to our desired, true subset of GPT’s possible continuations or is it one of the unspecifiable tasks. The discussion of how it is not possible to prime such models for truth will explain why we’ve claimed that GPT-3 cannot differentiate between truth and falsehood during training, while the loss in compression will be the grounds for a description of GPT’s real semantics.
To see whether GPT can be primed for truth we need to examine what prompt could we put to it before testing to coerce it into giving true continuations. We need a general prompt that will make GPT continue in coherence with the real world - a task description of the Truing test – a specification of that node of semantic connections that pertains to the way things are. Such a task specification would be some variation on the phrase: “Answer the question truthfully”. There however lies the pitfall of prompting for truth: GPT does not ground symbols and in order to predict well it must only be coherent within a given text. As such, because GPT does not share an actuality with us, the ‘truthfully’ instruction from the prompt does not have to mean ‘in accordance with the actual world’, an obvious interpretation to humans living in the physical world, but could be, depending on the origin of text, in coherence with the world of a detective novel or science-fiction. Any truth that we wish to specify is only indexical to the given text, the meaning of the word ‘truth’ remains only in relationship to the world of the text. This means that when GPT would activate the connections associated with truth, the model has no reason to favour any one reality, but can select from an infinite array of possible worlds as longs as each is internally coherent (Reynolds & McDonell,
2021, 6). To specify the text is true, in the sense that it is actual, we would have to reach outside of text, and thus the second issue – any such language model is unable to be prompted for truth. This is the real extent of semantics based on the distributional hypothesis, of language models that can possess only semantics of semantic relationships.
If then there are no distributional clues as to the actuality of statements, not even if the text claims to be true, then also during training, while predicting the beginning of a text GPT has no clues as to the actuality of the statements being predicted, not even if including the indexical word ‘truth’. But this leads us to the third problem for GPT-3 described earlier — that it cannot differentiate between true and false texts in its training data, and could not have taken the truthfulness of the piece into account while predicting. This necessarily prevents it from picking up on the supposed regularity that is the real world, even if it is truly there. Most the time it is thus forced to make predictions that are probable both in the actual and possible worlds, and as the information about what is true in each world has to be memorized, the loss in semantic knowledge that GPT has learned is the loss in compression of the amalgamation of true and plausible statements. As the incentives present at training do not push it to develop a model of the world, and the only regularity that helps GPT remember the facts is the biases in our retelling of them, we claim the effect of this compression endows it with a kind of possible world semantics that operates on plausibility and renders models like GPT-3 unable to participate in a truth-telling game. The plausibility comes from the fact that semantic errors that GPT makes involve semantically related words, words of the same ad hoc category, which serve a similar relationship in the sense of distributional semantics. We’ll explore this logic of plausibility, as well as its social consequences, in the final part of this paper.
The modal limit – GPT cannot produce the desired continuation, if the continuation is to be reliably actual in the real world. This is because any mention of truth that is not grounded anywhere outside the text keeps indexical to the text, which makes truth both obscured from GPT during training, and unpromptable during use.
One remedy future large language model engineers might wish to employ is to curate the dataset to include only factual writing, or better still label the training data to inform the model, whether the text is actual in the real world (which we have claimed the model cannot infer on its own during training). However, such fixes are unlikely to circumvent the limitations we outlined, which are likely to persist into future generations of large language models. Our first critique of such an approach is that it would deprive the model of its main advantage of using unlabelled data for training, which would make it extremely impractical. Second, non-fiction writing is still filled with utterances either beyond the scope of propositional logic, or that stripped of their context can appear non-actual. Third, even if one were to go through with the tumultuous task of training such a network, there would still be the issue of facts of the actual world not being compressible without loss of fidelity. It is thus prudent to ask whether a better strategy for the model to minimize loss during training wouldn’t still be to generalize the types of things that happen in our world instead of memorizing each thing that actually happened in our world. The model’s continuation might in fact become even more plausible, as it would strip its possible continuations of fantastical occurrences, that are criticized by Marcus & Davis (
2020) as failures of physical reasoning.