What’s wrong with blockhead? After all, the system is able to master a conversation. But it seems clear that it has no understanding of what it is talking about. It cannot grasp the meaning of the words. In fact, it can’t possibly talk about the world. It doesn’t refer to the world, as it never had any contact with the world. In short: it has no semantic grounding. Genuine intelligence, however, needs at some point some sort of grounding. Semantic grounding will be our third candidate for an AI state space dimension.
4.1 The Symbol Grounding Problem and the Dimension of Functional Role Grounding
As argued in Sect.
2.2, DeepBlue doesn’t really play chess as it essentially won by basic symbol manipulation and calculation without any self-driven exploration of chess. Conversely, how about a rigorous self-learning system like AlphaGo (or AlphaGoZero)? Does AlphaGo actually play Go? As an easier to grasp example, let us consider DQN as already briefly described in Sect.
3.1. It plays Atari breakout at a super-human level. As typical for DL systems, DQN draws on a gigantic number of training examples. In fact, the number of training games exceeds human training and, thus, the experience of human Atari gamers by orders of magnitude. This already suggests that the DL algorithms and network architectures do not strictly correspond to those that play a role in humans. But this makes them no less successful with regard to certain capabilities. And they can nevertheless be regarded as a “proof of principle” for a biologically inspired connectionism (cf. Hassabis et al.
2017 for faster deep reinforcement learning models inspired by neuroscience).
Consider the training data of DQN. How can they be understood from the perspective of the system? For a human player, breakout’s block-like pixel world, despite its minimalism, looks like a world of rackets, balls, walls and stones. There is nothing to suggest that this is the case for DQN. The system never had contact with real rackets, balls, walls or stones. From DQN’s perspective, the only things that exist, as it were, are tons of pure pixel distributions. With this in mind, the system’s superhuman playing abilities, and especially its additional ability to develop creative long-term solution strategies (like digging a tunnel), seem almost scary. At least it might seem that way. And the same seems to apply to AlphaGo or AlphaGoZero.
The problem can be seen as an instance of the
symbol grounding problem, although this problem, in its original form, is aimed at classical symbolism (cf. Taddeo and Floridi
2005 for a review). A symbol is a physical token that is individuated based on its physical form and that can be linked to other symbols according to syntactic rules. Symbols are therefore elements of symbol systems. Symbolism regards the manipulation of physical symbols as necessary and sufficient for intelligence and cognition. This is compatible with a computational theory of mind according to which the brain as the vehicle of cognition is to be regarded as a computing device or Turing machine. According to Stevan Harnad the symbol grounding problem now consists in the following: „
Suppose you had to learn Chinese as a first language and the only source… you had was a Chinese/Chinese dictionary! … How is symbol meaning to be grounded in something other than just more meaningless symbols? This is the symbol grounding problem“(Harnad
1990, p. 339–340).
In contrast to symbolism, connectionism emphasizes not only the network architecture of cognitive systems, but also a “subsymbolism” instead of symbol-based information processing. Superficially, neural networks do not operate on symbols but on inputs that represent (micro) features. In the case of DQN or AlphaGo, these are pixels with different gray or color values. However, since it can be shown that important classes of neural networks such as recurrent networks are Turing complete (cf. Siegelmann and Sontag
1995), these systems ultimately also operate symbolically insofar as they can be mapped to the symbolic operations of Turing machines. The question of whether and how the input pixel distributions are meaningful for DQN or AlphaGo amounts to the question as to what extent these distributions have a grounding or anchoring in the world. And superficially, it seems as if they do not have any such grounding. Therefore, neither DQN nor AlphaGo operate meaningfully, they do not understand what they are doing.
But this conclusion falls short because it overlooks an important distinction regarding meaning. Consider chess. How do the chess pieces get their meaning? What makes a knight a knight? Well, two things. First, it must be different in shape from all other types of pieces, such as rook or bishop. Second, it acquires its meaning in the game through exactly the role it plays in the game, which in turn is clearly assigned to it by the rules of the game. The knight (or any other chess piece) works like a physical symbol that is manipulated according to rules (in other words, according to a syntax). And the semantics of the physical symbols, the chess pieces, originates from this syntax. In contrast, consider the words of a spoken language. They allow for sequences according to grammatical rules to form sentences. However, the question of what a particular word, such as the word “tree,” refers to, is in no way determined by the grammar. Semantics in the sense of reference does not originate from syntax. We must therefore distinguish between meaning as functional role, which is determined by internal rules, and meaning in the sense of external reference. The symbol grounding problem primarily asks for meaning in the second sense: how can system-internal symbols be grounded in the external world so that they acquire a meaning in the sense of reference?
Consider again the example of Atari breakout. DQN seems not to dispose of the meaning of the terms racket, ball, etc. in the sense of reference. However, it is by no means excluded that the learning performance of DQN consists essentially in the fact that it recognizes certain stable and recurring patterns in pixel distributions and links them to regular behavior. A sufficient XAI analysis could provide exactly this kind of information, as it could show that DQN represents stable pixel configurations in higher layers and thus achieves the concepts racket, ball etc. in the sense of a functional role semantics (FRS). It is, moreover, reasonable to assume that, for the purposes of significance in the game, everything essential has been achieved by an FRS framework. For if we look for instance at AlphaGo, the question of semantics in the sense of reference does not arise at all, since Go pieces just like chess pieces do not refer to things or states of affairs in the world, but only possess an internal functional role within the system, i.e. a meaning in the game.
To conclude: we must expect the dimension of semantic grounding to decompose in at least two parts. Grounding on the basis of meaning as functional role, call it functional role grounding, and a more genuine form of grounding by means of reference-to-world. The latter will be our concern in the following two sections. The former is now identified as the FRS grounding dimension of the AI state space.
6
4.2 The Chinese Room and the Dimension of Causal Grounding
Harnad’s symbol grounding problem was inspired by Searle’s related and well-known Chinese room argument (Searle
1980,
1990, cf. Harnad
1989,
2001). In a way, Harnad’s argument makes the deeper core of the Chinese room argument explicit. The latter argument aims to show that syntax is not sufficient for semantics, and that the human brain is not a computer in the sense of symbolism. To this end, Searle conceives the Chinese room as the caricature of a Turing machine, where he himself takes over the role of a tape head for reading and writing by sitting in an otherwise empty room and by using a set of rules (the machine table) provided to him and allowing him to manipulate Chinese characters that he obtains in the room as input and that he reaches out as output. Since Searle doesn’t understand Chinese and since for him Chinese symbols look like “meaningless squiggles”, he insists that he can never attain the meaning of Chinese symbols in this way, i.e. by pure syntactic symbol manipulation.
In the 1980s, the Chinese room argument triggered a flood of reactions and discussions. Among the most common objections to the argument are the connectionist critique (Searle
1990) and the criticisms referred to by Searle as
systems reply and
robot reply (Searle
1980). Searle basically counters all objections according to the same strategy by showing that they just provide “more of the same”. According to the systems reply the whole room rather than the internal operator is proficient in Chinese. Searle, however, argues that he could just as well internalize the whole room (particularly by learning the rule book) and still do nothing but mere syntactic symbol manipulation. According to the connectionist variant of the systems reply we are asked to consider an entire network of operators rather than a single operator. Searle, again, argues that we can as well imagine a “Chinese gym” with lots of operators manipulating symbols according to rules, but that still neither any of the operators nor the whole gym would thereby acquire the meaning of Chinese symbols.
Of particular importance is the robot reply. Would not a robot equipped with the rules of Chinese and operating in the Beijing marketplace gain the meaning of the previously merely syntactic symbols? Wouldn’t a system in this way establish the necessary reference-to-world grounding? According to Searle, this is not the case, since the “computer inside the robot” (Searle
1980, 420) is still an analogue of the Chinese room. Be it that the input stems from an external camera and the output is used to control the arms, on the level of the internal computer that controls the robot both input and output still consist of nothing but mere meaningless symbols. This answer is reminiscent of a strange homunculus conception, and the question also arises as to whether a combination of systems and robot reply cannot be reinforced by further arguments from the areas of embodied and situated cognition (cf. Robbins & Aydede
2009). But we do not need to pursue this further here. Since Searle’s argument is an argument against GOFAI’s symbolism (for Searle: “strong AI”), it merely aims to show that meaningful thinking goes beyond algorithmic symbol manipulation. But if, in order to attain semantics, embodiment, social interaction and situatedness are crucial, this even ultimately strengthens the argument. And it shows, in essence, that the Chinese room argument boils down to the problem of grounding in the sense of reference.
Strangely, the DL technologies now available in the area of machine translation seem to realize the Chinese room scenario, at least in part. Freely available systems such as Google Translate or DeepL have shown a breathtaking improvement in their translation performance in recent years. And yet: one would hardly want to assume that any of these systems truly understand the texts they translate (sometimes in excellent quality). The new systems go beyond earlier forms of either rule-based or statistics-based machine translation. They extract rules of word selection, word order etc. by self-learning on the basis of voluminous bilingual text corpora.
7 All this suggests the following: syntax is ‘almost sufficient’ to produce the linguistic behavior that corresponds to the behavior of speakers who are truly semantically grounded. Although syntax is not completely sufficient for semantics, syntax is ‘almost sufficient’ in the sense that it is sufficient for all practical purposes (but still insufficient from a strict Searlean point of view). This means that, in effect, a syntactic machine can be indistinguishable in its translation performance from a human speaker.
In the light of the above considerations, it follows that DL translation systems do not appear to have any semantic grounding. Nevertheless, these systems acquire rule knowledge and meaning in an FRS sense through self-learning. This alone is remarkable. The crucial step, however, is yet to come. How can it even be possible that a mere symbol based FRS becomes effectively indistinguishable (regarding language behavior) from a truly referential semantics? Should it be possible that, when it comes to semantics, we can do with a pure FRS alone and dismiss any appeal to reference? On the face of it, this amazing possibility seems to be suggested by the translation capabilities of systems such as Google Translate and DeepL. But on closer inspection, it is not. The systems do indeed go beyond a pure FRS. The text corpora used in learning allow for a kind of indirect reference-to-the-world. They were created by human speakers who dispose of a semantics in a full referential sense. Hence, Google Translate and DeepL have no direct but an
indirect grounding. They refer to the world indirectly. Regularities that can be extracted from text corpora comparisons go beyond grammatical regularities, they provide world regularities as the texts deal with worldly circumstances. This allows to extract a decent amount of structural information about the world.
8
All of this shows that semantic grounding in the sense of reference, even if in an indirect and tricky way as in our foregoing example, but most decisively in the direct variant of causal contact with the world, is of utmost importance to acquire meaning. It is thus of utmost importance for intelligence, whether artificial or natural. Grounding in the sense of causal reference is a crucial dimension of the AI state space. Note that postulating this dimension means no commitment to any particular theory of meaning or mental representation. It also means neither a realist nor an anti-realist commitment to content or representation. The criterion of grounding in the sense of causal reference is equally fulfilled in programs of naturalized semantics such as causal theories (Fodor
1987) or teleosemantics (Millikan
1984), as in recent accounts of structural representation (Cummins
1996, Ramsey
2007, Shea
2018) that are more in tune with instrumentalism. All such accounts entail causal world connections as central elements, and this is what counts for grounding by causal reference-to-the-world.
4.3 Meaning as Use and the Dimension of Social Grounding
While causal grounding is certainly an element in almost all theories of meaning and representation, there are, however, accounts that downplay the role of reference such as conventionalism and use theories of meaning. Core ideas of the latter have been introduced by Ludwig Wittgenstein (Wittgenstein
1953). The central idea is to trace meaning back to linguistic use and social practice. According to Wittgenstein, the diversity of language can be seen in the variety of ways in which it is used. The focus is on the concept of rules. A classical, rule-based conception of language sees language as regulated by some unambiguous syntax. This applies all the more to formal languages or mathematics (and has tacitly been assumed in our considerations on the relationship between syntax and semantics). As Wittgenstein aims to show in his “rule following” considerations, such a strict Platonic conception of rules leads to an infinite regress. In order to set up the syntactic rules of, say, a certain Turing machine, other rules are required governing the former rules. But these too satisfy further rules–hence, regress.
According to Wittgenstein, language is governed by rules, but these rules presuppose a public practice and only become apparent in use. The rule following problem consists in the fact that language use and practice are always finite, but that no finite number of cases determines the “rules” of language use and thus the meaning of linguistic expressions under
all, hence infinitely many, circumstances. Language rules are by no means rigid, but depend on the social context. Wittgenstein’s bizarre thought experiment of the two-minute man drastically demonstrates the consequences of his conception:
Let us imagine a god creating a country instantaneously in the middle of the wilderness, which exists for two minutes and is an exact reproduction of a part of England, with everything that is going on there in two minutes. Just like those in England, the people are pursuing a variety of occupations. Children are in school. Some people are doing mathematics. Now let us contemplate the activity of some human beings during these two minutes. One of these people is doing exactly what a mathematician in England is doing, who is just doing a calculation. -
Ought we to say that this two minute man is calculating? Could we for example not imagine a past and a continuation of these two minutes, which would make us call the process something quite different? (Wittgenstein
1956, VI §34).
Wittgenstein’s answer is obvious: the two-minute man “does not calculate” because he is not embedded in the practice and context of mathematics. Against this backdrop, let us consider the question whether AlphaGo actually plays Go. Games, like language, are limited by rules. Wittgenstein suggests a tight analogy between games and language and between the corresponding roles of rules and rule use. Indeed, he speaks of language as a “language game”. Just as there is no mathematics or linguistic meaning without a social context, there are no games. Hence, from a Wittgensteinian understanding of use and practice, AlphaGo does not play Go since it lacks social context: the shared and public practice of the game of Go.
In Sect.
4.1 our conclusion was that the functional roles comprise everything that’s essential in terms of meanings
in the game. The main reason for this was that the meaning of moves and pieces is not referential, as for instance chess pieces do not refer to anything in the world. Following Wittgenstein, however, the meaning of games still has a kind of grounding, even if not in the referential sense. Instead, it is a kind of social grounding. Without public practice rules of games won’t be subject to external control and, therefore, are no rules at all.
Wittgenstein’s reflections on rule following are undoubtedly radical (and accordingly controversial), as the two-minute man scenario drastically demonstrates. Saul Kripke saw himself prompted to a radical rule skepticism, which infects not only rules of games or language, but even the rules of mathematics and logic (Kripke
1982). An in-depth discussion of these questions is far beyond the scope of the current paper. We shall assume that, for all practical purposes, rule knowledge can be set up in an AI machine modulo “Kripkensteinian” doubts.
Thus, for each version of the Alpha series, from AlphaGo to AlphaZero, the respective rules of the games to be learned were unambiguously implemented (Silver et al.
2017). The machine then develops a functional role semantics about the elements and overall setup of the game limited by these pre-determined rules. The systems of the Alpha series have no further grounding. Google Translate or DeepL, on the other hand, already have a rudimentary form of a socially anchored semantics, because these systems acquire an indirect social grounding in the course of their translation learning. After all, the text corpora on the basis of which the systems learn were generated by socially situated speakers, and are therefore parasitic with regard to their social practices. A future AI that combines, for example, the external performance of Google Duplex with the indirect grounding of world knowledge on the basis of Internet data could ultimately become a real part of our social practice of language and, hence, a real part of the language community. There is no convincing reason to assume that such systems would still lack a proper semantic grounding.
To conclude: social grounding is as important as causal grounding. To acquire meaning, intelligent systems must not only be coupled to the world, they must also share social practices. The debate about the ultimate theory of meaning and representation is still open in philosophy of mind and language, but for the time being it seems reasonable to assume both types of grounding as independent dimensions of the AI state space.