Keywords

1 Introduction and Motivation

figure a

Impressive successes in artificial intelligence (AI) and machine learning (ML) have been achieved in the last two decades, including: (1) IBM Deep Blue [6] defeating the World Chess Champion Garry Kasparov in 1997, (2) the success of IBM Watson [10] in 2011 in defeating the Jeopardy players Brad Rutter and Ken Jennings, or (3) the sensation of DeepMind’s Alpha Go [42] in defeating Go masters Fan Hui in 2015 and Lee Sedol in 2016.

Such successes are often seen as milestones for and “measurements” of AI. We argue that such successes are reached in very specific tasks and not appropriate for evaluating the “intelligence” of machines.

The development of intelligence, therefore, is the result of the incremental interplay between challenge/task, a conceptual change (physiological as well as mentally) of the system, and the assessment of the effects of the conceptual change. To advance AI, specifically in the direction of explainable AI, we suggest bridging the human strength and the human assessment methods with those of AI. In other words, we suggest introducing principles of human intelligence testing as an innovative benchmark for artificial systems.

The ML community is becoming now aware that human IQ-tests are a more robust approach to machine intelligence evaluation than such very specific tasks [9]. In this paper we provide (1) some background on testing intelligence, (2) report on some preliminary results from 271 participants of our online study on explainabilityFootnote 1, and (3) propose to use our Kandinsky Patterns [32]Footnote 2 as an IQ-Test for machines.

2 Background

A fundamental problem for AI are often the vague and widely different definitions of the notion of intelligence and this is particularly acute when considering artificial systems which are significantly different to humans [28]. Consequently, intelligence testing for AI in general and ML in particular has generally not been in the focus of extensive research in the AI community. The evaluation of approaches and algorithms primarily occurred along certain benchmarks (cf. [33, 34]).

The most popular approach is the one proposed by Alan Turing in 1950 [45], claiming that an algorithm can be considered intelligent (enough) for a certain kind of tasks if and only if it could finish all the possible tasks of its kind. The shortcoming of this approach, however, is that it is heavily task-centric and that it requires an a-priori knowledge of all possible tasks and the possibility to define these tasks. The latter, in turn, bears the problem of the granularity and precision of definitions. An indicative example is the evaluation, or in other terms, the “intelligence testing” for autonomously driving cars [29], or another example is CAPTCHA (completely automated public Turing test to tell computers and humans apart), which are simple for humans but hard for machines and therefore used for security applications [1]. Such CAPTCHAs use either text or images of different complexity and pose individual differences in cognitive processing [3].

In cognitive science, the testing of human aptitude – intelligence being a form of cognitive aptitude – has a very long tradition. Basically, the idea of psychological measurement stems from the general developments in 19th century science and particularly physics, which put substantial focus on the accurate measurement of variables.

This view was the beginning of so-called anthropometry [36] and subsequently the psychological measurement. The beginning of intelligence testing occurred around 1900 when the French government had passed a law requiring all French children to go to school. Consequently, the government regarded it as important to find a way to identify children who would not be capable to follow school education. Alfred Binet (1857–1911) [11] started the development of assessment questions to identify such children. Remarkably, Binet not only focused on aspects which were explicitly taught in schools but also on more general and perhaps more abstract capabilities, including attention span, memory, and problem solving skills. Binet and his colleagues found out that the children’s capacity to answer the questions and solve the tasks was not necessarily a matter of physical age. Based on this observation, Binet proposed a “mental age” – which actually was the first intelligence measure [4]. The level of aptitude was seen relative to the average aptitude of the entire population. Charles Spearman (1863–1945) coined in 1904 [43], in this context, the term g-factor, a general, higher level of intelligence.

This very early example for an intelligence test already makes the fundamental difference to the task-centric evaluation of later AI very clear. Human intelligence was not seen as the capability to solve one particular task, such as a pure classification task, it was considered being a much wider construct. Moreover, human intelligence generally was not measured in an isolated way but always in relation to an underlying population. By the example of the self-driving cars, the question would be whether one car can drive “better” against all the other cars, or even whether and to what extent the car does better than human drivers. In the 1950s, the American psychologist David Wechsler (1896–1981) extended the ideas of Binet and colleagues and published the Wechsler Adult Intelligence Scale (WAIS), which, in its fourth revision, is a quasi standard test battery today [48]. The WAIS-IV contains essentially ten subtests and provides scores in four major areas of intelligence, that is, verbal comprehension, perceptual reasoning, working memory, and processing speed. Moreover, the test provides two broad scores that can be used as a summary of overall intelligence. The overall full-scale intelligence value (IQ was already coined by William Stern in 1912 for the German term Intelligenzquotient) uses the popular mean 100, standard deviation 15 metric.

In advancing Spearman’s g-factor idea, Horn and Cattell [17] argued that intelligence is determined by about 100 interplaying factors and proposed two different levels of human intelligence, fluid and crystallized intelligence. The former includes general cognitive abilities such as pattern recognition, abstract reasoning, and problem solving. The latter is based on experience, learning, and acculturation; it includes general knowledge or the use of language. In addition to Wechsler’s WAIS-IV, among the most commonly used tests, for example, is Raven’s Progressive Matrices [37], which is a non-verbal multiple choice measures of the reasoning component of Spearman’s g, more exactly, the two components (i) “thinking clearly” and “making sense of complexity”, and (ii) the “ability to store and reproduce information”. The test was originally developed by John Raven in 1936 [37]. The task is to continue a visual pattern (cf. Fig. 1). Other tests are the Reynolds Intellectual Assessment Scales, the Multidimensional Aptitude Battery II, the Naglieri Nonverbal Ability Test (cf. Urbina [46]), and in German speaking countries the IST-2000R [2] or the Berlin Intelligence Structure Test (BIS; [20]).

There exists a large amount of classifications and sub-classifications of sub-factors of intelligence, The Cattell-Horn [17] classification includes, for example:

  • Quantitative knowledge (the ability to understand and work with mathematical concepts)

  • Reading and writing

  • Comprehension-Knowledge (the ability to understand and produce language)

  • Fluid reasoning (incl. inductive and deductive reasoning and reasoning speed)

  • Short term memory

  • Long term storage and retrieval

  • Visual processing (including closure of patterns and rotation of elements)

  • Auditory processing (including musical capabilities)

  • General processing speed.

An - at the first sight similar - classification was introduced by Gardner [12] based on his theory of multiple intelligences. As opposed to prior classification, his theory includes a much broader understanding of intelligence as human aptitude. Gardner’s theory, therefore, was a starting point for an (often discussed as inflationary) increase of types of intelligence, for example in direction of emotional, social, and artistic “intelligence” [30]. Over the past 120 years, the 20th century ideas of human intelligence have been further developed and new models have been proposed. These new models tend to interpret general intelligence as an emergent construct reflecting the patterns of correlations between different test scores and not as a causal latent variable. The models aim to bridge correlational and experimental psychology and account for inter-individual differences in terms of intra-individual psychological processes and, therefore, the approaches look into neuronal correlates of performance [7]. One of these new approaches is, for example, process overlap theory, a novel sampling account, based upon cognitive process models, specifically models of working memory [22].

When explaining predictions of deep learning models we apply an explanation method, e.g. simple sensitivity analysis, to understand the prediction in terms of the input variables. The result of such an explainability method can be a heatmap. This visualization indicates which pixels need to be changed to make the image look (from the AI-systems perspective!) more or less like the predicted class [40]. On the other hand there are the corresponding human concepts and “contextual understanding” needs effective mapping of them both [24], and is among the future grand goal of human-centered AI [13].

For a detailed description of the KANDINSKY Patterns please refer to [32].

When talking about explainable AI it is important from the very beginning to differentiate between Explainability and Causability: under explainability we understand the property of the AI-system to generate machine explanations, whilst causability is the property of the human to understand the machine explanations [15]. Consequently, the key to effective human-AI interaction is an efficient mapping of explainability with causability. Compared to the map metaphor, this is about establishing connections and relations - not drawing a new map. It is about identifying the same areas in two completely different maps.

3 Related Work

Within the machine learning community there is an intensive debate if e.g. neural networks can learn abstract reasoning or whether they merely rely on pure correlation. In a recent paper the authors [41] propose a data set and a challenge to investigate abstract thinking inspired by a well-known human IQ test: the Raven test, or more specifically the Raven’s Progressive Matrices (RPM) and Mill Hill Vocabulary Scales, which were developed 1936 for use in fundamental research into both the genetic and the environmental determinants of “intelligence” [37]. The premise behind RPMs is simple: one must reason about the relationships between perceptually obvious visual features – such as shape positions or line colors – to choose an image that completes the matrix. For example, perhaps the size of squares increases along the rows, and the correct image is that which adheres to this size relation (see Fig. 1). RPMs are strongly diagnostic of abstract verbal, spatial and mathematical reasoning ability. To succeed at the challenge, models must cope with various generalisation ‘regimes’ in which the training and test data differ in clearly-defined ways.

Fig. 1.
figure 1

Raven-style Progressive Matrices. In (a) the underlying abstract rule is an arithmetic progression on the number of shapes along the columns. In (b) there is an XOR relation on the shape positions along the rows (panel 3 = XOR(panel 1, panel 2)). Other features such as shape type do not factor in. A is the correct choice for both, Figure taken from [41].

The amazingly advancing field of AI and ML technologies adds another dimension to the discourse of intelligence testing, that is, the evaluation of artificial intelligence as opposed to human intelligence. Human intelligence tends to focus on adapting to the environment based on various cognitive, neuronal processes. The field of AI, in turn, very much focuses on designing algorithms that can mimic human behavior (weak or narrow AI). This is specifically true in applied genres such as autonomously driving cars, robotics, or games. This also leads to distinct differences in what we consider intelligent. Humans have a consciousness, they can improvise, and the human physiology exhibits plasticity that leads to “real” learning by altering the brain itself. Although humans tend to make more errors, human intelligence as such is usually more reliable and robust against catastrophic errors, whereas AI is vulnerable against software, hardware and energy failures. Human intelligence develops based on infinite interactions with an infinite environment, while AI is limited to the small world of a particular task.

The development of intelligence, therefore, is the result of the incremental interplay between challenge/task, a conceptual change (physiological as well as mentally) of the system, and the assessment of the effects of the conceptual change. To advance AI, specifically in the direction of explainable AI, we suggest bridging the human strength and the human assessment methods with those of AI. In other words, we suggest introducing principles of human intelligence testing as an innovative benchmark for artificial systems.

We want to exemplify this idea by the challenge of the identification and interpretation/explanation of visual patterns. In essence, this refers to the human ability to make sense of the world (e.g., by identifying the nature of a series of visual patterns that need to be continued). Sensemaking is an active processing of sensations to achieve an understanding of the outside world and involves the acquisition of information, learning about new domains, solving problems, acquiring situation awareness, and participating in social exchanges of knowledge [35]. The ability can be applied to concrete domains such as various HCI acts [35] but also to abstract domains such as pattern recognition.

This topic was specifically in the focus of medical research. Kundel and Nodine [23], for example, investigated gaze paths in medical images (a sonogram, a tomogram, and two standard radiographic images). They were asked to summarize each of the images in one sentence. The results of this study revealed that correct interpretations of the images were related to attending the relevant areas of the images as opposed to attending visually dominant areas of the images. The authors also found a strong relation of explanations to experiences with images.

A fundamental principle in the perception and interpretation of visual patterns is the likelihood principle, originally formulated by Helmholtz, which states that the preferred perceptual organization of an abstract visual pattern is based on the likelihood of specific objects [27]. A, to a certain degree competing, explanation is the minimum principle, proposed by Gestalt psychology, which claims that humans perceive a visual pattern according the simplest possible interpretation. The role of experience is also reflected in studies in the context of the perception of abstract versus representative visual art; [47] demonstrated distinct differences in art experts and laymen in the perception and their preferences of visual art. Psychological research could demonstrate that the nature of perceiving and interpreting visual patterns, therefore, is a function of expectations [50]. On the one hand, this often leads to misinterpretations or premature interpretations, on the other hand, it increases the “explainability” of interpretations since the visual perception is determined by existing conceptualizations.

Fig. 2.
figure 2

Visual patterns to be explained by humans.

4 How Do Humans Explain? How Do Machines Explain?

In a recent online study [14], we asked (human) participants to explain random visual patterns (Fig. 2). We recorded and classified the free verbal explanations of in total 271 participants. Figure 3 summarizes the results. The results show that the majority of explanations was made based on the properties of individual elements in an image (i.e., shape, color, size) and the appearance of individual objects (number). Comparisons of elements (e.g., more, less, bigger, smaller, etc.) were significantly less likely and the location of objects, interestingly, played almost no role in the explanation of the images.

Fig. 3.
figure 3

Visual patterns to be explained by humans.

In a natural language statement about a Kandinsky Figure humans use a series of basic concepts which are combined through logical operators. The following (incomplete) examples illustrate some concepts of increasing complexity.

  • Basic concepts given by the definition of a Kandinsky Figure: a set of objects, described by shape, color, size and position, see Fig. 4(A) for color and (B) for shapes.

  • Existence, numbers, set-relations (number, quantity or quantity ratios of objects), e.g. “a Kandinsky Figure contains 4 red triangles and more yellow objects than circles” , see Fig. 4(C).

  • Spatial concepts describing the arrangement of objects, either absolute (upper, lower, left, right, ...) or relative (below, above, on top, touching, ...), e.g. “in a Kandinsky Figure red objects are on the left side, blue objects on the right side, and yellow objects are below blue squares”, see Fig. 4(D).

  • Gestalt concepts (see below) e.g. closure, symmetry, continuity, proximity, similarity, e.g. “in a Kandinsky Figure objects are grouped in a circular manner”, see Fig. 4(E).

  • Domain concepts, e.g. “a group of objects is perceived as a “flower””, see Fig. 4(F).

Fig. 4.
figure 4

Kandinsky Pattern showing concepts as color (A), shape (B), numeric relations (C), spatial relations (D), Gestalt concepts (E) and domain concepts (F) (Color figure online)

These basic concepts can be used to select groups of objects, e.g. ‘all red circles in the upper left corner’, and to further combine single objects and groups in a statement with logic operator, e.g. ‘if there is a red circle in the upper left corner, there exists no blue object’, or with complex domain specific rules, e.g. ‘if the size of a red circle is smaller then the size of a yellow circle, red circles are arranged circular around yellow circles’.

In their experiments [18] discovered, among others, that the visual system builds an image from very simple stimuli into more complex representations. This inspired the neural network community to see their so-called “deep learning” models as a cascading model of cell types, which follows always similar simple rules: at first lines are learned, then shapes, then objects are formed, eventually leading to concept representations.

By use of back-propagation such a model is able to discover intricate structures in large data sets to indicate how the internal parameters should be adapted, which are used to compute the representation in each layer from the representation in the previous layer [26]. Building concept representations refers to the human ability to learn categories for objects and to recognize new instances of those categories. In machine learning, concept learning is defined as the inference of a Boolean-valued function from training examples of its inputs and outputs [31] in other words it is training an algorithm to distinguish between examples and non-examples (we call the latter counterfactuals).

Concept learning has been a relevant research area in machine learning for a long time and had it origins in cognitive science, defined as search for attributes which can be used to distinguish exemplars from non exemplars of various categories [5]. The ability to think in abstractions is one of the most powerful tools humans possess. Technically, humans order their experience into coherent categories by defining a given situation as a member of that collection of situations for which responses x, y, etc. are most likely appropriate. This classification is not a passive process and to understand how humans learn abstractions is essential not only to the understanding of human thought, but to building artificial intelligence machines [19].

In computer vision an important task is to find a likely interpretation W for an observed image I, where W includes information about the spatial location, the extent of objects, the boundaries etc. Let SW be a function associated with an interpretation W that encodes the spatial location and extent of a component of interest, where \(SW_{(i,j)}\) = 1 for each image location (ij) that belongs to the component and 0 else-where. Given an image, obtaining an optimal or even likely interpretation W, or associated SW, can be difficult. For example, in edge detection previous work [8] asked what is the probability of a given location in a given image belonging to the component of interest.

[44] presented a model of concept learning that is both computationally grounded and able to fit to human behaviour. He argued that two apparently distinct modes of generalizing concepts – abstracting rules and computing similarity to exemplars – should both be seen as special cases of a more general Bayesian learning framework. Originally, Bayes (and more specific [25]) explained the specific workings of these two modes, i.e. which rules are abstracted, how similarity is measured, why generalization should appear in different situations. This analysis also suggests why the rules/similarity distinction, even if not computationally fundamental, may still be useful at the algorithmic level as part of a principled approximation to fully Bayesian learning.

Gestalt-Principles (“Gestalt” = German for shape) are a set of empirical laws describing how humans gain meaningful perceptions and make sense of chaotic stimuli of the real-world. As so-called Gestalt-cues they have been used in machine learning for a long time. Particularly, in learning classification models for segmentation, the task is to classify between “good” segmentations and “bad” segmentations and to use the Gestalt-cues as features (the priors) to train the learning model. Images segmented manually by humans are used as examples of “good” segmentations (ground truth), and “bad” segmentations are constructed by randomly matching a human segmentation to a different image [39]. Gestalt-principles [21] can be seen as rules, i.e. they discriminate competing segmentations only when everything else is equal, therefore we speak more generally as Gestalt-laws and one particular group of Gestalt-laws are the Gestalt-laws of grouping, called Prägnanz [49], which include the law of Proximity: objects that are close to one another appear to form groups, even if they are completely different, the Law of Similarity: similar objects are grouped together; or the law of Closure: objects can be perceived as such, even if they are incomplete or hidden by other objects.

Unfortunately, the currently best performing machine learning methods have a number of disadvantages, and one is of particular relevance: Neural networks (“deep learning”) are difficult to interpret due to their complexity and are therefore considered as “black-box” models [16]. Image Classifiers operate on low-level features (e.g. lines, circles, etc.) rather than high-level concepts, and with domain concepts (e.g images with a storefront). This makes their inner workings difficult to interpret and understand. However, the “why” would often be much more useful than the simple classification result.

5 Conclusion

By comparing both the strengths of machine intelligence and human intelligence it is possible to solve problems where we are currently lacking appropriate methods. One grand general question is “How can we perform a task by exploiting knowledge extracted during solving previous tasks?” To answer such questions it is necessary to get insight into human behavior, but not with the goal of mimicking human behavior, rather to contrast human learning methods to machine learning methods. We hope that our Kandinsky Patterns challenge the international machine learning community and we are looking forward to receiving many comments and results. Updated information can be found at the accompanying Web pageFootnote 3. A single Kandinsky pattern may serve as an “intelligence (IQ) test” for an AI system. To make the step towards a more human-like and probably in-depth assessment of an AI system, we propose to apply the principles of human intelligence tests, as outlined in this paper. In relation to the Kandinsky patterns we suggest applying the principle of Raven’s progressive matrices. This test is strongly related to the identification of a “meaning” in the complex visual patterns [38]. The underlying complex pattern, however, is not based on a single image, the meaning only arises from the sequential combination of multiple images. To assess AI, a set of Kandinsky patterns, each of which complex in itself, can be used. A “real” intelligent achievement would be identifying the concepts - and therefore the meaning ! - of sequences of multiple Kandinsky patterns. At the same time, the approach solves one key problem of testing “strong AI”, the language component. With this approach it is not necessary to verbalize the insights of the AI system. Per definition, the identification of the right visual pattern that “traverses” the Kandinsky patterns (analogous to Raven’s matrices) indicates the identification of an underlying meaning. Much further experimental and theoretical work is needed here.