Elsevier

Cognition

Volume 118, Issue 3, March 2011, Pages 306-338
Cognition

The learnability of abstract syntactic principles

https://doi.org/10.1016/j.cognition.2010.11.001Get rights and content

Abstract

Children acquiring language infer the correct form of syntactic constructions for which they appear to have little or no direct evidence, avoiding simple but incorrect generalizations that would be consistent with the data they receive. These generalizations must be guided by some inductive bias – some abstract knowledge – that leads them to prefer the correct hypotheses even in the absence of directly supporting evidence. What form do these inductive constraints take? It is often argued or assumed that they reflect innately specified knowledge of language. A classic example of such an argument moves from the phenomenon of auxiliary fronting in English interrogatives to the conclusion that children must innately know that syntactic rules are defined over hierarchical phrase structures rather than linear sequences of words (e.g., Chomsky, 1965, Chomsky, 1971, Chomsky, 1980, Crain and Nakayama, 1987). Here we use a Bayesian framework for grammar induction to address a version of this argument and show that, given typical child-directed speech and certain innate domain-general capacities, an ideal learner could recognize the hierarchical phrase structure of language without having this knowledge innately specified as part of the language faculty. We discuss the implications of this analysis for accounts of human language acquisition.

Introduction

Nature, or nurture? To what extent is human mental capacity a result of innate domain-specific predispositions, and to what extent does it result from domain-general learning based on data in the environment? One of the tasks of modern cognitive science is to move past this classic nature/nurture dichotomy and elucidate just how innate biases and domain-general learning might interact to guide development in different domains of knowledge.

Scientific inquiry in one domain, language, was influenced by Chomsky’s observation that language learners make grammatical generalizations that appear to go beyond what is immediately justified by the evidence in the input (Chomsky, 1965, Chomsky, 1980). One such class of generalizations concerns the hierarchical phrase structure of language: children appear to favor hierarchical rules that operate on grammatical constructs such as phrases and clauses over linear rules that operate only on the sequence of words, even in the apparent absence of direct evidence supporting this preference. Such a preference, in the absence of direct supporting evidence, may suggest that human learners innately know a deep organizing principle of natural language, that syntax is organized in terms of hierarchical phrase structures.

In outline form, this is one version of the “Poverty of the Stimulus” (or PoS) argument for innate knowledge. It is a classic move in cognitive science, but in some version this style of reasoning is as old as the Western philosophical tradition. Plato’s argument for innate principles of geometry or morality, Leibniz’ argument for an innate ability to understand necessary truths, and Kant’s argument for an innate spatiotemporal ordering of experience are all used to infer the prior existence of certain mental capacities based on an apparent absence of support for acquiring them through learning.

Our goal in this paper is to reevaluate the modern PoS argument for innate language-specific knowledge by formalizing the problem of language acquisition within a Bayesian framework for rational inductive inference. We consider an ideal learner who comes equipped with two powerful but domain-general capacities. First, the learner has the capacity to represent structured grammars of various forms, including hierarchical phrase-structure grammars and various alternatives. Second, the learner has access to a Bayesian engine for statistical inference that can operate over these structured grammatical representations and compute their relative probabilities given observed data. We will argue that a certain core aspect of linguistic knowledge – that syntactic representations are organized in terms of hierarchical phrase structure – can be inferred by a learner with these capabilities but without a language-specific innate bias favoring this conclusion.

There have been many different framings of stimulus poverty questions over the years, and ours differs from both Chomsky’s original framing and recent alternatives in some subtle ways that we will clarify over the course of this article. Berwick and Chomsky, 2008, Berwick and Chomsky, submitted for publication have argued that much recent work on the poverty of the stimulus misses the original intention of the argument in generative linguistic theory. This may be true; it is certainly not for us to debate with Chomsky the original intentions of generative theory. Yet the stimulus poverty debate has taken on a larger life of its own in cognitive science more generally, and our goal here is to explore what we see as a basic issue at the heart of language learning – the origins of hierarchical phrase structure in syntactic representation – as an instance of the more general question of what kinds of structure must be innate in cognitive development. In our view, the argument about innateness is primarily about the role of domain-specificity in the learner’s innate endowment. Because language acquisition presents a problem of induction, it is clear that learners must have some constraints limiting the hypotheses they consider. The question is whether a certain feature of language – such as hierarchical phrase structure in syntax – must be assumed to be specified innately as part of a language-specific “acquisition device”, rather than derived from more general-purpose representational capacities and inductive biases.

Note also that our focus is on the issue of what kind of knowledge must be assumed as an innate constraint on the learner’s inductive hypotheses, rather than on what kind of representational machinery must be available to the learner. We are not arguing that a learner lacking a potential to represent hierarchical phrase structures can somehow acquire this potential; we accept here for the sake of argument the traditional view that a learning system can only learn structures that it is capable of representing. The question is whether a learner who is capable of representing grammars based on hierarchical phrase structure, as well as other kinds of structure, can infer that hierarchical phrase structure is indeed the best way to describe natural language syntax – without requiring specific innate knowledge that language is structured in this way. Some traditional nativist arguments equate these ideas: prior knowledge concerning the hypotheses that the learner considers takes the form of limitations on the class of hypotheses that the learner is capable of representing. This assumption makes sense if the learning mechanism is very simple – if learners can only select hypotheses based on their consistency with the observed data. By positing a more powerful Bayesian learning engine, we are able to relax this assumption and study how learners can select from among multiple a priori possible representational frameworks the one that best describes the data they observe – for instance, between regular grammars and context-free grammars, where the latter more naturally captures hierarchical phrase structure in syntax.

We introduce PoS arguments in the context of a specific example that has sparked many discussions of innateness, from Chomsky’s original discussions to present-day debates (Laurence and Margolis, 2001, Legate and Yang, 2002, Lewis and Elman, 2001, Pullum and Scholz, 2002, Reali and Christiansen, 2005): the phenomena of auxiliary fronting in constructing English interrogative sentences. We begin by introducing this example and then lay out the abstract logic of the PoS argument of which this example is a special case. This logic will motivate the form of our Bayesian analysis, but our focus is on one of the abstract questions that emerged based on the original example: the learnability of hierarchical phrase structure.

Before moving into the argument itself, we should highlight and clarify two aspects of our approach that contrast with other recent analyses of PoS arguments in language, and analyses of auxiliary-fronting in particular (Laurence and Margolis, 2001, Legate and Yang, 2002, Lewis and Elman, 2001, Pullum and Scholz, 2002, Reali and Christiansen, 2005). First, our analysis should not be seen as an attempt to explain the learnability of auxiliary fronting (or any specific linguistic rule) per se. Rather the goal is explore how and whether learners can infer deeper and more abstract principles of linguistic structure, such as the hierarchical phrase-structure basis for syntax. This principle (in conjunction with many other aspects of linguistic knowledge) supports an entire class of specific generalizations that include the auxiliary-fronting rule but also many other phenomena surrounding agreement, movement, and extraction. We take as data a corpus of child-directed speech and evaluate hypotheses about candidate grammars that could account for the corpus as a whole. Our findings suggest that it is vital to consider the learnability of entire candidate grammars holistically. While crucial data that would independently support any one generalization (such as the auxiliary-fronting rule) may be very sparse or even nonexistent, there may be extensive data supporting other, related generalizations; this can bias a rational learner towards making the correct inferences about the cases for which the data is very sparse. To put this point another way, while it may be sensible to ask what a rational learner can infer about language as a whole without any language-specific biases, it is less sensible to ask what a rational learner can infer about any single specific linguistic rule (such as auxiliary-fronting). The need to acquire a whole system of linguistic rules together imposes constraints among the rules, so that an a priori unbiased learner may acquire constraints that are based on the other linguistic rules it must learn at the same time.

Second, our approach offers a way to tease apart two fundamental dimensions of linguistic knowledge that are often conflated in the language acquisition literature. The question of whether human learners have (innate) language-specific knowledge is logically separable from the question of whether and to what extent human linguistic knowledge is based on structured symbolic representations like generative phrase-structure grammars. Different approaches to language acquisition correspond to different answers to these questions, which we can visualize along a two-dimensional space of possibilities (Fig. 1). However, the best known approaches have explored only two corners in this space: domain-general learning accounts in the emergentist tradition, which seeks to explain language as arising from non-linguistic cognitive bases, have been studied primarily using simple representations that avoid explicit symbolic structure, such as n-grams or recurrent neural networks (e.g., Elman et al., 1996, Reali and Christiansen, 2005, Rumelhart and McClelland, 1986). By contrast, structured symbolic representations have been explored primarily in the context of accounts based on innate language-specific knowledge that largely eschew general-purpose learning mechanisms (e.g., Chomsky, 1965, Chomsky, 1980, Pinker, 1984). Few cognitive scientists have explored the possibility that explicitly structured mental representations might be constructed or learned via domain-general learning mechanisms. Despite this, there are compelling reasons to believe that the human mind has available both powerful general-purpose learning abilities and powerful representational capacities. Our framework offers a way to explore this relatively uncharted territory in the context of language acquisition. We will argue that domain-general learning of structured symbolic representations provides a valuable way to think about aspects of language acquisition (and potentially other areas of cognitive development) where data are sparse but the learner’s generalizations are rich.

We hope that the position we lay out here may help to bridge the two more standard diagonally-opposed views in Fig. 1. For an emergentist audience, we suggest that one may retain the core of emergentism – namely the focus on domain-general bases of language – while considering explicitly structured representations. Such a broadening of scope follows naturally from the observation that structured representations are themselves domain-general –family trees, organizational hierarchies, and plan hierarchies all rely on representations similar in some ways to those of language, but used for very different purposes. Moreover, we argue that domain-general learning mechanisms may suffice to determine what form of structured representation provides the best account of the child’s linguistic input. Thus, our approach is fundamentally consistent in spirit with an emergentist view, even though the representations we consider are different from those traditionally adopted by many emergentists. At the same time, for a nativist audience, we hope that our adoption of structured representations helps to cast the claims of domain-generality in terms that are more recognizable and more obviously relevant to traditional analyses of what cognitive structures must be innate based on the linguistic input children receive and the final knowledge state they achieve.

At the core of modern linguistics is the insight that sentences, although they might appear to be simply linear sequences of words or sounds, are built up in a hierarchical fashion from phrases nested in tree structures (Chomsky, 1965, Chomsky, 1980). The rules of syntax are defined over linguistic elements corresponding to phrases that can be represented hierarchically with respect to one another in a tree structure: for instance, a noun phrase might itself contain a prepositional phrase or another noun phrase. Henceforth, when we say that “language has hierarchical phrase structure” we mean, more precisely, that the basic representations over which syntactic rules operate are defined in terms of abstract phrases which may be nested hierarchically in arbitrary tree-structured topologies, and do not simply consist of linear sequences of words or linearly branching phrases (i.e., purely right-branching or purely left-branching structures).1 Is the knowledge that language is organized in this way innate? In other words, is it a part of the initial state of the language acquisition system and thus a necessary feature of any possible hypothesis that the learner will consider?

A similar question has been the target of stimulus poverty arguments in the context of a number of different syntactic phenomena, but perhaps most famously auxiliary-fronted interrogatives in English (Chomsky, 1965, Chomsky, 1971, Chomsky, 1980, Crain, 1991, Crain and Nakayama, 1987, Laurence and Margolis, 2001, Legate and Yang, 2002, Lewis and Elman, 2001, Pullum and Scholz, 2002, Reali and Christiansen, 2005). Different authors have framed this challenge in different ways, so we first lay out the classic analysis of auxiliary fronting and then discuss variants, including ours.

There appears to be a strong structural regularity in English, relating simple declaratives like (1a) and (2a) with corresponding interrogative forms (1b) and (2b):

A traditional way to describe this regularity is in terms of “movement”: between corresponding declarative and interrogative forms, the auxiliary verbs was and is in (1a) and (2a) appear to move to the front of the sentences in (2a) and (2b). A language learner who grasps this regularity could extend it to produce and comprehend an infinite range of new utterances. These new cases may be more complex than the simple cases above, yet the extension of this syntactic pattern still appears straightforward:

Consider (4a), however, in which the declarative form contains two identical auxiliary verbs.

Which is the correct way to form the corresponding interrogative: (4b), in which the first is from (4a) appears to move to the front, or (4c), in which the second is appears to move? Such cases suggest that there is not a unique logical way to characterize the relation between the simple declarative and interrogative forms in (1) and (2). Any regularity is an inductive inference, and there will be different ways of analyzing these sentences as linguistic objects that make different inductive hypotheses more or less natural.

This ambiguity presents a challenge for the language learner. A learner who analyzes sentences as linear structures of words, and who assumes that any linguistic rules would be consistent with that underlying structure, might characterize the patterns in (1) and (2) as something like (5), while one who analyzes sentences in terms of hierarchical structures of phrases might characterize these same patterns more like (6):

These two ways of describing the observed patterns suggest different inductive generalizations for complex utterances with two or more occurrences of the same auxiliary: the inference in (5) suggests that (4b) would be correct, while the inference in (6) suggests that (4c) would be correct. We know that only (4c) is acceptable in English, and that the actual grammar of English follows rules that are more like (6) than (5), but how is a child to know which inference is correct?

One possibility is that simple observation could show the child that (5) is wrong and (6) is (more or less) right. If children learning language hear a sufficient sample of sentences like (4c) and few or no sentences like (4b), they might reasonably infer that English follows the pattern in (6) rather than the (5). The poverty of the stimulus argument focuses here. It has been argued that complex interrogative sentences such as (4c) do not exist in sufficient quantity in child-directed speech to make this inference. For instance, Chomsky (1971) suggests that “it is quite possible for a person to go through life without having heard any of the relevant examples that would choose between the two principles.” In spite of this paucity of evidence, children three to five years old can form correct complex interrogative sentences like (4c) but appear not to produce incorrect forms such as (4b) (Crain & Nakayama, 1987; but see also Ambridge, Rowland, & Pine, 2008).

Another possibility is that the generalization expressed by (6) is somehow a priori simpler or better than that in (5). But it is hard to see how to justify such a preference, at least if one does not assume a priori that language has hierarchical phrase structure. If anything, a general-purpose learning agent who knows nothing specifically about human natural languages might take (5) to be the simpler induction, because it does not assume the existence of hidden objects (e.g., syntactic phrases) structured according to some unobservable relations (e.g., hierarchical phrase structures). If the correct generalization is not directly indicated by the data and is also not preferred on the grounds of a general inductive bias favoring simplicity, a natural conclusion is that children come equipped with some innate constraint or knowledge that biases them to induce the correct generalization rather than the incorrect one.

What is the nature of that innate bias? In the most famous framing of this argument, which has led to decades of intense controversy among a broad range of researchers, Chomsky (1980) appeared to suggest that it is an innate restriction on the kinds of representations that the language faculty can consider. We quote at some length from one of Chomsky’s most accessible statements of this argument:

The issue is, in this case, do we look at sentences in a linear or a hierarchical manner in order to carry out the induction? … There are cases in which people deal with properties like leftmost (they may regard an array of elements as linear and consider the physical arrangement of the elements), whereas there are other cases where people take into account all kinds of hierarchical structures in visual space or whatever. What we have to ask is what is the property in the initial state S0 [of the language faculty] that forces us, in this specific linguistic case, always to go to the hierarchical abstract rule and always to neglect the more elementary linear physical rule? Several answers have been proposed to this question: the right one, I think, is the one which is implicit in the theory of transformational grammar, which in effect asserts that there is a notation available for describing linguistic rules that does not permit the formulation of the property leftmost… There is a very specific theory of representations in terms of which follows the first noun-phrase is a more elementary property than leftmost; but that happens to be a property of this specific concrete theory and not a consequence of any general theory of representations and structures. Of course this property has many consequences elsewhere; it has vast consequences for grammar, where, applied to other linguistic structures, the use of the category leftmost… should always be less accessible than properties like follows the first noun-phrase. This hypothesis is one that is rich in empirical consequences and to my knowledge true. (Chomsky, 1980, pp. 115–116)

Although this argument for innate language-specific constraints on syntactic rules is clearly stated, its interpretation is subtle and depends on what sort of rules we take as the basis for syntax, or the focus of interest for a cognitive theory of language. Generative linguistic theorists have long been interested in auxiliary-fronting as an example of syntactic movement. To explain these phenomena, they posit rules like those expressed informally in (5) and (6) that invoke explicit “fronting” or “raising” of words or phrases, as part of what a speaker knows when they grasp the structure of complex utterances such as (4a) or (4c). We, along with many cognitive scientists, are less sure about whether explicit movement rules provide the right framework for representing people’s knowledge of language, and specifically whether they are the best way to account for how a child comes to understand and produce utterances like (4c) but not like (4b). We agree with the more general insight from linguistic theory, however, that only by defining syntactic rules (in whatever form they take) over hierarchical phrase structure representations is a child likely to be able understand that (4c) expresses a certain complex thought while (4b) expresses no well-formed thought. Hence our focus here is on the more basic question of how a learner can come to know that language should be represented in terms of hierarchical phrase structure.

Our goal in this paper is to show that a disposition to represent syntax in terms of hierarchical phrase structure rather than linear structures need not be innately specified as part of the language faculty, but instead could be inferred using domain-general learning and representational capacities. The basis for this inference is implicitly anticipated by Chomsky’s characterization of the phrase-structure hypothesis, in the lines quoted above, as “rich in empirical consequences” throughout language, not just for a single linguistic structure. While a child may not receive direct evidence about the correctness of a particular hierarchical phrase structure rule for analyzing some particular set of sentences such as the aux-fronting examples, there is vast indirect evidence for the general superiority of syntax with that structure throughout language. A learner who adopts a hierarchical phrase structure framework for describing the syntax of English will arrive at a much simpler, more explanatory account of her observations than a learner who adopts a linear framework.

We formalize this argument in Bayesian terms, where the “simpler, more explanatory” account becomes the more probable hypothesis. Linguists in the generative grammar tradition came to this inference early on. When one looks at the structure of natural language and considers the possibility of a framework with hierarchical phrase structure as opposed to a linear description, the superiority of the formal system quickly becomes apparent due to its “rich empirical consequences.” Indeed, Berwick and Chomsky, 2008, Berwick and Chomsky, submitted for publication have suggested recently that hierarchical phrase structure in syntactic representations was always taken for granted by generative theorists – but it is nonetheless a significant inductive leap. As Chomsky observes in the quotation above, many other domains of human activity besides language unfold sequentially. In some cases these activities are structured based on linear order properties, while in others they are not. A learner or a linguist at some point must decide to view language in one or the other of these ways, even if that decision occurs quickly, unconsciously and automatically. Our Bayesian analysis could apply just as well to formalizing this inductive leap inside the minds of human learners or linguists. Where our proposal differs from the standard view in generative linguistics is in the suggestion that children may receive sufficient language data to make this inference to hierarchical phrase structure, and hence need not have this assumption given as part of the innate state of a language acquisition device. We are not arguing that children necessarily do learn about the hierarchical phrase structure of syntax in this way, but rather that there exists a plausible learning framework which could allow them to do so from the data they observe.

The PoS argument is, of course, not merely a point about auxiliary fronting in interrogative formation. We can formulate the general PoS argument in more precise and abstract terms as follows:

This form of the PoS argument, also shown schematically in Fig. 2, is applicable to a variety of domains and datasets. Unlike other standard treatments (Laurence and Margolis, 2001, Pullum and Scholz, 2002), it makes explicit the distinction between multiple levels of knowledge (G and T); this distinction is necessary to see what is really at stake in arguments about innateness in language and other cognitive domains. In the case of auxiliary fronting, the specific generalization G refers to the hierarchical rule (6) that governs the formation of interrogative sentences. The learning challenge is to explain how children come to produce only the correct forms for complex interrogatives (B), apparently following a rule like (6), when the data they observe (D) comprise only simple interrogatives (such as “Is the man hungry?”) that do not discriminate between the correct generalization and simpler but incorrect alternatives such as (5).

But the interesting claim of innateness here is not about the rule for producing interrogatives (G) per se; rather, it concerns some more abstract knowledge T. Note that nothing in the logical structure of the argument requires that T be specific to the domain of language – constraints due to domain-general processing, memory, or learning factors could also limit which generalizations are considered. Nevertheless, many versions of the PoS argument assume that the T is language-specific: in particular, that T is the knowledge that linguistic rules are defined over hierarchical phrase structures. This knowledge constrains the specific rules of grammar that children may posit and therefore licenses the inference to G. Constraints on grammatical generalizations at the level of T may be seen as one aspect of, or as playing the role of, “universal grammar” (Chomsky, 1965).

An advantage of this logical schema is to clarify that the correct conclusion given the premises is not that the higher-level knowledge T is innate – only that it is necessary. The following corollary is required to conclude that T is innate:

Given this schema, our argument here can be construed in two different ways. On one view, we are arguing against premise (8.ii); we suggest that the abstract linguistic knowledge T – that language has hierarchical phrase structure – might be learnable using domain-general mechanisms and representational machinery. Given some observed data D, we evaluate knowledge at both levels (T and G) together by drawing on the methods of hierarchical Bayesian models and Bayesian model selection (Gelman, Carlin, Stern, & Rubin, 2004). Interestingly, our results suggest that less data is required to learn T than to learn the specific grammar G.

On another view, we are not arguing with the form of the PoS argument, but merely clarifying what content the knowledge T must have. We argue that phenomena such as children’s ability to correctly front the auxiliary in polar interrogatives are not sufficient to require that the innate knowledge constraining generalization in language acquisition be language-specific. Rather it could be based on more general-purpose systems of representation and inductive biases that favor the construction of simpler representations over more complex ones.

Other critiques of the innateness claim dispute the three premises of the original argument, arguing either:

In the case of auxiliary fronting, one example of the first response (9.i) is the claim that children do not in fact always avoid errors that would be best explained under a linear rule rather than a hierarchical rule. Although Crain and Nakayama (1987) demonstrated that children do not spontaneously form incorrect complex interrogatives such as (4b), they make other mistakes that are not so easily interpretable. For instance, one might utter a sentence like “Is the man who is hungry is ordering dinner?”, which is not immediately compatible with the correct hierarchical phrase-structure grammar but might be consistent with a linear rule. Additionally, recent research by Ambridge et al. (2008) suggests that 6–7 year-old children presented with auxiliaries other than is do indeed occasionally form incorrect sentences like (4b), such as “Can the boy who run fast can jump high?”

A different response (9.iii) accepts that children have inferred the correct hierarchical rule for auxiliary fronting (6), but maintains that the input data is sufficient to support this inference. If children observe sufficiently many complex interrogative sentences like (4c) while observing no sentences like (4b), then perhaps they could learn directly that the hierarchical rule (6) is correct, or at least better supported than simple linear alternatives. The force of this response depends on how many sentences like (4c) children actually hear. While it is an exaggeration to say that there are no complex interrogatives in typical child-directed speech, they are certainly rare: Legate and Yang (2002) estimate based on two CHILDES corpora2 that between 0.045% and 0.068% of all sentences are complex interrogative forms. Is this enough? Unfortunately, in the absence of a specific learning mechanism, it is difficult to develop an objective standard about what would constitute “enough.” Legate and Yang attempt to establish one by comparing how much evidence is needed to learn other generalizations that are acquired at around the same age; they conclude on this basis that the evidence is probably insufficient. However, such a comparison overlooks the role of indirect evidence, which has been suggested to contribute to learning in a variety of contexts (Foraker et al., 2009, Landauer and Dumais, 1997, Reali and Christiansen, 2005, Regier and Gahl, 2004).

Indirect evidence also plays a role in the second type of reply, (9.ii), which is probably the most currently popular line of response to the PoS argument. The claim is that children could still show the correct pattern of linguistic behavior – acceptance or production of sentences like (4c) but not (4b) – even without having learned any grammatical rules like (5) or (6) at all. Perhaps the data, while poor with respect to complex interrogative forms, are rich in distributional and statistical regularities that would distinguish (4c) from (4b). If children pick up on these regularities, that could be sufficient to explain why they avoid incorrect complex interrogative sentences like (4b), without any need to posit the kinds of grammatical rules that others have claimed to be essential (Redington et al., 1998, Lewis and Elman, 2001, Reali and Christiansen, 2004, Reali and Christiansen, 2005).

For instance, Lewis and Elman (2001) trained a simple recurrent network to produce sequences generated by an artificial grammar that contained sentences of the form AUX NP ADJ? and Ai NP Bi, where Ai and Bi stand for inputs of random content and length. They found that the trained network predicted sentences like “Is the boy who is smoking hungry?” with higher probability than similar but incorrect sequences, despite never having received that type of sentence as input. In related work, Reali and Christiansen (2005) showed that the statistics of actual child-directed speech support such predictions (though see Kam, Stoyneshka, Tornyova, Fodor, and Sakas (2008) for a critique). They demonstrated that simple bigram and trigram models applied to a corpus of child-directed speech gave higher likelihood to correct complex interrogatives than to incorrect interrogatives, and that the n-gram models correctly classified the grammaticality of 96% of test sentences like (4b) and (4c). They also argued that simple recurrent networks could distinguish grammatical from ungrammatical test sentences because they were able to pick up on the implicit statistical regularities between lexical classes in the corpus.

Though these statistical-learning responses to the PoS argument are important and interesting, they have two significant disadvantages. First of all, the behavior of connectionist models tends to be difficult to understand analytically. For instance, the networks used by Reali and Christiansen (2005) and Lewis and Elman (2001) measure success by whether they predict the next word in a sequence or by comparing the prediction error for grammatical and ungrammatical sentences. These networks lack not only a grammar-like representation; they lack any kind of explicitly articulated representation of the knowledge they have learned. It is thus difficult to say what exactly they have learned about linguistic structure – despite their interesting linguistic behavior once trained.

Second, by denying that explicit structured representations play an important role in children’s linguistic knowledge, these statistical-learning models fail to engage with the motivation at the heart of the PoS arguments and much of contemporary linguistics. PoS arguments begin with the assumption – taken by most linguists as self-evident – that language does have explicit hierarchical phrase structure, and that linguistic knowledge must at some level be based on representations of syntactic categories and phrases that are hierarchically organized within sentences. The PoS arguments are about whether and to what extent children’s knowledge about this structure is learned via domain-general mechanisms, or is innate in some language-specific system. Critiques based on the premise that this explicit structure is not represented as such in the minds of language users do not really address this argument – although they may be valuable in their own right by calling into question the broader assumption that linguistic knowledge is structured and symbolic. Our work here is premised on taking seriously the claim that knowledge of language is based on structured symbolic representations. We can then investigate whether the principle that these linguistic representations are hierarchically organized might be learned. We do not claim that linguistic representations must have explicit structure, but assuming such a representation allows us to engage with PoS arguments more directly on their own terms.

One place where our analysis does make a significant simplification (relative to the standard linguistic treatment of aux-fronting and related phenomena) is that we – like Reali and Christiansen (2005) and Lewis and Elman (2001) – do not attempt to explain these phenomena in terms of movement or transformation rules. Chomsky’s (1980) formulation of “linear” and “hierarchical” hypotheses for forming complex interrogatives, (5) and (6) respectively, framed these as alternative rules for moving elements of a base declarative form, and the question was whether these rules should be defined over a hierarchical phrase-structure analysis or the linear sequence of words in the declarative form. Instead, as we explain above, we focus on the more basic question of how and whether a learner could infer that representations with hierarchical phrase structure provide the best way to characterize the set of syntactic forms found in a language. We see this question as the simplest way to get at the essence of the core inductive problem of language acquisition posed in the generative tradition. It is also relevant to the original phenomenon we began with: the best hierarchically phrase structured grammars we find do indeed generate correct aux-fronted complex interrogative forms like 4(c) and not incorrect forms like 4(b), while the best linear grammars do not. An important direction for future work would be to link our learnability analysis more tightly to standard syntactic analyses, by extending it to grammars based on explicit movement rules or other means to the same end. Even without doing so, it seems a reasonable premise that any such extension would naturally involve rules defined over the constituents of the grammar, and thus the identification of those constituents – the problem we address here – is important and relevant: if the hierarchical nature of phrase structure can be inferred, then any reasonable approach to inducing rules defined over constituent structure should result in appropriate structure-dependent rules.

We present two main results. First of all, we demonstrate that a learner equipped with the capacity to explicitly represent both linear and hierarchical phrase-structure grammars – but without any initial bias to prefer either in the domain of language – can infer that the hierarchical phrase-structure grammar is a better fit to typical child-directed input, even on the basis of as little as a few hours of conversation. Our results suggest that at least in this particular case, it may be possible to acquire domain-specific knowledge about the form of structured representations via domain-general learning mechanisms operating on data from that domain. Secondly, we show that the hierarchical phrase-structure grammar favored by the model – unlike the other grammars it considers – succeeds in one important auxiliary fronting task, even when no direct evidence to that effect is available in the input data. This second point is simply a byproduct of the main result, but it provides a valuable connection to the literature and makes concrete the benefits of learning abstract linguistic principles.

These results emerge because an ideal learner must trade off simplicity and goodness-of-fit in evaluating hypotheses. The notion that inductive learning should be constrained by a preference for simplicity is widely shared among scientists, philosophers of science, and linguists. Chomsky himself concluded that natural language is not finite-state based on informal simplicity considerations (1956, 1957), and suggested that human learners rely on an evaluation procedure that incorporates simplicity constraints (1965). The tradeoff between simplicity and goodness-of-fit can be understood in domain-general terms. Consider the hypothetical data set illustrated in Fig. 3. We imagine that data is generated by processes occupying different subsets of space. Models correspond to different theories about which subset of space the data is drawn from; three are shown in A–C. These models fit the data increasingly precisely, but they attain this precision at the cost of additional complexity. Intuitively, the model in B appears to offer the optimal balance, and this intuition can be formalized mathematically using techniques sometimes known as the Bayesian Occam’s Razor (e.g., MacKay, 2003). In a similar way, we will argue, a hierarchical phrase-structure grammar yields a better tradeoff than linear grammars between simplicity of the grammar and fit to typical child-directed speech.

Though our findings suggest that the specific feature of hierarchical phrase structure can be learned without an innate language-specific bias, we do not argue that all interesting aspects of language will have this characteristic. Because our approach combines structured representation and statistical inductive inference, it provides a method to investigate the unexplored regions of Fig. 1 for a wide range of other linguistic phenomena, as has recently been studied in other domains (e.g., Griffiths et al., 2004, Kemp and Tenenbaum, 2008, Yuille and Kersten, 2006).

One finding of our work is that it may require less data to learn a higher-order principle T – such as the hierarchical nature of linguistic rules – than to learn every correct generalization G at a lower level, e.g., every specific rule of English. Though our model does not explicitly use inferences about the higher-order knowledge T to constrain inferences about specific generalizations G, in theory T could provide effective and early-available constraints on G, even if T is not itself innately specified. In the discussion, we will consider what drives this perhaps counterintuitive result and discuss its implications for language acquisition and cognitive development more generally.

Section snippets

Method

We cast the problem of grammar induction within a hierarchical Bayesian framework

Results

The posterior probability of a grammar G is the product of the likelihood and the prior. All scores are presented as log probabilities and thus are negative; smaller absolute values correspond to higher probabilities.

Discussion

Our model of language learning suggests that there may be sufficient evidence in the input for an ideal rational learner to conclude that language has hierarchical phrase structure without having an innate language-specific bias to do so. The best-performing grammars form grammatically correct English interrogatives, even though the input contains none of the crucial data Chomsky identified. In this discussion, we consider the implications of these results for more general questions of

Conclusion

We have demonstrated that an ideal learner equipped with the resources to represent a range of symbolic grammars that differ qualitatively in structure, as well as the ability to find the best fitting grammars of various types according to a Bayesian score, can in principle infer the appropriateness of hierarchical phrase-structure grammars without the need for innate language-specific biases to that effect. If an ideal learner can make this inference from actual child-directed speech, it is

Acknowledgements

For many helpful discussions, we thank Virginia Savova, Jeff Elman, Jay McClelland, Steven Pinker, Lila Gleitman, Tim O’Donnell, Adam Albright, Robert Berwick, Cedric Boeckx, Edward Gibson, Ken Wexler, Danny Fox, Fei Xu, Michael Frank, Morten Christiansen, Daniel Everett, Noah Goodman, Vikash Mansinghka, Charles Kemp, and Susanne Gahl. Special thanks to Virginia Savova for hand-parsing the Eve corpus and Gideon Borensztajn for sharing the constituency-annotated parses for that corpus. This work

References (90)

  • S. Pinker et al.

    The faculty of language: What’s special about it?

    Cognition

    (2005)
  • M. Redington et al.

    Distributional information: A powerful cue for acquiring syntactic categories

    Cognitive Science

    (1998)
  • T. Regier et al.

    Learning the unlearnable: The role of missing evidence

    Cognition

    (2004)
  • R. Solomonoff

    A formal theory of inductive inference

    Information and Control

    (1964)
  • M. Tomasello

    The item based nature of children’s early syntactic development

    Trends in Cognitive Sciences

    (2000)
  • Alishahi, A., & Stevenson, S. (2005). A probabilistic model of early argument structure acquisition. In Proceedings of...
  • B. Ambridge et al.

    Is structure dependence an innate constraint? New experimental evidence from children’s complex-question production

    Cognitive Science

    (2008)
  • Angluin, D. (1988). Identifying languages from stochastic examples. Tech report, Yale University....
  • J. Anderson

    The adaptive nature of human categorization

    Psychology Review

    (1991)
  • R. Berwick

    Learning from positive-only examples: The subset principle and three case studies

    Machine Learning

    (1986)
  • Berwick, R., & Chomsky, N. (2008). ‘Poverty of the stimulus’ revisited: Recent challenges reconsidered. In Proceedings...
  • Berwick, R., & Chomsky, N. (submitted for publication). Poverty of the stimulus revisited: Recent challenges...
  • A. Booth et al.

    Mapping words to the world in infancy: Infants’ expectations for count nouns and adjectives

    Journal of Cognition and Development

    (2003)
  • Borensztajn, G., Zuidema, W., & Bod, R. (2008). Children’s grammars grow more abstract with age - evidence from an...
  • A. Borovsky et al.

    Language input and semantic categories: A relation between cognition and early word learning

    Journal of Child Language

    (2006)
  • Briscoe, E. (2006). Language learning, power laws, and sexual selection. In 6th international conference on the...
  • R. Brown

    A first language: The early stages

    (1973)
  • E. Charniak

    Statistical language learning

    (1993)
  • N. Chomsky

    Three models for the description of language

    IRE Transactions on Information Theory

    (1956)
  • N. Chomsky

    Syntactic structures

    (1957)
  • N. Chomsky

    Aspects of the theory of syntax

    (1965)
  • N. Chomsky

    Problems of knowledge and freedom

    (1971)
  • N. Chomsky
  • Clark, A., & Eyraud, R. (2006). Learning auxiliary fronting with grammatical inference. In Proceedings of the 28th...
  • Collins, M. (1999). Head-driven statistical models for natural language parsing. Unpublished doctoral dissertation,...
  • S. Crain et al.

    Structure dependence in grammar formation

    Language

    (1987)
  • S. Crain

    Language acquisition in the absence of experience

    Behavioral and Brain Sciences

    (1991)
  • Dowman, M. (1998). A cross-linguistic computational investigation of the learnability of syntactic, morphosyntactic,...
  • Dowman, M. (2000). Addressing the learnability of verb subcategorizations with Bayesian inference. In Proceedings of...
  • H. Edelsbrunner et al.

    Edgewise subdivision of a simplex

    Discrete Computational Geometry

    (2000)
  • J. Elman et al.

    Rethinking innateness: A connectionist perspective on development

    (1996)
  • D. Everett

    Cultural constraints on grammar and cognition in Pirahã: Another look at the design features of human language

    Current Anthropology

    (2005)
  • Feldman, J., Gips, J., Horning, J., & Reder, S. (1969) Grammatical complexity and inference (tech. rep. CS-TR-69-125)....
  • S. Foraker et al.

    Indirect evidence and the poverty of the stimulus: The case of anaphoric one

    Cognitive Science

    (2009)
  • A. Gelman et al.

    Bayesian data analysis

    (2004)
  • Cited by (0)

    View full text