It started with a blog post.

And a response:

What ensued was a Twitter discussion sometimes referred to as the 2018 Semantics Megathread.

In the time since, the claim that “language meaning cannot be learned from form alone” has been a subject of interest, debate, and sometimes outright confusion. So Emily M. Bender and Alexander Koller (henceforth B&K) authored a paper in ACL 2020 articulating their argument, entitled Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data. It generated a lively discussion on Twitter, in the virtual ACL Rocket Chat, and in the live Q&A sessions during the conference, and won the Best Thematic Paper award for the ACL 2020 Theme, Taking Stock of Where We’ve Been and Where We’re Going.

What can we take away from the paper and ensuing discussion? Fully understanding the claim and its implications requires deconstructing what it means to learn (or understand) meaning—and attitudes toward how to do so may vary. B&K’s core argument (Thesis) is based on a fairly simple idea. Bender has argued that attributing an understanding of meaning to a system which has only seen form (e.g., language text) is a “category error,” form and meaning being two totally different types of things. To some AI researchers, this may seem like a quibble about definitions, so the most common response to B&K’s argument (Antithesis) works within the behavioral testing tradition of AI originally advocated by Turing, which attempts to sidestep such problems. Approaching B&K’s argument and their proposed octopus test strictly from within Turing’s framing, critics have argued that their claim is dubious, narrow, or has few practical implications. I think this argument is valid, and I will flesh it out in detail. But I think what it implies is that we should take a step back from the relatively narrow behavioral testing interpretation of B&K’s claim. When interpreted more carefully (Synthesis), I think B&K’s argument holds a lot of value for AI researchers, both from the scientific perspective and for real system design and use.

Credit for most of the ideas in this post goes of course to Bender & Koller and the vibrant discussion among tweeters and ACL attendees. In my analysis I will draw on quotes, with attribution, from the Rocket Chat instance for the paper (“the chat”), which has unfortunately since been deleted. If your name appears here and you prefer it didn’t, please let me know privately and I will remove it.


Form, Meaning, and Intent

The core claim is that “a system trained only on form has a priori no way to learn meaning” (B&K, Abstract). In the introduction, B&K point out how loose use of words like “understanding” and “comprehension” in recent publications describing machine learning systems (particularly, those trained on traditional language modeling or masked language modeling objectives) leads to mischaracterization and unwarranted hype in popular media. It stands to reason, though they don’t explicitly say so, that this also indicates or facilitates sloppy thinking and argumentation among researchers. We’d be better served by facing the issue head-on to clear our thinking.

In order to articulate their core argument, the authors highlight communicative intent as an important component of language meaning. Crucially, this pertains to language-external goals of communication between people—such as assertions or desires pertaining to the real world (or abstract worlds). They also highlight conventional (or standing) meaning as an intermediary between form and communicative intent, representing what is common in meaning across all contexts of an expression’s use. Conventional meanings, furthermore, must be interpreted with respect to some external system (such as a model theory in the Montagovian tradition of formal semantics, or referents and sensory inputs in the context of Stevan Harnad’s Symbol Grounding) in order to connect to the world and form communicative intents. Furthermore, humans bring a great deal of language-external context to bear in order to disambiguate conventional meanings and infer intents. None of this external context—crucial both to interpretation and grounding of communicative intents—is contained in forms, and meanings are not “contained” in forms in any sense for a system that doesn’t already have mastery of the language. Therefore, they argue, meaning cannot be learned from form alone.

Understanding the Octopus Test

To illustrate their point, B&K propose a thought experiment, which they call the octopus test: two people A and B live alone on remote islands, but communicate with each other in English1 text messages through a trans-oceanic cable. A hyper-intelligent Octopus O, sitting on the ocean floor, taps the cable and listens in on their conversations for an indeterminate period. Eventually, O gets lonely and decides to interpose on the conversation, cutting off B and impersonating their replies to A. The question is whether O can keep up the charade indefinitely without eventually raising A’s suspicions.

B&K propose several scenarios inviting the reader to reject the idea that O could pass the test:

  • Suppose A constructs a coconut catapult, sends detailed instructions over the wire, and asks for B’s thoughts.
  • Suppose A is suddenly chased by an angry bear, grabs a couple of sticks to defend herself, and asks B for advice.

They ask: will O be able to provide meaningful feedback in either case, having no grounded experience with coconuts, catapults, bear attacks, or any of the physical, social, and embodied processes and situations involved? Their answer is no—that he has no way of reasoning about such situations, since he has never encountered them (or anything like them, or even encountered anything besides linguistic form, for that matter). He would likely be forced to, e.g., regress to vague, high-probability replies instead of meaningful engagement. They make this point more concretely in their appendix, where they run several prompts related to the bear attack and simple arithmetic problems through GPT-2, showing that it does not produce correct or useful responses in either case.

A crucial point about the octopus test and B&K’s proposed scenarios is that they are illustrative, not diagnostic. Their claim that the octopus won’t pass the test is not meant to imply the main thesis—indeed, they state the converse: that “a system that is trained only on form would fail a sufficiently sensitive test, because it lacks the ability to connect its utterances to the world” (Section 3, emphasis mine). Furthermore, the test points to a mere existence claim, that there is some behavioral capability which cannot be recovered from form alone. The test is not precise enough to say which capabilities exactly are lacking, at least when it comes to manipulating form. Instead, it serves as an intuition pump for the kind of phenomena at play. As Bender wrote in the chat,

The point isn’t really whether O could fool A under what circumstances, but rather to use that thought experiment to show what is missing in O’s (and thus modern LM’s) input.

I think it’s important to stress this. There were many questions in the chat about what kind of behavior might be acceptable on the part of the octopus, along the lines of:

  • What if B hadn’t heard of bears before?

  • Aren’t real humans also sometimes vague and unhelpful?

  • Doesn’t misunderstanding and misattribution of meaning happen between humans as well?

  • When O produces a vague response that might otherwise also be produced by B, does this count as a failure?

However, these questions miss the forest for the trees. Ultimately, you can find many ways that B may respond which could be believably replicated by O. But the question of interest is not so much whether O would be competent (or lucky) enough to pass in any particular situation, but why B behaves the way they do. The point is not to establish precise expectations of O, but to highlight the differences between the processes that O and B must be executing, with a particular focus on the context available to B that O is missing.

Subjective Judgments and the ELIZA Effect

This perspective is important for understanding criticisms of the octopus test: some challenged B&K’s claim on the basis of their proposed scenarios, suggesting that GPT-3, for example, can pass them reasonably. Blogger Gwern Branwen ran several of B&K’s suggested prompts2 through GPT-3 and found that it got arithmetic questions right and, with some cajoling, can produce reasonable-sounding advice for dealing with a bear. However, this also misses the point—again, these were not meant as concrete diagnostics. The bear example in particular is meant to illustrate B’s presumed ability to deal with new concepts, project situations into the future, anticipate A’s needs, and develop creative solutions. Approximately recalling National Park Service advice on bears has little to do with these basic capabilities. In fact, Gwern in some sense made B&K’s point: he had to carefully engineer the prompt in order to get it to produce the kind of material he was looking for. GPT-3 could hardly be considered a helpful interlocutor in this light.

Evaluating a system’s understanding of meaning on the basis of such concrete results presents a conundrum: as B&K point out, humans have a very strong tendency to impute meaning onto language, even if they know the language is produced by simple rules—a phenomenon known as the ELIZA effect. This is at least in part because, as Bender argues in the chat, “every time we perceive linguistic form in a language we are competent in, we also perceive meaning.” So the mere subjective appearance of meaning is not a reliable indicator of a system’s understanding. Indeed, Turing’s Imitation Game was once regarded with great interest as a way of gauging progress towards general intelligence. But when the test was (arguably) passed by systems like Eugene Goostman—designed more around deception than intelligence, and fairly transparent to those who know its strategies—it became ever more clear that quantifiably, objectively evaluating general intelligence is not straightforward.

There’s another side to this coin. The unreliability of subjective judgments can potentially become a grand excuse, allowing a skeptic to rule out any behavior as “not exhibiting true understanding,” an example of the No True Scotsman fallacy. As an example of the complexity of this issue, consider the Winograd Schema Challenge (WSC), proposed to mitigate some of the problems with the Imitation Game while also being “Google-proof,” i.e., not solvable by “simple” statistical methods but requiring actual common-sense knowledge. In 2015, I attended the Symposium on Logical Formalisms of Commonsense Reasoning, which had a panel on the WSC. Leora Morgenstern, one of its proposers, was asked: “what should we make of it if a statistical learning system passes the WSC?” Instead of replying “then such a system must have common-sense knowledge,” she said: “then the WSC is not the test we thought it was.” Morgenstern’s reply indeed proved prescient—as Trichelair et al. demonstrated the existence of artifacts which make it easier than hoped, and recent systems manage to achieve over 90% on the WSC as evaluated in the SuperGLUE benchmark, while still seemingly lacking a general understanding of language meaning, as indicated, e.g., by other tests showing chance performance or reliance on shallow heuristics under careful evaluation. So it makes sense to take quantitative results with a grain of salt: when a system trounces a new test on understanding language meaning, it seems more likely that the test misses something important than that the problem of language meaning is solved. However, to always take this position would close oneself off to all evidence, and a more reasonable position would probably attempt to place such results within a broader spectrum of meaning-associated capabilities, as suggested by Tom Dietterich (we’ll revisit his point later).

These complexities, I believe, are why B&K pose their thought experiment for illustrative and not diagnostic purposes. Instead of providing a new metric for everyone to optimize, their proposals provide opportunities to ask why: in highlighting what O is missing, we gain perspective on the reasons and guiding factors behind B’s behavior, and what kind of knowledge and context B leverages. This class of questions falls under the scientific challenge of the study of language meaning in NLP. I highly recommend On our best behaviour, by Hector J. Levesque, as an exploration of this issue focused on the Winograd Schema Challenge (plus Zhang et al. from this year’s ACL as a bonus follow-up). B&K expound on the issue in Section 8, arguing that top-down perspective on which problems to solve is crucial for long-term progress in NLU. As Bender wrote in the chat, the broader point is a call “to explore additional hills to climb, as it were, that have contours that suggest they will bring your research closer to the end goal.”

Antithesis: an AI Perspective

The rest of B&K’s paper supports the plausibility of the argument through more illustrative examples and pointers to fields such as language acquisition research and distributional semantics, where researchers are finding that language meaning seems difficult (if not impossible) to learn without grounding, e.g., in joint attention with an interlocutor or in sensory inputs. However, the core of the argument remains the same, and most readers focused on the octopus test. Many weren’t convinced. If the test only requires manipulation of forms, why should it be impossible in principle for O to learn the exact function describing B’s behavior? And if that’s the test to determine if O understands meaning, does it not seem reasonable to say it’s at least possible that he does?

In this section, I will carefully develop this point of view, show why it might be a natural one for an AI researcher to take, and show how this framing reduces B&K’s thesis to a claim that is rather narrow, with few practical implications.

From Octopus Test to Imitation Game

First, we have to address what it means for a machine to “learn” or “understand” something. Similar questions were at the heart of Alan Turing’s seminal paper Computing Machinery and Intelligence, published in 1950. B&K write:

Turing (1950) argued that a machine can be said to “think” if a human judge cannot distinguish it from a human interlocutor after having an arbitrary written conversation with each. (Section 3.2)

I would characterize his argument differently. In fact, Turing said he believes the question “Can machines think?” is “too meaningless to deserve discussion” (p. 442), and instead of trying to define the terms “machine” and “think,” he would “replace the question by another” (p. 443), in particular that of his famous Imitation Game. His point was that scientifically speaking, it doesn’t matter what thinking (or meaning, for that matter) is—what matters is the testable behavior it implies. Turing’s approach has been widely adopted by the AI community.

This raises two issues related to Bender & Koller’s thesis.

  1. Turing’s move would replace the question of whether a system understands language meaning with whether it can exhibit behaviors which hinge on what we think of as language meaning. Crucially, many such tests (including both the Imitation Game and B&K’s octopus test) only require the system to produce linguistic form—any external grounding is implicit in the choice of what to say. It may not seem obvious that a system trained on form alone cannot learn meaning, if the test of having “learned meaning” only requires the system to manipulate forms in a way that is indistinguishable from humans—even if humans themselves mentally ground those forms to representations of the language-external world.

  2. Behaviors exist on a continuum—see Tom Dietterich’s blog post entitled What does it mean for a machine to “understand”?, where he argues for a capabilities-based notion of understanding. In particular, understanding of language meaning may include many facets and be exhibited in a wide variety of behaviors. So the empirical approach to investigating understanding should not treat meaning as a monolith: there are many aspects of meaning and many ways to test against them—some are surely trivial, while some are likely AI-complete. This implies that the proper question with regard to the form/meaning distinction is which capabilities may be acquired from form alone, and which cannot.

In light of these issues, we may choose to view B&K’s thesis as an empirical claim about what is theoretically possible. Monojit Choudhury summarized the argument in the chat as follows:

Systems that only work at the level of forms will never be able to achieve a state of indistinguishability from agents that understand meaning… In other words, there will always be some test, say the octopus test, which will be able to distinguish the form-based system from the grounded meaning based systems (or humans).

Koller agreed and elaborated: “unrestricted communication will eventually expose that [the form-only learner] didn’t learn meaning.” An implicit but important part of the claim is the form restriction: that it applies only to tests of the agent’s ability to manipulate form. Otherwise, it would be trivial: a language model has no API through which to process an image, let alone demonstrate “understanding” of it.

It is important to distinguish this point from John Searle’s Chinese Room Argument, about which questions have come up repeatedly. Searle’s argument is that a machine operating on a set of rules can not be considered to “think” even if its behavior is indistinguishable from a human. This is exactly the kind of question in which Turing had no interest. The octopus test, on the other hand, is meant to show that if such a machine were the product of a learning system, the system’s supervision would need to contain meaning, which is not present in form alone (i.e., available from self-supervised training on large amounts of natural language text).

It is also important to note how weak this claim is: that there exists some test which the system will fail, or that it will eventually be exposed. It is partly due to this essential narrowness of the claim that neither side seems to be able to convince the other with evidence, as we’ll see in the next section.

Out of Task, Out of Mind: Assessing the Evidence

In B&K’s appendix, they present GPT-2 completions of a simple arithmetic problem and some prompts related to the bear chase from the octopus test. The responses generally don’t form coherent bear-fighting advice, and they get the arithmetic wrong (or don’t answer it). B&K argue that this is “because GPT-2 does not know the meaning of the generated sentences, and thus cannot ground them in reality” (Appendix A) and state that simple arithmetic is “beyond the current capability of GPT-2 and, we would argue, any pure LM” (Appendix B).

However, there’s a problem with these experiments, which becomes clear by comparison to the octopus test. In the octopus test, O can observe communications between A and B indefinitely until it decides to interpose, and its goal is to exactly imitate B’s side. Its test task is perfectly aligned with its training signal (notwithstanding an imperfect O’s mistakes sending it out-of-distribution). On the other hand, a language model like GPT-2 is trained to maximize the likelihood of a certain subset of web text. So if we sample continuations from a prompt, we would expect GPT-2 to sample plausible web text which began with that prompt. When we prompt GPT-2 with, e.g., arithmetic problems and expect correct results, we’re not only going out-of-distribution: we’re going out-of-task.

Consider “fact recall.” As part of a large experimental suite, Allyson Ettinger tested how BERT’s mask-filling behavior varies on sentence pairs with and without negation, with respect to the truth of the resulting statement. BERT’s top five predictions for one example are listed below (Ettinger, Table 13):

Context Predictions
A robin is a _____. bird, robin, person, hunter, pigeon
A robin is not a _____. robin, bird, penguin, man, fly

From her experiments, she notes that for a certain class of sentences, “BERT shows a complete inability to prefer true over false completions for negative sentences” (Section 9). However, this may be more charitably characterized as a lack of desire rather than a lack of ability (though both are metaphorical: terms like desire and ability risk anthropomorphizing an algorithm which simply does what it does). Predicting what word comes next is not the same as predicting what forms a true statement—in fact, it may involve the opposite. Unusual or even false situations or properties may be disproportionately reported in comparison to obvious ones (which aren’t worth mentioning). This phenomenon is often called reporting bias, and examples include a black sheep (due to Meg Mitchell) or the blue banana (from personal experience). Consider also whether you are more likely to see the phrase “pigs fly” or “pigs walk.”

It’s conceivable that a model which has reverse-engineered reporting bias could learn an encoding of facts so it knows not to report them, or that properties of words are encoded indirectly by pairwise similarities (as formalized, for example, by Katrin Erk). By this logic, just because a model doesn’t produce factual text (or correct answers to arithmetic problems, or helpful advice for staving off a bear) doesn’t mean it lacks the requisite language understanding capabilities. It may just be that it hasn’t been properly asked to use them. (This also seemed to be Gwern’s attitude in his response to B&K.)

Aside: One might argue that this is precisely the point—because behavior can only be queried in terms of next word probabilities and not the real world uses that are actually associated with meanings, then it doesn’t make sense to think of language models as having “meaning.” I suspect B&K may agree with this position, and it’s an important point which we will revisit in the Synthesis section on meaning and use. But if we take the view that a language model’s generated text can be used as a diagnostic for the LM understanding language meaning, then we have to address the out-of-task problem: even a model which somehow reverse-engineered a human atom-for-atom would fail a test of the real-world factuality of its predicted text, so it seems like an unfair requirement—especially as the expectations we have of the model may change or even contradict each other between test settings.

A fairer test of the difference between a language model’s capabilities and a human’s may be what the authors of GPT-3 did when they generated news articles from the model and asked humans to guess whether they were real or fake (Section 3.9.4). Humans barely beat chance, at 52% accuracy. However, while arguably fair, their test is very weak: it lacks the generality, interactivity, element of novelty, and long time horizon possessed by the octopus test. “Surely,” Bender comments, “a human’s capabilities extend far beyond generating journalistic prose” (personal communication).

So, supposing a system can understand what a human understands without behaving exactly like a human, is it possible to more fairly and precisely “ask” a language model to do a task, or probe its understanding of desired aspects of meaning? There is a veritable cottage industry around these questions, and approaches include:

Language model pre-training has produced positive results in all of these cases, and many tests are at least relevant to language meaning, if not proper “meaning-associated capabilities.” But because B&K’s claim, in this framing, reduces to the existence of an unsolvable test, none of these results can quite touch the thesis—though they might shift some people’s assessment of its plausibility. If they don’t affect your assessment at all, that’s perfectly reasonable: but make sure you treat examples such as those in B&K’s appendix with similar skepticism. Beware of grand excuses that allow you to ignore evidence you don’t like.

It seems that, in the near future, evidence will not help us solve the octopus’s mystery, or perhaps change many minds. So instead of exploring more evidence, next we’ll take a theoretical look at our AI-style framing of B&K’s thesis.

Climbing Towards Meaninglessness: Latent Variables and Laundered Meaning

The core of B&K’s argument may seem inherently suspicious to some readers: “Meaning is not there, therefore it cannot be learned.” But couldn’t you say the same for any latent variable? We learn those all the time. In the chat, Graham Neubig writes:

One thing from the twitter thread that it doesn’t seem made it into the paper… is the idea of how pre-training on form might learn something like an “isomorphic transform” onto meaning space. In other words, it will make it much easier to ground form to meaning with a minimal amount of grounding. There are also concrete ways to measure this, e.g. through work by Lena Voita or Dani Yogatama… This actually seems like an important point to me, and saying “training only on form cannot surface meaning,” while true, might be a little bit too harsh—something like “training on form makes it easier to surface meaning, but at least a little bit of grounding is necessary to do so” may be a bit more fair.

Consider Hidden Markov Models. Even if the hidden states are never observed, they may still potentially be recovered from observations alone. Then to relate the learned model to a “ground-truth” definition of hidden states, one would only need to learn the simple isomorphism between its hidden states and the ground-truth ones instead of the whole model. One may concretely say how much the model learned about the ground-truth states in each step by counting the bits needed to represent its learned parameters and the bits needed to represent the final isomorphism—information-theoretic description length style measures, as suggested by Voita’s and Yogatama’s work that Neubig referenced. (Note, though, that this example lets the learner constrain the hypothesis space to HMMs—a very strong assumption.)

This is related to the out-of-task testing issue from the previous section. You certainly can’t expect the octopus to “tell apart a coconut and a mango,” as Guy Emerson suggests in the chat, without giving him some way of relating his new sensory stimuli to his existing knowledge—i.e., grounding. But suppose that his knowledge is so vast, from a near-eternity of listening in on A and B, that with only a miniscule amount of grounding signal, he can reconstruct an entire grounded lexicon. Then did he really not learn any meaning until it all came rushing in at the end?

Running the Gamut of Grounding

As a concrete example, consider an extension to the octopus test concerning color—a grounded concept if there ever was one. Suppose our octopus O is still underwater, and he:

  • Understands where all color words lie on a spectrum from light to dark… But he doesn’t know what light or dark mean.

  • Understands where all color words lie on a spectrum from warm to cool… But he doesn’t understand what warm or cool mean.

  • Understands where all color words lie on a spectrum of saturated to washed out… But he doesn’t understand what saturated or washed out mean.

Et cetera, for however many scalar concepts you think are necessary to span color space with sufficient fidelity. A while after interposing on A and B, O gets fed up with his benthic, meaningless existence and decides to meet A face-to-face. He follows the cable to the surface, meets A, and asks her to demonstrate what it means for a color to be light, warm, saturated, etc., and similarly for their opposites. After grounding these words, it stands to reason that O can immediately ground all color terms—a much larger subset of his lexicon. He can now demonstrate full, meaningful use of words like green and lavender, even if he never saw them used in a grounded context. This raises the question: When, or from where, did O learn the meaning of the word “lavender”?

It’s hard for me to accept any answer other than “partly underwater, and partly on land.” Bender acknowledges this issue in the chat as well:

The thing about language is that it is not unstructured or random, there is a lot of information there in the patterns. So as soon as you can get a toe hold somewhere, then you can (in principle, though I don’t want to say it’s easy or that such systems exist), combine the toe hold + the structure to get a long ways.

But once we acknowledge that an understanding of meaning can be produced from the combination of a grounding toehold and form-derived structure, that changes the game. (Admittedly, it’s not clear that Bender agrees with this; “a long ways” doesn’t necessarily mean “understanding lots of meaning.”) In particular, if O is hyper-intelligent, his observations should be exchangeable with respect to his conclusions about latent meaning; hearing “lavender” after he has knowledge of grounding cannot teach him any more than hearing the word before. A human may infer the meaning of a word when reading it for the first time in a book, on the basis of their prior understanding of the meaning of its linguistic context. A hyper-intelligent O should be able to do the same thing in the opposite order; the total amount of information learned about meaning is the same.

Asserting a Toehold

So now, as Jesse Dunietz points out in the chat: “The important question is… how much of the grounding information can be derived from a very small grounding toehold plus reams of form data and very clever statistical inference.” In this light, we can take B&K’s claim to imply that there is a significant chunk of meaning (i.e., the necessary grounding toehold) which cannot be learned from form alone (and furthermore that this chunk is necessary in order to convincingly manipulate form), rather than that no meaning can be learned from form alone.

This leaves us with another way to phrase Dunietz’s question: how large of a grounding toehold is necessary to, given an unlimited amount of form data, reconstruct all of meaning? It seems like the answer could plausibly be quite small. B&K’s own Java example illustrates this. They propose:

Imagine that we were to train an LM on all of the well-formed Java code published on Github. The input is only the code. It is not paired with bytecode, nor a compiler, nor sample inputs and outputs for any specific program. We can use any type of LM we like and train it for as long as we like. We then ask the model to execute a sample program, and expect correct program output. (Section 5)

They remark that this test is plainly impossible to pass: “The form of Java programs, to a system that has not observed the inputs and outputs of these programs, does not include information on how to execute them.” However, later they add:

It has been pointed out to us that the sum of all Java code on Github (cf. § 5) contains unit tests, which specify input-output pairs for Java code. Thus a learner could have access to a weak form of interaction data, from which the meaning of Java could conceivably be learned. This is true, but requires a learner which has been equipped by its human developer with the ability to identify and interpret unit tests. This learner thus has access to partial grounding in addition to the form. (Section 9)

But the operative part of unit tests is easily syntactically identifiable: the argument of every assert is expected to evaluate to true (and nearly always will in practice, if you’re looking at all code on Github). Perhaps more evocatively, if a next-token-predictor is prompted with assertEquals(f(x), , its task can be fairly characterized in most cases as evaluating f on x. In this way, arbitrary details of Java evaluation semantics may potentially show up in the task and be queried via asserts. The programmer does not have to teach the system to pay special attention to these cases. They simply come up as a subproblem of predicting the next token.

It’s not unreasonable to think a learning system may pick up on the fact that in cases like assertEquals(f(x), _), the value on the right side of the assertion is more or less conditionally independent of the rest of the code given x and f’s implementation. And there may also be any number of less obvious cues in the form of Java code which indirectly encode its execution behavior. For example, indexing into an array is almost always done with ints that, at runtime, have values within the bounds of the array. From this and other constraints, a model may theoretically be able to reverse-engineer runtime arithmetic operations on ints.

There are certainly more possibilities I can’t think of, and the details don’t matter so much as the clear fact that some potentially powerful clues about execution semantics are there. Powerful learning systems are liable to pick up on them and leverage them as a toehold. For the learner to be “equipped… with the ability to identify and interpret unit tests” may indeed require rather minor biases, like thou shalt seek useful conditional independence assumptions, or frankly just being able to notice when an assertEquals has come along. Would this really qualify as “partial grounding”?

Furthermore, in ML and NLP we often think of such biases as imposed by a prior over models. As Matt Richardson suggests in the chat:

It seems plausible to me that with the right priors connecting the two, a system could still learn meaning from surface forms. Such a prior might be e.g., different objects in the world have different surface forms, or text tends to reflect the temporal order of things, or etc. Perhaps this is considered cheating because the model is being given information about meaning, but there might be only very little such prior needed to still form a meaningful interpretation into meaning.

I would add on to this: nearly any machine learning system may be accurately thought of as having some prior, explicit or implicit. So the key “partial grounding” signal, especially if it takes the form of a fairly general prior, may be an unavoidable part of any learning system—it could be in the octopus’s DNA, so to speak.

This leaves our AI formulation of B&K’s thesis even weaker: it leaves room for all but a very small component of meaning (in particular, a perhaps-very-general prior and the “final isomorphism” to sensory inputs / grounded situations) to be learnable from form alone. If this is the case, and especially if we acknowledge the octopus to have some prior baked into its DNA, then it’s not so obvious why the octopus would necessarily fail its test, or what it means for NLP researchers if it does.


The previous section shows why the octopus test, when taken literally as a diagnostic test of understanding language meaning, seems to render the original question—whether meaning can be learned from form alone—basically meaningless. However, that doesn’t mean B&K’s thesis is meaningless. Instead, this may be taken to mean that the octopus test is not appropriate as a diagnostic. In this section, I’ll explore some alternative angles.

Generalization from What to How to Why

Consider B&K’s bear attack example. As given, it is framed as a novel situation—something the octopus will have never observed B deal with before. But when Gwern coaxes an arguably useful response out of GPT-3, his approach is to try and trigger approximate recall of advice on handling wildlife, the likes of which almost certainly appeared somewhere in GPT-3’s training data. This is much less interesting, but, to some, seems to pass the test.

This relates to the “big data” objection B&K address in Section 9: perhaps O sees so much form that it has covered just about every possible new situation, or enough that new situations can be successfully treated, more-or-less, via simple interpolation between the ones it has seen. Then it can pass the test primarily through recall. In fact, Khandelwal et al. constructed a neural language model that worked exactly this way—by explicit recall—and got performance gains on language modeling (though it’s not clear whether a huge model like GPT-3 is doing something analogous and not qualitatively different). But B&K reject that such a model would learn meaning, on the basis of the “infinitely many stimulus-response pairs” that would need to be memorized to deal with the “new communicative intents [constantly generated by interlocutors] to talk about their constantly evolving inner and outer worlds,” saying that even if such a system could score highly on benchmarks, it wouldn’t be doing “human-analogous NLU.”

The How: Generalization

What this tells me is that “learning meaning” may not be about what the model does after all—but how it does it. My interpretation of B&K here is that understanding meaning is what’s required for O to generalize correctly. This is why anything which can be solved by repeating behaviors O has seen before seems like cheating—it requires only recall, not understanding. In this light, B&K’s position seems to follow from the perspective that associative learning and recall of patterns in linguistic form—even if potentially extremely powerful—does not seem to be how language meaning actually works—at least if you listen to researchers studying human language acquisition (B&K, Section 6) or distributional semantics (B&K, Section 7).

In this framing, a better diagnostic than the octopus test may probe generalization ability or systematicity, measuring how well subsystems of language meaning can be learned from limited or controlled training data. It’s not totally clear how such a test should be constructed, or if there’s a single test which can capture the whole idea. But this is currently an active field of research in NLP. Much of this work focuses on learning in grounded, multi-modal environments that have language-external entities, goals or execution semantics, in line with B&K’s arguments about what it takes to learn meaning.

The Why: Causation

This also relates to an issue that came up in the live Q&A session I attended. A participant (whose name I unfortunately have forgotten) brought up Judea Pearl’s Book of Why, pointing out a well-known theoretical truth: correlation does not imply causation. It may be the case that our friendly octopus O, no matter how long he observes A and B converse, may never be able to pick up the causal structure at work behind their words. This means that his generalizations may be incorrect under covariate shift, when the distribution of its inputs changes—i.e., when previously rare or unseen events (such as bear attacks) become common.

Consider the Java example. Above I said the model may discover that the right side of an assertion like assertEquals(f(x), _) is more or less conditionally independent of the rest of the code given the x and f’s implementation. The “more or less” here is crucial: other factors in how the code is written may bias (or reflect biases in) how programmers choose which outputs are tested against, and how. If these biases are systematic, there may be no way for a learner to distinguish between them and the actual runtime semantics of Java, i.e., to learn the true causal structure at work.

I don’t know enough about causality to go deeper on this issue, but it seems important. My understanding is that generally, in order to establish causation, an intervention is needed, which requires a learner to interact and experiment with its environment (though it might be possible in some cases to identify natural experiments). Other work conceives of causal learning as detecting invariances across environments, though this seems to require some way of separating and parameterizing the environments. It seems to me that an understanding of causation may be necessary to run the sort of mental simulations that Josh Tenenbaum, in his ACL 2020 keynote, suggested may be involved in language comprehension. Bisk, Holtzman, and Thomason et al., referenced by B&K, also provide a careful exposition of related issues and a vision for the “world scopes” in which such action dynamics may be learned. In the chat, Marti Hearst said:

The conversations “back in the day” were all about how embodiment and being a child learning in the world was necessary for intelligence and understanding the way people do.

Thankfully, those conversations are continuing today.

What Use is Meaning, Anyway?

I think there’s another point worth making about what is meant by “understanding.”

Consider fact recall. A while after releasing T5, its creators showed that fine-tuning a pre-trained language model to answer factoid questions yields surprisingly good results, producing a system which can answer questions correctly using facts and entities it never saw in fine-tuning. One may take results like this to mean that a language model is like a knowledge base, storing factual knowledge that we can learn how to query by fine-tuning. (Or, in GPT-3’s case, we query it by running its forward pass with a given textual prefix.) But is it really storing facts? One could similarly say it is storing and retrieving:

  • Claims commonly reported in encyclopedic text.
  • Commonly held beliefs and common-sense knowledge.
  • Teachings of the United States public school system.
  • Events and claims reported on the news.

All of these things are highly correlated with each other and with true facts—but they also all disagree. There is no reason to believe that a model draws clear distinctions between these things, or that real-world true facts have special status in it. As a user or fine-tuner of a language model, you don’t have the ability to reach into its stored knowledge and query the latent notion of a “fact.” The entire job of defining a fact is up to you. But the notion of a fact, and the semantics of the resulting system, i.e., that it outputs true facts, is exactly what is critical to the use of that system in the real world.

I think this is just another way of saying B&K’s point: if we take “understanding language meaning” to mean understanding the relationship between language form and communicative intent, or use, then language models aren’t doing it. They’re not helping us find the nail, so to speak, so much as giving us a fancier hammer.

The upshot is that having “smart enough” language models is not going to solve any problem which is already bottlenecked by our ability to precisely formulate it in terms of data (and sufficiently far from the language modeling task itself). In fact, real-world facts may be one of the easiest cases of this, since they’re easy to objectively verify, they can be sourced at very large scale, and they don’t require annotators to be creative, which can lead to annotation artifacts. For other tasks, like reading comprehension, natural language inference, or many tasks involving language generation, this has proven extremely difficult. Bias, vulnerability to adversarial inputs, and reliance on shallow heuristics and surface correlations all remain problems. So it may be an easy mistake to over-value work in scaling up what we’re already doing over more careful, slower work investigating what the heck we’re trying to do in the first place. Again, see B&K’s Section 8 on climbing the right hills.

Anthropomorphism, Bias, and Use

Besides understanding language meaning, there are other qualities that are often attributed, in possible acts of anthropomorphism, to language models. One such quality is bias.

The latest Twitter debacle about bias in machine learning saw no shortage of comments on where bias may reside (and thus potentially be mitigated) in ML models: is it in the data? Loss function? Learned weights? Input features? Inductive bias?

These questions must be approached carefully: an algorithm or set of model weights cannot in and of itself be biased. Rather, more essential operative questions for investigating bias in ML are, as Timnit Gebru writes, who is benefiting, and who is being harmed? Benefits and harms accrue through the real-world processes which unfold between the creators, users, and subjects of machine learning models. We may trace harms incurred through, for example, dataset curation, affordance of inequitable inaccuracies in models deployed commercially or by law enforcement, or the use of machine learning in a myriad of other situations in ways that perpetuate or reinforce existing inequalities, often based on shaky or false assumptions about social truths.

B&K’s argument implies that meaning, by its status as a way of relating language form to an external system, should not be imputed on models which are not exposed to that system. I would say a corollary is that language processed or generated by these models is assigned meaning through its use. Similarly, to separate bias in machine learning models from how those models are used is a reductive mistake. On one hand, this subtle form of anthropomorphism invites dismissals from those who remark that math cannot be biased. On the other, it can lead to complacency with model debiasing, which—while good—merely scratches the surface of the issues of bias (and sometimes only barely).

To make the connection clearer: you cannot “debias” a language model by excising its harmful beliefs (nor can it hide such beliefs), because it doesn’t have beliefs. It simply models its training data. If you ignore this and instead pitch your language model as—for example—a general-purpose few-shot learner or language understander, you might find yourself in very unproductive conversations about whether your AI is evil or holds reprehensible opinions, rather than productive ones about bias in the systems that produced it and govern its use.


In this post, I have tried to take a nuanced deep dive into Bender & Koller’s paper and the responses to it. I tried to clarify how evidence factors into the debate and synthesize the main critiques that came up during the conference, as well as provide some perspective on what these criticisms may be missing. Hopefully this analysis makes the debate more accessible to those who were not in the Rocket Chat and aren’t constantly using Twitter. And hopefully it makes it a little less confusing and frustrating for those who are.

The paper’s core thesis is that language meaning cannot be learned from form alone. In line with some of the paper’s critics, I’ve argued that if we use the octopus test as a guide to interpret this claim straightforwardly in the traditional paradigm of AI research, then the claim seems to reduce to something quite narrow with few practical implications. However, viewed under a broader lens, I think the argument has important ramifications for the way that we conceive of and evaluate language understanding, pointing towards systematicity, causality, careful scrutiny of our training objectives, data, and model outputs, and thoughtful separation between what is being modeled and how a model is being used.

The word “understanding” is heavily loaded: it carries strong connotations of a system which has the right systems and causal structures in place, or has preconceived notions of truth, falsehood, entailment, belief, or other human-familiar (and human-useful) concepts. We have no reason to think language models contain such things in useful or recognizable form; all we know is that they are powerful statistical summaries of their training data. Loosely saying our models “understand language” slips these implications under the table. As Phil Resnik said in the chat:

For me the huge contribution here is that you are getting (at least more) people to think about these issues and discuss them again, whether or not they agree with the specifics of your argument. For what it’s worth, for many years I have issued the following as a standard warning (particularly in entrepreneurial contexts): “If someone tells you that their system ‘understands’ language, put your hand on your wallet and keep it there”. Doubly so if they say “like people do” and quadruply so if they make any reference whatsoever to the way children learn language.

Couldn’t agree more.

Huge thanks to Emily Bender, who provided valuable feedback on an early version of this post and corrected some important misunderstandings I had about their argument. Thanks also to Amandalynne Paullada for valuable feedback on the framing, Alexander Koller for useful criticisms of an earlier version of the Java argument, Jonathan Kummerfeld and Sanxing Chen for comments, Max Forbes for encouragement, and the many participants at ACL and on Twitter who have contributed to this discussion.


  1. Their argument is clearly not specific to English, but it seems prudent to specify the language as English since they use it for their examples—a generous application of the Bender rule to avoid reinforcing the assumption of English as the default language. 

  2. Before going to Gwern’s response, some readers might want to know that he’s the type of blogger that may do a tremendous amount of research to produce things like graphs relating the number of human IVF embryos to the expected profit from selecting among them for IQ. Yes, you’re entering that part of the internet.