| CARVIEW |
Transcript Episode 98: Helping computers decode sentences - Interview with Emily M. Bender
This is a transcript for Lingthusiasm episode ‘Helping computers decode sentences - Interview with Emily M. Bender’. It’s been lightly edited for readability. Listen to the episode here or wherever you get your podcasts. Links to studies mentioned and further reading can be found on the episode show notes page.
[Music]
Lauren: Welcome to Lingthusiasm, a podcast that’s enthusiastic about linguistics! I’m Lauren Gawne. Today, we’re getting enthusiastic about computers and linguistics with Professor Emily M. Bender.
But first, November is our traditional anniversary month! This year, we’re celebrating eight years of Lingthusiasm. Thank you for sharing your enthusiasm for linguistics with us. We’re also running a Lingthusiasm listener survey for the third and final time. As part of our anniversary celebrations, we’re running the survey as a way to learn more about our listeners, get your suggestions for topics, and to run some linguistics experiments. If you did the survey in a previous year, there’re new questions, so you can totally participate again this year. There’s also a spot for asking us your linguistics advice questions, since our first linguistics advice bonus episode was so popular.
You can hear about the results of the previous surveys in two bonus episodes, which we’ll link to in the show notes. We’ll have the results from this year’s survey in an episode for you next year. To do the survey or read more details, go to bit.ly/lingthusiasmsurvey24 – that’s bit.ly/lingthusiasmsurvey24 (the numbers 2 and 4) – before December 15 anywhere on Earth. This project has ethics board approval from La Trobe University, and we’re already including results from previous surveys into some academic papers. You, too, could be part of science if you do the survey.
Our most recent bonus episode was a linguistics travelogue. We discuss Gretchen’s recent trip to Europe where she saw cool language museums, and what she did to prepare for encountering several different languages on the way, as well as planning our fantasy linguistic excursion to Martha’s Vineyard. Go to patreon.com/lingthusiasm to hear this and many more bonus episodes and to help keep the show running ad-free.
Also, very exciting news from Patreon, which is that they’re finally adding the ability to buy Patreon memberships as a gift for someone else. If you’d be excited to receive a Patreon membership to Lingthusiasm as a gift, we’ll have a link in the show notes for you to forward to your friends and/or family with a little wink wink, nudge nudge. We also have lots of Lingthusiasm merch that makes a great gift for the linguistics enthusiast in your life.
[Music]
Lauren: Today, I am delighted to be joined by Emily M. Bender who is a professor at the University of Washington in the Department of Linguistics. She is the director of the Computational Linguistics Laboratory there. Emily’s research and teaching expertise is in multilingual grammar engineering and societal impacts of language technologies. She runs the live-streaming podcast Mystery AI Hype Theater 3000 with sociologist Dr. Alex Hanna. Welcome to the show, Emily!
Emily: I am so enthusiastic to be on Lingthusiasm.
Lauren: We are so delighted to have you here today. Before we ask you about some of your current work with computational linguistics, how did you get into linguistics?
Emily: It was a while ago. Back when I was in high school, we didn’t have things like the Lingthusiasm podcast – or podcasts for that matter – to spread the word about what linguistics was. I actually hadn’t heard about linguistics until I got to university. Someone gave me the excellent advice to get the course catalogue ahead of time – it was a physical book in those days – and just flip through it and circle anything that looked interesting. There was this one class called “An Introduction to Language.” In my second term, I was looking for a class that would fulfil some kind of requirements, and it did, and I took it. Let me tell you, I was hooked on the first day. Even though the first day was actually about the bee dance and other animal communication, I just fell in love with it immediately. I think, honestly, I had always been a linguist. I loved studying languages. My ideal undergraduate course of study would’ve been, like, take the first year of all the languages I could.
Lauren: That would be an amazing degree. Just like, “I have a bachelors in introductory language.”
Emily: Yeah, I mean, speaking now as a university educator, I think there’s some things missing from that, but as a linguist, how much fun would that be. I didn’t know there was a way to study how languages work without studying all the languages. When I found it, I was just thrilled.
Lauren: Excellent. I think that’s such a typical experience of a lot of people who get to university, and they’re intrigued by something that’s like, “How can it be an intro to language when I’ve learnt a bunch of languages?” And then you discover there’s linguistics, which brings you into the whole systematic nature of things.
Emily: Absolutely. My other favourite story to tell about this is I have a memory of being 11 or 12 and day dreaming and trying to figure out what the difference was between a consonant and a vowel.
Lauren: Amazing.
Emily: Because we were taught the alphabet. There’s five vowels and sometimes Y, and the other ones are consonants. What’s the difference? My regret with this story is that I didn’t record what it was that I came up with. I have no idea if I was anywhere near the right track. But I don’t think that your average non-linguist does things like that.
Lauren: That’s extremely proto-linguist behaviour. I love it. I’m sad we don’t have 11-year-old Emily’s figuring out from first principles of the IPA.
Emily: Emily who definitely went on to be a syntax / semantics side linguistics and not a phonetics / phonology side linguist.
Lauren: How did you become a syntax-semantics linguist? How did you get into your research topic of interest?
Emily: In undergrad, it was definitely the syntax class that I connected with the most. I got to study Construction Grammar with Chuck Fillmore and Paul Kay at US Berkeley, which was amazing, and sort of was aware at the time that at Stanford there was work going on on two other frameworks called Lexical-Functional Grammar and Head-Driven Phrase-Structure Grammar. These are different ways of building up representations of language. I went to grad school at Stanford with the idea that I was going to create generalised Bay Area grammar and bring together everything that was best about each of the frameworks. They are similar in spirit. They’re sometimes described as “cousins.” Then I got to Stanford, and I took a class with Joan Bresnan on Lexical-Functional Grammar and a class with Ivan Sag on Head-Driven Phrase-Structure Grammar. I realised that it’s actually really valuable to have different toolkits because they help you focus on different aspects of the grammars of languages. Merging them all together really wasn’t gonna be a valuable thing to do.
Lauren: It’s good that you could see what each of them was bringing to – that we have syntax, and there’s structure, but different ways of explaining it give different perspectives on things.
Emily: Exactly, and lead linguists to want to go explore different things about different languages. If you’re working with Lexical-Functional Grammar, then languages that do radical things with their word order, like some of the languages of Australia, are particularly interesting, and languages that put a lot of information into the morphology – so the parts of the words – are really interesting. If you’re doing Head-Driven Phrase-Structure Grammar, then it’s things like getting deep into the idiosyncrasies of particular languages – the idioms and the sub-patterns – and making them work together with the major patterns is a big focus of HPSG. You’re just gonna work on different problems using the different frameworks.
Lauren: I love that. An incredibly annoying undergraduate proto-linguist behaviour I still remember in my syntax class – because you learn to draw syntax trees. One of my fellow students and I were like, “Trees are fine, but we need to keep extending them down because they only go as far as words,” and there’s all this stuff happening in the morphology. We thought we were very clever for having this very clever thought. We were very lucky that our syntax professor was Rachel Nordlinger, who is another person who works with Lexical-Functional Grammar, which, as you said, is really interested in morphology. You could tell she was just like, “You guys are gonna be so happy when we get to advanced syntax, but just hold on. We’re just doing trees for now.” That’s how I got introduced to different forms of syntax helping answer different questions. It’s like, “Oh, this is one that accounts for the things that are happening inside words as well.” It’s really cool.
Emily: One of the things about both LFG and HPSG is that they’re associated with these long-term computational projects where people aren’t just working out the grammars of languages with pen and paper but actually codifying them in rules that both people and computers can deal with. I got involved with the HPSG project like that as a graduate student at Stanford, and then later on while – my first job, actually – that’s not true. My first job out of grad school was teaching for a year at UC Berkeley, but then I had a year after that where I was working in industry at a start up called “YY Technologies” that was using a large-scale grammar of English to create automated customer service response. You’ve got an email coming in, and the idea is that we parse the email, get some representation of what’s being asked, look up in a database what an appropriate answer would be, and then send that answer back. The goal was to do it on the easy cases so that the harder cases that the automated system couldn’t handle would get passed through to a representative. The start up was doing that for English, and they wanted to expand to Japanese. I had been working on the English grammar, actually, as a graduate student at Stanford because it’s an open source grammar, and I speak Japanese, and so I got to do this job where it was literally my job to build a grammar of Japanese on a computer. It was so cool. That was a fantastic job. In the course of that year, there was a project starting up in Europe that was interested in building more of these grammars for more languages. I picked up the task of saying, “How can we abstract out of this big grammar for English,” which at that point was about seven years old, still under development. It is quite a bit older now, quite a bit bigger.
Lauren: Amazing.
Emily: “How can we take what we’ve learned about doing this for English and make it available for people to build grammars more quickly of other languages?” I took that English grammar and held it up next to the Japanese grammar I was working on and basically just stripped out everything that the Japanese made look English-specific and said, “Okay, here’s a starter kit. This is the start of the grammar matrix that you can use to build a new grammar.” That’s the beginning of that project. I have since been developing that – we can talk more about what “developing it” means – together with students, now, for 23 years. It’s a really long-standing project.
Lauren: Amazing. That is – in terms of linguistics research projects and, especially, computational linguistics projects – a really long time. It speaks to the fact that computers don’t process language the same way we do. A human by the age of 23 is fully functional at a language by itself and can be sharing that language with other people, but for a computer, you’re finding more and more – I assume at this point it’s really specific rules or factors or edge cases.
Emily: For the English grammar that I was describing, yes, it’s basically that. The grammar matrix grows when people add facilities to it for handling new things that happen across languages. For example, in some languages, you have a situation where, instead of having just one verb to say something like “bathe,” it requires two words together. You might have a verb like “take” that doesn’t mean very much on its own and then the noun “bath,” and “take a bath” means the same thing as “bathe.” This phenomenon, which is called “light verb constructions,” shows up in many different languages around the world in slightly different ways. When the student is done with her master’s thesis, you’ll be able to go to the grammar matrix website and enter in a description of light verb constructions in a language and have a grammar come out that can handle them.
Lauren: So excellent. And not something, if we were only working in English, that we would think about, but light verbs show up across different language families and across the grammars of languages that you want to build computational resources for, so it makes sense to add this kind of functionality.
Emily: Exactly. And light verbs do happen in English, but they happen in different ways and more intensively in other languages. You can kind of ignore them in English and get pretty far, but in a language like Bardi, for example, in Australia, you aren’t gonna be able to do very much if you don’t handle the light verbs.
Lauren: And now, hopefully at the end of this MA, we’ll be able to.
Emily: Yes, exactly.
Lauren: Why is it useful to have resources and grammars that can be used for computers for languages like Bardi or, I mean, even large languages like Japanese?
Emily: Why would you want to build a grammar like this? Sometimes, it’s because you want to build a practical application where you can say, “Okay, I’m gonna take in this Japanese string, and I’m going to check it for grammatical errors,” or “I’m going to come up with a very precise representation of what it means that I can then use to do better question answering,” or things like that. But sometimes, what you’re really interested in is just what’s going on in that language. The cool thing about building grammars in a computer is that your analysis of light verb constructions has to work together with your analysis of coordination and your analysis of negation and your analysis of adverbs because they aren’t separate things, they’re all part of one grammar.
Lauren: And so, if we can make computers understand it, it’s a good way of validating that we have understood it and that we’ve described the phenomenon sufficiently.
Emily: And on top of that, if you have a collection of texts in the language, and you’ve got your grammar that you’ve built, and you wanna find what you haven’t yet understood about the language, you try running that text through your grammar and find all of the places where the grammar can’t process the sentence. That’s indicative of something new to look into.
Lauren: It’s thanks to this kind of computational linguistics that all those blue squiggles turn up on my word processing, and I don’t make major syntactic mess ups while I’m writing.
Emily: That’s actually an interesting case. Historically, yes, the blue squiggles came from grammar engineering. I believe they are now done with the large language models. We can talk about that some if you want.
Lauren: Okay, sure. But it was that kind of grammar engineering that led to those initial developments in spell checkers and those kind of things.
Emily: Yes, exactly.
Lauren: Amazing. Attempting to get computers to understand human language has been something that has been part of the interest of computational scientists since the early days of 20th Century computing. I feel like a question that keeps popping up when you read the history of this is like, “And then someone figured something out, and they figured we’d solve language in five years.” Why haven’t we solved getting computers to understand language yet?
Emily: I think part of it is that getting computers to understand language is a very imprecise goal, and it is one where, if you really want the computer to behave the same way that a person would behave if they heard something and understood it, then you need way more than linguistics. You need something – and I really hate the term “artificial intelligence” – but you basically need to solve all of the problems that building artificial intelligence – if that were a worthy goal – would require solving. You can ask much narrower questions and build useful language technology – so grammar checkers, spell checkers – that is computers processing natural languages to good effect. Machine translation – it’s not the case that the computer has understood and then is giving you a rendition in the output language. Machine translation is just “Well, we’re gonna take this string of characters and turn it into that string of characters because, according to all of the data that was used to develop the system, those patterns relate to each other.”
Lauren: I think it’s also easier to understand from a linguistic perspective that when people say, “solve language,” they have this idea of language as a single, unified thing, but so far, we’ve only been talking about written things and the issues that are around syntax and meaning. But dealing with understanding or processing written language versus processing voice going in versus creating voice – they’re all different skills. They require different linguistic and computational skills to do well. Solving language involves solving, actually, hundreds and thousands of tiny different problems.
Emily: Many, many different problems, and they’re problems that, you say, involve different skills. So, are you dealing with sound files? Are you dealing with if you actually wanted to process something more like what a person is doing? Do you have video going on? Are you capturing the gesture and figuring out what shades of meaning the gesture is adding?
Lauren: Nodding vigorously here.
Emily: I know I don’t need to tell you that. [Laughs] But also pragmatics, right, we can get to a pretty clear representation for English at least of the “Who did what to whom?” in a sentence – the bare bones meaning in the form of semantics. But if we want to get to “Okay, but what did the person mean by saying that? How does that fit in with what we’ve been discussing so far and the best understanding possible of what the person is trying to do with those words?” that’s a whole other set of problems – that’s called “pragmatics” – that is well beyond anything that’s going on right now. There’s tiny little forays into computational pragmatics, but if you really want to understand language – a language, right, most of this work happens in English. We have a pretty good idea about how languages vary in their syntax. Variation at the level of semantics, less well studied. Variation in pragmatics, even less so. If we were going to solve language, we need to say which language.
Lauren: Which raises a very important point. As you’ve said, most of this work happens in English. In terms of computational linguistics, there’s been the sense that people are very pleased that we’ve now got maybe a few hundred languages that we have pretty good models for, but there’s still thousands of languages that we don’t have any good computational models for. What is required to make that happen? If you had a very large budget and a great deal many computational linguists to train at your disposal, what’s the first thing you would need to start doing?
Emily: The very first thing that I would start doing, I think, is engaging with communities and seeing which communities actually want computational work done on their languages. And then my ideal use of those resources would be to find the communities that want to do that, find the people in those communities who want to be computational linguists, and train them up rather than what’s usually a much more extractive, “We’re gonna grab your data and build something” kind of a thing. And then it becomes a question of “Okay, well, what do you want computers to be able to do with your language?” – a question to the community. Do you want to be able to translate in and out of, maybe, English or French or some other world or colonial language? Do you want a spell checker? Do you want a grammar checker? Do you want a dialogue partner for people who are learning the language? Do you want a dictionary that makes it easier to look up words? If your language is the kind of language that has a whole bunch of prefixes, just alphabetical order, you know, the words, isn’t gonna be very helpful. What’s needed? And then it depends – do you want automatic transcription? Do you want text-to-speech? Then depending on what the community is trying to build, you have different data requirements. If you wanna build a dictionary like that, that’s a question of sitting down and writing the rules of morphology for the language and collecting a big lexicon. If you want text-to-speech, you need lots and lots of recordings that have been transcribed in the language. If you want machine translation, you need lots and lots of parallel text between that language and the language you’re translating into.
Lauren: And so, a lot of that will use the same computational grammar models but will have slightly different takes on what those models are and will need different data to help those models do their job.
Emily: In some cases, the same models, in some cases, different. I think if we’re talking speech processing, automatic transcription, or speech-to-text, we’re definitely in machine learning territory, and so that’s one kind of model. Machine translation can be done in a model of the grammar mapped to semantics form, or it can be done with machine learning. The spell checker, especially if you’re dealing with a language that doesn’t have enormous amounts of texts to start with, you definitely want to do that in a someone-writes-down-the-rules kind of a fashion. That’s a kind of grammar engineering, but it’s distinct from the kind that I do with syntax.
Lauren: And so, it just starts to unpack how complicated this idea of “Computers do language” is because they’re doing lots of different things, and they need lots of different data. Obviously, we say “data” as though it’s some kind of objective, general pot of things, but when we say “data,” we mean maybe people’s recordings, maybe people’s stories, maybe knowledge and language that they don’t want people outside of their community to have. That creates different imperatives around whether these models are gonna be a way forward or useful for people.
Emily: And at the moment, we don’t have very many great models for collecting data and then handling it respectfully. There are some great models, and then there’s a lot of energy behind not doing that. The best example that I like to point to is the work of Te Hiku Media in Aotearoa (New Zealand). This is an organisation that grew out of a radio project for Te Reo Māori. They were at a community level collecting transcriptions of radio shows in Teo Reo Māori, which is the Indigenous language of Aotearoa (New Zealand). Forgive my pronunciation; I’m trying my best. They had been approached over the years many, many times by big tech saying, “Give us that data. We’d like to buy that data,” and they said, “No, this belongs to the community.” They have developed something called the “Kaitiakitanga License,” which is a way that works for them of granting access to the data and keeping data sovereignty – basically keeping community control of the data. There are ways of thinking about this, but it really requires strength and community against the interests of big tech that takes a very extractivist view of data.
Lauren: It’s good that there are some models that are being developed and normalising of this as one possible way of going forward. As you’ve said, you’ve spent a lot of time working to build a grammar matrix for lots of different languages. This goes against a general trend of focusing on technologies in major languages where there’re clear commercial and large-audience imperatives. Part of this work has been making visible the fact that English is very much a default language in the computational linguistics space. Can you give us an introduction to the way that you started going about making the English-centric nature of computational linguistics more visible?
Emily: I think that this really came to a head in 2019 when I was getting very fed up with people writing about English as if it weren’t language. They would say, “Here’s an algorithm for doing machine reading comprehension,” or “Here’s an algorithm for doing spell checking,” or whatever it is. If it were English, they wouldn’t say that. It seems like, “Well, that’s a general solution,” and then anybody working on any other language would have to say, “Well, here’s a system for doing spell checking in Bardi,” or “Here’s a system for doing spell checking in Swahili,” or whatever it is. Those papers tended to get read as, “Well, that’s only for Bardi,” or “That’s only for Swahili,” where the English ones – because English was treated as default – were taken as general. I made a pest of myself at a conference in 2019 – the conference is called “NAACL” – where I basically just, after every talk where people didn’t mention the name of the language, went to the microphone, introduced myself, and said, “Excuse me, what language was this on?” which is a ridiculous question, right, because it’s obvious that it’s English. It’s sort of face threatening. It’s impolite because it’s “Why are you asking this question?” but it’s also embarrassing as the asker. Like, “Why would you ask this silly question?” But I was just making a point. Somewhere along the line, people dubbed that the “Bender Rule,” that you have to name the language that you’re working on, especially if it’s English.
Lauren: I really appreciate your persistence, and I appreciate people who codified it into the Bender Rule because now it’s actually less threatening for me to “I’m just gonna evoke the Bender Rule and just check if this was just on English.” You’ve given us a very clear model where we can all very politely make pests of ourselves to remind people that solving something for English or improving a process for English doesn’t automatically translate to that working for other languages as well.
Emily: Exactly. And I like to think that, basically, by lending my name to it, I’m allowing people to ask that question while blaming it on me.
Lauren: Great. Thank you very much. I do blame it on you all the time in the nicest possible way.
Emily: Excellent.
Lauren: This seems to be part of a larger process you’ve been working on. Obviously, there’s people working on computational processes for English, and you’re trying to be very much a linguist at them, but it seems like you also are spending a lot of time, especially in terms of ethical use of computational processes, trying to explain linguistics to computer scientists as well. How is that work going? Are computer scientists receptive to what linguistics has to offer?
Emily: Computer scientists are a large and diverse group in terms of their attitudes. They are an unfortunately un-diverse group in other ways. It’s an area of research and development that has a lot of money in it right now. There’s always new people coming in, and so it feels like no matter how much teaching of linguistics I do, there is still just as many people who don’t know about it as there ever were because new people are coming in. That said, I think it’s going well. I have written two books that I call, informally, “The 100 Things” books because they started off as tutorials at these computational linguistics conferences with the title, “100 Things You Always Wanted to Know About Linguistics But Were Afraid to Ask” and then subtitle, “For Fear of Being Told 1,000 More.” [Laughter]
Lauren: I mean, it’s not a mischaracterisation of linguists, that’s for sure.
Emily: We’re gonna keep linguisting at you, right. In both cases, the first one is about morphology and syntax. I basically just wrote down, literally, 100 things that I wish that people working in natural language processing in general knew about how language works because they tend to see language as just strings of words without structure. Worse than that, they tend to see language as directly being the information they’re interested in. I used to have really confusing conversations with colleagues in computer science here – people who were interested in gathering information from large collections of texts, like the web (this is a process called “information extraction”) – and when I finally realised that we were focusing on different things – I was interested in the language, and they were interested in the information that was expressed in the language – the conversations started making sense. I came up with a metaphor to help myself, which is, if you live somewhere rainy, can you picture you’ve got a rain-splattered window. You can focus on the raindrops, or you can focus on the scene through the window distorted by the raindrops. Language and its structures are the raindrops, which have an effect on what it is that you can see through the window, but it is very easy to look right through them and imagine you’re just seeing the information of the world outside. When I realised that, as a computational linguist, I’m interested in the raindrops, but some of these people working in computer language processing are just staring straight through them at the stuff outside, it helped me communicate a lot better.
Lauren: I feel like I’ve had a lot of conversations with computational scientists where they’re like, “Ah, we did a big semantic analysis of –” so there’s a process you can apply where you have a whole bunch of processes and algorithms that run, and it says, “80% of the people in this chat –” or this series of, I think they’re like, used to pulling things from Reddit. You could do that easily. It’s like, “80% of people in this hate chocolate ice cream.” I’d always be like, “Okay, but did you account for the person who’s like, ‘Oh my god, I hate how delicious this ice cream is’?” And they’re just like, “Ah…well, no, because we just used – ‘hate’ was negative so… ‘delicious’ was positive, so this person probably came out in the wash.” I’m like, “No, this is a person who extremely likes this ice cream,” and it’s also a very idiomatic, informal kind of English. I certainly wouldn’t write that in a professional reference for someone – “I hate how amazing this person is. You should hire them.” As a linguist, I’m really interested in these nuanced, novel edge cases, and as a computational scientist, they’re like, “Oh, we just hope we get enough data that they disappear in the noise.”
Emily: And the words are the data. The words are the meaning. There’s no separation there. There’s no structure to the raindrops. “If I have the words, I have the meaning” seems to be the attitude.
Lauren: Well, it’s great that you’re doing the work of slowly letting them down from that assumption.
Emily: We’re trying. Oh, one other thing about these books. The first one is morphology and syntax, the second one is semantics and pragmatics. In both of them – the second one is co-authored with Alex Lascarides – in both of them I have the concept index and the index of languages. Every time we have an example sentence, it shows up as an entry in the index for languages. There’s an index entry for English. Even though it indexes almost every single page in the book, it’s in there because English is a language.
Lauren: There’s this thing called the “Bender Rule.” I don’t know if you’ve heard of it, but I’m really glad that you’re following its principles. A lot of the work you’ve been doing is with a type of computational linguistics where you are building rules to process language and create useful computational outputs, but there are other models for how people can use language computationally.
Emily: I tend to do symbolic or rule-based computational linguistics. I’m really interested in “What are the rules of grammar for this language or for this phenomenon across languages? How can I encode them so that I can get the machine to test them, but also, I can still read them?” But a lot of work in computational linguistics, instead, uses statistical models, so building models that can represent patterns across large bodies of text.
Lauren: Oh, so that’s like predictive text on my mobile phone where it’s so used to reading all of the data that it has from other people’s text messages and my text messages that sometimes it can just predict its way through a whole message for me.
Emily: Yes, exactly. And in fact, I don’t know if this is so true anymore, but for a while, you could see that the models were all different on different phones. Remember we used to play that game where you typed in, “Sorry I’m late, I…” and then just picked the middle option over and over again, and people would get different, fun answers.
Lauren: Yes, and you’d get wildly different answers.
Emily: That reflects local statistics being gathered based on how you’ve been using that phone versus a model that it may have started with that was based on something more generic. That is, yes, an example of statistical patterns. You also see these – and this is fun in automatic transcriptions, like the closed captioning in TV shows if you’re thinking about live news or something where it wasn’t done ahead of time, and they get to a name of a person or a place which clearly wasn’t in the training data already represented in the model, and ridiculous, funny things come out because the system has to fall back to statistical patterns about what that word might have been, and it reveals interesting things about the training data.
Lauren: We used to always put the show through a first pass on YouTube, where Lingthusiasm is also hosted, before Sarah Dopierala came in and transformed our lives by being an amazing transcriptionist. For years, YouTube would transcribe “Lingthusiasm” – a word it has never encountered before, in its defence, as a computer – it would come up with “Link Susy I am” most often. We still occasionally refer to “Link Susy I am.” It was interesting when it finally, clearly, had enough episodes with Lingthusiasm with our manually updated transcripts that it got the hang of it, but that was definitely a case where it needed to learn. We definitely have a much higher success rate of perfect, first-time transcripts with Sarah.
Emily: That pattern that you saw happening with YouTube, that change, shows you that Google was absolutely taking your data and using it to train their models. In the podcast that I run, Mystery AI Hype Theater 3000, we have some phrases that are uncommon, and we do use a first-pass auto-transcriber. For example, we refer to the so-called AI models as “Mathy Maths.”
Lauren: “Mathy Maths,” yeah.
Emily: That’ll come out as like, “Matthew Math.”
Lauren: Oh, my good friend Matthew Math.
Emily: [Laughs] And the phrase “stochastic parrots” sometimes comes out as like, “sarcastic parrots” or things like that.
Lauren: And you and Alex both have, I would say, relatively standard North American English accents, which is really important for these models because, so far, we’ve just been talking about data where it’s found, and like, we’re linguists working with it and processing it before the computer gets to it. But with a lot of these new statistical models, it’s just taking what you give it. That means, as an Australian English speaker, I’m relatively okay, but it’s not as good for me as it is for a Brit or an American. And then if you’re a Singaporean English or Indian English speaker, even as a native English speaker, the models aren’t trained with you in mind as the default user. It just gets more and more challenging.
Emily: Exactly. Some of that is a question of “What could the companies training these models easily get their hands on?” But some of it is also a question of “Who were they designing for in the first instance? Whose data did they think of as ‘normal data’ that they wanted to collect?”
Lauren: These are deliberate choices that are being made.
Emily: Absolutely.
Lauren: With these statistical models, how do they differ from the grammars that you’ve created?
Emily: In a rule-based grammar system, somebody is sitting down and actually writing all the rules. Then when you try a sentence, and it doesn’t work as expected, you can trace through “What rule was used and shouldn’t have been used?” “What rule did you expect to have showing up in that analysis that wasn’t there?” and you can debug like that. The statistical models, instead, you build the model that’s the receptacle for the statistics. You gather a whole bunch of data, and then you use this receptacle model to process the data item by item and have it output according to its current statistics, likely answers, and then compare them to what’s actually there, and then update the statistics every time it’s wrong. You do that over and over and over again, and it becomes more and more effective at closely modelling the patterns in the data, but you can’t open it up and say, “Okay, this part is why it gives that output, and I want to change that.” It’s much more amorphous, in a sense, much more of a “black box” is the terminology that gets used a lot.
Lauren: In 2020 we were really lucky to have Janelle Shane join us on the show and walk us through one of these generative statistical models from that era. She generated some Lingthusiasm transcripts based off the first 40 or so episodes of transcripts that we had. When it generated transcripts, the model had this real fixation on soup. It got the intro to Lingthusiasm right because we say that 40 times across 40 episodes. We’ll be like, “Today, we’re talking about soup.” And we were like, “Janelle, what’s with the soup?” and she’s like, “I can’t tell you. It’s a black box in there,” literally referred to as hidden layers in the processing. So, because we don’t know why it was fixated on soup, there’s some great fake Lingthusiasm transcripts that we read – very soup-focused, very focused on a couple of major pieces of fan fiction literature, which, again, is classic fan fiction favourite IP because it read a bunch of fan fiction as well. You can make some guesses about why it’s talking about wizards a whole bunch, but you can’t make many guesses about why it’s talking about soup a whole bunch, and that makes it hard to debug that issue.
Emily: Hard to debug, yeah. But also, if you don’t know the original training data – so it sounds like she took a model that had been trained on some collection of data –
Lauren: Yes, so that it could be coherent with only those 40 transcripts.
Emily: Exactly, yeah. But if you don’t know what’s in that training data, then you are even more poorly placed to figure out “Why soup?”
Lauren: And since we did that episode, I think the big thing that’s changed is that the models are being given enough extra data that they’re no longer fixated on soup, but they’ve also just become easier for everyday people to use. Part of why we were really grateful for her to come on the show is that she walked us through the fact that she was still using scripting language to ingest those transcripts and to generate the new fabricated text. It all looked very straightforward if you’re a computer person, but you need to be a person who’s comfortable with scripting languages. That’s no longer the case with these new chat-based interfaces. That’s really changed the extent to which people interact with these models.
Emily: Yes, exactly. There’s a few things that have changed. One is there’s been some engineering that allowed companies to make models that could actually take advantage of very large data sets. There has been the collection of very large data sets in a not very consent-based fashion. Then there has been the establishment of these chat interfaces, as you say, where you can just go and poke at it and get something back. Honestly, the biggest thing that happened – the reason that all of a sudden everybody’s talking about ChatGPT and so-called “AI” – was that OpenAI set up this interface where anybody could go poke at it, and then they had a million people sharing their favourite examples. It was this marketing win for OpenAI and a big loss for the rest of us.
Lauren: I think the sharing of examples is really important as well because people don’t talk very often about the human curation that goes into picking funny or coherent or relevant examples. We had to junk so many of those fake transcripts to find the handful that were funny enough to pretend read and give a rendition of. When people are sharing their favourite things that come out of these machines, that’s a level of human interaction with them that I think is often missing but making it very easy for people to generate a whole bunch of content and then pick their favourite and share it has really normalised the use of these large language model ways of playing with language.
Emily: Exactly. If you were someone who’s not playing with it, or even if you are, most of the output you’re going to see is other people sharing their favourites. You get a very distorted view of what it’s doing.
Lauren: In terms of what it is doing, you know, we talked before about when a computer is doing translation between two languages, it’s not that it’s understanding, it’s replacing one string of texts with another string of text with these generative models that are creating this text that, on an initial read, reads like English. What are some of the limitations of these models?
Emily: Just like with machine translation, it’s not understanding. The chat interface encourages you to think that you are asking the chat bot a question, and it is answering you. This isn’t what’s happening. You are imputing a string, and then the model is programmed to come up with a likely continuation of that string. But a lot of its training data is dialogues, and so something that takes the form of a question provokes as a likely continuation an answer. But it hasn’t understood. It doesn’t have a database that it’s consulting. It doesn’t have access to factual information. It’s just coming out with a likely next string given what you put in. Any time it seems to make sense, it’s because the person using it is the one making sense of it.
Lauren: And because it’s had enough input because it basically took large chunks of the English speaking internet that there’s a statistical likelihood it is going to say something that is correct, but that is only a statistical chance. It doesn’t actually have the ability to verify its own factual information.
Emily: Exactly. I really dislike this term, but people talk about “hallucinations” with these models to describe cases where it outputs something that is not factually correct.
Lauren: Okay, why is “hallucination” not an appropriate word for you?
Emily: There’s two problems with it. One speaks to what you were just talking about which is if it says something that is factually correct, that is also just by chance. It’s always doing the same thing; it’s just that sometimes it corresponds to something we take to be true and sometimes it doesn’t. But also, if you think about the term “hallucination,” it refers to perceiving things that aren’t there. That suggests that these chat bots are perceiving things, which they very much aren’t. That’s why I don’t like the term.
Lauren: Fair enough. It’s a bit too human for what they’re actually doing, which is a pretty cool party trick, but it is just a party trick. One thing I’ve really appreciated about your critiquing of these systems is that you situate the linguistic issues around lack of actual understanding and real pragmatic capability, but you also talk about it in terms of these larger systems issues in terms of problems with the data and problems with the amount of computer processing it takes to perform this party trick, which are a combination of alarming issues. Can you talk to some of those issues and maybe some of the other issues that you’ve seen crop up with these models?
Emily: It’s so vexed. So, one place to start is a paper that I wrote with six other people in late 2020 called “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜” and then the parrot emoji is part of the title.
Lauren: Excellent.
Emily: This paper became famous in large part because five of the co-authors were at Google, and Google decided, after approving it for submission to a conference, that, in fact, it should be either retracted or have their names taken off of it, and, ultimately, three of the authors took their names off, and two others got fired over it.
Lauren: Right, okay. That is big impact for a conference paper.
Emily: In part, the paper, in the aftermath of that, the impact was enhanced by the fact that the first author to get fired, Dr. Timnit Gebru, was masterful at taking the ensuing media attention and using it to shine a light on the mistreatment of black women in tech. She did an amazing job. Dr. Margaret Mitchell was the other one who got fired. It took a couple more months in her case.
Lauren: Oh, you mean, her name is not “Shmargaret Shmitchell”? [Laughter] That was a pseudonym?
Emily: That was a pseudonym, yeah. Who would’ve thought?
Lauren: I can’t believe it.
Emily: We wrote that paper because Dr. Gebru came to me in a Twitter DM in September of 2020 saying, “Hey, has anyone written about the problems with these large language models and what we should be considering?” because she was a research scientist in AI ethics at Google. It was literally her job to research this stuff and write about it. She had seen people around her pushing for ever bigger language models. This is 2020. So, the 2020 language models are small compared to the ones that we have now. Doing her job, she said, “Hey, we should be looking into what to look out for down this path.” I wrote back saying, “I don’t know of any such papers, but off the top of my head, here are the issues that I would expect to find in one based on independent papers,” so looking at things one by one in the literature. That was things like environmental impact, like the fact that they pick up biases and systems of oppression from the training data, like the fact that if you have a system that can output plausible-looking synthetic text that nobody is accountable for, that can cause various problems down the road when people believe it to be a real text. Then a beat or so later, I said, “Hey, this looks like a paper outline. Do you wanna write it?” That’s how the paper came to be. There’s two really important things that we didn’t realise at the time. One is the extent to which creating these systems relies on exploitative labour practices. That is both basically stealing everybody’s text without consent, but then also, in order to keep the systems from routinely outputting bigoted garbage, there’s this extra layer of so-called training where poorly paid workers, working long hours without psychological support, have to look at all the awful stuff and say, “That’s bad. That’s bad. This one’s okay,” and so on. This tends to be outsourced. There’s famously workers in Kenya who had been doing this. We didn’t know about that at the time, though some of the information was available, we could have.
Lauren: And it keeps outputting highly bigoted, disgusting text because it’s been trained on the internet, which as we all know is a bastion of enlightened and equal opportunity conversation.
Emily: Yes. But even if you go with only, for example, scientific papers, which are supposed to not be awful, guess what? There’s such a thing as scientific racism, and it is well embedded in the scientific literature. There was a large language model that Meta put together called “Galactica.” It came out right before ChatGPT. It was built as a way to access the world’s scientific knowledge, which of course it isn’t because if you take a whole bunch of scientific text, chop it up, and turn it into paper mâché, what you get out is not science but paper mâché, right. But anyway, people were poking at this and very quickly got it to say racist and otherwise terrible things in the guise of being scientific. I think it was the linguist Rikker Dockum who asked it something about stigmatisation of linguist varieties, and it came out with something about how African Americans don’t have a language of their own.
Lauren: Oh. A thing that we don’t even need to fact check because that is incorrect.
Emily: Anyway, you can certainly get to bigoted stuff starting with things less awful than the stuff that’s out there on the internet, but also, these models are trained on what’s out there on the internet. Labour exploitation was one thing that we missed. The other thing that we missed in the stochastic parrots paper was we had no idea that people were gonna get so excited about synthetic text. In the section where we actually introduce the term “stochastic parrot” to describe these machines that are outputting text with no understanding and no accountability, we thought we were going out on thin ice. Like, “People aren’t really gonna do this.” But now, it’s all over the place, and everyone is trying to sell it to you as something you might pay for.
Lauren: Yes, in many ways it’s a paper that was very prescient about a technology that has really become very quickly normalised, which creates a compounding effect in terms of data because now everyone’s sharing the synthetic text that they’re creating for fun, but people are also using it to populate webpages, and heavens knows a lot of spam in my inbox is getting longer because it can just be generated with these machines and processes as well. What used to be human-created data that it was trained on, now, if you try to scrape the internet, there’d be all of this synthetic machine-created language as well. It will just start training on its own output, which I’m not a computational linguist, but that just sounds like it’s not a great idea.
Emily: If you think about what it is that you want to use these for, then ultimately, data quality really, really matters and, ideally, data quality that is not only good data but well-documented data, so you can decide, “Hey, is this good for my use case?” The ability to use the web as corpus to do linguistic studies is rapidly degrading. In fact, there’s a computational linguist named Robyn Speer who used to maintain a project called “wordfreq” which counted frequencies of words in web text over time. She has discontinued it because she says, “There’s too much synthetic garbage out there anymore. I can’t actually do anything reliable here. So, this is done.”
Lauren: So, it’s bad for computational linguistics. It’s bad for linguistics. And just to be clear, with these models, there’s no magic tweak that we can make to make them be factual.
Emily: No. Not at all. Because they’re no representing facts. They’re representing co-occurrences of words in text. Does this spelling happen a lot next to that spelling? Do they happen in the same places? Then they’re likely to be output in the same places that sometimes reflects things that happen in the world because sometimes the training text is things that people said because they were describing the actual world, but if it outputs something factual, it’s just by accident.
Lauren: So, your work on the stochastic parrots paper really set the tone for this conversation in linguistics. And you’ve been continuing to talk about the issues and challenges with these large language models and other kinds of generative models because, obviously, similar processes are used for image creation, and we’ve only really talked about the text-based stuff, and there’s a whole bunch of things happening with audio and spoken language as well. But there’ll be heaps more of that on Mystery AI Hype Theater 3000, and also in your book The AI Con, which is coming out in spring 2025.
Emily: Yes, I am super excited for this book. It was a delight to work with Dr. Alex Hanna, who is my co-host on Mystery AI Hype Theater 3000, to put together a book that is for popular audiences. One of the things that I think worked really well is that she’s a sociologist, and I’m a linguist, and so we have different technical terms. We were able to basically catch each other, it’s like, “I don’t really know what that word means,” and so the general audience isn’t gonna know what that word means. Hopefully, it will be nice and accessible. The subtitle, by the way – so the title, The AI Con, and the subtitle is “How to Fight Big Tech’s Hype and Create the Future We Want.” It’ll be out in May of 2025.
Lauren: And it seems like, given the limitations of these big models, there’s still lots of space for the kind of symbolic grammar-processing work that you do.
Emily: Yes, there’s definitely space for symbolic grammar-based work, especially if you’re interested in something that will get a correct answer, if it gets an answer at all. And you’re in a scenario where it’s okay to say, “No possibility here. Let’s send this on to a human,” for example. But also, there’s a lot of room for linguistics in designing better statistical natural language processing in understanding what it is that the person is going to be doing with the computer and how people relate to language so that we can design systems that are not misleading but, in fact, are useful tools.
Lauren: If you could leave people knowing one thing about linguistics, what would it be?
Emily: In light of this conversation, the thing that I would want people to know is that linguistics is the area that lets us zoom in on language and pick apart the rain drops and understand their structure so that we can then zoom back out and have a better idea of what’s going on with the language in the world.
Lauren: Thank you so much for joining us today, Emily.
Emily: It’s been an absolute pleasure.
[Music]
Lauren: For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on all of the podcast platforms or lingthusiasm.com. You can get transcripts of every episode on lingthusiasm.com/transcripts. You can follow @lingthusiasm on all social media sites. You can get scarves with lots of linguistics patterns on them including IPA, branching tree diagrams, bouba and kiki, and our favourite esoteric Unicode symbols, plus other Lingthusiasm merch – like our “Etymology isn’t Destiny” t-shirts and Gavagai pin buttons – at lingthusiasm.com/merch.
My social media and blog is Superlinguo. Links to Gretchen’s social media can be found at gretchenmcculloch.com. Her blog is AllThingsLinguistic.com. Her book about internet language is called Because Internet.
Lingthusiasm is able to keep existing thanks to the support of our patrons. If you want to get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just want to help keep the show running ad-free, go to patreon.com/lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include behind-the-scenes on the Tom Scott Language Files with Tom and team, linguistics travel, and also xenolinguistics and what alien languages might be like. If you can’t afford to pledge, that’s okay, too. We really appreciate it if you can recommend Lingthusiasm to anyone in your life who’s curious about language.
Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins, and our Editorial Assistant is Jon Kruk. Our music is “Ancient City” by The Triangles.
Emily: Stay lingthusiastic!
[Music]

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
generallykindanot reblogged this from lingthusiasm
tincanvalley liked this
matt-lock reblogged this from lingthusiasm
matt-lock liked this
wanthetreacherous reblogged this from lingthusiasm
heymerle liked this alexr-fightgames liked this
solangelos-cats liked this
lingthusiasm posted this
This is a transcript for Lingthusiasm episode ‘Helping computers decode sentences - Interview with Emily M. Bender’....
About Lingthusiasm
A podcast that's enthusiastic about linguistics by Gretchen McCulloch and Lauren Gawne.
Weird and deep conversations about the hidden language patterns that you didn't realize you were already making.
New episodes (free!) the third Thursday of the month.