Speech and language technology: People are not dictionaries

with Mark Liberman

Download this episode in mp3 (26.15 MB) or all episodes in a zip folder (1.16 GB).

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

As a researcher and research manager at AT&T Bell Labs from 1975-1990, as a member of Penn's faculty since 1990, and as founder and director of the Linguistic Data Consortium since 1992, Liberman has participated actively in the evolution of speech and language research towards a model of quantitative, replicable studies based on published datasets. In its 26 years of existence, the LDC has distributed more than 160,000 copies of nearly 3,000 datasets to more than 5,600 research organizations in 92 countries.

In his personal research, a key focus has been the scientific application of techniques from machine learning and human language technology to very large speech collections, in the range of dozens to thousands of hours and involving up to tens of thousands of speakers. Scientific application areas include phonetics, psychology of language, sociolinguistics, and clinical diagnosis and monitoring. The last category includes current collaborations on speech, language, and communicative interaction in Autism Spectrum Disorder, Frontotemporal Degeneration, and Alzheimer's Disease.

During this conversation, I ask Mark how the marriage between linguistics and computer science works today and has worked since the early days of this field, before it was called so. What skills are young students equipped with, and what applications computational linguistics has today. I also ask trivial questions like "how many languages are there in the world?" and you never get a trivial answer from a world class expert like Mark. I have learnt so much from this conversation and I hope you will too! My new favourite quote is: "A language is a dialect with an army and a navy".

More detailed info can be found here: https://www.ling.upenn.edu/~myl/LibermanCV.html

Fig. 1 - The original interview for this podcast was recorded over Skype. I finally had the pleasure of meeting Prof. Liberman during a visit to Philadelphia in November 2019. This picture was taken in his office at the LDC headquarters.

Highlights from this episode

At minute 1:45: "Applications of digital processing technology to linguistic problems of language, and then also to problems in the humanities, is pretty much as old as the existence of computing machinery itself."

At minute 7:35: "It's not that there was computer science and there was linguistics, and then at a certain point they say, oh let's form computational linguistics. Rather people on both sides had their own problems that they were trying to solve in ways that involve computational methods."

At minute 14:03: "One analogy that's been made is between these [artificial intelligence] modern techniques and alchemy, because the alchemists were actually extremely effective practical chemists. You know, they knew how to extract acids, they knew how to create flavors, and so on, but almost every process was kind of a thing in itself. So, to become a good alchemist you had to engage in a very very long apprenticeship and even then when you approached a new problem it wasn't really clear how to do it, because the theory behind all of this was either non-existent or nonsensical. "

At minute 15:09: I ask Mark whether the main difference between linguists and computation linguists lies in the questions that are asked, or in the methods applied to answer those questions.

At minute 16:36: Modern computational tools developed by engineers "allow us to do literally in hours what previously would have taken decades."

At minute 18:32: Noam Chomsky was on Mark's dissertation committee at MIT in 1975.

At minute 19:30: Mark talks about the clinical applications of speech and language analysis, especially about #autism and the early detection on Alzheimer's disease.

At minute 20:27: "I think anyone who looks at this problem [the definition of autism] quickly realizes that it's not just one dimension, that it's not a spectrum but a space, and furthermore it's a space that we all live in. It's just that some corners of the space, because they interfere with people's ability to carry out ordinary life have been sort of medicalised. So, one scientific question is what the dimensions of this space really are."

At minute 24:14: Mark talks about the Linguistic Data Consortium (LDC), its aims and how it is doing today.

At minute 26:15: "There was a period of a decade or so that people sometimes referred to as the AI desert."

At minute 27:13: "Glamour and deceit, otherwise known as BS:" Mark quotes John Pierce about research in AI in the mid-1970s. Pierce was the founder of the center where Mark used to work at Bell Labs.

At minute 31:15: Talking about work at the LDC, Mark says: "Sometimes people send us a cardboard box full of analog tapes, [and] we arrange for normalization of formatting, and then we do curation, [...] we handle intellectual property, rights negotiations, and make sure that privacy and confidentiality and human subjects constraints are obeyed, and so on."

At minute 32:00: When did audio start being an important part of the material analysed by linguists? Mark says, "There was already digital audio in process in the late 1930s, early 1940s."

At minute 33:17: Churchill and Roosevelt used an encrypted radio transatlantic telephone system involving a room full of complicated and power-hungry apparatus in London and one in Washington.

At minute 34:03: "The novel [by Solzhenitsyn] is not primarily about technology, but the technology is there in the background, and I think it is actually quite accurately portrayed."

At minute 33:17: What is the most remarkable advancement in the field of technology that Mark has witnessed during his career? "The most remarkable changes [for linguistics] are the result of the same developments and forces that are changing everything else in modern life. Namely, the development of ubiquitous inexpensive high bandwidth digital networking, the exponential improvement in the cost performance of various kinds of computational devices, including computers and various kinds of wearable devices, as well as 'the cloud' as we call it now."

At minute 36:14: "For anyone interested in analysis of speech and language, using digital means is, you know, like walking into an amazing magical garden." [Talking about the abundance of material available on platforms like YouTube today.]

At minute 36:38: "I think the most important, the most impressive, the most valuable, the most interesting and insightful [technological] developments are actually still in the future."

At minute 38:58: "People are not dictionaries."

At minute 45:10: What is the difference between language and speech?

At minute 48:45: 2018 was the European Year of Cultural Heritage (EYCH,) and heritage has a lot to do with identity. Language has also a lot to do with identity. Mark talks about language preservation and documentation and how audio recordings are valued and preserved.

At minute 49:54: "One could almost argue that the density of endangered [...] languages is actually greater in Europe than almost anywhere else in the world."

At minute 51:07: "It has been said that there was a time when you could walk from the English Channel to the Straits of Gibraltar, through France and Spain, or from the English Channel to Sicily, through France and Italy, and never pass a point where the people in one village couldn't speak with the people in the next village."
Because "the state of nature," especially in settled agricultural civilizations, is generally for a kind of geographical dialect continuum - that is, until "the nation-states come along to change this."

At minute 53:03: How many languages are there in the world? And who keeps count? We take this information for granted, we just go online and ask google. But who keeps track of the languages that are dying out? Who labels the endangered ones as endangered?

At minute 53:46: "Obviously, the question of what's a distinct language and what's a variety is not an easy question to answer. In fact, it's not a question that has a coherent answer. As the famous saying goes, 'a language is a dialect with an army and a navy.'"

At minute 55:10: How does a linguistic variety become a language? In order to exist, a language needs to be recognized. Mark talks about the role of political movements in this process of recognition.

Fig. 2 - At the LDC in Philadelphia, PA, on November 8th, 2019.

People, places and organizations mentioned in the interview

Timed links are available in the description of this episode on YouTube.

  • The Enigma decryption project at minute 2:03 and Alan Turing at minute 2:53
  • American linguist Zellig Harris at minute 5:08
  • Noam Chomsky and the Chomsky Hierarchy at minute 6:05
  • Pierre Paul Schutzen Berger and the Schnutzen-Berger Hierarchy at minute 6:07
  • League of European Research Universities (LERU), mentioned at minute 8:28: https://www.leru.org
  • The Linguistic Data Consortium (LDC) at minute 24:14: https://www.ldc.upenn.edu
  • The book "Giant brains; or, Machines that think" by Edmund Callis Berkeley published in 1949, mentioned at minute 25:24
  • Jonh Pierce, founder of the center at Bell Labs where Mark used to work, at minute 27:13
  • Charles Wayne, former Director of Speech and Language Research at the federal Defense Advanced Research Projects Agency (DARPA), at minute 28:51
  • IBM and Google, at minute 30:46
  • Aleksandr Solzhenitsyn's novel "In the First Circle" published in 1968, at minute 32:12
  • Alan Turing (already mentioned at minute 2:53), at minute 32:05
  • Bell Labs, at minute 33:08
  • Claude Shannon, at minute 33:09
  • Winston Churchill and Franklin D. Roosevelt, at minute 33:15
  • The sharashka that Solzhenitsyn writes about (see quote above), at minute 33:36
  • Joseph Stalin (in the context of Solzhenitsyn's book), at minute 33:41
  • Audiobooks, at minute 35:58
  • YouTube, at minute 36:01
  • National Heart, Lung, and Blood Institute (NHLBI), at minute 40:02
  • The Framingham heart study at Framingham, Massachusetts, at minute 40:14: https://www.framinghamheartstudy.org
  • The Nethelands, for the numerous local varieties of Dutch that are at risk of dying out, at minute 49:34
  • The French revolution, at minute 51:59
  • SIL International (Summer Institute of Linguistics), at minute 53:07: https://www.sil.org
  • "Ethnologue: Languages of the World," a publication that provides statistics on the living languages of the world, at minute 53:12: https://www.ethnologue.com
  • ISO 639-3 - SIL International, standard for the representation of names of languages, at minute 53:32
  • Former Yugoslavia, and its linguistic situation after the national breakup, at minute 55:15

Go to interactive wordcloud (you can choose the number of words and see how many times they occur).

Episode transcript

Download full transcript in PDF (122.76 kB).

Host: Federica Bressan [Federica]
Guest: Mark Liberman [Mark]

[Federica]: Welcome to a new episode of Technoculture. I'm your host, Federica Bressan, and today my guest is Mark Liberman, an American linguist who has a dual appointment at the University of Pennsylvania, both at the Department of Linguistics and at the Department of Computer and Information Sciences. Mark is also founder and director of the LDC, the Linguistic Data Consortium. Welcome, Mark.

[Mark]: Glad to be here.

[Federica]: So what fascinates me about computational linguistics is that it's a field where two different disciplines — computer science and linguistics — have encountered each other quite a long time ago. That means that today, computational linguistics has some history. We can see how these two disciplines merged and now operate together, and that fascinates me because my current research is placed in the digital humanities, which is not technically such a new thing, but it's recent, and people still debate how to, in fact, put together these different disciplines and see them operate together with a scientifically credible methodology.

So I would like to ask you, to begin with: How does this marriage between linguistics and computer science work, and when did it start?

[Mark]: You know, I think there are two different questions here. One is the roots of computational linguistics (and for that matter digital humanities), and the other is the development of the nomenclature (calling it "computational linguistics" or calling it "digital humanities"). And in fact, applications of digital processing technology to linguistic, to problems of language and then also to problems in the humanities, is pretty much as old as the existence of computing machinery itself. So during the Second World War, the Enigma decryption project involved the construction of both special purpose and a movement in the direction of general-purpose computers, and although the goal of that technology was to decrypt transmissions, the transmissions involved were textual transmissions, and the calculations done had very much to do with the differential probability of different sequences of letters and words in German military transmissions. And so the problems involved, for example, the estimation of what we call N-gram models. The techniques commonly used there go back to ideas that were developed by Alan Turing in that project, so, you know, that was already in the early 1940s. So, you know, the term "computational linguistics" didn't really exist then, but the same algorithmic ideas were implicit in that very early work. And then, similarly, in the digital humanities, even back in the era of, you know, punched cards, sorting computing, one of the early applications that people worked on was the calculation of concordances of texts. I don't know exactly when that was first done, but it must have been done in the relatively early 1950s, I'm pretty sure.

[Federica]: So if I understand this correctly, it was computer scientists who first got interested in processing texts, rather than linguists got interested in the new technology and said, "How could we use this in our studies?"

[Mark]: I'm not sure that that... Again, I think there are there's the question of content versus the question of nomenclature and [organization?]. So one thing to keep in mind is that computer science as a discipline is actually relatively recent, and when... I mean, even as recently (not that it was that recently) as the time that I was an undergraduate, what we would now call "computer science" rather was carried out in departments of applied mathematics or in departments of electrical engineering. There were no departments of computer science; that came later. Similarly, linguists... So there was a lot of work in the, really starting in the 1930s and ‘40s, but becoming really more intense in the 1950s, along two lines that became and remain quite important in computational linguistics really begun by linguists.

So one was the idea of distributional analysis. So this is an idea that goes back to the structuralist linguists in America and Europe. It was particularly strong in the work of Zellig Harris. One of the slogans of that work in the area of syntax and semantics was that "You shall know a word by the company it keeps." That is, by looking at the distribution of word occurrences in large bodies of text, you should learn something. In fact, that would be the initial evidence from which you would infer the abstract analysis of morphology, syntax, and semantics. And as soon as computers became available for use, some linguists and computer scientists (actually, I suppose, electrical engineers and mathematicians in those days) began cooperating on trying to implement those ideas.

And then the second strand of work (again relatively early) was the work by Noam Chomsky and Pierre-Paul Schützenberger on what has come to be known as the Chomsky hierarchy or the Chomsky-Schützenberger hierarchy about a hierarchy of mathematical types of languages, algorithmic processing automata, and types of rewriting rules in recursive function theory. So again, that was not... That was mathematics. It was presented as mathematics rather than as computation but has become a central part of the underpinnings of computational linguistics, although the people who devised it were a linguist and a mathematician, neither of whom actually worked on computers in those days. So I think there's a sense in which from the point of view of content, the marriage of linguistics and computer science and the marriage of computational analysis and humanistic processing of text and, for that matter, of audio, that existed as soon as it became possible to conceive of the relationship, and the nomenclature and the academic and industrial organization, the existence of scientific societies, engineering societies and so on — that developed on top, so to speak, rather than the other way around. It wasn't that there was computer science and there was linguistics, and then at a certain point they say, "Oh, let's form computational linguistics." Rather, people on both sides had their own problems that they were trying to solve in ways that involved computational methods. I hope that's not too complicated.

[Federica]: No, no, absolutely. I would just maybe like to ask a similar question from another perspective, and that is: Today, computational linguistics is an established field, so new generations of students can choose their programs at university and enroll in a program in computational linguistics — so how do you present the field to them today? How do you contextualize it in a historical perspective but yet give them a definition of what the field is today and also what kind of competences the students must acquire? How much tech savvy, how much of the traditional linguistics must they acquire? How is the profile of today's researcher or just expert in computational linguistics?

[Mark]: Well, there have been roughly three stages of development, and the first stage involved programming grammars and analyzers or parsers. So the idea was that we would figure out what the grammatical patterns of English, or French, or German, or Russian, or Chinese are, we would either write those down in a mathematical formal way, and then have ideally some kind of computer program that could interpret that formalism to either analyze or generate the relevant patterns or perhaps to do translation or something of that kind — or if that didn't work, we would write a parser as a specific kind of code that wouldn't interpret a grammar, but would look for certain patterns in strings of letters or would generate appropriate patterns. So that was stage one, which is human beings, computational linguists (linguists or computer scientists; it doesn't matter) writing grammars and parsers.

The next stage was the machine learning stage where, rather than writing the grammar and/or writing the code that would do the analysis, instead you would attempt to create a system that would learn the grammar and learn how to do the analysis using, for example, techniques of stochastic grammar, so it could be a stochastic finite state grammar or a stochastic context-free grammar, that is a grammatical formalism of a kind that computational linguists had developed, and it would go through lots of material and attempt by some iterative process to figure out what the rules are and what probability should be associated with various options when things are ambiguous, which they always are. So that was stage two, stochastic machine learning.

And then stage three, which we're sort of still in the middle of now and the outcome is not so clear, is the applications of so-called deep learning, or deep neural nets, or pseudo-neural nets or whatever, which is a very general approach to pattern learning where typically very large amounts of material are put in, and rather than giving the system fairly detailed instructions about what it's supposed to learn (i.e., a grammar of a particular kind of form), instead you leave it open to the system (to some extent, at least) to form its own ideas (if we can call it that) not only about what the grammar is, so to speak, but even what a grammar is. That is, what kinds of patterns it's looking for. And pretty much the same kind of architecture might be used for recognizing pictures, or for recognizing speech, or for analyzing text, and that's kind of where we are now.

Now, I would say most students going into computational linguistics (whether relative to text, or to speech, or to both) actually probably learn things from all three stages, but the most common, practical, effective systems are now of the third kind, which, you know... There will undoubtedly be later stages, and one of the complaints that people have about this third stage, the deep learning stage, the most important complaint I would say, is: The systems that result are end-to-end black boxes. That is, you put in text or you put in audio and you get out the system's judgment about what the analysis is, but if you want to know why, there's nothing much to say other than, "This is what the system did." Whereas the earlier stages, the system processes the input and produces the output, but it also produces lots of humanly interpretable intermediate waypoints and structures.

A second problem with the current deep learning approach is that even though the same general kind of architecture can be used for image recognition, or for autonomous vehicle control, or for text parsing, or for speech recognition, or whatever, in fact, there are lots and lots and lots and lots of ways to modify that architecture that could be applied in any of those domains. Is it a recursive network? Is it a long short-term memory network? Is there an attentional mechanism? Is it a convolutional network? What's the batch size? What's the momentum? And so on. The meaning of those terms doesn't really matter. The point is that there are lots of piece parts, and knobs, and switches that have to be assembled and set in order to set up one of these projects, and it will work better or worse or maybe not at all depending on the choices you make, and there is no theory, mathematical or otherwise — no really effective theory — about how to make those choices.

One analogy that's been made is between these modern techniques and alchemy, because the alchemists were actually extremely effective practical chemists. You know, they knew how to extract acids, they knew how to create flavors and so on, but almost every process was kind of a thing in itself, so to become a good alchemist you had to engage in a very, very long apprenticeship, and even then when you approached a new problem, it wasn't really clear how to do it, because the theory behind all of this was either nonexistent or nonsensical.

[Federica]: That's a very nice analogy. I think I have one more kind of general question, so I could ask: What does a linguist do, and what does a computational linguist do that's different? But also, I add to that, is it more specifically the questions that are asked that are different or just the methods applied that are different?

[Mark]: Well, let's first make a differentiation between, on the one hand, science and maybe scholarship and on the other hand engineering and the creation of applications. And those, obviously, are different motives. So someone interested in science, for example, or in scholarship might be interested in, I don't know, how the verbal system of English developed from 1,000 years ago to today. No, that's not an engineering problem. An engineer might be interested in: How do we find names? How do we do what's sometimes called "entity tagging"? That is, how do we find in text names of people, places, organizations, dates and other sort of semantically well-defined — more or less well-defined — categories?

Someone who is interested in the history of English syntax, for example, people have been studying that before computers came into the picture by reading lots of texts and writing down lots of examples on file cards, and arranging the file cards, and counting things, and making either qualitative or quantitative judgments about how things changed. More recently, because we have almost the entire history of Middle English and Early Modern English available as digital text, we can use computational methods originally devised by engineers to go through and allow us to do literally in hours what previously would have taken decades.

And I could, you know, give you some specific examples — for example, work on the history of so-called "do support" in English, which was originally done by a linguist who did not attempt to use computers but just used old-fashioned, you might say almost philological, methods: reading texts and making notes — and more recently, a graduate student here at Penn who was able to use early English books online and some other sources of digital text and parsing technology in order to obtain several orders of magnitude more historical data not in a decade (as the previous work had involved) but rather literally over a period of days. And the value of the additional data is that he was able to look at the development over time of patterns with different verbs because he was able to get enough examples, decade by decade, of particular verbs in order to follow their trajectories separately. So the details there are complicated, and I won't go into them, but, I guess, again and again we find that there are matters of scholarship or science where we can use the techniques that engineers have developed and maybe extend them as well in order to address questions that are not, would not be of interest to the engineers qua engineers.

[Federica]: You have received your training in linguistics at MIT. You have your master's degree and PhD in linguistics there. During those years, Noam Chomsky was most active there. Was he one of your professors? Was he an important figure in your training?

[Mark]: He was certainly one of the people I took courses from, and I spoke with him quite a bit, and I think he was on my dissertation committee, for what that's worth.

[Federica]: You're an expert in speech and language technology. Speaking of technology, I would like to ask you if there is any special — well, piece of technology, a machine, something — that is only known or mostly known by the experts at the cutting edge of your research field that we non-experts might not know about that is involved in extracting data, processing data, some sensor maybe, and also if any of this is used in your research which I find very fascinating on the early diagnosis of Alzheimer's disease, so something that has an impact on health starting from the analysis of how a person talks. So I guess it's audio analysis.

Mark: Yes. Well, audio and/or text. So, again, there are both scientific questions and technological questions in the area of clinical applications of language — speech and language — analysis. So let's take another case which I think is even clearer — namely autism, or, as it's sometimes called, the autism spectrum. And this is an area whose definition is constantly changing. The most recent DSM made some very serious changes in what counts as autism and what doesn't, and there are a bunch of other related characteristics, related categories, like social anxiety disorder, for example, or ADHD. And people, for quite a while, rather than talking about autism as a well-defined box you can put people in, have started talking about a spectrum where people can be arranged along a line in some sense. I think anyone who looks at this problem quickly realizes that it's not just one dimension, that it's not a spectrum but a space, and furthermore, it's a space that we all live in. It's just that some corners of the space, because they interfere with people's ability to carry out ordinary life, have been sort of medicalized.

So one scientific question is what the dimensions of the space really are, and some have to do with language and communication, and some have to do with the ability to understand other people's goals, and intentions, and beliefs, and knowledge, and others have to do with interests and preferences and so on. There are many aspects. And it's likely, I think, that if we had a large amount of the right kind of evidence where computational linguistic and speech analysis could help us with that, that we could really start figuring out what the true latent dimensions are as well as, importantly but I think scientifically less interestingly, developing diagnostic techniques for placing people in this space.

And I've done some work along with colleagues at the Center for Autism Research at Children's Hospital here in Philadelphia aimed toward that kind of goal. Autism, obviously, applies to people throughout the life cycle. There, the issue that's especially important is, how early is it possible to make a diagnosis of one kind of problem or another in young children? Because there are interventions, mostly behavioral interventions, that really can help, and you want to start them as early as possible, but you don't want to start them unnecessarily. So one of the questions is sort of what are the dimensions of variation that are relevant, and what are manifestations that might appear in young children, maybe even before the age of one year?

At the other end of the life cycle, of course, there are the neurodegenerative disorders like Alzheimer's disease, which pretty much all of us are very likely to suffer from if we live long enough, and there again, there's a hope, at least, that if we could detect the onset earlier, then there might be therapies that would help. At present, there don't seem to be any, but early diagnosis is, again, a sort of interesting possibility.

Even more important, I think, is tracking the time course. So suppose that we have a drug that we believe might, in some or all cases, slow the onset of Alzheimer's or even reverse the syndrome. How do we test that drug? Well, we pick perhaps a few hundred subjects who are at risk of the disease and we divide them into the clinical group and the placebo group, and we give them the placebo or the medicine, and now we do something over a period of six months, or a year, or two years. But what is it that we do? How do we determine whether they're getting worse, or getting worse at some rate, or getting better, or staying the same? But we need some metric, and we would like a metric that is relatively non-invasive, that is relatively inexpensive, that is relatively low-stress for everyone concerned, and it seems quite likely that a speech-and-language-based metric will actually fit those needs very well.

[Federica]: In the early ‘90s, you founded and you're still director of the LDC, the Linguistic Data Consortium. What is this about, what is it for, and how is it doing today?

[Mark]: Well, taking those questions in reverse order, it's doing fine. What is it for? The original issue was that, starting in the mid-to-late 1980s, some people in the U.S. government, in U.S. government funding agencies, decided that it was important to try to fund engineering research in speech and language areas. The two most important ones, though there have been many others, were automatic speech recognition and machine translation. Others include things like language recognition, speaker recognition, and a bunch of other things as well, information extraction from text or from speech, document retrieval and so on. And that kind of work, as of 1985 let's say, had a very, very bad reputation because people had begun working on it with something in between naive optimism and maybe a little bit of dishonesty since the late 1940s, early 1950s, making big promises about what they were going to achieve. There was a book published in the early 1950s called Giant Brains which projected, for example, you know, things like the early ENIAC computer and so forth, that within a decade or two computers would be housed in buildings the size of the Empire State Building and would [unclear] the entire electrical output of Niagara Falls. But also they felt that within a decade or so, a voice typewriter would exist. And of course, the computer size thing turned out to be totally and completely in the wrong direction, and the voice typewriter, one might say, has arrived, but it took, you know, 70 years, 60 or 70 years, not 10. And there were many, there had been many failures, people who promised that they could achieve machine translation or they could achieve speech to text and who really didn't deliver much. And so there was a period of a decade or so that people sometimes refer to as the AI desert when at least U.S. government funding in those areas was almost completely withdrawn, and in fact most commercial, most industrial labs also backed away or entered very gingerly into those ideas. I experienced that because I got my PhD in 1975 and went to work for AT&T Bell Laboratories, and the idea of working on speech recognition was something that people were very leery about, and they would make very small promises and try simple things like, say, isolated digits or maybe digit string recognition, not a general voice typewriter.
So anyway, as of the late 1980s, the government decided that in order to get into this area, they would need to carefully guard themselves against, the founder of my center at Bell Labs, John Pierce, called "glamour and deceit, otherwise known as BS." And in order to do that, they wanted to have a series of very well-defined tasks with well-defined examples eventually becoming training material, very well-defined test material, and automatic, well-defined evaluation programs that could be carried out, implemented by the National Bureau of Standards, National Institute of Standards and Technologies now. So that was the way things were set up.
The problem that immediately arose was how to organize the creation, curation, and distribution of these large bodies of digital audio, and text, and video, and images to some extent as well. Remember, this was in the late 1980s. Google didn't exist. The internet didn't really exist. The main way to send things around was on... I don't know if you've ever seen a 9-track tape, but these are large 12-inch platters of magnetic tape that maybe contain a megabyte or so of information each. So it was a different world. But there also were problems of intellectual property rights, of human subjects' permissions, and things of that kind. And anyway, they started trying to do this through the National Archives and other government agencies, and in those days those, at least those agencies were not really set up to do this kind of thing and not all that interested in doing it. So anyway, Charles Wayne, who was then the Director of Speech and Language Research at DARPA, decided that it would be a good idea to have an organization housed in an academic context that could take care of those things and maybe also encourage a kind of marketplace, you might almost say, or, you know, open publication of materials not just from these government programs, but from other kinds of research entities. And so he got some funding from Congress to set this up, and I joined with some other people in 1987 or 8 in drawing up a white paper about how this might be done, never thinking that I would be involved in actually doing it.

In 1990, I left AT&T and came to Penn as a faculty member, and I was very happy to put research administration behind me and return to research and teaching, but Charles asked me to apply on behalf of Penn to be the home of this Linguistic Data Consortium, and I was only able, effectively, to say "no" to him for about three months and eventually gave in and did it. And this was, what started out as the top right-hand drawer of my desk has now turned into the floor of an office building with 50-odd employees and lots of computers.

[Federica]: Basically, this consortium provides linguists all over the world with materials and I would imagine today large datasets for their experiments and analysis, like a coherent, common, shared, open, public corpus of data?

[Mark]: Yeah. The datasets are of different sizes. Some we create. About half or a little more we publish on behalf of other people. Sometimes we publish things on behalf of IBM and Google and other companies, but we've also published on behalf of academic institutions all around the world — in the United States, in Europe, in Asia. And what we do, whether for our own material or for other people's material, is, we create and maintain documentation and catalog entries, we do quality control. In some cases, we have to do more. Sometimes, what people send us is, you know, a cardboard box full of analog tapes and some typescripts or something of that kind. We arrange for normalization of formatting, and then we do curation. That is, we produce later editions if there are corrections to be made or additions. We handle intellectual property rights, negotiations, and make sure that privacy and confidentiality and human subjects constraints are obeyed and so on. Or... I say "we." I mean, it's, of course, the people who work there who do it.

[Federica]: In the early days of computational linguistics, although it was not called so, text processing was the main thing. When did sound processing become being a thing, like speech, etc.?

[Mark]: Well, I don't know for sure, but there was already digital audio in process in the late 1930s, early 1940s. Actually, there's a literary connection. So if you read Solzhenitsyn's novel The First Circle, it's about a laboratory near Moscow which is a fictionalized version of a laboratory he actually participated in which is staffed by political prisoners and whose goal is to do what I guess we would call computational speech and language research, mostly speech. And in particular, the novel deals with two technologies. One is speaker recognition from audio recordings — obviously, in that context, intended for, you know, purposes of political repression — and the other being encrypted telephony, encrypted voice transmission. And the encrypted voice transmission is something that was developed jointly in England by people that included Alan Turing and in the United States at Bell Labs by people including Claude Shannon, and they collaborated on this. And it was actually implemented in such a way that Churchill and Roosevelt could use an encrypted radio transatlantic telephone conversation, basically. Now, this involved, you know, a room full of complicated and power-hungry apparatus in London and one in Washington, but it did involve digital voice, as I understand it. And, anyway, in the sharashka that Solzhenitsyn writes about, they were attempting to build something like that for Stalin because Stalin was jealous of the fact that Roosevelt and Churchill had it and he didn't. And one of the things that's clearly there is how to develop technologies for producing speech to bits, doing things with the bits to provide encryption, decrypting on the other end, and then reconstituting the speech. Now, the novel is not primarily about technology, but the technology is there in the background and I think is actually quite accurately portrayed.

[Federica]: You've been in this field for many years, and I'd like to ask you something about technological evolution. Have you witnessed something during your career that came from the technological front that was remarkable, that had an impact on the way you do research in this field, an advancement that's significant enough to share?

[Mark]: Well, frankly, the most remarkable changes are the result of the same developments and forces that are changing everything else in modern life — namely, the development of ubiquitous, inexpensive, high-bandwidth digital networking, the exponential improvement in the cost performance of various kinds of computational devices, including computers and various kinds of wearable devices, as well as the cloud (as we call it now), and also from the point of view of speech and language, especially the incredible changes in the cost performance of mass storage. So it's been a while now that arbitrary amounts of text are in effect trivial or free to store. That is, you could take what were 20 or 30 years ago unimaginable amounts of digital text and put them in your shirt pocket. And speech is rapidly moving in that direction. That is, again, for a relatively small amount of money, you can buy a mass storage device on which you can put enormous amounts of audio, along with transcripts if they're available, and so on, and, you know, you can download tens of thousands of hours of audiobooks from the web, and you can go to YouTube or other places and see millions — probably many more than, many millions — tens of millions — of hours of speeches and songs and readings and so on. And, you know, for anyone interested in analysis of speech and language using digital means, it's like walking into an amazing magical garden.

[Federica]: That's a very nice image, again. So it's nothing in particular, but the development of technology itself that is what's fascinating.

[Mark]: And I think the most important, the most impressive, the most valuable, the most interesting and insightful developments are actually still in the future.

[Federica]: Speaking of the future, I would like to ask you to share a vision for computational linguistics in the future — not necessarily a realistic one, just an optimistic vision that you have of where this research field could go. In the best-case scenario, how do you see computational linguistics in 5 to 10 to 20 years?

[Mark]: Well, there are many kinds of speech analysis where we start either just with the audio or perhaps with the audio and a transcript, but it's much, much easier, obviously, to create audio than to create audio with transcripts, just because transcription is a somewhat labor-intensive process and fairly expensive. So at some point in the next, I would say, 20 years or so, speech recognition will get good enough that across a wide variety of kinds of input for a wide variety of languages, we'll have speech-to-text which is good enough to be able to do away with most cases of human transcription. There have been some claims that we're there now. Several companies have claimed human parity (as the expression goes) in speech-to-text transcription, and they have actually achieved something like that in particular domains, but it's not the case that across arbitrary kinds of content, arbitrary kinds of recording conditions, arbitrary kinds of modes of interaction and so on, backgrounds and whatever, that automatic methods can reliably work. They work very well sometimes and they fail completely in other cases, but 20 years from now, that won't be true.

The next factor, the next thing there is that once we have the transcript and the audio, we can do what's called forced alignment, and we can figure out quite accurately which words occur where, but we do that despite the fact that people are not dictionaries. That is, even someone who speaks the standard variety of whatever language we're looking at doesn't pronounce words in spontaneous speech the way the dictionary says they should. Sometimes they do, but more often there are various forms of lenition, or reduction, or modification, and we have specialized ways of looking for those effects in particular cases in particular languages for particular kinds of material, but we don't have sort of what one might call an automated phonetician.

[Federica]: I ask for your forgiveness, but I'm not quite following. What's so extraordinary in speech-to-text recognition? It seems to me that I missed the application. What happens next?

[Mark]: Oh, okay. Well, let's take an example. We're working in this Alzheimer's and other analysis area, or at least that's the goal, on a body of material from the Framingham Heart Study, which is something that the national center for, National Heart, Lung, and Blood Institute in the United States began in around 1950. They recruited pretty much as much as they could of the entire adult population of a small town west of Boston — Framingham, Massachusetts — and they began keeping track of medical history issues and lifestyle factors and other things for all of those people over time in order to try to figure out what the factors were that influenced coronary disease. Around 20 years ago, they broadened their scope to include neurodegenerative disorders including stroke but also neurodegenerative diseases, and so they began giving a battery of neuropsychological tests which lasts about one to two hours to each of the members of their original cohort and the cohorts that they've recruited since then. [To read the rest of this transcript, please download the PDF linked on this page.]

Page created: November 2018
Last updated: July 2021