Speech and language technology: People are not dictionaries

with Mark Liberman

Download this episode in mp3 (26.15 MB) or all episodes in a zip folder (1.16 GB).

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.




As a researcher and research manager at AT&T Bell Labs from 1975-1990, as a member of Penn's faculty since 1990, and as founder and director of the Linguistic Data Consortium since 1992, Liberman has participated actively in the evolution of speech and language research towards a model of quantitative, replicable studies based on published datasets. In its 26 years of existence, the LDC has distributed more than 160,000 copies of nearly 3,000 datasets to more than 5,600 research organizations in 92 countries.

In his personal research, a key focus has been the scientific application of techniques from machine learning and human language technology to very large speech collections, in the range of dozens to thousands of hours and involving up to tens of thousands of speakers. Scientific application areas include phonetics, psychology of language, sociolinguistics, and clinical diagnosis and monitoring. The last category includes current collaborations on speech, language, and communicative interaction in Autism Spectrum Disorder, Frontotemporal Degeneration, and Alzheimer's Disease.

During this conversation, I ask Mark how the marriage between linguistics and computer science works today and has worked since the early days of this field, before it was called so. What skills are young students equipped with, and what applications computational linguistics has today. I also ask trivial questions like "how many languages are there in the world?" and you never get a trivial answer from a world class expert like Mark. I have learnt so much from this conversation and I hope you will too! My new favourite quote is: "A language is a dialect with an army and a navy".

More detailed info can be found here: https://www.ling.upenn.edu/~myl/LibermanCV.html


Fig. 1 - The original interview for this podcast was recorded over Skype. I finally had the pleasure of meeting Prof. Liberman during a visit to Philadelphia in November 2019. This picture was taken in his office at the LDC headquarters.


Highlights from this episode


At minute 1:45: "Applications of digital processing technology to linguistic problems of language, and then also to problems in the humanities, is pretty much as old as the existence of computing machinery itself."

At minute 7:35: "It's not that there was computer science and there was linguistics, and then at a certain point they say, oh let's form computational linguistics. Rather people on both sides had their own problems that they were trying to solve in ways that involve computational methods."

At minute 14:03: "One analogy that's been made is between these [artificial intelligence] modern techniques and alchemy, because the alchemists were actually extremely effective practical chemists. You know, they knew how to extract acids, they knew how to create flavors, and so on, but almost every process was kind of a thing in itself. So, to become a good alchemist you had to engage in a very very long apprenticeship and even then when you approached a new problem it wasn't really clear how to do it, because the theory behind all of this was either non-existent or nonsensical. "

At minute 15:09: I ask Mark whether the main difference between linguists and computation linguists lies in the questions that are asked, or in the methods applied to answer those questions.

At minute 16:36: Modern computational tools developed by engineers "allow us to do literally in hours what previously would have taken decades."

At minute 18:32: Noam Chomsky was on Mark's dissertation committee at MIT in 1975.

At minute 19:30: Mark talks about the clinical applications of speech and language analysis, especially about #autism and the early detection on Alzheimer's disease.

At minute 20:27: "I think anyone who looks at this problem [the definition of autism] quickly realizes that it's not just one dimension, that it's not a spectrum but a space, and furthermore it's a space that we all live in. It's just that some corners of the space, because they interfere with people's ability to carry out ordinary life have been sort of medicalised. So, one scientific question is what the dimensions of this space really are."

At minute 24:14: Mark talks about the Linguistic Data Consortium (LDC), its aims and how it is doing today.

At minute 26:15: "There was a period of a decade or so that people sometimes referred to as the AI desert."

At minute 27:13: "Glamour and deceit, otherwise known as BS:" Mark quotes John Pierce about research in AI in the mid-1970s. Pierce was the founder of the center where Mark used to work at Bell Labs.

At minute 31:15: Talking about work at the LDC, Mark says: "Sometimes people send us a cardboard box full of analog tapes, [and] we arrange for normalization of formatting, and then we do curation, [...] we handle intellectual property, rights negotiations, and make sure that privacy and confidentiality and human subjects constraints are obeyed, and so on."

At minute 32:00: When did audio start being an important part of the material analysed by linguists? Mark says, "There was already digital audio in process in the late 1930s, early 1940s."

At minute 33:17: Churchill and Roosevelt used an encrypted radio transatlantic telephone system involving a room full of complicated and power-hungry apparatus in London and one in Washington.

At minute 34:03: "The novel [by Solzhenitsyn] is not primarily about technology, but the technology is there in the background, and I think it is actually quite accurately portrayed."

At minute 33:17: What is the most remarkable advancement in the field of technology that Mark has witnessed during his career? "The most remarkable changes [for linguistics] are the result of the same developments and forces that are changing everything else in modern life. Namely, the development of ubiquitous inexpensive high bandwidth digital networking, the exponential improvement in the cost performance of various kinds of computational devices, including computers and various kinds of wearable devices, as well as 'the cloud' as we call it now."

At minute 36:14: "For anyone interested in analysis of speech and language, using digital means is, you know, like walking into an amazing magical garden." [Talking about the abundance of material available on platforms like YouTube today.]

At minute 36:38: "I think the most important, the most impressive, the most valuable, the most interesting and insightful [technological] developments are actually still in the future."

At minute 38:58: "People are not dictionaries."

At minute 45:10: What is the difference between language and speech?

At minute 48:45: 2018 was the European Year of Cultural Heritage (EYCH,) and heritage has a lot to do with identity. Language has also a lot to do with identity. Mark talks about language preservation and documentation and how audio recordings are valued and preserved.

At minute 49:54: "One could almost argue that the density of endangered [...] languages is actually greater in Europe than almost anywhere else in the world."

At minute 51:07: "It has been said that there was a time when you could walk from the English Channel to the Straits of Gibraltar, through France and Spain, or from the English Channel to Sicily, through France and Italy, and never pass a point where the people in one village couldn't speak with the people in the next village."
Because "the state of nature," especially in settled agricultural civilizations, is generally for a kind of geographical dialect continuum - that is, until "the nation-states come along to change this."

At minute 53:03: How many languages are there in the world? And who keeps count? We take this information for granted, we just go online and ask google. But who keeps track of the languages that are dying out? Who labels the endangered ones as endangered?

At minute 53:46: "Obviously, the question of what's a distinct language and what's a variety is not an easy question to answer. In fact, it's not a question that has a coherent answer. As the famous saying goes, 'a language is a dialect with an army and a navy.'"

At minute 55:10: How does a linguistic variety become a language? In order to exist, a language needs to be recognized. Mark talks about the role of political movements in this process of recognition.

Fig. 2 - At the LDC in Philadelphia, PA, on November 8th, 2019.

People, places and organizations mentioned in the interview


Timed links are available in the description of this episode on YouTube.

  • The Enigma decryption project at minute 2:03 and Alan Turing at minute 2:53
  • American linguist Zellig Harris at minute 5:08
  • Noam Chomsky and the Chomsky Hierarchy at minute 6:05
  • Pierre Paul Schutzen Berger and the Schnutzen-Berger Hierarchy at minute 6:07
  • League of European Research Universities (LERU), mentioned at minute 8:28: https://www.leru.org
  • The Linguistic Data Consortium (LDC) at minute 24:14: https://www.ldc.upenn.edu
  • The book "Giant brains; or, Machines that think" by Edmund Callis Berkeley published in 1949, mentioned at minute 25:24
  • Jonh Pierce, founder of the center at Bell Labs where Mark used to work, at minute 27:13
  • Charles Wayne, former Director of Speech and Language Research at the federal Defense Advanced Research Projects Agency (DARPA), at minute 28:51
  • IBM and Google, at minute 30:46
  • Aleksandr Solzhenitsyn's novel "In the First Circle" published in 1968, at minute 32:12
  • Alan Turing (already mentioned at minute 2:53), at minute 32:05
  • Bell Labs, at minute 33:08
  • Claude Shannon, at minute 33:09
  • Winston Churchill and Franklin D. Roosevelt, at minute 33:15
  • The sharashka that Solzhenitsyn writes about (see quote above), at minute 33:36
  • Joseph Stalin (in the context of Solzhenitsyn's book), at minute 33:41
  • Audiobooks, at minute 35:58
  • YouTube, at minute 36:01
  • National Heart, Lung, and Blood Institute (NHLBI), at minute 40:02
  • The Framingham heart study at Framingham, Massachusetts, at minute 40:14: https://www.framinghamheartstudy.org
  • The Nethelands, for the numerous local varieties of Dutch that are at risk of dying out, at minute 49:34
  • The French revolution, at minute 51:59
  • SIL International (Summer Institute of Linguistics), at minute 53:07: https://www.sil.org
  • "Ethnologue: Languages of the World," a publication that provides statistics on the living languages of the world, at minute 53:12: https://www.ethnologue.com
  • ISO 639-3 - SIL International, standard for the representation of names of languages, at minute 53:32
  • Former Yugoslavia, and its linguistic situation after the national breakup, at minute 55:15