Mispredicted Words, Mispredicted Futures
The accuracy of computer speech recognition flat-lined in 2001, before reaching human levels. The funding plug was pulled, but no funeral, no text-to-speech eulogy followed. Words never meant very much to computers—which made them ten times more error-prone than humans. Humans expected that computer understanding of language would lead to artificially intelligent machines, inevitably and quickly. But the mispredicted words of speech recognition have rewritten that narrative. We just haven’t recognized it yet.
After a long gestation period in academia, speech recognition bore twins in 1982: the suggestively-named Kurzweil Applied Intelligence and sibling rival Dragon Systems. Kurzweil’s software, by age three, could understand all of a thousand words—but only when spoken one painstakingly-articulated word at a time. Two years later, in 1987, the computer’s lexicon reached 20,000 words, entering the realm of human vocabularies which range from 10,000 to 150,000 words. But recognition accuracy was horrific: 90% wrong in 1993. Another two years, however, and the error rate pushed below 50%. More importantly, Dragon Systems unveiled its Naturally Speaking software in 1997 which recognized normal human speech. Years of talking to the computer like a speech therapist seemingly paid off.
However, the core language machinery that crushed sounds into words actually dated to the 1950s and ‘60s and had not changed. Progress mainly came from freakishly faster computers and a burgeoning profusion of digital text.
Speech recognizers make educated guesses at what is being said. They play the odds. For example, the phrase “serve as the inspiration,” is ten times more likely than “serve as the installation,” which sounds similar. Such statistical models become more precise given more data. Helpfully, the digital word supply leapt from essentially zero to about a million words in the 1980s when a body of literary text called the Brown Corpus became available. Millions turned to billions as the Internet grew in the 1990s. Inevitably, Google published a trillion-word corpus in 2006. Speech recognition accuracy, borne aloft by exponential trends in text and transistors, rose skyward. But it couldn’t reach human heights.
Source: National Institute of Standards and Technology Benchmark Test History
“I’m sorry, Dave. I can’t do that.”
In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space.
The language universe is large, Google’s trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain which force speech recognizers into risky guesses. Even where data are lush, picking what’s most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the “best” bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.
Statistics veiling ignorance
Many spoken words sound the same. Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.” Other laughers include “wreck an eyes peach” and “recondite speech.” But with a little knowledge of word meaning and grammar, it seems like a computer ought to be able to puzzle it out. Ironically, however, much of the progress in speech recognition came from a conscious rejection of the deeper dimensions of language. As an IBM researcher famously put it: “Every time I fire a linguist my system improves.” But pink-slipping all the linguistics PhDs only gets you 80% accuracy, at best.
In practice, current recognition software employs some knowledge of language beyond just the outer surface of word sounds. But efforts to impart human-grade understanding of word meaning and syntax to computers have also fallen short.
We use grammar all the time, but no effort to completely formalize it in a set of rules has succeeded. If such rules exist, computer programs turned loose on great bodies of text haven’t been able to suss them out either. Progress in automatically parsing sentences into their grammatical components has been surprisingly limited. A 1996 look at the state of the art reported that “Despite over three decades of research effort, no practical domain-independent parser of unrestricted text has been developed.” As with speech recognition, parsing works best inside snug linguistic boxes, like medical terminology, but weakens when you take down the fences holding back the untamed wilds. Today’s parsers “very crudely are about 80% right on average on unrestricted text,” according to Cambridge professor Ted Briscoe, author of the 1996 report. Parsers and speech recognition have penetrated language to similar, considerable depths, but without reaching a fundamental understanding.
Researchers have also tried to endow computers with knowledge of word meanings. Words are defined by other words, to state the seemingly obvious. And definitions, of course, live in a dictionary. In the early 1990s, Microsoft Research developed a system called MindNet which “read” the dictionary and traced out a network from each word out to every mention of it in the definitions of other words.
Words have multiple definitions until they are used in a sentence which narrows the possibilities. MindNet deduced the intended definition of a word by combing through the networks of the other words in the sentence, looking for overlap. Consider the sentence, “The driver struck the ball.” To figure out the intended meaning of “driver,” MindNet followed the network to the definition for “golf” which includes the word “ball.” So driver means a kind of golf club. Or does it? Maybe the sentence means a car crashed into a group of people at a party.
To guess meanings more accurately, MindNet expanded the data on which it based its statistics much as speech recognizers did. The program ingested encyclopedias and other online texts, carefully assigning probabilistic weights based on what it learned. But that wasn’t enough. MindNet’s goal of “resolving semantic ambiguities in text,” remains unattained. The project, the first undertaken by Microsoft Research after it was founded in 1991, was shelved in 2005.
Can’t get there from here
We have learned that speech is not just sounds. The acoustic signal doesn’t carry enough information for reliable interpretation, even when boosted by statistical analysis of terabytes of example phrases. As the leading lights of speech recognition acknowledged last May, “it is not possible to predict and collect separate data for any and all types of speech…” The approach of the last two decades has hit a dead end. Similarly, the meaning of a word is not fully captured just by pointing to other words as in MindNet’s approach. Grammar likewise escapes crisp formalization.
To some, these developments are no surprise. In 1986, Terry Winograd and Fernando Flores audaciously concluded that “computers cannot understand language.” In their book, Understanding Computers and Cognition, the authors argued from biology and philosophy rather than producing a proof like Einstein’s demonstration that nothing can travel faster than light. So not everyone agreed. Bill Gates described it as “a complete horseshit book” shortly after it appeared, but acknowledged that “it has to be read,” a wise amendment given the balance of evidence from the last quarter century.
Fortunately, the question of whether computers are subject to fundamental limits doesn’t need to be answered. Progress in conversational speech recognition accuracy has clearly halted and we have abandoned further frontal assaults. The research arm of the Pentagon, DARPA, declared victory and withdrew. Many decades ago, DARPA funded the basic research behind both the Internet and today’s mouse-and-menus computer interface. More recently, the agency financed investigations into conversational speech recognition but shifted priorities and money after accuracy plateaued. Microsoft Research persisted longer in its pursuit of a seeing, talking computer. But that vision became increasingly spectral, and today none of the Speech Technology group’s projects aspire to push speech recognition to human levels.
We are surrounded by unceasing, rapid technological advance, especially in information technology. It is impossible for something to be unattainable. There has to be another way. Right? Yes—but it’s more difficult than the approach that didn’t work. In place of simple speech recognition, researchers last year proposed “cognition-derived recognition” in a paper authored by leading academics, a scientist from Microsoft Research and a co-founder of Dragon Systems. The project entails research to “understand and emulate relevant human capabilities” as well as understanding how the brain processes language. The researchers, with that particularly human talent for euphemism, are actually saying that we need artificial intelligence if computers are going to understand us.
Originally, however, speech recognition was going to lead to artificial intelligence. Computing pioneer Alan Turing suggested in 1950 that we “provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.” Over half a century later, artificial intelligence has become prerequisite to understanding speech. We have neither the chicken nor the egg.
Speech recognition pioneer Ray Kurzweil piloted computing a long way down the path toward artificial intelligence. His software programs first recognized printed characters, then images and finally spoken words. Quite reasonably, Kurzweil looked at the trajectory he had helped carve and prophesied that machines would inevitably become intelligent and then spiritual. However, because we are no longer banging away at speech recognition, this new great chain of being has a missing link.
That void and its potential implications have gone unremarked, the greatest recognition error of all. Perhaps no one much noticed when the National Institute of Standards Testing simply stopped benchmarking the accuracy of conversational speech recognition. And no one, speech researchers included, broadcasts their own bad news. So conventional belief remains that speech recognition and even artificial intelligence will arrive someday, somehow. Similar beliefs cling to manned space travel. Wisely, when President Obama cancelled the Ares program, he made provisions for research into “game-changing new technology,” as an advisor put it. Rather than challenge a cherished belief, perhaps the President knew to scale it back until it fades away.
Speech recognition seems to be following a similar pattern, signal blending into background noise. News mentions of Dragon System’s Naturally Speaking software peaked at the same time as recognition accuracy, 1999, and declined thereafter. “Speech recognition” shows a broadly similar pattern, with peak mentions coming in 2002, the last year in which NIST benchmarked conversational speech recognition.
With the flattening of recognition accuracy comes the flattening of a great story arc of our age: the imminent arrival of artificial intelligence. Mispredicted words have cascaded into mispredictions of the future. Protean language leaves the future unauthored.
Dude, where's my universal translator? (CBC radio show)
Dutch translation of Rest in Peas: De onbegrepen dood van spraakherkenning