Why there will never be a model of a cell

Future imperfect: computer models cannot attain life-like fidelity

Biology’s holy grail, a full mechanistic understanding of the workings of life, is beyond reach according to two recent papers. Computer models that closely replicate the phenomena of a single cell are not possible, and the goal has been dropped.

Over the last decade, researchers have tried to grapple with biological complexity by modeling less complicated organisms. Yeast proved too complex and was replaced by organisms with smaller and smaller genomes, all the way down to tiny Mycoplasma pneumoniae. Unable to reduce genomes any further, scientists have radically reduced expectations for models instead.

In Science last month, researchers described the “popular view,” in which “we progress linearly, from conceptual to ever more detailed models.” The popular, linear view is no more. From now on, models “should be judged by how useful they are and what we can learn from them,” according to the paper’s authors, “not by how close we are to the elusive ‘whole cell model’.”

Alex Mogilner, one of authors and a professor at UC Davis, believes some future discovery might make the whole cell model again possible. “Never say never,” he advised. However, a paper from the Institute for Systems Biology forecloses the possibility for all time:

[N]o practically conceivable model will ever represent all possible physical parameters in a system, nor will enough data ever exist to fully constrain them all. It is also experimentally infeasible to measure, and technically prohibitive to model all possible phenomena in a cell, all possible environmental contexts, and all possible genetic perturbations.

There will be no in silico model of a cell, one that fully recapitulates cell behavior and substitutes for wet lab experiments. “Anyone who thinks we can ever obtain a completely deterministic view of an organism will have a hard job to convince me,” said Marc Kirschner, chair of the systems biology department at Harvard. “It is probably true that the number of equations to describe the events in a single cell is so large that this approach will never work,” according to Kirschner. He does hope to be able to predict “to some accuracy” particular responses of a system.

The implications for the future have yet to be worked out, although Mogilner and colleagues observed that such models were envisioned as enabling personalized medicine. For historical purposes, however, these papers bring an end to a monumentally successful, physics-based program for biology that began roughly a century ago.

Biologist Thomas Hunt Morgan successfully pioneered the methods of physics in biology, elucidating the role of chromosomes in heredity. This “turned out to be extraordinarily simple,” as he wrote in 1919, and nature was entirely approachable. “[I]f the world in which we live were as complicated as some of our friends would have us believe,” Morgan wrote, “we might well despair that biology could ever become an exact science.”

Shortly thereafter, physics underwent a crisis of faith as the discipline moved from an intuitive, mechanistic basis into a new and unsettling quantum era which renounced the Newtonian ideal of casually linking everything in space and time. When DNA was discovered decades later, the Newtonian paradise was regained. As a theoretical physicist turned biologist Max Delbrück said in his Nobel Prize lecture:

It might be said that Watson and Crick’s discovery of the DNA double helix in 1953 did for biology what many physicists hoped in vain could be done for atomic physics: it solved all the mysteries in terms of classical models and theories, without forcing us to abandon our intuitive notions about truth and reality.

Not long after, Lee Hood decided to become a biologist after reading an article by Francis Crick in Scientific American. Crick wrote how “the sequence of the bases acts as a kind of genetic code…” which was unknown. Many years later, Hood expressed the belief that “the core of biology is ultimately knowable, and hence, we start with a certainty that is not possible in the other disciplines,” like physics. He forecast being able to predict the behavior of a biological systems “given any perturbation.” His lab at Caltech invented the DNA sequencer.

A draft sequence of the human genome was published in 2000 and Hood founded his Institute for Systems Biology (ISB). The same year, Matt Ridley published his best-selling Genome which predicted a leap from knowing “almost nothing about our genes to knowing everything,” which he described as “the greatest intellectual moment in history. Bar none."

For the next dozen years, researchers from ISB heaved with might and main to realize Hood’s vision. Instead, they now say it is unattainable.

Undoubtedly, there will be disbelief. But Robert Millikan, a founding father of Caltech, didn’t want to believe in Einstein’s photoelectric effect. He won a Nobel Prize for being wrong and proving Einstein right.

This may still be one of the greatest intellectual moments in history, just not what we expected.


Image of yeast adapted from Nelson et al. DOI: 10.1073/pnas.0910874107

IBM’s Watson: Portent or Pretense?

Game over.

IBM’s Watson, with a 15 terabyte chunk of human knowledge loaded into it like a Game Boy cartridge and set to hair trigger, poured out a high-precision fact fusillade that left no humans standing in the (aptly-named) Jeopardy. Is machine omniscience upon us?

“I, for one, welcome our new computer overlords,” said defeated Jeopardy champion Ken Jennings. But Jennings has buzzed in too quickly. Watson might point not to the inevitability of artificial intelligence but its unattainability. Barring unexpected revelations from IBM, Watson represents exquisite engineering work-arounds substituting for fundamental advance.

Questionable progress
In 1999, early question answering systems, including one from IBM, began competing against each other under the auspices of the National Institute for Standards Testing (NIST). At the time, researchers envisioned a four-step progression, starting with systems batting back facts to simple questions such as “When was Queen Victoria born?” A few tall steps later, programs would stand atop the podium and hold forth on matters requiring a human expert like “What are the opinions of the Danes on the Euro?”

Getting past step one proved difficult. While naming the grape variety in Chateau Petrus Bordeaux posed little difficulty, programs flailed on follow-up questions like “Where did the winery’s owner go to college?” even though the answer resided in the knowledgebase provided. These contextual questions were deemed “not suitable” by NIST and dropped. The focus remained on simple, unconnected factoids. In 2006, context questions returned—only to be cut again the following year. In the consensus view, such questions were “too hard,” according to Jimmy Lin, a computer science professor at the University of Maryland and a coordinator of the competition.

The entire contest was dropped after 2007. “NIST decides what to push,” explained Lin, “and we were not getting that much out of this…” Progress had turned incremental, like “trying to build a better gas engine,” according to Lin. Question answering wasn’t finished, but it was done. Although James T. Kirk asked the Star Trek computer questions like whether a storm could cause inter-dimensional contact with a parallel universe, actual question answering systems like Watson would be hard pressed to answer NIST level one questions like “Where is Pfizer doing business?”

ABC easy as 1, 2, 3
Kirk spoke to the computer. Watson’s designers opted for text messages—which says a lot. Speech recognition software accuracy reaches only around 80% whereas humans hover in the nineties. Speech software treats language not as words and sentences carrying meaning but as strings of characters following statistical patterns. Seemingly, Moore’s law and more language data should eventually yield accuracy at or conceivably beyond human levels. But although chips have sped up and data abound, recognition accuracy plateaued around 1999 and NIST stopped benchmarking in 2001 for lack of progress to measure. (See my Rest in Peas: the Unrecognized Death of Speech Recognition.)

Much of speech recognition’s considerable success derives from consciously rejecting the deeper dimensions of language. This source of success is now a cause of failure. Ironically, as Watson triggers existential crisis among humans, computers are struggling to find meaning.

Words are important in language. We’ve had dictionaries for a quarter millennium, and these became machine readable in the last quarter century. But the fundamental difficulties of word meanings have not changed. For his 1755 English dictionary, Samuel Johnson hoped that words, “these fundamental atoms of our speech might obtain the firmness and immutability of… particles of matter…” After nine years’ effort, Johnson published his dictionary but abandoned the idea of a periodic table of words. Concerning language, he wrote: “naked science is too delicate for the purposes of life.”

Echoing Johnson 250 years later, lexicographer Adam Kilgarriff wrote: “The scientific study of language should not include word senses as objects…” The sense of a word depends on context. For example, if Ken Jennings calculates that Oreos and crosswords originated in the 1920s, does calculate mean mathematical computation or judge to be probable? Well, both. Those senses are too fine-grained and need to be lumped together. But senses can also be too coarse. If I buy a vowel on Wheel of Fortune, do I own the letter A? No. This context calls for a finer, even micro-sense.

The decade of origin for Oreos and crosswords was actually the 1910s—as Watson correctly answered. In what decade will computers understand word meanings? Not soon; perhaps never. Theoretical underpinnings are absent: “[T]he various attempts to provide the concept ‘word sense’ with secure foundations over the last thirty years have all been unsuccessful,” as Kilgarriff wrote more than a decade ago, in 1997. Empirical, philosophy-be-damned approaches were tried the following year.

In 1998, at Senseval-1, researchers tackled a set of 35 ambiguous words. The best system attained 78% accuracy. The next Senseval, in 2001, used more nuanced word definitions which knocked accuracy down below 70%. It didn’t get up: “[I]t seems that the best systems have hit a wall,” organizers wrote. The wall wasn’t very high. Systems struggled to do better than mindlessly picking the first dictionary sense of a word every time. Organizers acknowledged that most of the efforts “seem to be of little use…” and had “not yet demonstrated real benefits in human language technology applications.”

Disambiguation was dustbinned. Senseval was renamed Semeval and semantic tasks subsumed word sense disambiguation by 2010. Today no hardware/software system can reliably determine the meaning of a given word in a sentence.

That’s imparsable
Belief continues unfazed, however, regarding whether language can be solved with statistical methods. “The answer to that question is ‘yes,’ “ declares an unabashedly partial Eugene Charniak, professor of computer science at Brown University. Whatever the trouble with word meanings, at the sentence level, computer comprehension is quite impressive—thanks to probabilistic models. Charniak has written a parsing program that unfurls a delicate mobile of syntax from the structure of a sentence. 

Mobile meaning, hanging in the balance (Diagram: phpSyntaxTree)

Such state-of-the-art parsers spin accurate mobiles about 80% of the time when given sentences from The Wall Street Journal. But feed in a piece of literature, biomedical text or a patent and the parses tangle. Nouns are mistaken for finite verbs in patents; in literary texts, different kinks and knots tug accuracy down to 70%.

Performance droops because the best parsers don’t apply universal rules of grammar. We don’t know them or if they exist. Instead parsers try to reverse engineer grammar by examining huge numbers of example sentences and generating a statistical model that substitutes for the ineffable principles.

That strategy hasn’t worked. Accuracy invariably declines when parsers confront an unfamiliar body of text. The machine learning approach finds patterns no human realistically could, but these aren’t universal. Change the text and the patterns change. That means current parsing technology performs poorly on highly diverse sources like the web.

Progress has gone extinct: parsing accuracy gained perhaps a few tenths of one percent in the last decade. “Squeezing more out of the current approaches is not going to work,” says Charniak. Instead, he concludes, “we need to squeeze more out of meaning.”

Surface features don’t provide a reliable grip on sentence syntax. Word order and parts of speech often aren’t enough. Regular sentences can be slippery, like:

  • President F.W. de Klerk released the ANC men along with one of the founding members of the Pan Africanist Congress (PAC).

Not knowing about apartheid, the parser must guess whether de Klerk and a PAC member together released the ANC men—although the PAC figure was also in prison. The program has no basis for deciding where in the mobile (pictured above) to hang up the phrase “along with…”

Notice that winning Jeopardy is easier than correctly diagramming some sentences. And Watson provides no help to a parser in need. Questions like, “What are the chances of PAC releasing members of the ANC?” are far too hard, the reasoning power and information required too vast. Watson’s designers likened organizing all knowledge to “boiling the ocean.” They didn’t try. Others are.

But it’s called mining the World Wide Web and aims to penetrate to the inner core, the sanctum sanctorum, of meaning.

In the beginning was the word. But the problem of meaning arises in word senses and spreads, as we have, seen to sentences. Errors and misprisions accumulate, fouling higher level processing. In a paragraph referencing “Mr. Clinton,” “Clinton,” and “she,” programs cannot reliably figure out if “Clinton” refers to Bill Clinton or Hillary Clinton—after 15 years of effort. Perhaps because of this problem, Watson once answered “Richard Nixon” to a clue asking for a first lady,

Evading this error requires understanding the senses of nearby words, that is, solving the unsolved word disambiguation problem. Finding entry into this loop of meaning has been elusive, the tape roll seamless, thumbnail never catching at the beginning.

Structuring web-based human knowledge promises to break through today’s dreary performance ceiling. Tom Mitchell at Carnegie Mellon University seeks “growing competence without asymptote,” a language learning system which spirals upward limitlessly. Mitchell’s Never Ending Language Learning (NELL) reads huge tracts of the web, every day. NELL’s blade servers grind half a billion pages of text into atoms of knowledge, extracting simple facts like “Microsoft is a company.” Initially, facts are seeded onto a human-built knowledge scaffold. But the idea is to train NELL in this process and enable automatic accretion of facts into ever-growing crystals of knowledge, adding “headquartered in the city Redmond” to facts about Microsoft, for example. Iterate like only computers can and such simple crystals should complexify and propagate.

But instead NELL slumps lifelessly soon after human hands tip it on its feet. Accuracy of facts extracted drops from an estimated 90% to 57%. Human intervention became necessary: “We had to do something,” Mitchell told fellow researchers last year. The interventions became routine and NELL dependent on humans. NELL employs machine learning but knowledge acquisition might not be machine learnable: “NELL makes mistakes that lead to learning to make additional mistakes,” as NELL’s creators observed.

The program came to believe that F.W. de Klerk is a scientist not a former president of South Africa—providing little help in resolving ambiguous parsing problems. At the same time, NELL needs better parsing to mine knowledge more accurately: “[W]e know NELL needs to learn to parse,” Mitchell wrote in email. This particular Catch-22 might not be fundamentally blocking. But if NELL can’t enhance the performance of lower-level components, those components might clamp a weighty asymptote on NELL’s progress.

Represent, represent
An older, less surmountable, perhaps impossible problem faced by NELL is how to arrange facts, assuming they can be made immaculate. Facts gleaned by NELL must be pigeon-holed into one of just 270 categories—tight confines for all of knowledge. Mitchell wants NELL to be able to expand these categories. However, while incorrect individual facts might compromise NELL, getting categories wrong would be fatal.

But no one knows how to write a kind of forensic program that accurately reconstructs a taxonomy from its faint imprint in text. Humans manage, but only with bickering. Even organizing knowledge in relatively narrow, scientific domains poses challenges, small molecules in biology, for example. Some labs just isolate and name the different species. Other researchers with different interests represent a molecule with its weight and the weights of its component parts, information essential to studying metabolism.  However, 2D representations are needed for yet another set of purposes (reasoning about reaction mechanisms) whereas docking studies call for 3D representations, etc.

What a thing is or, more specifically, how you represent it, depends on what you are trying to do—just as the quest for word senses discovered. So even for the apparently simple task of representing a type of molecule, “there is not one absolute answer,” according to Fabien Campagne, research professor of computational biomedicine at Weill Medical College. The implication is that representation isn’t fixed, pre-defined. And new lines of inquiry, wrote Campagne, “may require totally new representations of the same entity.”

One of NELL’s biological conceptions is that “blood is a subpart of the body within the right ventricle.” Perhaps this and a complement of many other facts cut in a similar shape can represent blood in a way that answers some purpose or purposes. But it will not apply in discussions of fish blood. (Fish have no right ventricle.) And when it comes to human transfusion, blood is more a subpart of a bag.

The difficulties of representation represented: Marcel Duchamp’s Nude Descending Staircase. NELL’s representation of Marcel Duchamp

Particular regions of knowledge can be tamed by effort or imposition of a scheme by raw exercise of authority. But these fiefdoms resist unification and generally conflict. After millennia of effort, humans have yet to devise a giant plan which would harmonize all knowledge. The Wikipedia folksonomy works well for people but badly for automating reasoning. Blood diamonds and political parties in Africa, for example, share a category but clearly require different handling. One knowledge project, YAGO, simply lops off the Wikipedia taxonomy.

The dream of a database of everything is very much alive. Microsoft Research, from its Beijing lab, recently unveiled a project named Probase which its creators say contains 2.7 million concepts sifted from 1.68 billion web pages. “[P]robably it already includes most, if not all, concepts of worldly facts that human beings have formed in their mind,” claim the researchers with refreshing idealism. Leaving aside the contention that everything ever thought has been registered on the Internet, there still are no universal injection molds—categories—ready to be blown full of knowledge.

A much earlier, equally ambitious effort called Cyc failed for a number of reasons, but insouciance about knowledge engineering, about what to put where, contributed to Cyc’s collapse. Human beings tried to build Cyc’s knowledgebase by hand, assembling a Jenga stack of about one million facts before giving up.

NELL may be an automated version of Cyc. And it might succeed less. NELL’s minders already have their hands full tweaking the program’s learning habits to keep fact accuracy up. NELL is inferior to Cyc when it comes to the complexity of knowledge each system can handle. Unless NELL can learn to create categories, people will have to do it, entailing a monumental knowledge engineering effort and one not guaranteed to succeed. Machine learning relies on examples which simply might not work for elucidating categories and taxonomies. Undoubtedly, it is far harder than extracting facts.

NELL may also represent a kind of inverse of IBM’s Watson. NELL arguably is creating a huge Jeopardy clue database full of facts like “Microsoft is a company headquartered in Redmond.” NELL and Watson attack essentially the same problem of knowledge, just from different directions. But it will be difficult for NELL to reach even Watson’s level of performance. Watson left untouched the texts among its 15 terabytes of data. NELL eviscerates text, centrifuging the slurry to separate out facts and reassembling them into a formalized structure. That is harder.

And Watson is confined to the wading pool, factoid shallows of knowledge. The program is out of its depth on questions that require reasoning and understanding of categories. That may be why, in Final Jeopardy, Watson answered “Toronto” not “Chicago” for the United States city whose largest airport is named for a World War II hero and second largest for a World War II battle. Watson likely could have separately identified O’Hare and Midway if asked sequential questions. And pegging Chicago as the city served by both airports also presumably would be automatic for the computer. But decomposing and then answering the series appears to have been too hard. NIST dropped such questions—twice—for their perceived insuperability. And yet they are trivial compared to answering questions about the relations between F. W. de Clerk, the African National Congress, and the Pan Africanist Congress, questions of the kind which have stalled progress in parsing.

Google vs. language
Google contends with language constantly—and prevails. Most Google queries are actually noun phrases like “washed baby carrots.” To return relevant results, Google needs to know if the query is about a clean baby or clean carrots. Last year, a team of researchers crushed this problem under a trillion-word heap of text harvested from the many-acred Google server farms. Statistically, the two words “baby carrots” show up together more than “washed baby.” Problem solved. Well, mostly.

The method works an impressive 95.4% of the time, at least on sentences from The Wall Street Journal. Perhaps as important, accuracy muscled up as the system ingested ever-larger amounts of data. “Web-scale data improves performance,” trumpeted researchers. “That is, there is no data like more data.” And more data are inevitable. So will the growing deluge wash away the inconveniencies of parsing and other language processing problems?

Performance did increase with data, but bang for the byte still dropped—precipitously. Torqueing accuracy up just 0.1% required an order of magnitude increase in leverage, to four billion unique word sequences. Powering an ascent to 96% accuracy would require four quadrillion, assuming no further diminution of returns. To reach 97%, begin looking for 40 septillion text specimens. 

Mine the gap: Does the Internet have enough words to solve noun phrases? (Adapted from Pitler et al., “Using Web-scale N-grams to Improve Base NP Parsing Performance”)

More data yielding ever better results is the exception not the rule. The problem of words senses, for example, is relatively impervious to data-based assaults. “The learning curve moves upward fairly quickly with a few hundred or a few thousand examples,” according to Ted Pedersen, computer science professor at the University of Minnesota, Duluth, “but after a few thousand examples there's usually no further gain/learning, it just all gets very noisy.”

Conceivably, we are now witnessing the data wave in language processing. And it may pass over without sweeping away the problems.

Let the data speak, or silence please
In speech recognition too, according to MIT’s Jim Glass, “There is no data like more data.” Glass, head of MIT’s Spoken Language System Group, continued in email: “Everyone has been wondering where the asymptote will be for years but we are still eking out gains from more data.” However, evidence for continuing advance toward human levels of recognition accuracy is scarce, possibly non-existent.

Nova’s Watson documentary asserts that recognition accuracy is “getting better all the time” (~34:00) but doesn’t substantiate the claim. Replying to an email inquiry, a Nova producer re-asserted that programs like Dragon Naturally Speaking from Nuance “are clearly more accurate and continuing to improve,” but again adduced no evidence.

Guido Gallopyn, vice president of research and development at Nuance, has worked on Dragon Naturally Speaking for over a decade. He says Dragon’s error rate had been cut “more than in half.” But Gallopyn begged off providing actual figures, saying accuracy was “complicated.” He did acknowledge that there was still “a long way to go.” And while Gallopyn has faith that human-level performance can be attained, astonishingly, it is not a goal for which Nuance strives: “We don’t do that,” he stated flatly.

Slate also recently talked up speech recognition, specifically Google Voice. The article claims that programs like Dragon “tend to be slow and use up a lot of your computer's power when deciphering your words,” in contrast to Google’s powerful servers. In the Google cloud, 70 processor-years of data mashing can be squeezed into a single day. Accurate speech recognition then springs from the “magic of data,” but exactly how magic goes unmeasured. Google too is mum: “We don't have specific metrics to share on accuracy,” a spokesperson for the company said.

By contrast, The Wall Street Journal, recently reported on how Google Voice is laughably mistake prone, serving as the butt of jokes in a new comedic sub-genre.

There is no need for debate or speculation: the NIST benchmarks, gathering dust for a decade, can definitively answer the question of accuracy. The results would be suggestive for the prospects of web-scale data to overcome obstacles in language processing. Computer understanding of language, in turn, has substantial implications for machine intelligence. In any event, claims about recognition accuracy should come with data.

Today, all that can be said is this:

Progress in voice recognition: the sound of one hand clapping since 2001 (Adapted from NIST, “The History of Automatic Speech Recognition Evaluations at NIST”)

To be || not to be
That is the question about machine intelligence.

When Garry Kasparov was asked how IBM might improve Deep Blue, its chess playing computer, he answered tartly: “Teach it to resign earlier.” Kasparov, then world chess champion, had just soundly defeated Deep Blue. Rather than follow this advice, IBMers put some faster chips in, tweaked the software and then utterly destroyed Kasparov not long after, in 1997. It was IBM’s turn to vaunt: “One hundred years from now, people will say this day was the beginning of the Information Age,” claimed the head of the IBM team. Instead, apart from chess, Deep Blue has had no effect.

If Deep Blue represented an effort to rise above human intelligence by brute computational force, Watson represents the data wave. But we have been inundated by data for some time. Google released its trillion word torrent of text five years ago. Today the evidence may suggest that the problems of language will remain after the deluge. If the rising tide of world-wide data can’t float computing’s boat to human levels, “What’s the alternative?” demands Eugene Charniak. He perhaps means there is no alternative.

A somewhat radical idea is to revise the parts of speech, as Stanford University’s Christopher Manning has proposed. Disturbingly, Manning asks: “Are part-of-speech labels well-defined discrete properties enabling us to assign each word a single symbolic label?” Recall that words don’t cleanly map to discrete senses, and similarly that things in the world don’t fit into obvious, finite, universal categories. Now the parts of speech seem to be breaking down.

Tagging accuracy: time for new parts of speech? (Source: Flickr, Tone Ranger)

Manning is skeptical that machine learning could conjure even 0.2% more accuracy in the tagging of words with their part of speech. Achim Hoffmann, at the University of New South Wales, believes more generally that machine learning now bumps against a ceiling. “New techniques,” he adds, “are not going to substantially change that.” Hoffman points out that relatively old techniques “are still among the most successful learning techniques today, despite the fact that many thousand new papers have been written in the field since then.”

For Hoffman, the alternative is to approach intelligence not through language or knowledge but algorithm. Arguably, however, this is just a return to the very origins of artificial intelligence. John McCarthy, inventor of the term “artificial intelligence,” tried to find a formal logic and mathematical semantics that would lead to human-like reasoning. This project failed. It led to Cyc. As Cyc founder Doug Lenat wrote in 1990: “We don’t believe there is any shortcut to being intelligent, any yet-to-be-discovered Maxwell’s equations of thought.” Forget algorithm. Knowledge would pave the way to commonsense. Cyc, of course, also did not work.

Are we just turning circles, or is the noose cinching tighter with repeated exertions? There is something viscerally compelling—disturbing—about Watson and its triumph. “Cast your mind back 20 years,” as AI researcher Edward Feigenbaum recently said in the pages of The New York Times, “and who would have thought this was possible?” But 20 years ago, Feigenbaum published a paper with Doug Lenat about a project called Cyc. Cyc aimed at full blown artificial intelligence. Watson stands in relation to a completely realized Cyc the way J. Craig Venter’s synthetic cell stands to the original vision of genetic engineering: a toy.

John McCarthy derided the Kasparov-Deep Blue spectacle, calling it “AI as sport.” Jimmy Lin, the former NIST coordinator, is not derisive but more ho-hum, wordly-wise about Watson: “Like a lot of things,” he says, “it’s a publicity stunt.” Perhaps an artificially intelligent computer wouldn’t fall for it, but people have. The New Yorker sees the triumphs of Deep Blue and Watson as forcing would-be defenders of humanity to move the goalposts back, to re-define the boundaries of intelligence and leave behind the fields recently annexed by computers. But the goalposts arguably have been moved up, so that weak artificial intelligence—artificial AI—can put it through the uprights.

The New York Times contends that Watson means “rethinking what it means to be human.” Actually what needs redefinition may be humanity’s relationship to dreams of technological transcendence.

Rest in Peas: The Unrecognized Death of Speech Recognition

Pushing up daisies (Photo courtesy of Creative Coffins)

 Mispredicted Words, Mispredicted Futures

The accuracy of computer speech recognition flat-lined in 2001, before reaching human levels. The funding plug was pulled, but no funeral, no text-to-speech eulogy followed. Words never meant very much to computers—which made them ten times more error-prone than humans. Humans expected that computer understanding of language would lead to artificially intelligent machines, inevitably and quickly. But the mispredicted words of speech recognition have rewritten that narrative. We just haven’t recognized it yet.

After a long gestation period in academia, speech recognition bore twins in 1982: the suggestively-named Kurzweil Applied Intelligence and sibling rival Dragon Systems. Kurzweil’s software, by age three, could understand all of a thousand words—but only when spoken one painstakingly-articulated word at a time. Two years later, in 1987, the computer’s lexicon reached 20,000 words, entering the realm of human vocabularies which range from 10,000 to 150,000 words. But recognition accuracy was horrific: 90% wrong in 1993. Another two years, however, and the error rate pushed below 50%. More importantly, Dragon Systems unveiled its Naturally Speaking software in 1997 which recognized normal human speech. Years of talking to the computer like a speech therapist seemingly paid off.

However, the core language machinery that crushed sounds into words actually dated to the 1950s and ‘60s and had not changed. Progress mainly came from freakishly faster computers and a burgeoning profusion of digital text.

Speech recognizers make educated guesses at what is being said. They play the odds. For example, the phrase “serve as the inspiration,” is ten times more likely than “serve as the installation,” which sounds similar. Such statistical models become more precise given more data. Helpfully, the digital word supply leapt from essentially zero to about a million words in the 1980s when a body of literary text called the Brown Corpus became available. Millions turned to billions as the Internet grew in the 1990s. Inevitably, Google published a trillion-word corpus in 2006. Speech recognition accuracy, borne aloft by exponential trends in text and transistors, rose skyward. But it couldn’t reach human heights.

Source: National Institute of Standards and Technology Benchmark Test History 

“I’m sorry, Dave. I can’t do that.”

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space.

The language universe is large, Google’s trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10570.  Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain which force speech recognizers into risky guesses. Even where data are lush, picking what’s most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the “best” bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.

Strings, heavy with meaning. (Photo credit: t_a_i_s)

Statistics veiling ignorance

Many spoken words sound the same. Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.” Other laughers include “wreck an eyes peach” and “recondite speech.” But with a little knowledge of word meaning and grammar, it seems like a computer ought to be able to puzzle it out. Ironically, however, much of the progress in speech recognition came from a conscious rejection of the deeper dimensions of language. As an IBM researcher famously put it: “Every time I fire a linguist my system improves.” But pink-slipping all the linguistics PhDs only gets you 80% accuracy, at best.

In practice, current recognition software employs some knowledge of language beyond just the outer surface of word sounds. But efforts to impart human-grade understanding of word meaning and syntax to computers have also fallen short.

We use grammar all the time, but no effort to completely formalize it in a set of rules has succeeded. If such rules exist, computer programs turned loose on great bodies of text haven’t been able to suss them out either. Progress in automatically parsing sentences into their grammatical components has been surprisingly limited. A 1996 look at the state of the art reported that “Despite over three decades of research effort, no practical domain-independent parser of unrestricted text has been developed.” As with speech recognition, parsing works best inside snug linguistic boxes, like medical terminology, but weakens when you take down the fences holding back the untamed wilds. Today’s parsers “very crudely are about 80% right on average on unrestricted text,” according to Cambridge professor Ted Briscoe, author of the 1996 report. Parsers and speech recognition have penetrated language to similar, considerable depths, but without reaching a fundamental understanding.

Researchers have also tried to endow computers with knowledge of word meanings. Words are defined by other words, to state the seemingly obvious. And definitions, of course, live in a dictionary. In the early 1990s, Microsoft Research developed a system called MindNet which “read” the dictionary and traced out a network from each word out to every mention of it in the definitions of other words.

Words have multiple definitions until they are used in a sentence which narrows the possibilities. MindNet deduced the intended definition of a word by combing through the networks of the other words in the sentence, looking for overlap. Consider the sentence, “The driver struck the ball.” To figure out the intended meaning of “driver,” MindNet followed the network to the definition for “golf” which includes the word “ball.” So driver means a kind of golf club. Or does it? Maybe the sentence means a car crashed into a group of people at a party.

To guess meanings more accurately, MindNet expanded the data on which it based its statistics much as speech recognizers did. The program ingested encyclopedias and other online texts, carefully assigning probabilistic weights based on what it learned. But that wasn’t enough. MindNet’s goal of “resolving semantic ambiguities in text,” remains unattained. The project, the first undertaken by Microsoft Research after it was founded in 1991, was shelved in 2005.

Can’t get there from here

We have learned that speech is not just sounds. The acoustic signal doesn’t carry enough information for reliable interpretation, even when boosted by statistical analysis of terabytes of example phrases. As the leading lights of speech recognition acknowledged last May, “it is not possible to predict and collect separate data for any and all types of speech…” The approach of the last two decades has hit a dead end. Similarly, the meaning of a word is not fully captured just by pointing to other words as in MindNet’s approach. Grammar likewise escapes crisp formalization.  

To some, these developments are no surprise. In 1986, Terry Winograd and Fernando Flores audaciously concluded that “computers cannot understand language.” In their book, Understanding Computers and Cognition, the authors argued from biology and philosophy rather than producing a proof like Einstein’s demonstration that nothing can travel faster than light. So not everyone agreed. Bill Gates described it as “a complete horseshit book” shortly after it appeared, but acknowledged that “it has to be read,” a wise amendment given the balance of evidence from the last quarter century.

Fortunately, the question of whether computers are subject to fundamental limits doesn’t need to be answered. Progress in conversational speech recognition accuracy has clearly halted and we have abandoned further frontal assaults. The research arm of the Pentagon, DARPA, declared victory and withdrew. Many decades ago, DARPA funded the basic research behind both the Internet and today’s mouse-and-menus computer interface. More recently, the agency financed investigations into conversational speech recognition but shifted priorities and money after accuracy plateaued. Microsoft Research persisted longer in its pursuit of a seeing, talking computer. But that vision became increasingly spectral, and today none of the Speech Technology group’s projects aspire to push speech recognition to human levels.

Cognitive dissonance

We are surrounded by unceasing, rapid technological advance, especially in information technology. It is impossible for something to be unattainable. There has to be another way. Right? Yes—but it’s more difficult than the approach that didn’t work. In place of simple speech recognition, researchers last year proposed “cognition-derived recognition” in a paper authored by leading academics, a scientist from Microsoft Research and a co-founder of Dragon Systems. The project entails research to “understand and emulate relevant human capabilities” as well as understanding how the brain processes language. The researchers, with that particularly human talent for euphemism, are actually saying that we need artificial intelligence if computers are going to understand us.

Originally, however, speech recognition was going to lead to artificial intelligence. Computing pioneer Alan Turing suggested in 1950 that we “provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.” Over half a century later, artificial intelligence has become prerequisite to understanding speech. We have neither the chicken nor the egg.

Speech recognition pioneer Ray Kurzweil piloted computing a long way down the path toward artificial intelligence. His software programs first recognized printed characters, then images and finally spoken words. Quite reasonably, Kurzweil looked at the trajectory he had helped carve and prophesied that machines would inevitably become intelligent and then spiritual. However, because we are no longer banging away at speech recognition, this new great chain of being has a missing link.

That void and its potential implications have gone unremarked, the greatest recognition error of all.  Perhaps no one much noticed when the National Institute of Standards Testing simply stopped benchmarking the accuracy of conversational speech recognition. And no one, speech researchers included, broadcasts their own bad news. So conventional belief remains that speech recognition and even artificial intelligence will arrive someday, somehow. Similar beliefs cling to manned space travel. Wisely, when President Obama cancelled the Ares program, he made provisions for research into “game-changing new technology,” as an advisor put it. Rather than challenge a cherished belief, perhaps the President knew to scale it back until it fades away.

Source: Google

Speech recognition seems to be following a similar pattern, signal blending into background noise. News mentions of Dragon System’s Naturally Speaking software peaked at the same time as recognition accuracy, 1999, and declined thereafter. “Speech recognition” shows a broadly similar pattern, with peak mentions coming in 2002, the last year in which NIST benchmarked conversational speech recognition.

With the flattening of recognition accuracy comes the flattening of a great story arc of our age: the imminent arrival of artificial intelligence. Mispredicted words have cascaded into mispredictions of the future. Protean language leaves the future unauthored.



Dude, where's my universal translator? (CBC radio show)

Dutch translation of Rest in Peas: De onbegrepen dood van spraakherkenning

Ray Kurzweil does not understand the brain