IBM’s Watson: Portent or Pretense?

Game over.

IBM’s Watson, with a 15 terabyte chunk of human knowledge loaded into it like a Game Boy cartridge and set to hair trigger, poured out a high-precision fact fusillade that left no humans standing in the (aptly-named) Jeopardy. Is machine omniscience upon us?

“I, for one, welcome our new computer overlords,” said defeated Jeopardy champion Ken Jennings. But Jennings has buzzed in too quickly. Watson might point not to the inevitability of artificial intelligence but its unattainability. Barring unexpected revelations from IBM, Watson represents exquisite engineering work-arounds substituting for fundamental advance.

Questionable progress
In 1999, early question answering systems, including one from IBM, began competing against each other under the auspices of the National Institute for Standards Testing (NIST). At the time, researchers envisioned a four-step progression, starting with systems batting back facts to simple questions such as “When was Queen Victoria born?” A few tall steps later, programs would stand atop the podium and hold forth on matters requiring a human expert like “What are the opinions of the Danes on the Euro?”

Getting past step one proved difficult. While naming the grape variety in Chateau Petrus Bordeaux posed little difficulty, programs flailed on follow-up questions like “Where did the winery’s owner go to college?” even though the answer resided in the knowledgebase provided. These contextual questions were deemed “not suitable” by NIST and dropped. The focus remained on simple, unconnected factoids. In 2006, context questions returned—only to be cut again the following year. In the consensus view, such questions were “too hard,” according to Jimmy Lin, a computer science professor at the University of Maryland and a coordinator of the competition.

The entire contest was dropped after 2007. “NIST decides what to push,” explained Lin, “and we were not getting that much out of this…” Progress had turned incremental, like “trying to build a better gas engine,” according to Lin. Question answering wasn’t finished, but it was done. Although James T. Kirk asked the Star Trek computer questions like whether a storm could cause inter-dimensional contact with a parallel universe, actual question answering systems like Watson would be hard pressed to answer NIST level one questions like “Where is Pfizer doing business?”

ABC easy as 1, 2, 3
Kirk spoke to the computer. Watson’s designers opted for text messages—which says a lot. Speech recognition software accuracy reaches only around 80% whereas humans hover in the nineties. Speech software treats language not as words and sentences carrying meaning but as strings of characters following statistical patterns. Seemingly, Moore’s law and more language data should eventually yield accuracy at or conceivably beyond human levels. But although chips have sped up and data abound, recognition accuracy plateaued around 1999 and NIST stopped benchmarking in 2001 for lack of progress to measure. (See my Rest in Peas: the Unrecognized Death of Speech Recognition.)

Much of speech recognition’s considerable success derives from consciously rejecting the deeper dimensions of language. This source of success is now a cause of failure. Ironically, as Watson triggers existential crisis among humans, computers are struggling to find meaning.

Words are important in language. We’ve had dictionaries for a quarter millennium, and these became machine readable in the last quarter century. But the fundamental difficulties of word meanings have not changed. For his 1755 English dictionary, Samuel Johnson hoped that words, “these fundamental atoms of our speech might obtain the firmness and immutability of… particles of matter…” After nine years’ effort, Johnson published his dictionary but abandoned the idea of a periodic table of words. Concerning language, he wrote: “naked science is too delicate for the purposes of life.”

Echoing Johnson 250 years later, lexicographer Adam Kilgarriff wrote: “The scientific study of language should not include word senses as objects…” The sense of a word depends on context. For example, if Ken Jennings calculates that Oreos and crosswords originated in the 1920s, does calculate mean mathematical computation or judge to be probable? Well, both. Those senses are too fine-grained and need to be lumped together. But senses can also be too coarse. If I buy a vowel on Wheel of Fortune, do I own the letter A? No. This context calls for a finer, even micro-sense.

The decade of origin for Oreos and crosswords was actually the 1910s—as Watson correctly answered. In what decade will computers understand word meanings? Not soon; perhaps never. Theoretical underpinnings are absent: “[T]he various attempts to provide the concept ‘word sense’ with secure foundations over the last thirty years have all been unsuccessful,” as Kilgarriff wrote more than a decade ago, in 1997. Empirical, philosophy-be-damned approaches were tried the following year.

In 1998, at Senseval-1, researchers tackled a set of 35 ambiguous words. The best system attained 78% accuracy. The next Senseval, in 2001, used more nuanced word definitions which knocked accuracy down below 70%. It didn’t get up: “[I]t seems that the best systems have hit a wall,” organizers wrote. The wall wasn’t very high. Systems struggled to do better than mindlessly picking the first dictionary sense of a word every time. Organizers acknowledged that most of the efforts “seem to be of little use…” and had “not yet demonstrated real benefits in human language technology applications.”

Disambiguation was dustbinned. Senseval was renamed Semeval and semantic tasks subsumed word sense disambiguation by 2010. Today no hardware/software system can reliably determine the meaning of a given word in a sentence.

That’s imparsable
Belief continues unfazed, however, regarding whether language can be solved with statistical methods. “The answer to that question is ‘yes,’ “ declares an unabashedly partial Eugene Charniak, professor of computer science at Brown University. Whatever the trouble with word meanings, at the sentence level, computer comprehension is quite impressive—thanks to probabilistic models. Charniak has written a parsing program that unfurls a delicate mobile of syntax from the structure of a sentence. 

Mobile meaning, hanging in the balance (Diagram: phpSyntaxTree)

Such state-of-the-art parsers spin accurate mobiles about 80% of the time when given sentences from The Wall Street Journal. But feed in a piece of literature, biomedical text or a patent and the parses tangle. Nouns are mistaken for finite verbs in patents; in literary texts, different kinks and knots tug accuracy down to 70%.

Performance droops because the best parsers don’t apply universal rules of grammar. We don’t know them or if they exist. Instead parsers try to reverse engineer grammar by examining huge numbers of example sentences and generating a statistical model that substitutes for the ineffable principles.

That strategy hasn’t worked. Accuracy invariably declines when parsers confront an unfamiliar body of text. The machine learning approach finds patterns no human realistically could, but these aren’t universal. Change the text and the patterns change. That means current parsing technology performs poorly on highly diverse sources like the web.

Progress has gone extinct: parsing accuracy gained perhaps a few tenths of one percent in the last decade. “Squeezing more out of the current approaches is not going to work,” says Charniak. Instead, he concludes, “we need to squeeze more out of meaning.”

Surface features don’t provide a reliable grip on sentence syntax. Word order and parts of speech often aren’t enough. Regular sentences can be slippery, like:

  • President F.W. de Klerk released the ANC men along with one of the founding members of the Pan Africanist Congress (PAC).

Not knowing about apartheid, the parser must guess whether de Klerk and a PAC member together released the ANC men—although the PAC figure was also in prison. The program has no basis for deciding where in the mobile (pictured above) to hang up the phrase “along with…”

Notice that winning Jeopardy is easier than correctly diagramming some sentences. And Watson provides no help to a parser in need. Questions like, “What are the chances of PAC releasing members of the ANC?” are far too hard, the reasoning power and information required too vast. Watson’s designers likened organizing all knowledge to “boiling the ocean.” They didn’t try. Others are.

Unobtanium
But it’s called mining the World Wide Web and aims to penetrate to the inner core, the sanctum sanctorum, of meaning.

In the beginning was the word. But the problem of meaning arises in word senses and spreads, as we have, seen to sentences. Errors and misprisions accumulate, fouling higher level processing. In a paragraph referencing “Mr. Clinton,” “Clinton,” and “she,” programs cannot reliably figure out if “Clinton” refers to Bill Clinton or Hillary Clinton—after 15 years of effort. Perhaps because of this problem, Watson once answered “Richard Nixon” to a clue asking for a first lady,

Evading this error requires understanding the senses of nearby words, that is, solving the unsolved word disambiguation problem. Finding entry into this loop of meaning has been elusive, the tape roll seamless, thumbnail never catching at the beginning.

Structuring web-based human knowledge promises to break through today’s dreary performance ceiling. Tom Mitchell at Carnegie Mellon University seeks “growing competence without asymptote,” a language learning system which spirals upward limitlessly. Mitchell’s Never Ending Language Learning (NELL) reads huge tracts of the web, every day. NELL’s blade servers grind half a billion pages of text into atoms of knowledge, extracting simple facts like “Microsoft is a company.” Initially, facts are seeded onto a human-built knowledge scaffold. But the idea is to train NELL in this process and enable automatic accretion of facts into ever-growing crystals of knowledge, adding “headquartered in the city Redmond” to facts about Microsoft, for example. Iterate like only computers can and such simple crystals should complexify and propagate.

But instead NELL slumps lifelessly soon after human hands tip it on its feet. Accuracy of facts extracted drops from an estimated 90% to 57%. Human intervention became necessary: “We had to do something,” Mitchell told fellow researchers last year. The interventions became routine and NELL dependent on humans. NELL employs machine learning but knowledge acquisition might not be machine learnable: “NELL makes mistakes that lead to learning to make additional mistakes,” as NELL’s creators observed.

The program came to believe that F.W. de Klerk is a scientist not a former president of South Africa—providing little help in resolving ambiguous parsing problems. At the same time, NELL needs better parsing to mine knowledge more accurately: “[W]e know NELL needs to learn to parse,” Mitchell wrote in email. This particular Catch-22 might not be fundamentally blocking. But if NELL can’t enhance the performance of lower-level components, those components might clamp a weighty asymptote on NELL’s progress.

Represent, represent
An older, less surmountable, perhaps impossible problem faced by NELL is how to arrange facts, assuming they can be made immaculate. Facts gleaned by NELL must be pigeon-holed into one of just 270 categories—tight confines for all of knowledge. Mitchell wants NELL to be able to expand these categories. However, while incorrect individual facts might compromise NELL, getting categories wrong would be fatal.

But no one knows how to write a kind of forensic program that accurately reconstructs a taxonomy from its faint imprint in text. Humans manage, but only with bickering. Even organizing knowledge in relatively narrow, scientific domains poses challenges, small molecules in biology, for example. Some labs just isolate and name the different species. Other researchers with different interests represent a molecule with its weight and the weights of its component parts, information essential to studying metabolism.  However, 2D representations are needed for yet another set of purposes (reasoning about reaction mechanisms) whereas docking studies call for 3D representations, etc.

What a thing is or, more specifically, how you represent it, depends on what you are trying to do—just as the quest for word senses discovered. So even for the apparently simple task of representing a type of molecule, “there is not one absolute answer,” according to Fabien Campagne, research professor of computational biomedicine at Weill Medical College. The implication is that representation isn’t fixed, pre-defined. And new lines of inquiry, wrote Campagne, “may require totally new representations of the same entity.”

One of NELL’s biological conceptions is that “blood is a subpart of the body within the right ventricle.” Perhaps this and a complement of many other facts cut in a similar shape can represent blood in a way that answers some purpose or purposes. But it will not apply in discussions of fish blood. (Fish have no right ventricle.) And when it comes to human transfusion, blood is more a subpart of a bag.

The difficulties of representation represented: Marcel Duchamp’s Nude Descending Staircase. NELL’s representation of Marcel Duchamp

Particular regions of knowledge can be tamed by effort or imposition of a scheme by raw exercise of authority. But these fiefdoms resist unification and generally conflict. After millennia of effort, humans have yet to devise a giant plan which would harmonize all knowledge. The Wikipedia folksonomy works well for people but badly for automating reasoning. Blood diamonds and political parties in Africa, for example, share a category but clearly require different handling. One knowledge project, YAGO, simply lops off the Wikipedia taxonomy.

The dream of a database of everything is very much alive. Microsoft Research, from its Beijing lab, recently unveiled a project named Probase which its creators say contains 2.7 million concepts sifted from 1.68 billion web pages. “[P]robably it already includes most, if not all, concepts of worldly facts that human beings have formed in their mind,” claim the researchers with refreshing idealism. Leaving aside the contention that everything ever thought has been registered on the Internet, there still are no universal injection molds—categories—ready to be blown full of knowledge.

A much earlier, equally ambitious effort called Cyc failed for a number of reasons, but insouciance about knowledge engineering, about what to put where, contributed to Cyc’s collapse. Human beings tried to build Cyc’s knowledgebase by hand, assembling a Jenga stack of about one million facts before giving up.

NELL may be an automated version of Cyc. And it might succeed less. NELL’s minders already have their hands full tweaking the program’s learning habits to keep fact accuracy up. NELL is inferior to Cyc when it comes to the complexity of knowledge each system can handle. Unless NELL can learn to create categories, people will have to do it, entailing a monumental knowledge engineering effort and one not guaranteed to succeed. Machine learning relies on examples which simply might not work for elucidating categories and taxonomies. Undoubtedly, it is far harder than extracting facts.

NELL may also represent a kind of inverse of IBM’s Watson. NELL arguably is creating a huge Jeopardy clue database full of facts like “Microsoft is a company headquartered in Redmond.” NELL and Watson attack essentially the same problem of knowledge, just from different directions. But it will be difficult for NELL to reach even Watson’s level of performance. Watson left untouched the texts among its 15 terabytes of data. NELL eviscerates text, centrifuging the slurry to separate out facts and reassembling them into a formalized structure. That is harder.

And Watson is confined to the wading pool, factoid shallows of knowledge. The program is out of its depth on questions that require reasoning and understanding of categories. That may be why, in Final Jeopardy, Watson answered “Toronto” not “Chicago” for the United States city whose largest airport is named for a World War II hero and second largest for a World War II battle. Watson likely could have separately identified O’Hare and Midway if asked sequential questions. And pegging Chicago as the city served by both airports also presumably would be automatic for the computer. But decomposing and then answering the series appears to have been too hard. NIST dropped such questions—twice—for their perceived insuperability. And yet they are trivial compared to answering questions about the relations between F. W. de Clerk, the African National Congress, and the Pan Africanist Congress, questions of the kind which have stalled progress in parsing.

Google vs. language
Google contends with language constantly—and prevails. Most Google queries are actually noun phrases like “washed baby carrots.” To return relevant results, Google needs to know if the query is about a clean baby or clean carrots. Last year, a team of researchers crushed this problem under a trillion-word heap of text harvested from the many-acred Google server farms. Statistically, the two words “baby carrots” show up together more than “washed baby.” Problem solved. Well, mostly.

The method works an impressive 95.4% of the time, at least on sentences from The Wall Street Journal. Perhaps as important, accuracy muscled up as the system ingested ever-larger amounts of data. “Web-scale data improves performance,” trumpeted researchers. “That is, there is no data like more data.” And more data are inevitable. So will the growing deluge wash away the inconveniencies of parsing and other language processing problems?

Performance did increase with data, but bang for the byte still dropped—precipitously. Torqueing accuracy up just 0.1% required an order of magnitude increase in leverage, to four billion unique word sequences. Powering an ascent to 96% accuracy would require four quadrillion, assuming no further diminution of returns. To reach 97%, begin looking for 40 septillion text specimens. 

Mine the gap: Does the Internet have enough words to solve noun phrases? (Adapted from Pitler et al., “Using Web-scale N-grams to Improve Base NP Parsing Performance”)

More data yielding ever better results is the exception not the rule. The problem of words senses, for example, is relatively impervious to data-based assaults. “The learning curve moves upward fairly quickly with a few hundred or a few thousand examples,” according to Ted Pedersen, computer science professor at the University of Minnesota, Duluth, “but after a few thousand examples there's usually no further gain/learning, it just all gets very noisy.”

Conceivably, we are now witnessing the data wave in language processing. And it may pass over without sweeping away the problems.

Let the data speak, or silence please
In speech recognition too, according to MIT’s Jim Glass, “There is no data like more data.” Glass, head of MIT’s Spoken Language System Group, continued in email: “Everyone has been wondering where the asymptote will be for years but we are still eking out gains from more data.” However, evidence for continuing advance toward human levels of recognition accuracy is scarce, possibly non-existent.

Nova’s Watson documentary asserts that recognition accuracy is “getting better all the time” (~34:00) but doesn’t substantiate the claim. Replying to an email inquiry, a Nova producer re-asserted that programs like Dragon Naturally Speaking from Nuance “are clearly more accurate and continuing to improve,” but again adduced no evidence.

Guido Gallopyn, vice president of research and development at Nuance, has worked on Dragon Naturally Speaking for over a decade. He says Dragon’s error rate had been cut “more than in half.” But Gallopyn begged off providing actual figures, saying accuracy was “complicated.” He did acknowledge that there was still “a long way to go.” And while Gallopyn has faith that human-level performance can be attained, astonishingly, it is not a goal for which Nuance strives: “We don’t do that,” he stated flatly.

Slate also recently talked up speech recognition, specifically Google Voice. The article claims that programs like Dragon “tend to be slow and use up a lot of your computer's power when deciphering your words,” in contrast to Google’s powerful servers. In the Google cloud, 70 processor-years of data mashing can be squeezed into a single day. Accurate speech recognition then springs from the “magic of data,” but exactly how magic goes unmeasured. Google too is mum: “We don't have specific metrics to share on accuracy,” a spokesperson for the company said.

By contrast, The Wall Street Journal, recently reported on how Google Voice is laughably mistake prone, serving as the butt of jokes in a new comedic sub-genre.

There is no need for debate or speculation: the NIST benchmarks, gathering dust for a decade, can definitively answer the question of accuracy. The results would be suggestive for the prospects of web-scale data to overcome obstacles in language processing. Computer understanding of language, in turn, has substantial implications for machine intelligence. In any event, claims about recognition accuracy should come with data.

Today, all that can be said is this:

Progress in voice recognition: the sound of one hand clapping since 2001 (Adapted from NIST, “The History of Automatic Speech Recognition Evaluations at NIST”)

To be || not to be
That is the question about machine intelligence.

When Garry Kasparov was asked how IBM might improve Deep Blue, its chess playing computer, he answered tartly: “Teach it to resign earlier.” Kasparov, then world chess champion, had just soundly defeated Deep Blue. Rather than follow this advice, IBMers put some faster chips in, tweaked the software and then utterly destroyed Kasparov not long after, in 1997. It was IBM’s turn to vaunt: “One hundred years from now, people will say this day was the beginning of the Information Age,” claimed the head of the IBM team. Instead, apart from chess, Deep Blue has had no effect.

If Deep Blue represented an effort to rise above human intelligence by brute computational force, Watson represents the data wave. But we have been inundated by data for some time. Google released its trillion word torrent of text five years ago. Today the evidence may suggest that the problems of language will remain after the deluge. If the rising tide of world-wide data can’t float computing’s boat to human levels, “What’s the alternative?” demands Eugene Charniak. He perhaps means there is no alternative.

A somewhat radical idea is to revise the parts of speech, as Stanford University’s Christopher Manning has proposed. Disturbingly, Manning asks: “Are part-of-speech labels well-defined discrete properties enabling us to assign each word a single symbolic label?” Recall that words don’t cleanly map to discrete senses, and similarly that things in the world don’t fit into obvious, finite, universal categories. Now the parts of speech seem to be breaking down.

Tagging accuracy: time for new parts of speech? (Source: Flickr, Tone Ranger)

Manning is skeptical that machine learning could conjure even 0.2% more accuracy in the tagging of words with their part of speech. Achim Hoffmann, at the University of New South Wales, believes more generally that machine learning now bumps against a ceiling. “New techniques,” he adds, “are not going to substantially change that.” Hoffman points out that relatively old techniques “are still among the most successful learning techniques today, despite the fact that many thousand new papers have been written in the field since then.”

For Hoffman, the alternative is to approach intelligence not through language or knowledge but algorithm. Arguably, however, this is just a return to the very origins of artificial intelligence. John McCarthy, inventor of the term “artificial intelligence,” tried to find a formal logic and mathematical semantics that would lead to human-like reasoning. This project failed. It led to Cyc. As Cyc founder Doug Lenat wrote in 1990: “We don’t believe there is any shortcut to being intelligent, any yet-to-be-discovered Maxwell’s equations of thought.” Forget algorithm. Knowledge would pave the way to commonsense. Cyc, of course, also did not work.

Are we just turning circles, or is the noose cinching tighter with repeated exertions? There is something viscerally compelling—disturbing—about Watson and its triumph. “Cast your mind back 20 years,” as AI researcher Edward Feigenbaum recently said in the pages of The New York Times, “and who would have thought this was possible?” But 20 years ago, Feigenbaum published a paper with Doug Lenat about a project called Cyc. Cyc aimed at full blown artificial intelligence. Watson stands in relation to a completely realized Cyc the way J. Craig Venter’s synthetic cell stands to the original vision of genetic engineering: a toy.

John McCarthy derided the Kasparov-Deep Blue spectacle, calling it “AI as sport.” Jimmy Lin, the former NIST coordinator, is not derisive but more ho-hum, wordly-wise about Watson: “Like a lot of things,” he says, “it’s a publicity stunt.” Perhaps an artificially intelligent computer wouldn’t fall for it, but people have. The New Yorker sees the triumphs of Deep Blue and Watson as forcing would-be defenders of humanity to move the goalposts back, to re-define the boundaries of intelligence and leave behind the fields recently annexed by computers. But the goalposts arguably have been moved up, so that weak artificial intelligence—artificial AI—can put it through the uprights.

The New York Times contends that Watson means “rethinking what it means to be human.” Actually what needs redefinition may be humanity’s relationship to dreams of technological transcendence.

2 responses
It's a brilliant summary of the state of the art and lack of progress in the SCIENCE of language understanding. Please keep in mind though that irrespective of grandiose scientific claims, the underlying TECHNOLOGY is useful and profitable. IBM/Watson, Google, Nuance and others are making it applicable on a large scale. Watson is a next generation search engine, and not a proto-intelligent machine -- the same way as excel spreadsheet is a better calculator, and not a robotic mathematician.
Again, it's a pleasure to read you prose. It's rare to find generally accessible articles that get AI and NLP right.
Excellent post. I really like it. My respect to the author.